As suggested at the beginning of the first post on neural networks, this post just covers the transformation of images to an HDF5 file. You can easily skip this post if you are not interested in this step. However, it is a short post, and the most important step here is making sure that you transpose your image matrix when creating the feature vector image. As I mentioned, this analysis is written in the R programming language, but can easily be translated into Python. I personally have gotten comfortable enough with both R and Python that I don’t really have a preference. However, there were fewer examples of this type of programming on the Internet in R, so I decided to use R hoping that it would be more useful to others. Also, if you have a better way of doing it, please share.
Preparing the Image Data and Train/Test/Validation Datasets:
There are two methods that will be used during this analysis for storing and retrieving image data. The first is discussed here, and that is using the HDF5 file system. This is not a requirement for executing the analysis and has nothing to do with deep learning per se. However, many of the matrices can get very large and managing memory can be challenging. The second method that will be used is using Spark to retrieve data for the analysis.
Even by using HDF5, I had to limit the number of images used for analysis to 750 for the training set and 250 each for the testing and validation data sets. Image processing is memory intensive, and can require a lot of memory, especially if you are going to use a large number of images. I extended the number as much as I could to test the theory that DLNN are even more accurate with more data. I will use 750 images for the HDF5 method, and 15,000 for the Spark method — we will see.
Resizing Images Using R:
# read file names/addresses and labels (1/0 Cat/Dog or Not Cat) from the 'train' folder addrs <- list.files(path = './train', pattern="jpg", full.names = FALSE) labels = lapply(addrs, function(x) ifelse(grepl('cat', x), 1, 0)) # 1 = Cat, 0 = Dog) # Could pad images to prevent distortion and Resize images # Removed all images larger than 500 pxls in width or height # Deleted two images that were bigger than 500x500 (probably not necessary, but I had 25,000 images and didn't want to be concerned about two outliers.) # "cat.835.jpg" and "dog.2317.jpg" for( i in (1:length(addrs))){ img <- load.image(paste('./train/', addrs[i], sep = '')) img <- resize(img, size_x = 250, size_y = 250, size_z = 1, size_c = 3, interpolation_type = 1L, boundary_conditions = 0L) imager::save.image(img,paste("./train1/", addrs[i])) }
Creating Train, Test, and Validation Datasets:
With the image sizes standardized, the next step is to randomly shuffle the images of ‘dogs’ and ‘not-dog’ images, then randomly separate the set of shuffled images into training, testing, and validation datasets. The images are then converted to feature vectors and the `X` matrix of (nxm) for each training, testing and validation data set.
Even with using the ‘bigmemory’ package in R, creating a matrix (nxm) = (187500×14988) is simply too large to maintain in memory on the computer being used has 64GB or RAM. So for the HDF5 example, we will use a randomly selected subset of the images to create our datasets, and for the Spark example, the complete set of 25,000 images are used. They say deep learning is more precise with more data and we will test that statement.
addrs <- list.files(path = './train1', pattern="jpg", full.names = FALSE) labels = lapply(addrs, function(x) ifelse(grepl('cat', x), 0, 1)) # 0 = Cat, 1 = Dog) label_addrs <- as.data.frame(cbind(labels = unlist(labels),addrs = unlist(addrs)), stringsAsFactors = FALSE) # Shuffle data label_addrs_shuffled <- label_addrs[sample(nrow(label_addrs)),] # Divide the data into 60% train, 20% validation, and 20% test spl = sample.split(label_addrs_shuffled$labels, SplitRatio = 0.05)# using 5% of images sampleDataSet <- subset(label_addrs_shuffled, spl==TRUE) spl = sample.split(sampleDataSet$labels , SplitRatio = 0.6)# Taking 60% of the 5% Train = subset(sampleDataSet, spl==TRUE) # 750 images in the Train dataset test_val = subset(sampleDataSet, spl==FALSE) # Remaining 40% will be split into Test and Validation spl2 = sample.split(test_val$labels, SplitRatio = 0.5) # Spliting the 40% in half Test = subset(test_val, spl2==TRUE) # Making half (20%) Test Validation = subset(test_val, spl2==FALSE) # and the other half (20%) Validation rm(spl,spl2, test_val, sampleDataSet, label_addrs,label_addrs_shuffled) # 750 Images in the Train dataset -- 375 Dog images # 250 Images in the Test dataset -- 125 Dog images # 250 Images in the Val dataset -- 125 Dog images train_addrs = Train$addrs train_labels = as.numeric(Train$labels) val_addrs = Validation$addrs val_labels = as.numeric(Validation$labels) test_addrs = Test$addrs test_labels = as.numeric(Test$labels) ##################################################### # Right now the Train, Test, and Validation sets only # contain the Y value, and the name of the file/image # In the next step the images will be loaded and # converted to vectors. # The vectors will then be combined using the 'cbind' # to create on large (nxm) matrix and saved to an # HDF5 file for later processing. #####################################################
The important part to remember here is making sure that you properly convert regular images into feature vectors as discussed in part 1. Notice in the code below, lines 17-20, that the each matrix is transposed.
temp <- vector() cnt <- 0 if(exists("train_x")){ rm(train_x) } for( i in (1:length(train_addrs))){ img <- load.image(paste('./train1/', train_addrs[i], sep = '')) imgX <- imsplit(img, 'c') names(imgX) <- c('c1', 'c2', 'c3') # Check the discussion on forming the vectors properly # The image c_1-c_3 objects remain image objects, but # just one of the spectrums in the RGB spectrum. # imgX$c1 # Image. Width: 250 pix Height: 250 pix Depth: 1 Colour channels: 1 # Each channel must be converted to a matrix, then transposed and # converted into a vector. c_1 <- as.vector(t(as.matrix(imgX$c1, byrow = TRUE))) c_2 <- as.vector(t(as.matrix(imgX$c2, byrow = TRUE))) c_3 <- as.vector(t(as.matrix(imgX$c3, byrow = TRUE))) # Each channel is appended in order to create the vector # of 187500 in length temp <- append(c_1, c_2, length(c_1)) temp <- append(temp, c_3, length(temp)) if(!exists("train_x")){ train_x <- matrix(nrow = length(temp), ncol = 1) train_x[,1] <- temp } else{ train_x <- cbind(train_x, temp) } cnt <- cnt + 1 if(cnt %% length(train_addrs) == 0){ s1 <- object.size(train_x) cat(paste(i, " images loaded into train_x, size: ")) print(s1, units = "MB") } } # s1 <- object.size(train_x) # print(s1, units = "auto") # print(s1, units = "auto", standard = "IEC") # print(s1, units = "auto", standard = "SI")
The size of the matrix is approximately 1GB. As we will see when the data is reloaded from the HDF5 file, the data is approximately 1.8MB (pointer file).
The following code demonstrates how to extract data from a matrix and recreate the image:
x <- train_x[,1]; dim(x) <- c(250, 250, 1, 3) # [1] 250 250 1 3 # [1] 250 250 plot(imrotate(as.cimg(x),90))
Perform the same operations to create the testing and validation datasets:
temp <- vector() cnt <- 0 if(exists("val_x")){ rm(val_x) } for( i in (1:length(val_addrs))){ img <- load.image(paste('./train1/', val_addrs[i], sep = '')) imgX <- imsplit(img, 'c') names(imgX) <- c('c1', 'c2', 'c3') c_1 <- as.vector(t(as.matrix(imgX$c1, byrow = TRUE))) c_2 <- as.vector(t(as.matrix(imgX$c2, byrow = TRUE))) c_3 <- as.vector(t(as.matrix(imgX$c3, byrow = TRUE))) # Each channel is appended in order to create the vector # of 187500 in length temp <- append(c_1, c_2, length(c_1)) temp <- append(temp, c_3, length(temp)) if(!exists("val_x")){ val_x <- matrix(nrow = length(temp), ncol = 1) val_x[,1] <- temp } else{ val_x <- cbind(val_x, temp) } cnt <- cnt + 1 if(cnt %% length(val_addrs) == 0){ s1 <- object.size(val_x) cat(paste(i, " images loaded into val_x, size: ")) print(s1, units = "MB") } } temp <- vector() cnt <- 0 if(exists("test_x")){ rm(test_x) } for( i in (1:length(test_addrs))){ img <- load.image(paste('./train1/', test_addrs[i], sep = '')) imgX <- imsplit(img, 'c') names(imgX) <- c('c1', 'c2', 'c3') c_1 <- as.vector(t(as.matrix(imgX$c1, byrow = TRUE))) c_2 <- as.vector(t(as.matrix(imgX$c2, byrow = TRUE))) c_3 <- as.vector(t(as.matrix(imgX$c3, byrow = TRUE))) # Each channel is appended in order to create the vector # of 187500 in length temp <- append(c_1, c_2, length(c_1)) temp <- append(temp, c_3, length(temp)) if(!exists("test_x")){ test_x <- matrix(nrow = length(temp), ncol = 1) test_x[,1] <- temp } else{ test_x <- cbind(test_x, temp) } cnt <- cnt + 1 if(cnt %% length(test_addrs) == 0){ s1 <- object.size(test_x) cat(paste(i, " images loaded into test_x, size: ")) print(s1, units = "MB") } }
Creating an HDF5 File:
Now that the training, testing, and validation data have all been converted into matrices of (nxm) dimensions, we don’t want to have to do this every time we use this data. One solution is to save this data in a structure defined by the HDF5 standard for easy retrieval, analysis, and deletion from our work environment.
HDF5 is “a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5.”
For this step of the analysis, the matrices will be stored in an HDF5 file. The next phase of the analysis will use Spark with the objective of using all 25,000 images. As you will see, the HDF5 format is very easy to work with, its fast, and makes it easier to manage memory constraints with relatively large structures by quickly deleting and loading datasets as needed.
There are plenty of online examples for using the HDF5 standard in Python, but very few resources for using R. If you know of a better way of doing this, please pass them along. However, from what I can tell, the following works:
# In this block the matrices created above are saved to an HDF5 file using # R's 'h5' package. # Creating groups for training, testing, and validation enables the loading of # multiple datasets as needed. Datasets with the suffix '_x' contain the input # feature variables. Those with the suffix '_addrs' contain the image file names # and those with the suffix '_labels' contain the `Y` values for each image. # mode = 'a': Read/write if exists, create otherwise (default). file <- h5file("dogvcat1.h5", mode = 'a') # Create new DataSet 'trainX' in H5Group 'traingroup' file["traingroup/trainX"] <- train_x file["traingroup/trainAddrs"] <- train_addrs file["traingroup/trainLabels"] <- train_labels # Create new DataSet 'testX' in H5Group 'testgroup' file["testgroup/testX"] <- test_x file["testgroup/testAddrs"] <- test_addrs file["testgroup/testLabels"] <- test_labels # Create new DataSet 'valX' in H5Group 'valgroup' file["valgroup/valX"] <- val_x file["valgroup/valAddrs"] <- val_addrs file["valgroup/valLabels"] <- val_labels h5close(file)
Managing Data by Groups and Datasets:
Now that the HDF5 file has been created, the matrices can be removed from the environment and loaded as needed. Another advantage of the HDF5 format is that data can be easily shared with others that would want to repeat the analysis simply by providing the ‘dogvcat1.h5’ file. The real advantage, however, is having rapid access to the data that can be processed and deleted from the environment, and reloaded if necessary.
Another nice feature of using HDF5 is being able to group datasets. In the code block above you can see that the groups “traingroup”, “testgroup”, and “valgroup” were created. Each of the groups have datasets associated with consistent suffixes to make is easy to quickly retrieve data.
f <- h5file("dogvcat1.h5", mode = 'r') # Retrieve H5Group 'valgroup' traingroup <- f["traingroup"] testgroup <- f["testgroup"] valgroup <- f["valgroup"] print(list.datasets(f, recursive = TRUE)) # [1] "/testgroup/testAddrs" "/testgroup/testLabels" "/testgroup/testX" # [4] "/traingroup/trainAddrs" "/traingroup/trainLabels" "/traingroup/trainX" # [7] "/valgroup/valAddrs" "/valgroup/valLabels" "/valgroup/valX" # Retrieve H5Group 'traingroup' train_x <- traingroup["trainX"] trainAddrs <- traingroup["trainAddrs"] train_y <- traingroup["trainLabels"] print(list.datasets(traingroup, recursive = TRUE)) # [1] "/traingroup/trainAddrs" "/traingroup/trainLabels" "/traingroup/trainX" # Retrieve H5Group 'testgroup' test_x <- testgroup["testX"] testAddrs <- testgroup["testAddrs"] test_y <- testgroup["testLabels"] print(list.datasets(testgroup, recursive = TRUE)) # [1] "/testgroup/testAddrs" "/testgroup/testLabels" "/testgroup/testX" # Retrieve H5Group 'valgroup' val_x <- valgroup["valX"] valAddrs <- valgroup["valAddrs"] val_y <- valgroup["valLabels"] print(list.datasets(valgroup, recursive = TRUE)) # [1] "/valgroup/valAddrs" "/valgroup/valLabels" "/valgroup/valX"
The data is now accessed just like you would the matrix. However, if you just enter ‘train_x’ for example, you get back the following:
> train_x DataSet 'trainX' (1.875e+05 x 750) type: numeric chunksize: 1.875e+05 x 750 maxdim: UNLIMITED compression: H5Z_FILTER_DEFLATE > s1 <- object.size(train_x) print(s1, units = "auto") # 1.8Kb > dim(train_x[,1]) # [1] 187500 1
The HDF5 object ‘train_x’ is an S4 object and ‘train_x@dim’ returns (187500, 750).
In the next post we will put this all together and create the deep learning neural network to identify images with dogs.
Leave a Reply
Your email is safe with us.