Deep learning refers to training neural networks and is a major contributor to many of the recent advances in machine learning. Deep learning has been the subject of many popular press articles on how deep neural networks have revolutionized the field of speech recognition. Even more cool in my opinion is that Google’s AI algorithm mastered the game of Go, defeating the world champion. During this match, the algorithm made a brilliant move that it had not been taught. The assumption being that it learned the move! In this post, we will go step by step through the process of image recognition using linear regression to identify images of dogs. Before you go, I am sure the thought pops up, so what? The principles applied here are the same ones used for real-world applications such as self-driving cars, distinguishing a red light from a green light, or a yield sign from a stop sign, identifying people by name, separating types of fish as they come off the conveyer belt, or for medical imaging to identify healthy brains from brains with tumors as shown in the following code and figure:
img <- read.csv('./images/tumor.csv') img <- as.matrix(img) img <- imrotate(as.cimg(as.matrix(img)),90) dim(img) # [1] 512 570 1 1 plot(img, axes = FALSE)
The amazing aspect of deep learning algorithms is that they are much more accurate than other classification algorithms, and fast enough to be designated as performing in near real-time. In this post I will discuss this relatively new approach to classification. Also, with this one caveat, the idea of backpropagation is not new at all, and was first written about in 1986 by Geoffrey Hinton and colleagues. I won’t go into the details of the article, but suffice it to say that advanced computing power has enabled the concepts of backpropagation to be realized, and without further similar advances today, advances in AI might quickly peek.
Caveat aside, one thing I find so fascinating about neural networks is that nobody really understands why they work so well; much like the human brain. Professor Andrew Ng from Stanford University, Ankur Moitra from MIT, and even Geoffrey Hinton concur that the analogies drawn between the brain and neural networks are somewhat exaggerated and contrived. So, given that the experts are not really sure how it works, I will not go there either. What this post does do is provide a step-by-step guide for building a deep learning algorithm.
ImageNet:
During my research on the topic of image recognition I stumbled across something called ImageNet, which is an “image database organized according to the WordNet hierarchy (currently only the nouns). Each noun node in the WordNet hierarchy is depicted by hundreds and thousands of images.” How cool is that?! This post will use images from a variety of sources, including the Kaggle web site. I can’t share the images, but if you have an account you can easily download them. For that matter, you can use this same code and process for classifying any image.
ImageNet is well known within the computer vision community, but new to me. There is a yearly competition to see whose algorithm can successfully identify the most images. This year, all three contests are being held by Kaggle. ImageNet contains approximately one million labeled images, each labeled with 1 of 1,000 different possible categories. These categories can be very specific, identifying a specific breed of dog for example. The objective of the contest is to train a classifier that correctly labels images with the lowest error rate. My point, and reason for bringing up ImageNet; deep learning algorithms have won every year since 2012.
Preparing Image Data for Analysis:
The problem addressed in this series of post is identifying images of dogs, which could just as easily be a cat, a person, a car, a street sign, or any noun object. These processes are not limited to images either. These same principles can be applies to audio files for voide recognition, natural language processing, and much more. However, the objective of this post is to explain as best I can how neural networks with deep learning backpropagation are applied to do some amazing things. This is a simple binary classification problem, with 1 representing an image with a dog, and a 0 (zero) representing an image without a dog. These types of binary classifications as we discussed earlier in a logistic regression post, can be used for many types of analysis where a yes/no, 1/0, TRUE/FALSE type response is required.
Image Structures and Preparing the data for Analysis in R or Python:
An image comes in various levels of resolution, and the notation of 480×600 represents 480 pixels on the vertical axis and 600 pixels on the horizontal axis. When working with images, matrices are used, and for many of the operations, the matrix must be a square, so the image must be converted to square, such as 480×480, or 250×250. Images are stored on a computer as three separate matrices corresponding with the Red, Green, and Blue color channels for the image. Each matrix of the spectrum (RGB) is flattened and combined into a vector.
For this example, we will take an image of a cat (Petunia), that has been cropped to a 250×250 image, meaning 250 pixels high, and 250 pixels wide. Each matrix for a given color is a 250×250 matrix as shown in Figure 2, and is converted into an input feature vector. The value in each cell of the matrix represents the pixel intensity for that color, and these values are the input to the feature vector, x
, of n-dimensions
. This feature vector represents the object, in this case a 250×250 image.
So, following the steps described above, and using the programming language of choice, an image is read in, and split into three separate matrices, each with values representing the intensity of the color image for its layer. It really doesn’t matter which programming language you use. There will be differences, but they are easy to work around. I used the ‘imager’ package in R 3.4.1 which represents images in 4D numeric arrays. Each dimension is labelled in order: x,y,z,c. You can see below in the code block that the image of the dog safely harnessed in for a ride to the park (Sam) has the dimensions 250 x 250 x 1 x 3. The third dimension, z
is used for videos, 1 for each frame of the video and is referred to as the depth, or time dimension. The last dimension, c
, references each color dimension R/G/B/. This is referred to as the spectrum dimension. The first two dimensions, as you would expect, refer to the image itself, and are called the spatial dimensions.
library(imager) set.seed(1) img <- 'Sam1.jpg' img <- load.image(paste('./images/', img, sep = '')) img <- resize(img, size_x = 250, size_y = 250, size_z = 1, size_c = 3, interpolation_type = 1L, boundary_conditions = 0L) dim(imgT) # [1] 250 250 1 3 a <- imsplit(img, 'c') #telling it to split the image into the 3 matrices par(mfrow=c(2,2)) plot(img, axes=FALSE, main='Sam in Full Color') plot(a$`c = 1`, main='Dimension 1') plot(a$`c = 2`, main='Dimension 2') plot(a$`c = 3`, main='Dimension 3')
The following plot shows the image of Sam with all the dimensions for x, y, and c (250 x 250 x 3 — since there is only one image, and it is not a video, we can ignore the z
dimension); this is the ‘Sam in Full Color’ image. The image labeled ‘Dimension 1’ is the intensity values of the image for the first color channel with the dimensions 250 x 250 x 1, and so on for the images labeled ‘Dimension 2’ and ‘Dimension 3.’
Each spectrum dimension is converted into one feature vector of:
$$\ 1n_{x}$$ where $$n_{x}\ =\ 250\ \bullet\ 250\ \bullet\ 3\ =\ 187,500$$
Properly Creating the Input Image Vector:
As you recall from Figure 2, it shows the vector being created starting at the top left and moving to the right until the end of the top row is reached. The process starts over on the next row, and so on until it has all of the matrix values in a vector. It then moves to the next dimension matrix and repeats the process, appending the values to the same feature vector. Unless you are really accustomed to working with matrices, it is a good idea to create a small matrix, and make sure the vectors are being populated as expected. The following code demonstrates the process:
# Create a small matrix: x <- 1:16 ; dim(x) <- c(4,4) # Display the matrix x # [,1] [,2] [,3] [,4] # [1,] 1 5 9 13 # [2,] 2 6 10 14 # [3,] 3 7 11 15 # [4,] 4 8 12 16 # Transpose the matrix (in Python this done with x = x.T): x <- t(x) x # [,1] [,2] [,3] [,4] # [1,] 1 2 3 4 # [2,] 5 6 7 8 # [3,] 9 10 11 12 # [4,] 13 14 15 16 # Now convert the matrix into a vector: x_1 <- as.vector(x) x_1 # [1] 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 # Notice that the order matches the original matrix from top left to right, by row.
Standard Notations and Definitions for Deep Learning:
A single training pair is (x, y)
, and $$x\ \in\ R^{n_{x}}$$ (n-x dimensional feature vector), and y
, the label is either 0, or 1: $$\ y\ \in\ \left\{{0,\ 1}\right\}$$
Sizes:
m
: number of images in the dataset
$$n_{x}$$ : input size
$$n_{y}$$ : output size (or number of classes)
In Python: $$X.shape\ =\ (n_{x},\ m)\ \ $$ provides the dimensions of the matrix. In R, dim(x)
provides dimensions.
Output labels of y, are represented similarly:
$$Y\ =\ \left[{y^{(1)},y^{(2)},\ .\ .\ .\ .,\ y^{(m)}}\right]\ $$
$$Y\ \in\ R^{1m}$$
$$Y.shape\ =\ (1,\ m)$$
OR:
$$dim(Y)\ =\ (1,\ m)$$
Activation Function:
The activation function is applied to the input values of a node. For example, a simple function applied to the feature vector created above is the first step, and this function is called the activation function. As we will discuss later, this function is not the same at each level of the network. There are a number of activation functions, and in the process defined here, we will use the Rectified Linear Unit (ReLU) for hidden layers, and the sigmoid function for the final layer (the output of which is the calculated y-hat, or predicted value) as described below. This is defined as the linear perceptron in neural networks which is defined below.
Logistic Regression:
Binary classification problems, where given x (in this case an image of a dog), we want to determine the probability that y = 1
given x
, or:
Since we are looking for the probability of y-hat, this means that:
For the logistic regression formula, we will use parameters w
which is an X dimensional vector, and b
which is a real number. To ensure that y-hat is a probability, we will use the sigmoid function of w
transpose times x
, plus b
, which is equal to z
:
From Figure 4 you can see as z
increases, the denominator gets very small, and therefore sigma approaches 1. Likewise, as z
gets very small, or even a very large negative number, sigma approaches 0. As would be desired when seeking the probability, the value of y-hat would be between 0 and 1.
As discussed above, the sigmoid function is an activation function, and can be other functions as well, such as Rectified Linear Unit (ReLU) activation function, or the perceptron function. In our example here, the ReLU function is used for the inside, or hidden layers, and the sigmoid function for the output layer.
Logistic Regression Cost Function:
Loss Function:
A loss, or error function, will measure how well the algorithm is performing on a single training example. y-hat is the predicted outcome, and y
is the actual value. The loss function is defined by:
If y = 1
then:
and we want y-hat to be large.
If y = 0
then:
and we want y-hat to be small.
The above states that if y = zero
, then the loss function will push the parameters to make y-hat (the predicted value) as close to zero as possible (because there is no dog in the image, and the actual value, y
, is 0
so we want our prediction to also be 0
in the training set), and the opposite holds true if y = 1
.
Cost Function:
Cost function defines how well the algorithm is performing on the entire training set and is used to train your logistic regression model. You could set your loss function to be 1/2 (y-hat -y)sqr
which would be the square error. Unfortunately, what you end up with is a non-convex function, so you end up with an optimization problem with multiple local optima – which in turn means that gradient descent might not find the global optimum.
The above states that the cost function J
applied to the parameters w
and b
is going to be the average of the loss function applied to each of the training examples. This cost function provides us with the cost of the parameters with the objective being to find parameters w
and b
that minimize the cost of the linear regression function.
This post will continue with a discussion on Gradient Descent.
Leave a Reply
Your email is safe with us.