Deep learning is a development on top of the classical neural network architecture that was all the rage in the 80s. Following a long AI winter , deep learning is emerging as a successful technique. In particular, it already showed impressive results in the domain of computer vision.
Deep learning, as the name implies, is a neural network architecture composed of multiple layers. Advancement in compute and the availability of large amounts of data via the internet are the key enablers. To appreciate deep learning, you really should pick a book or take a class. This tutorial is more practical: I’ll show you how to build a classifier for 5 flower types: daisy, rose, sunflower, dandelion and tulip. We are going to use a dataset of flower images that is available for download on Kaggle. We will build a convolution neural network using a popular open source software called Caffe.
Most of this post is about building an image classifier from scratch. In the last part , I’ll present transfer learning — an approach that let’s us take advantage of a pre-trained network. For another example of the power of transfer learning, check out my 2008 RSS paper about Relational Reinforcement Learning.
Typically, we would like to get a few thousand samples per class. Here we only have 800 images per class. Still, our classifier will achieve about 80% accuracy on the test dataset for the 5 flower classes. This can certainly be improved with a more complex network architecture and more data. Nevertheless, 80% is 4 times better than random guessing.
Here are some examples of the 5 types of flowers:
As you can see, some flowers look quite similar. Also, some images are better than others for the purpose of learning the appearance of each flower type.
CNN: Convolutional Neural Network
Convolutional neural networks are particularly well suited for image processing tasks because they are modeled after the visual cortex. A CNN is composed of convlutional layers and pooling layers that enable the network to respond to specific image properties. These properties are very helpful in visual recognition and image understanding tasks.
Here’s an example of a CNN called LeNet (proposed by LeCun in 1998):
A convolution layer is a set of filters that perform a convolution operation with an image. That is, we slide each convolution kernel over the image, computing the dot product between the filter and the input image. The filter’s depth is that same as the input. For instance, a color image has 3 channels, and so our convolution kernel will be of size k*k*3.
A filter is activated (or: generates a maximum response) when interacting with specific structure in the image. Effectively, this response acts as a detector for a certain type of structure / property. For example, a convolution kernel might respond strongly to the presence of an edge in the image. And, deeper in the network, a kernel might respond to the presence of a face in the image.
A pooling layer introduces non linearity. It essentially down samples the output of the convolution layer. As you can see in the above image of LeNet, the output gets progressively smaller with every pooling layer.
Pooling reduces the number of parameters in the network. As a result, not only computation time is reduced, but also the risk of over-fitting is reduced.
There are multiple functions that implement pooling. Probably the most common is max pooling.
The image below shows an example of a pooling kernel of size 2×2. Pooling is applied at every depth slice. If we use such kernel with stride 2, the input image will be reduced to a quarter of its original size:
This is an example of max pooling. You can see that effectively each 2×2 region in the original image is reduced to a single cell, with the value being the maximum of the values in the 2×2 block.
Putting It All Together
The architecture of a CNN is relatively simple. We start with an input layer (images) and go through a sequence of convolution layers and pooling layers. At every level the network reduces the dimensionality of the input while increasing the generalization. At the end of the network we use fully-connected layers. These layers are essentially a classifier, while the previous layers are feature extractors.
Two important notes:
- Convolution layers are typically followed by a ReLU (non-linear) activation function
- Only the convolution and fully-connected layers have weights; these weights are what we learn during training
Now we’re ready to implement our CNN based flower classifier. As mentioned above, we are going to use a dataset from Kaggle. For simplicity, I already packaged the data and code snippets. You can download it here.
The images in the dataset are split into training images and test images. We are going to train the network using the training images. This step includes validation. Then, we will use the test images to evaluate the performance of our network.
The CNN implementation proposed here requires Caffe and Python.
Please make sure both are installed in your system.
Generally speaking, there are 5 steps in training a CNN:
- Data preparation: basic image processing, computing a mean image, and storing the images in a location and format accessible for Caffe
- Model definition: choose an architecture for our network
- Solver definition: choose the solver parameters
- Training: train the model using Caffe
- Test: use the results of training in run-time / with test data
Let’s dive deeper into each step:
Step 1: Data Preparation
Assuming you’ve downloaded the project, go into the code folder.
In the code snippets below, I’m assuming Caffe is installed at ~/caffe and the project is at ~/flowers. It’s straightforward to use different paths.
We are now going to use the script to create the LMDB databases. The script does the following:
- Histogram equalization on all training images
- Resizes images to a fixed size
- Creates a validation set (5/6 of the data for training, 1/6 for validation)
- Stores the images in an LMDB database (one for training, one for validation)
After running the script, we have one last task in preparing the data for training. We are going to create an average image for all training data. The idea is to use that average image to subtract from each input image. The result is that zero mean images.
~/caffe/build/tools/compute_image_mean -backend=lmdb ~/flowers/input/train_lmdb ~/flowers/input/mean.binaryproto
Step 2: Model definition
Defining the model of the network is where magic happens. There are many decisions and choices we can make. After we decided on an architecture, we define it in caffenet_train_val.prototxt.
The Caffe library offers some popular CNN models including AlexNet and GoogleNet. We are going to use a model that is similar to AlexNet and is called the BVLC model.
The model definition file that you downloaded includes some specific changes to the original BCLV model. You may need to update them for your system:
First, we need to change the path for our data. You can do that in lines 24,40 and 51.
Second, we need to define the number of output. In the original reference file there were 1000 classes. In our case, we only have 5 flower classes (see line 373).
Note: Sometimes it’s useful to visually inspect the architecture of our network. Caffe provides a tool to do just that. You can run:
The resulting image architecture.png is a graphical representation of the specified network.
Step 3: Solver definition
Once we’ve designed a network architecture, we also need to make decisions regarding the parameter optimization process. These are stored in the solver definition file (solver.prototxt).
Here are some choices we can make:
- Use the validation set every 1000 iteration to measure the current model accuracy
- Allow optimization to run for 40,000 iterations before terminating
- Save a snapshot of the trained model parameters every 5,000 iterations
- Hyper-parameters: we are going to tune our optimization such that the initial learning rate is 0.001 (base_lr). The learning rate drops by a factor of 10 (gamma) every 2,500 iterations (stepsize):
- base_lr = 0.001
- lr_policy = “step”, stepsize = 2500
- gamma = 0.1
- We also need to defined other parameters such as momentum and weight_decay
To learn more about various strategies for the optimization process, checkout Caffe’s documentation here.
Steps 4 & 5: Training and Testing
We are now ready to train the network. The following command will start training our network and store the training log as model_train.log:
--solver ~/flowers/caffe_model/solver.prototxt 2>&1
| tee ~/flowers/caffe_model/model_train.log
As training proceeds, Caffe will report the training loss and model accuracy. We can stop the process at any point, and use the parameters in the most recent snapshot. In our settings, we will have saved a snapshot very 5000 iterations.
To analyze the performance of our network architecture we can plot the learning results. We can quickly see how long it takes for our validation accuracy to plateau.
Here’s how you create the plot, followed by an example of a plot for the our flowers classifier:
python ~/flowers/code/plot_learning_curve.py ~/flowers/caffe_model/model_train.log ~flowers/caffe_model/learning_curve.png
You can see that after about 3000 iteration learning plateaus at about 80% accuracy. This is actually not a terrible result. It represents 4 times improvement over random guessing given that we have 5 classes. But, we can do better!
Transfer learning is a very powerful method. The intuition is that what we learn for one task could be helpful for other tasks. For instance, once a baby learns how to open a door, learning to open a window can happen much faster because of the similarities.
The same applies to visual tasks. Many tasks require extracting features or structure in an image. Think about corners, edges, or even more complex hand-crafted features such as SIFT. A network can learn those features using data for one task, then reuse them for another task. Intuitively, all we have to do is replace the last few layers of the network and re-train it for the specific task.
Here’s how you train a network with pre-existing weights:
--solver = ~/flowers/caffe_model/solver.prototxt
2>&1 | tee ~/flowers/caffe_model/model_train.log
The BVLC reference file contains the weight of the pre-trained BVLC network. We are simply asking it to retrain the last two layers (the fully connected ones) for our task.
Note: To indicate that we want to retrain the last two layers while keeping the weights for the earlier layers, we need to make sure the fully connected layers get a new name in the configuration file.
And here’s the performance plot for transfer learning:
The plots shows us an impressive improvement in the test accuracy: from 80% to about 90%. Interestingly, we get this level of performance already after 1000 iterations. This shows the power of transfer learning.
Finally, it is plausible that with more data we’d get even more improvement. And that, unfortunately, if often the bottleneck for deep learning: getting enough data to train the network.