Reinventing Food

The world is growing and faster than ever. Scientist predict we will cross 9B people this century. And while there are arguments about when that growth will stop and how quickly, one thing is clear: we are going to run out of food soon.

Recent research proposes that in order to accommodate such a large population, we are going to have to change our diet. Specifically, move away from meat and consume mostly (if not exclusively) fruit and vegetable.

Meat consumption has dire effects on the environment. It is expensive to grow, raises moral concerns, but primarily it is extremely ineffective. An average cow will consume 30 times more calories in its lifetime than it will provide. The number is much smaller (about 5 times) for poultry. And of course, the most effective way is to consume vegetable based calories.

The worldwide trend is clear: as countries get wealthier, their meat consumption increases. Together with the fact that the world’s population is larger than ever and growing, we are heading towards an environmental crisis and likely famine.

What can we do? Other than switching to a vegetarian diet, we also need to start thinking about optimizing our farming. This is where robotics, computer vision and machine learning can help. For example, fully automated farms that are controlled by algorithms can be extremely effective and minimize waste, space and therefore cost and environmental footprint.


Reconstructing a Room from a Single Image

I’ve bee involved in the current generation of Virtual Reality from early on. You can read more about that in my website [Dov Katz]. There are many exciting problems related to VR and computer vision but one that I always found exciting is that of reconstructing a room from a single image. All in all, humans seem to be pretty good at it. It is clear that we are good at estimating the true structure of a room because we have so much experience with rooms and the objects in it. Now, can machines do the same?

In a recent paper titled “LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image”  (you can read it here) the authors propose a deep learning architecture accompanied with a few computer vision techniques to reconstruct a room boundaries from a single image.

The idea is to first align the image with gravity so that floors are parallel to the ground and walls are perpendicular. This is done to reduce the complexity associated with tilted views. Then, Manhattan lines are computed and together with the images are fed into a CNN. The network returns corners and boundaries which are the building blocks of a 3D model for the room.

The final step is fitting surfaces to satisfy the corners and edges the network suggested. This completes a 3D model of the room. The results are actually quite good. Check out the video above to see some examples.

6D Pose Estimation using CNN

Tracking the 6D pose of an object is an important task with many applications. For example, it’s a key component for AR, VR and many robotics systems.

There are many approaches to solve this problem. For example, when you know something about the size of the object and trackable features on it. Or, when the tracked object is a moving camera (ie. SLAM). These techniques all rely on the stereoscopic vision principle (directly or indirectly). They basically relate multiple views to compute depth. The main issue with such methods is objects that are weakly textured, low resolution videos and low light conditions.

An alternative approach is using depth sensing data acquired by RGB-D cameras. Such solutions are quite reliable but expensive in power, cost, weight, etc. This makes them less than optimal for mobile devices.

People, however, seem to be able to determine the pose of an object effortlessly from a single image with no depth sensor. This is likely so because we have vast knowledge of object sizes and appearance. Which, of course, begs the question: can we estimate pose using machine learning?

In a recent paper, the authors propose an extension of YoLo to 3D. Yolo is a deep learning algorithm that detects objects by determining the correct 2D bounding box. Extending YoLo is therefore pretty straight forward. We want to learn the 2D corners of a projected 3D bounding box. If we can do that, reconstructing the 3D pose of the bounding box is simple.

The input to the system is a single-shot RGB image. The output is the object’s 6D pose. The proposed solution runs at an impressive 50 frames per second. This rate make it suitable for real time object tracking. The key component of this system is a new CNN architecture that directly predicts the 2D image locations of the projected vertices of the object’s 3D bounding box. The object’s 6D pose is then estimated using a PnP algorithm.

The authors show some impressive results. Here’s an image showing the computed bounding box for a few objects:


Experiencing real estate in 3D

When my wife Gili Katz and I were looking for a house, we would have loved to have the option to check out a few options in VR. You can check out my website (Dov Katz) for more info about Virtual Reality. Here’s about some recent development that make virtual real estate shopping a possibility:


Shopping for real estate can be quite frustrating. You need to search ads for what seems like the right place for you, then contact agents, schedule appointments, drive and finally walk-through the house for a few minutes. If it doesn’t seem very effective, it’s because it isn’t.

Fortunately, computer vision and Virtual Reality can save the day. Recently, a YC backed startup called “Send Reality” (send reality) developed a system that offers a full 3D-model of a real estate listing. With Send Reality, you can simply walk through a house in Virtual Reality.

Send Reality sends photographers out to the listed property. The photographer only needs to carry an iPad fitted with an off-the-shelf depth sensor. Send Reality’s application does the rest. The photographer walks through the property, snapping hundreds of thousands of photos. Next, the software stitches those photos together to create a complete 3D model of the property.

Send Reality claims to do the stitching particularly fast. They claim that “… what this means is that the 3D models we create are so much more realistic than anything else anyone else has made”. I’m not sure why the speed of stitching has anything to do with the quality of the model. However, being able to handle hundreds of thousands of photos certainly helps in creating a detailed 3D model.

The company is currently focusing on luxury residential markets. Websites listing such properties can now include a 3D tour of the properties. Early results indicate people spend much more time viewing a property when a 3D virtual tour is available. The cost of creating a 3D virtual tour of a single property is about $500.

While this technology is exciting, it seems like a problem that is already solved by Virtual Reality hardware manufacturer. For instance, Oculus’ Go device tracks the environment in much the same way Send Reality tracks it to build a 3D model. It wouldn’t be big step to add the automatic construction of such models to the device’s tracking system. And, of course, the potential of scanning and sharing 3D environments goes beyond real estate.

Semantic Adversarial Examples

Deep Neural Networks have enabled tremendous progress in computer vision. They help us solve many detection, recognition and classification tasks that seemed out of reach not too long ago. However, DNNs are known to be vulnerable to adversarial  examples. That is, one can tweak the input to mislead the network. For example, it is well documented that small perturbations can dramatically increase prediction errors. Of course, the resulting image looks like it was artificially manipulated and the process can be mitigated with de-noising techniques.

In this paper the authors propose a new class of adversarial examples. The approach is quite simple. The authors convert the image from RGB into HSV (Hue, Saturation and Value). Then, they randomly shift hue and saturation while keeping the value fixed. 

Here’s an example of manipulating Hue and Saturation for a single image:


The resulting image looks very much like the original one to the human perception system. At the same time, it has a dramatic negative impact on the pre-trained network. In this paper, the CIFAR10 dataset was manipulated as described and then tested on a VGG16 network. The resulting accuracy drop from ~93$ to about 6%.

Here are some classification results. You can clearly see how the change in hue and saturation throws off the network and leads to pretty much random classification:


Deep Learning: building a flower classifier

Deep learning is a development on top of the classical neural network architecture that was all the rage in the 80s. Following a long AI winter , deep learning is emerging as a successful technique. In particular, it already showed impressive results in the domain of computer vision.

Deep learning, as the name implies, is a neural network architecture composed of multiple layers. Advancement in compute and the availability of large amounts of data via the internet are the key enablers. To appreciate deep learning, you really should pick a book or take a class. This tutorial is more practical: I’ll show you how to build a classifier for 5 flower types: daisy, rose, sunflower, dandelion and tulip. We are going to use a dataset of flower images that is available for download on Kaggle. We will build a convolution neural network using a popular open source software called Caffe.

Most of this post is about building an image classifier from scratch. In the last part , I’ll present transfer learning — an approach that let’s us take advantage of a pre-trained network. For another example of the power of transfer learning, check out my 2008 RSS paper about Relational Reinforcement Learning.

Typically, we would like to get a few thousand samples per class. Here we only have 800 images per class. Still, our classifier will achieve about 80% accuracy on the test dataset for the 5 flower classes. This can certainly be improved with a more complex network architecture and more data. Nevertheless, 80% is 4 times better than random guessing.

Here are some examples of the 5 types of flowers:

As you can see, some flowers look quite similar. Also, some images are better than others for the purpose of learning the appearance of each flower type.

CNN: Convolutional Neural Network

Convolutional neural networks are particularly well suited for image processing tasks because they are modeled after the visual cortex. A CNN is composed of convlutional layers and pooling layers that enable the network to respond to specific image properties. These properties are very helpful in visual recognition and image understanding tasks.

Here’s an example of a CNN called LeNet (proposed by LeCun in 1998):photo in blog by Dov Katz / Dubi Katz / דובי כץ / דב כץ

Convolution Layer

A convolution layer is a set of filters that perform a convolution operation with an image. That is, we slide each convolution kernel over the image, computing the dot product between the filter and the input image. The filter’s depth is that same as the input. For instance, a color image has 3 channels, and so our convolution kernel will be of size k*k*3.

A filter is activated (or: generates a maximum response) when interacting with specific structure in the image. Effectively, this response acts as a detector for a certain type of structure / property. For example, a convolution kernel might respond strongly to the presence of an edge in the image. And, deeper in the network, a kernel might respond to the presence of a face in the image.

Pooling Layer

A pooling layer introduces non linearity. It essentially down samples the output of the convolution layer. As you can see in the above image of LeNet, the output gets progressively smaller with every pooling layer.

Pooling reduces the number of parameters in the network. As a result, not only computation time is reduced, but also the risk of over-fitting is reduced.

There are multiple functions that implement pooling. Probably the most common is max pooling.

The image below shows an example of a pooling kernel of size 2×2. Pooling is applied at every depth slice. If we use such kernel with stride 2, the input image will be reduced to a quarter of its original size:photo in blog by Dov Katz / Dubi Katz / דובי כץ / דב כץ

This is an example of max pooling. You can see that effectively each 2×2 region in the original image is reduced to a single cell, with the value being the maximum of the values in the 2×2 block.

Putting It All Together

The architecture of a CNN is relatively simple. We start with an input layer (images) and go through a sequence of convolution layers and pooling layers. At every level the network reduces the dimensionality of the input while increasing the generalization. At the end of the network we use fully-connected layers. These layers are essentially a classifier, while the previous layers are feature extractors.

Two important notes:

  1. Convolution layers are typically followed by a ReLU (non-linear) activation function
  2. Only the convolution and fully-connected layers have weights; these weights are what we learn during training

Flower Classifier

Now we’re ready to implement our CNN based flower classifier. As mentioned above, we are going to use a dataset from Kaggle. For simplicity, I already packaged the data and code snippets. You can download it here.

The images in the dataset are split into training images and test images. We are going to train the network using the training images. This step includes validation. Then, we will use the test images to evaluate the performance of our network.

The CNN implementation proposed here requires Caffe and Python.
Please make sure both are installed in your system.

Generally speaking, there are 5 steps in training a CNN:

  1. Data preparation: basic image processing, computing a mean image, and storing the images in a location and format accessible for Caffe
  2. Model definition: choose an architecture for our network
  3. Solver definition: choose the solver parameters
  4. Training: train the model using Caffe
  5. Test: use the results of training in run-time / with test data

Let’s dive deeper into each step:

Step 1: Data Preparation

Assuming you’ve downloaded the project, go into the code folder.
In the code snippets below, I’m assuming Caffe is installed at ~/caffe and the project is at ~/flowers. It’s straightforward to use different paths.

We are now going to use the script to create the LMDB databases. The script does the following:

  1. Histogram equalization on all training images
  2. Resizes images to a fixed size
  3. Creates a validation set (5/6 of the data for training, 1/6 for validation)
  4. Stores the images in an LMDB database (one for training, one for validation)

After running the script, we have one last task in preparing the data for training. We are going to create an average image for all training data. The idea is to use that average image to subtract from each input image. The result is that zero mean images.

~/caffe/build/tools/compute_image_mean -backend=lmdb ~/flowers/input/train_lmdb ~/flowers/input/mean.binaryproto

Step 2: Model definition

Defining the model of the network is where magic happens. There are many decisions and choices we can make. After we decided on an architecture, we define it in caffenet_train_val.prototxt.

The Caffe library offers some popular CNN models including AlexNet and GoogleNet. We are going to use a model that is similar to AlexNet and is called the BVLC model.

The model definition file that you downloaded includes some specific changes to the original BCLV model. You may need to update them for your system:

First, we need to change the path for our data. You can do that in lines 24,40 and 51.

Second, we need to define the number of output. In the original reference file there were 1000 classes. In our case, we only have 5 flower classes (see line 373).

Note: Sometimes it’s useful to visually inspect the architecture of our network. Caffe provides a tool to do just that. You can run:

python ~/caffe/python/ 

The resulting image architecture.png is a graphical representation of the specified network.

Step 3: Solver definition

Once we’ve designed a network architecture, we also need to make decisions regarding the parameter optimization process. These are stored in the solver definition file (solver.prototxt).

Here are some choices we can make:

  • Use the validation set every 1000 iteration to measure the current model accuracy
  • Allow optimization to run for 40,000 iterations before terminating
  • Save a snapshot of the trained model parameters every 5,000 iterations
  • Hyper-parameters: we are going to tune our optimization such that the initial learning rate is 0.001 (base_lr). The learning rate drops by a factor of 10 (gamma) every 2,500 iterations (stepsize):
    • base_lr = 0.001
    • lr_policy = “step”, stepsize = 2500
    • gamma = 0.1
  • We also need to defined other parameters such as momentum and weight_decay

To learn more about various strategies for the optimization process, checkout Caffe’s documentation here.

Steps 4 & 5: Training and Testing

We are now ready to train the network. The following command will start training our network and store the training log as model_train.log:

~/caffe/build/tools/caffe train 
   --solver ~/flowers/caffe_model/solver.prototxt 2>&1 
   | tee ~/flowers/caffe_model/model_train.log

As training proceeds, Caffe will report the training loss and model accuracy. We can stop the process at any point, and use the parameters in the most recent snapshot. In our settings, we will have saved a snapshot very 5000 iterations.

To analyze the performance of our network architecture we can plot the learning results. We can quickly see how long it takes for our validation accuracy to plateau.

Here’s how you create the plot, followed by an example of a plot for the our flowers classifier:

python ~/flowers/code/ ~/flowers/caffe_model/model_train.log ~flowers/caffe_model/learning_curve.png


You can see that after about 3000 iteration learning plateaus at about 80% accuracy. This is actually not a terrible result. It represents 4 times improvement over random guessing given that we have 5 classes. But, we can do better!

Transfer Learning

Transfer learning is a very powerful method. The intuition is that what we learn for one task could be helpful for other tasks. For instance, once a baby learns how to open a door, learning to open a window can happen much faster because of the similarities.

The same applies to visual tasks. Many tasks require extracting features or structure in an image. Think about corners, edges, or even more complex hand-crafted features such as SIFT. A network can learn those features using data for one task, then reuse them for another task. Intuitively, all we have to do is replace the last few layers of the network and re-train it for the specific task.

Here’s how you train a network with pre-existing weights:

~/caffe/build/tools/caffe train 
   --solver = ~/flowers/caffe_model/solver.prototxt 
   --weights  ~/flowers/caffe_model/bvlc_reference_caffenet.caffemodel 
   2>&1 | tee ~/flowers/caffe_model/model_train.log

The BVLC reference file contains the weight of the pre-trained BVLC network. We are simply asking it to retrain the last two layers (the fully connected ones) for our task.

Note: To indicate that we want to retrain the last two layers while keeping the weights for the earlier layers, we need to make sure the fully connected layers get a new name in the configuration file.

And here’s the performance plot for transfer learning:


The plots shows us an impressive improvement in the test accuracy: from 80% to about 90%. Interestingly, we get this level of performance already after 1000 iterations. This shows the power of transfer learning.

Finally, it is plausible that with more data we’d get even more improvement. And that, unfortunately, if often the bottleneck for deep learning: getting enough data to train the network.

Vision in Agriculture

There are so many cool areas where computer vision can make a difference. We typically think about autonomous cars, robotic manipulators, or face recognition. But there are so many other areas. One of them is agriculture. Computer vision can make certain tasks easier, more efficient and more accurate. This could lead to cheaper and more available food supply.

Here’s a short review of a paper by Oppenheim et al. from Ben-Gurion University in Israel. You can find the publication here.

Paper overview

Oppenheim et al. write about using computer vision techniques to detect and count yellow tomato flowers in a greenhouse. This work is more practical than academic. They develop algorithms that can handle real-world conditions such as uneven illumination, complex growth conditions and different flower sizes.

Interestingly, the proposed solution is designed to run on a drone. This has significant implications on SWaP-C: size, weight, power and cost. It also greatly effects what is computationally feasible.

The proposed algorithm begins with computing the lighting conditions and transforming the RGB image into the HSV color space. Then, the image is segmented into background and foreground using color cues, and finally a simple classification is performed on the foreground patches to determine whether they are flowers or not.

Converting the image to HSV is straightforward. Computing lighting condition is about figuring out the whether the image is dark or bright. The authors use two indicators: median value of saturation (S channel) and the skew of the saturation histogram.

Here’s how the flowers are imaged from different perspectives by the drone:


During segmentation, and after considering the lighting conditions, the paper simply suggests thresholding yellow pixels. It helps, of course, that the flower’s yellow color is very distinguishable. The idea is to keep pixels that are yellow with low saturation while removing very bright pixels from consideration. photo in blog by Dov Katz / Dubi Katz / דובי כץ / דב כץ

Next, morphological operations are performed to eliminate small disconnected region and “close” holes in otherwise large segments. This creates more coherent image patches in which the algorithm believes it detected a flower.

The last step is classification. The algorithm goes over all connected components / image patches it extracted during segmentation and cleaned up with morphological operations. Small connected components are being discarded. The remaining connected components are considered to be good exemplars of yellow tomato flowers.

Here’s the algorithm’s performance according to the paper:


The plots shows the algorithm’s performance is best with a front view of the flower. Of course, that’s not surprising as this would be the clearest perspective.


I like the fact that this work tackles a practical problem. Solving this has clear applications. What I found a bit missing is a discussion about the more challenging cases of flower arrangement (eg. flowers overlapping in the image). In addition, I’d be curious to know how this method compares to a machine learning based approach that learns a model of the tomato flower from examples.

All images and data are from the paper.
I highly recommend reading it for a more thorough understanding of the work.