rift – Dov Katz: computer vision

3D Sensing

Image processing was for a long time focused on analyzing individual images. With early success and more compute power, researchers began looking into videos and 3D reconstruction of the world. This post will focus on 3D. There are many ways to get depth from visual data, I’ll discuss the most popular categories.

There are many dimensions to compare 3D sensing designs. These include compute, power consumption, manufacturing cost, accuracy and precision, weight, and whether the system is passive or active. For each design choice below, we’ll discuss these dimensions.

Prior Knowledge

The most basic technique for depth reconstruction is having prior knowledge. In a previous post (Position Tracking for Virtual Reality), I showed how this can be done for a Virtual Reality headset. The main idea is simple: if we know the shape of an object, we know what its projected image will look like at a fixed distance. Therefore, if we have the projected image of a known object, it’s straightforward to determine its distance from the camera.

This technique might not seem very important at first glance. However, keep in mind how good people are at estimating distance with one eye shut. The reason for that is our incredible amount of prior knowledge about the world around us. If we are building robots that perform many tasks, it’s reasonable to expect them to build models of the world over time. These models would make 3D vision easier than starting from scratch every time.

But, what if we don’t have any prior knowledge? Following are several techniques and sensor categories that are suitable for unknown objects.

Stereo

Stereo vision is about using two camera to reconstruct depth. The key is knowledge of the calibration of the cameras (intrinsics) and between the cameras (extrinsics). With calibration, when we view a point in both cameras, we can determine its distance using simple geometry.

Here’s an illustration of imaging a point (x,y,z) by two cameras with a known baseline:

We don’t know the coordinate z of the point, but we do know where it gets projected on the plane of each camera. This gives us a ray starting at the plane of each camera with known angles. We also know the distance between the two projection points. What remain is determining where the two rays intersect.

It’s worth noting that to compute the ray’s angle we need to know the intrinsics of the camera, and specifically the focal length

Stereo vision is intuitive and simple. Image sensors are getting cheaper while resolution is going up. And, stereo vision has a lot in common with human vision, which means our systems can see exactly when people expect to see. And finally, stereo systems are passive — they don’t emit a signal into the environment — which makes them easier to build and energetically efficient.

But, stereo system isn’t a silver bullet. First, matching points/features between the two views can be computationally expensive and not very reliable. Second, some surfaces either don’t have features (white walls, shiny silver spoon, …) or have many features that are indistinguishable from each other. And finally, stereo doesn’t work when it’s too dark or too bright.

Structure from Motion

Structure from motion replace the fixed second camera in the stereo design with a single moving camera. The idea here is that as the camera is moving around the world, we can use views from two different perspectives to reconstruct depth much like we did in the stereo case.

Now’s a great time to say: “wait what?! didn’t you say stereo requires knowledge of the extrinsics?”. Great question! Structure from motion is more difficult in that we not only want to determine the depth of a point, but we also need to compute the camera motion. And no, it’s not an impossible problem to solve.

The key realization is that as the camera is moving, we can try to match many points between two views, not just one. We can assume that most points are part of the static environment (ie. not moving as well). Our camera motion model therefore must provide a consistent explanation for all points. Furthermore, with structure from motion we can easily obtain multiple perspectives (not just two like stereo).

Structure from motion can be understood as a large optimization problem where we have N points in k frames and we’re trying to determine 6*(k-1) camera motion parameters and N depth values for our points from N*k observations. It’s easy to see how with enough points this problem quickly becomes over constrained and therefore very robust. For example, in the above image N=6 and k=3: we need to solve a system of 18 equations with 18 unknown.

The key advantages of structure from motion vs. stereo include simple hardware (just one camera), simple calibration (we only need intrinsics), and simple point matching (because we can rely on tracking frame-to-frame).

There are two main disadvantages. First, structure from motion is more complex mathematically. Second, structure from motion provides us a 3D reconstruction up to a scaling parameter. That is, we can always multiply the depth of all our points by some number and similarly multiply the camera’s motion and everything remains consistent. In stereo, this scaling ambiguity is solved through calibration: we know the baseline between the two cameras.

Structure from motion is typically solved as either an optimization problem (eg. bundle adjustment) or a filtering problem (eg. extended Kalman filter). It’s also quite common to mix the two: use bundle adjustment at a lower frequency and the Kalman filter at the measurement frequency (high).

3D Sensors

There are several technologies that try to measure depth directly. In contrast with the above methods, the goal is to recover depth from a single frame. Let’s discuss the most prominent technologies:

Active Stereo

We already know the advantages and disadvantages of stereoscopic vision. But what if our hardware could make some disadvantages go away?

In active stereo systems such as https://realsense.intel.com/stereo/ a noisy infra red pattern is projected on the environment. The system doesn’t know anything about the pattern, but its existence creates lots of features for stereo matching.

In addition, stereo matching is computed in hardware, making it faster and eliminates the processing burden on the host.

However, as its name implies, active stereo is an active sensor. Projecting the IR pattern requires additional power and imposes limitations on range. The addition of a projector also makes the system more complex and expensive to manufacture.

Structured Light

Structured light shares some similarities with active stereo. Here too a pattern is projected. However, in structured light a known pattern is projected onto the scene. Knowledge of how the pattern deforms when hitting surfaces enables depth reconstruction.

Typical structured light sensors (eg. PrimtSense) project infra red light and have no impact on the visible range. That is, humans cannot see the pattern, and other computer vision tasks are not impacted.

The advantages of structured light are that depth can be computed in a single frame using only one camera. The known pattern enable algorithmic shortcuts and results in accuracy and precise depth reconstruction.

Structured light, however, has some disadvantages: the system complexity increases because of the projector, computation is expensive and typically requires hardware acceleration, and infra red interference is possible.

Time of Flight

Time of flight sensors derive their name from the physics behind the device. It relies on the known time it takes an infrared beam to travel a distance through a known medium (eg. air). The sensor emits a light beam and measures the time it takes for the light to return to the sensor from different surfaces in the scene.

Kinect One is an example of such sensor:

Time of Flight sensors are more expensive to manufacture and require careful calibration. They also have a higher power consumption compared to the other technologies.

However, time of flight sensors do all depth computation on chip. That is, every pixel measures the time of flight and returns a computed depth. There is practically no computational impact on the host system.

Summary

We reviewed different technologies for depth reconstruction. It’s obvious that each has advantages and disadvantages. Ultimately, if you are designing a system requiring depth, you’re going to have to make the right trade-offs for your setup.

A typical set of parameters to consider when choosing your solution is SWaP-C (Size, Weight, Power and Cost). Early on, it’s often better to choose a simple HW solution that requires significant computation and power. As your algorithmic solution stabilizes, it is easy to correct SWaP-C later on with dedicated hardware.

Where’s my point?

A reoccurring problem in computer vision has to do with finding the precise location of an object in the image. In this post, I want to discuss specifically the problem of finding the location of a blob or a point in an image. This is what you’d get when a light source is imaged.

Finding the center of the blob is a question I love asking in job interviews. It’s simple, can be solved with good intuition and without prior knowledge, and it shows a lot about how one grasps the process of image formation.

What’s a blob?

A blob is what’s formed on an image plane when imaging a light source. For instance, here’s how you might imagine a set of LEDs would appear in an image:

This is interesting problem because if we know the location of the center of each blob, and if we know the 3D pattern that generated it (eg. the 3D shape of the LED array) then we can reconstruct its position in 3D space.

There are many possible approaches here. Most of them are either too expensive computationally, unnecessarily complex or plainly inaccurate. Let’s discuss a few:

I) Averaging

This one is rather simple. We take all the pixels associated with a blob and find their average. Here are two examples:

Computing the average of all pixels above would give us x=3 and y=3 as the sum of all x coordinates and all y coordinates is 27 and there are 9 pixels.

Here, however, the average is different. Now the average x is 3.2 and the average y is still 3. This isn’t surprising — the center is shifting to the right because of the additional pixel.

This computation is simple and require very little computation. However, it’s obviously not very precise. We can easily see that our average is in the space of 1/2 pixel — the center either lands on a pixel center or its edge (between two pixels). This is terrible accuracy for 3D reconstruction (unless you have a super narrow field of view and/or extremely high resolution image sensor).

But wait, it gets worse. This method doesn’t take into account the brightness of pixels. For example, imagine pixel (5,3) above was extremely dim. So dim, in fact, that it would be just on the threshold of our image sensor. The result will be that the pixel would appear or disappear in consecutive frames due to image noise. The center of the blob will be greatly effected by this phenomena. Not good.

II) Circle Fit

How about fitting a circle to the blob? It seems like this would give us much higher precision and aren’t circles pretty robust to noise?

Well yes in theory but not so much in practice.

First, let’s consider the image of even a really small light source. The light source isn’t going to be a point source in practice. We can imagine it has some small circular shape. But, when that light source gets projected onto the image plane, the result is an ellipse and not a circle. Fitting an ellipse is much more complex and less robust than circle fit.

Second, the image formation process is unlikely to result in pixels that are all saturated. We are going to get shades of white as the distance from the true center increases. Here’s an example:

As can be seen in the image, pixels have less brightness when further out from the blob center. In fact, pixels at the edge of the blob might not always pass the sensor’s threshold and remain black. These pixels are going to have a significant impact on the fit.

And finally, in many cases blobs are going to be pretty small. Circle fit is much less robust with fewer pixels.

Weighted Average

A weighted average is almost as simple as the averaging method discussed above. However, it takes into account the strength of the signal in each pixel and it doesn’t assume the projected shape is a circle. Weighted average takes the brightness level of each pixel, and uses it as a weight for the average.

Now, imagine the 2nd example above, where all pixels have brightness level of 255, except for pixel (5,3) that is very dim, say brightness level of 10. Now, our weighted average will yield: x = 3.008 while y remains 3. This makes perfect sense: the dim pixel is pulling the center only slightly towards it. If that pixel was brighter, the pull would be stronger and the computed center will shift towards it.

Weighted average is also a great way to eliminate noise in the image. It’s simple, precise and much more robust.

Summary

I’ve used this method to solve multiple computer vision problems over time. Check out my patents page. Some of them use a version of this technique.

Position Tracking for Virtual Reality

I developed the original position tracking for the Oculus Rift headset and controllers.
In this post, I will discuss why position tracking is important for Virtual Reality and how we realized it at Oculus.

What is Position Tracking?

Position tracking in the context of VR is about tracking the user’s head motion in 6D. That is, we want to know the user’s head position and orientation. Ultimately, we will expand this requirement to tracking other objects / devices such as controllers or hands. But for now, I’ll focus on head motion.

Tracking the head let’s us render the right content for the user. For instance, as you move your head forward, you expect the 3D virtual world to change: objects will get closer to you, certain object will go out of the field of view, etc. Similarly, when you turn around, you expect to see what was previously behind you.

Virtual reality can cause motion sickness. If the position and orientation updates are wrong, shaky or just imprecise, our brain will sense the conflict between proprioception and vision — we know how the head, neck and body is moving, and we know what we expect to see, the mismatch causes discomfort.

To give you a sense of the requirements: we’re looking for a mean precision of about 0.05[mm] and a latency of about 1[ms]. In the case of a consumer electronics device, we also want to system cost to be low in $ and computation.

Note: the following assumes tracking of a single headset using a single camera. Mathematically, using stereo is easier. However, the associated cost in manufacturing and calibration are significant.

Basic Math

We can compute the pose of an object if we know its structure. In fact, we need as few as 3 points to compute a 6D pose:

If we know the distance between P1, P2 and P3 in metric units and the camera properties (intrinsics), we can compute the pose of the object (P1,P2,P3) that creates the projection (p1,p2,p3).

There are just two hurdles:
(1) we need to be able to tell which point P(i) corresponds to a projected point p(j)
(2) in the above formulation, there can actually be more than one solution (pose) that creates the same projection

Here’s an illustration of #2:

ID’ing points

There are many ways to solve #1. For instance, we could color code the points so they are easy to identify. At Oculus we created the points using infra red LEDs on the headset. We used modulation — each LED switched between high and low — to create a unique sequence.

Later on, we were able to make the computation efficient enough that we didn’t need modulation. But that’s for another post 🙂

Dealing with multiple solutions

There are two primary ways to handle multiple solutions. First, we have many LEDs on the headset. This means that we can create many pose theories for subsets of 3 points. We expect the right pose to explain all of them. Second, we are going to track the pose of the headset over time — the wrong solution will make tracking break in the next few frames because while the right answer will change smoothly from frame to frame, the alternative pose solution will jump around erratically.

Tracking

Many problems in computer vision can be solved by either tracking or re-identification. Simply put, you can either use the temporal correlation between solutions or recompute the solution from scratch every frame.

Usually, taking advantage of the temporal correlation leads to significant computational savings and robustness to noise. However, the downside is that we end up with a more complex software: we use a “from scratch” solution for the first frame and a tracking solution for the following frames. When you design a real time computer vision system, you often have to balance the trade-off between the two options.

The Oculus’ position tracking solution uses tracking. We compute the pose for frame i as discussed above, and then track the pose of the headset in the following frames. When we compute the pose in frame n, we know where the headset was in frame n-1 and n-2. This means that not only we have a good guess for the solution (close to where it was in the previous frame), but we also know the linear and angular velocities of the headset.

The following illustration shows the process of refining the pose of the headset based on temporal correlation. We can make a guess for the pose of the headset (the blue point). We then compute where the projected LEDs should be in the image given that pose. Of course, there are going to be some errors — we can now adjust the pose locally to get the red solution below (the local minima).

Latency

So far so good — we have a clean solution to compute the pose of the headset over time using only vision data. Unfortunately, a reasonable camera runs at about 30[fps]. This means that we’re going to get pose measurements every 33[ms]. This is very far from the requirement of 1[ms] of latency…

Let me introduce you to the IMU (Inertial Measurement Unit). It’s a cheap little device that is part of every cellphone and tablet. It’s typically being used to determine the orientation of your phone. The IMU measures the linear acceleration and angular velocity. And, a typical IMU works at about 1[KHz].

Our next task, therefore, is to use the IMU together with vision observations to get pose estimation at 1[KHz]. This will satisfy our latency requirements.

Complementary Filter

The complementary filter is a well known method for integrating multiple sources of information. We are going to use it to both filter IMU measurements to track the orientation of the headset, and to fuse together vision and IMU data to get full 6D pose.

IMU: orientation

To compute the orientation of the headset from IMU measurements we need to know the direction of gravity. We can compute that from the acceleration measurements. The above figure shows we can do that by low-pass filtering the acceleration vector. As the headset moves around, the only constant acceleration the IMU senses is going to be gravity.

orientation = (orientation + gyro * dt) * (1 – gain) + accelerometer * gain

With knowledge of gravity, we can lock the orientation along two dimensions: tilt and roll. We can also estimate the change in yaw, but this degree of freedom is subject to drift and cannot be corrected with gravity. Fortunately, yaw drift is slow and can be corrected with vision measurements.

IMU: full 6D pose

The final step in our system integrates pose computed from the camera (vision) at a low frequency with IMU measurements at a high frequency.

err = camera_position – filter_position
filter_position += velocity * dt + err * gain1
velocity += (accelerometer + bias) * dt + err * gain2
bias += err * gain3

Summary

Now we have a fully integrated system that takes advantage of the advantages of the camera (position and no drift in yaw) and the IMU (high frequency orientation and knowledge of gravity).

The complementary filter is a simple solution that provides as with smooth pose tracking at a very low latency and with minimal computational cost.

Here’s a video of a presentation I gave together with Michael Abrash about VR at Carnegie Mellon University (CMU). In my part, I cover the position tracking system design.