Position Tracking for Virtual Reality

I developed the original position tracking for the Oculus Rift headset and controllers.
In this post, I will discuss why position tracking is important for Virtual Reality and how we realized it at Oculus.

What is Position Tracking?

Position tracking in the context of VR is about tracking the user’s head motion in 6D. That is, we want to know the user’s head position and orientation. Ultimately, we will expand this requirement to tracking other objects / devices such as controllers or hands. But for now, I’ll focus on head motion.

Tracking the head let’s us render the right content for the user. For instance, as you move your head forward, you expect the 3D virtual world to change: objects will get closer to you, certain object will go out of the field of view, etc. Similarly, when you turn around, you expect to see what was previously behind you.

Virtual reality can cause motion sickness. If the position and orientation updates are wrong, shaky or just imprecise, our brain will sense the conflict between proprioception and vision — we know how the head, neck and body is moving, and we know what we expect to see, the mismatch causes discomfort.

To give you a sense of the requirements: we’re looking for a mean precision of about 0.05[mm] and a latency of about 1[ms]. In the case of a consumer electronics device, we also want to system cost to be low in $ and computation.

Note: the following assumes tracking of a single headset using a single camera. Mathematically, using stereo is easier. However, the associated cost in manufacturing and calibration are significant.

Basic Math

We can compute the pose of an object if we know its structure. In fact, we need as few as 3 points to compute a 6D pose:

pose1

If we know the distance between P1, P2 and P3 in metric units and the camera properties (intrinsics), we can compute the pose of the object (P1,P2,P3) that creates the projection (p1,p2,p3).

There are just two hurdles:
(1) we need to be able to tell which point P(i) corresponds to a projected point p(j)
(2) in the above formulation, there can actually be more than one solution (pose) that creates the same projection

Here’s an illustration of #2:

photo in blog by Dov Katz / Dubi Katz / דובי כץ / דב כץ

ID’ing points

There are many ways to solve #1. For instance, we could color code the points so they are easy to identify. At Oculus we created the points using infra red LEDs on the headset. We used modulation — each LED switched between high and low — to create a unique sequence.

Later on, we were able to make the computation efficient enough that we didn’t need modulation. But that’s for another post 🙂

Dealing with multiple solutions

There are two primary ways to handle multiple solutions. First, we have many LEDs on the headset. This means that we can create many pose theories for subsets of 3 points. We expect the right pose to explain all of them. Second, we are going to track the pose of the headset over time — the wrong solution will make tracking break in the next few frames because while the right answer will change smoothly from frame to frame, the alternative pose solution will jump around erratically.

Tracking

Many problems in computer vision can be solved by either tracking or re-identification. Simply put, you can either use the temporal correlation between solutions or recompute the solution from scratch every frame.

Usually, taking advantage of the temporal correlation leads to significant computational savings and robustness to noise. However, the downside is that we end up with a more complex software: we use a “from scratch” solution for the first frame and a tracking solution for the following frames. When you design a real time computer vision system, you often have to balance the trade-off between the two options.

The Oculus’ position tracking solution uses tracking. We compute the pose for frame i as discussed above, and then track the pose of the headset in the following frames. When we compute the pose in frame n, we know where the headset was in frame n-1 and n-2. This means that not only we have a good guess for the solution (close to where it was in the previous frame), but we also know the linear and angular velocities of the headset.

The following illustration shows the process of refining the pose of the headset based on temporal correlation. We can make a guess for the pose of the headset (the blue point). We then compute where the projected LEDs should be in the image given that pose. Of course, there are going to be some errors — we can now adjust the pose locally to get the red solution below (the local minima).

pnprefinement

Latency

So far so good — we have a clean solution to compute the pose of the headset over time using only vision data. Unfortunately, a reasonable camera runs at about 30[fps]. This means that we’re going to get pose measurements every 33[ms]. This is very far from the requirement of 1[ms] of latency…

Let me introduce you to the IMU (Inertial Measurement Unit). It’s a cheap little device that is part of every cellphone and tablet. It’s typically being used to determine the orientation of your phone. The IMU measures the linear acceleration and angular velocity. And, a typical IMU works at about 1[KHz].

Our next task, therefore, is to use the IMU together with vision observations to get pose estimation at 1[KHz]. This will satisfy our latency requirements.

Complementary Filter

The complementary filter is a well known method for integrating multiple sources of information. We are going to use it to both filter IMU measurements to track the orientation of the headset, and to fuse together vision and IMU data to get full 6D pose.

IMU: orientation

IMU1

To compute the orientation of the headset from IMU measurements we need to know the direction of gravity. We can compute that from the acceleration measurements. The above figure shows we can do that by low-pass filtering the acceleration vector. As the headset moves around, the only constant acceleration the IMU senses is going to be gravity.

orientation = (orientation + gyro * dt) * (1 – gain) + accelerometer * gain

With knowledge of gravity, we can lock the orientation along two dimensions: tilt and roll. We can also estimate the change in yaw, but this degree of freedom is subject to drift and cannot be corrected with gravity. Fortunately, yaw drift is slow and can be corrected with vision measurements.

IMU: full 6D pose

The final step in our system integrates pose computed from the camera (vision) at a low frequency with IMU measurements at a high frequency.

err = camera_position – filter_position
filter_position += velocity * dt + err * gain1
velocity += (accelerometer + bias) * dt + err * gain2
bias += err * gain3

 

Summary

Now we have a fully integrated system that takes advantage of the advantages of the camera (position and no drift in yaw) and the IMU (high frequency orientation and knowledge of gravity).

The complementary filter is a simple solution that provides as with smooth pose tracking at a very low latency and with minimal computational cost.

Here’s a video of a presentation I gave together with Michael Abrash about VR at Carnegie Mellon University (CMU). In my part, I cover the position tracking system design.

Interactive Perception

Introduction

My PhD research was in robotics, machine learning and computer vision. My focus was on closing the loop between autonomous manipulation and perception. A result of this research was the concept of Interactive Perception.

In this post, I will explain Interactive Perception and give some concrete examples.

Here’s a quick introduction to the main idea (pulled out of my dissertation [Dov Katz’s Thesis]):

Explaining By Doing

Human children in the first three years of life are consumed by a desire to explore and experiment with objects. They are fascinated by causal relations between objects, and quite systematically explore the way one object can influence another object. They persistently explore the properties of objects using all their senses. A child might gently tap a new toy car against the floor, listening to the sounds it makes, then try banging it loudly, and then try banging it against the soft sofa. In fact, this kind of playing around with the world, while observing the outcome of their own actions, actually contributes to babies’ ability to solve the big, deep problems of disappearance, causality, and categorization.

Action and Perception

A child’s explanatory drive tightly couples action and perception. This coupling was observed in the 80s by the psychologist Gibson. Gibson describes perception as an active process, highly coupled with motor activities. Motor activities are necessary to perform perception and perception is geared towards detecting opportunities for motor activities. Gibson called these opportunities “affordances”. The philosopher and cognitive scientist Alva Noe describes an “enactive” approach to perception. He argues that perception is an embodied activity that cannot be separated from motor activities and that can only succeed if the perceiver possesses an understanding of motor activities and their consequences.

photo of Dov Katz / Dubi Katz / דובי כץ / דב כץ

Interactive Perception

Perceiving the world, making decisions, and acting to change the state of the world seem to be three independent processes. Thus, separating action and perception is intuitive. However, “enactive” approach to perception may be essential for surviving in a high-dimensional and uncertain world. Interactive Perception provides a straightforward way to formulate theories about the state of the world and directly test these theories through interactions. Interactive Perception imposes structure, limiting significantly what needs to be perceived and explained. Researchers have been following the Interactive Perception paradigm, developing robots which explain their environment by coupling actions and perception.

Further Reading

There’s been a lot of research in Interactive Perception over the past few years. Take a look at http://interactive-perception.org for more details and pointers to good research papers.

Examples

Let’s take a look at four representative examples of interactive perception algorithms solving challenging problems. I’ll cover: image segmentation, manipulating articulated objects, clearing debris, and sorting laundry.

1. Interactive Segmentation

Image segmentation is an old computer vision problem. It’s about identifying the boundaries of an object of interest in an image. There are many techniques and approaches. Some methods are based on color or texture analysis, others assume motion of the target object and more recent methods leverage machine learning.
Interactive segmentation closely couples vision with robotic manipulation. The robot’s interaction with objects in the environment creates a perceptual signal (motion) that is then used to perform segmentation.
Here’s a great illustration of the interactive segmentation process from Kenney et al. :
IPseg
The robot is interacting with the box, and in the process creates motion in the sequence of images. The robot’s hand is moving, but also the box and notebook that are being pushed by the robot. When computing the differences between frames, we can easily (and with little computation) compute a high quality segmentation of each object.
Here’s a cool video showing interactive segmentation in action (by Bersch et al.):

2. Manipulating Articulated Objects

Articulated objects are objects that have degrees of freedom. Look around you, these objects are everywhere: doors, door knobs, scissors, windows, drawers, wheels, …
This is an important category of objects because the degrees of freedom of an object tell us a lot about its intended use.

Extracting the degrees of freedom of an object from an image is difficult. It may even be impossible: can I actuate the garden shears or is the joint glued shut? Can I push open the door or is it locked? This information is simply not available in an image of the object.

Fortunately, interactive perception creates exactly that missing information by poking objects and monitoring the change to model their degrees of freedom. You can read more about how this is done for planar objects here.

Here’s a video showing the process for 3D objects like a train toy:

3. Clearing Debris

Interactive perception is also useful for complex tasks such as clearing a pile of unknown objects. Think about saving people after an earthquake or a terror attack. We want our robot to figure out what piece to pick up first with little to no disturbance to the pile. But piles are complex even for people, and the objects the robot will encounter will likely be broken and dusty — difficult to have a prior model for how to handle them.

With interactive perception the robot can learn on the go and use its experience to get better over time. The robot gets to test theories about what to do next, poke an object a little to verify, and then decide whether to go for it or try a different action. Here’s a video:

4. Sorting Laundry

I’ll leave you with this cool video by Li Sun of a robot using interactive perception to identify clothing articles in a pile. The idea is to pick up an unknown item and turn it around to get several perspectives. The robot can repeat this process to get different views until it is certain of the item. Then, the robot picks up the item one last time and drops it into the right bin.

Conclusion

Interactive perception is an intuitive paradigm for perception. It removes the artificial boundary between action and perception, enabling us to create better computer vision systems. Interactive perception is more complex in that you need a robot to manipulate the environment and not just a camera. However, in exchange you get to solve problems that are otherwise extremely difficult if not impossible to solve. In addition, because we deliberately create a motion signal that we then look for in the image stream, the computational complexity associated with interactive perception is often quite low.