The purpose of this project is to find a way to move through a virtual environment in a way different from the usual, that means not using the traditional devices that one can use with any personal computer such as keyboards, mouse, joysticks, special controls for video games, etc.. Here the user’s head is “the device” used to communicate with the interface, and all the user’s movements will be translated to be able to navigate through a 3d environment. Generally speaking, the main idea is to get an aproximation to a Perceptual User Interface (PUI).

To recreate a tridimensional world where the user can navigate through it by using the movements of his head, face recognition algorithms and the subsequent tracking and analysis of the user’s movements are here the main target to deal with. The main task of the project is aimed to have control over the user’s head movements and manage these movements as a way to operate with an interface, in that case, a three-dimensional world. In this virtual world, the user will move trough it with absolutely freedom. Only a webcam is needed to manage the digital video and its processing. In order to reach this target, several of the most popular algorithms of Image Processing and Computer Vision are used.

Here you can see the final result:

Another more sample, this time showing the zones and marks of the user’s face that are being tracked:

The core of the development is based on OpenCV, an open source library of programming functions for real time computer vision that offers a wide variety of algorithms for image interpretation. To implement face recognition and face tracking have been used the following techniques :

  • Haar Cascade classifiers to detect the face of the user.
  • Algorithm camshift to keep track of the user’s face.
  • Algorithm Optical Flow to track facial features.

The application displays a 3D world built with openGL as a user interface. The user can move along this environment in a similar way as most of the first-person shooter computer games do, that is, going ahead, going back, sidestepping and looking around. Also a collision system is present to add some realism to the user’s movements along the 3d environtment.


The first target of the application is to detect the user in front the webcam, so each frame of the live video is procesed to detect a face within that frame. I’ve already explained in more detail how to find faces in this post, so I will outline how the general process works:
OpenCV uses a type of object detector called Haar Cascade Classifier that can be used in the function cvHaarDetectObjects. This function finds rectangular regions in a given image that are likely to contain objects the cascade has been trained for and returns those regions as a sequence of rectangles. During the search, the function scans the image several times at different scales and returns a set of rectangular regions that have passed the cascade classifier (that is, regions that are considered faces). The cascade classifier is trained to find samples with an specific size. In our case, the classifier has been trained to find frontal faces with a size of 20x20px.
As we are obtaining a sequence of regions for each frame, that can leads to, for example, having 2 regions per frame in case of detect 2 diferent users in front of the webcam. For the sake of simplicity, I’m assuming just one user in front of the cam, so only the first region of the sequence is considered.

In general, the cascade classifier gives good results detecting faces in several positions and distances, as well as with differents facial gestures:

Even with the influence of occlusion and soft light sources, the classifier still is able to detect faces:

Worse results are obtained with extreme light conditions and with faces under variation in rotation and size. In fact, that can be explained as during the facial detection, the classifier uses simple rectangular features (called Haar features, imagine them as a pair of adjacent rectangles, one light and one dark). The classifier evaluates the presence of this features by subtracting the average dark-region pixel value from the average light-region value. If the difference is above a treshold, the features is considered to be present.

So once we are able to detect a face, the next step consists of being able to track it. Althought we can achieve that just locating the face over an over in each frame of the webcam’s videostream using the Haar detector, some facts have to be considered: what if another user appears in front the camera? which face to follow? Besides, our Haar classifier has been trained to detect frontal faces. What if the user turns towards profile view? To handle with these issues, we can make use of Camshift, a color-based tracking algorithm.


To track the face, we will use Camshift, an algorithm that uses color information to track on object along an image sequence. At this point in time we already have located where the user’s face is, so this region of interest it will be analysed by Camshift to start the tracking of the face. These are the steps done by Camshift to track the desired object:

  • Creates a color histogram of the image. That is, a distribution of colors in an image. The color histogram is created by using HSV color space. This color space separates out the hue (color) from saturation (how concentrated the color is) and from brightness. This color space handles better light variations better than RGB does. Another reason is that, contrary to what one might think, to track faces from different races different color spaces are needed. That’s not true at all, as all the human races have the same “hue” or “color” (i.e. black people have a higher saturation than white people). Precisely, these components are stored separately in the HSV color space.
  • Calculate a “face probability” for each pixel in the incoming video frames. So you end up with a flesh probability distribution of each new frame, as you can see in the image below (black pixel have the lowest probability value and white pixels, the highest).

  • For each new video frame, camshift “shifts” its estimate of the face location, keeping it centered over the area with highest concentration of bright pixels in the face-probability image.


An important issue that appeared during the development of the application was that, although the Camshift algorithm gives information to emulate side-shift movements as well as move forward and move back, there is no way to detect when the user is looking around. That is, there is no way to detect when the user is rotating his head around the vertical axis.

I made several attempts using different approaches, as using eyetracking to track the eyes (so the position where the user is looking at can be catched) or detecting simple geometric shapes using the Hough transform, but neither of them gave good results.

Finally I obtained good results using Optical Flow, which is an algorithm to track points (“features”) across multiple images. That is, given a set of points in an image, Optical Flow tries to find those same points in another image (in my case, find the same points across the frames of the digital video coming from the webcam).

So, what points are considered good features to track? We can consider those points that have features as textures or cornersm so I chose to track the central point of the nose.
But, How can I locate this point within the image? Again, Haar classifiers can be extremely helpful. Remember that they can be trained to detect any object, so I just used a classifier trained to detect noses. Once the nose is located, I can define the central point of the rectangular region detected by the classifier as point to be tracked

At that point, to detect where the user is looking at, we can compare the location of the point being tracked with Optical Flow with its initial position when the nose of the user was detected:

Just to mention that more points are good candidates to be tracked with Optical Flow, as point located within the eyes and some others (see image below). Nevertheless, to mantain simplicity I just keep track of the point located in the center of the nose.


At this point in time we are close to navigate through a 3D world using our head as device!
We are able to detect the user, to capture his movements and to translate them to real movements in a 3D world. How do we translate this movements? For the sake of simplicity, I decided not to spend too much time on that, so I performed some basic calculations, as follows below:

Next task is to draw a 3d world, so for that purpose I used openGL to recreate a tridimensional environment. I will not go into details, just to say that I have all the geometry of the world (that is, floor, walls, etc…) in a data file. This geometry is being redrawn continuously in order to update the scene updated according the movements of the user.

Regarding those movements, the 3D scene displayed in the application represents what the user sees based on his position on the 3D world. When we detect a movement in the user’s head, the application translates this movements into a translation/rotation of the 3D world to simulate that the user is navigating through this tridimensional environtment.

Last but not least, we need to have a collision detection system to have some realism when navigating through the world. For my purposes, I chose a basic collision detection (is not the best regarding the performance, but is enought for the my purposes) , where the idea is to consider the user as a sphere, and avoid this sphere get too close to any element. Before doing this, we should make sure that every plane you’ve got has it’s own normal, and D value (from the plane equation).The elements in the world (walls, obstacles, etc) are considered as planes so we check the distance between the sphere’s center point and each plane in the scene.

This distance is calculated like that: distance = dot product ( plane.normal , sphere.pos )
This distance is the blue line in the image above, so basically we consider that a collision occurs when the distance is less than the radius of the sphere. This way to detect colisions is far from being optimal because a lot of unnecesary calculations are being done (all the planes are being checked all the time), but is just fine to simulate some realism.

We are done! At this point in time we detect and track a face, and translate the movements of the user’s head into movements in a 3D world, here you can see another sample with the application running:

6 Comments • Give your comment!

  1. […] published a project made some time ago but I think is worth to make it public. The motivation of this project is the […]

  2. by G

    7th January 2013

    10:06 am

    Quite impressive

  3. by Xavi Gimenez

    7th January 2013

    10:45 am

    Thanks Gerard ;)

  4. […] ¿Tienes webcam? usa tu cabeza y olvídate del ratón. Desarrollo OpenSource […]

  5. by Face « Clinks

    7th January 2013

    4:58 pm

  6. by omm

    7th January 2013

    6:13 pm

    Love it, thanks for sharing!! OpenCV rocks :)

Have your say

About me

Data Visualization · Interactive · HCI · Open Data · Data Science · Digital Humanities

More info here and here: