Synopsis

Recently at Holler, we’ve launched the all-new R8 campaign website. The idea was simple: the user watches a video of the R8 racing 1km. During that time we track how many times they blink and the duration of each blink. Based on the car speed and all the blinks duration we will tell the user how many meters of the race they missed just by blinking.

At the time of this article, there is no out-of-the-box solution, so our approach to this challenge was:

Ok, we need to detect the eye, so we’ll need some sort of face tracking, then we’ll need to check if the user closed their eyes and we can achieve this by comparing my current sample data, with my last sample data.

So with this idea in mind, we’ve decided to start exploring.

After some brief research of what was out there for face tracking, we decided to go with Audun Øygard’s face tracking library (clmtrackr.js), because it performed the best from all the one’s we’ve tried.

Webcam Request

The first thing you’ll need to do is request to access the user’s webcam. Below is an example using HTML5 Stream API.

I will add to this example as I continue this article.

Face Tracking

The clmtrackr is a javascript library for fitting facial models to faces in videos or images. It currently is an implementation of constrained local models fitted by regularized landmark mean-shift, as described in Jason M. Saragih’s paper.

In order to start the tracker, you need to provide a CLM Model, which contains the shape information and patch appearance information. There are a several pre-build models included. The models will be loaded with the variable name pModel.

All of the models are trained on the same dataset (MUCT Database). The difference between them is the type of classifier, the number of components in the facial model, and how the components were extracted (Sparse PCA or PCA). A model with fewer components will be slightly faster, with some loss of precision.
For this project we decided to go with the SVM Kernel for classifiers, 10 components PCA.

Eye Tracking

Once started, clmtrackr will try to detect a face on the given element. If a face is found, clmtrackr will start to fit the facial model, and the positions can be returned via getCurrentPosition().

The facial models included in the library follow this annotation:
facemodel_numbering_new_small

Since we normally blink with both eyes at the same time, we decided to focus on only on the right eye (points 23, 24, 25 and 26) from the positions array.

Now that we grabbed the eye limits, we could start to compare the sampled data over time and identify changes in it (eye closed/open).

Improving Sampled Data

In order to maximise the accuracy of the sampled data, we need to apply some image processing. One very effective way of doing this, is applying a threshold to the image. There’s an excellent tutorial by Ilmari Heikkinen online which covers the basics of image processing on a “pixel” level. You can choose if you want to do this on a 2d canvas or in a shader. If your sampled data is small enough like ours is (80px x 60px) your app should perform fine with canvas 2d, but if you’re aiming for high resolution and precision you’re probably better letting the GPU handle this and perform the image processing on a shader.

Grayscale

Since we’re trying to threshold the image to either a white (1) or black (0) pixel, the first thing to do is to apply a grayscale filter.
You can read a little bit more about this process by having a look at Luminance and Relative Luminance, but the basic idea is that our eye perceive colours differently, so when converting it to grayscale the RGB channels aren’t linear but follow a formula that reflects the luminosity function (L = 0.2126 R + 0.7152 G + 0.0722 B).

We can improve the grayscale conversion further by applying a contrast and brightness to it.

Threshold

The simplest thresholding method is replacing a pixel to either white or black if the intensity is above or below a fixed constant. There are other thresholding methods (Local, Spatial, Object Attribute, Entropy, Clustering and Histogram), but for the sake of simplicity we’ve chose a fixed threshold.

Comparing Sampled Data

As you can see in the example above, there’s a clear change in the sampled data when you close your eyes, but how do we measure it? We need to compare the current sampled data with the previous sampled data in order to be able to identify tangible differences. We could do this on every frame or on a fixed interval. Because this detection was part of a big website build, we wanted to optimise website performance so we decided to implement a fixed interval every 100ms.

Now there are advantages and disadvantages to this approach. Since javascript is a single threaded language, choosing this approach will imply less calculations being made, leaving more space for video render, DOM animation, etc. However it limits us in a way that we can’t detect accurately the blink duration because we’re sampling data every 100ms not every 60fps. You could still measure the duration by saving a timestamp when you detect the eye closed and subtract it when you detect the eye open, however this value will be an approximation.

Correlation

Now that we have two sampled data sets to compare, we need to identify how different they are from each other. To do this we need to
calculate the correlation between the two data sets.
The correlation should be 0% if the data sets (image data) are completely different on a pixel level, or 100% if the data set is the same.

Unfortunately since we’re sampling data from a webcam, and not still images, there is noise naturally introduced to the data sets, so we will never have a perfect result. This noise can be introduced by small light changes in the room, or the screen which illuminates the user’s face, or movement of the user, or position adjustments at the tracker level. However, the results are very satisfying even with the noise.

Like in the threshold calculation, there is a fixed constant for the correlation. Basically it states that if the correlation of an image is above the fixed constant, means that we have two different images, if it is below, means we probably have the same image. For this example the correlation was set to 17%, so, if 17% of the current data set is different from the previous data set, we have a substantially different image, and therefore we can conclude that a blink occurred.
This correlation value on the R8 website is dependant on the user distance to the monitor. You might need to adjust this value on the below example.

Improvements

There are several improvements we can pursue from here. The most important being, improving the quality of our sampled data, and reducing the noise to the minimum.

Include rotation

If you take a look at the examples, you will notice that they don’t take rotation in consideration. You can slightly rotate your head and see the sampled eye data rotate along with it. We can stop this by calculating the angle between points 23 and 25, and then rotate the canvas based on that angle so the sampled data rotation will always be ~= 0°.

Dynamic threshold

Another win to improve our sampled data is have a dynamic threshold (see Categorizing thresholding Methods).

clmtrakr optimisations

If you have a look at the source code of the clmtrakr, on the runnerFunction method, you’ll notice that it runs every 60FPS sampling data, and every frame there’s also a while loop. This helps the tracker to quickly adapt to changes in the user’s position (in the webcam) but can be a downfall if you have several elements running or animating in your application. You can minimise this by creating a render method on your application that bypasses the clmtrackr build in one, giving you full control of when you want to run the runnerFunction yourself. In this example I’m only rendering it every 24FPS.

Conclusion

Though the techniques described above can allow a creative concept to work, they aren’t as precise as I wish they where. It’s your job as a developer to compromise and optimise in order to achieve the best results possible. As Javascript continues to grow and evolve, I’m sure in no time some of the techniques here can be improved, or extended, allowing more precision in the tracking. Let me know if you have any ideas, I’m always open to discussion! Hit me up on Twitter, I would love to hear your thoughts.

Thank you

A big thank you for the fellow developers who’s work I’ve long admire. For building the tools or sharing knowledge on the open source community. Zeh Fernando
Ilmari Heikkinen and Audun Øygard.