Recently at Holler, we’ve launched the all-new R8 campaign website. The idea was simple: the user watches a video of the R8 racing 1km. During that time we track how many times they blink and the duration of each blink. Based on the car speed and all the blinks duration we will tell the user how many meters of the race they missed just by blinking.
At the time of this article, there is no out-of-the-box solution, so our approach to this challenge was:
Ok, we need to detect the eye, so we’ll need some sort of face tracking, then we’ll need to check if the user closed their eyes and we can achieve this by comparing my current sample data, with my last sample data.
So with this idea in mind, we’ve decided to start exploring.
After some brief research of what was out there for face tracking, we decided to go with Audun Øygard’s face tracking library (clmtrackr.js), because it performed the best from all the one’s we’ve tried.
The first thing you’ll need to do is request to access the user’s webcam. Below is an example using HTML5 Stream API.
I will add to this example as I continue this article.
In order to start the tracker, you need to provide a CLM Model, which contains the shape information and patch appearance information. There are a several pre-build models included. The models will be loaded with the variable name pModel.
All of the models are trained on the same dataset (MUCT Database). The difference between them is the type of classifier, the number of components in the facial model, and how the components were extracted (Sparse PCA or PCA). A model with fewer components will be slightly faster, with some loss of precision.
For this project we decided to go with the SVM Kernel for classifiers, 10 components PCA.
Once started, clmtrackr will try to detect a face on the given element. If a face is found, clmtrackr will start to fit the facial model, and the positions can be returned via getCurrentPosition().
The facial models included in the library follow this annotation:
Since we normally blink with both eyes at the same time, we decided to focus on only on the right eye (points 23, 24, 25 and 26) from the positions array.
Now that we grabbed the eye limits, we could start to compare the sampled data over time and identify changes in it (eye closed/open).
Improving Sampled Data
In order to maximise the accuracy of the sampled data, we need to apply some image processing. One very effective way of doing this, is applying a threshold to the image. There’s an excellent tutorial by Ilmari Heikkinen online which covers the basics of image processing on a “pixel” level. You can choose if you want to do this on a 2d canvas or in a shader. If your sampled data is small enough like ours is (80px x 60px) your app should perform fine with canvas 2d, but if you’re aiming for high resolution and precision you’re probably better letting the GPU handle this and perform the image processing on a shader.
Since we’re trying to threshold the image to either a white (1) or black (0) pixel, the first thing to do is to apply a grayscale filter.
You can read a little bit more about this process by having a look at Luminance and Relative Luminance, but the basic idea is that our eye perceive colours differently, so when converting it to grayscale the RGB channels aren’t linear but follow a formula that reflects the luminosity function (L = 0.2126 R + 0.7152 G + 0.0722 B).
We can improve the grayscale conversion further by applying a contrast and brightness to it.
The simplest thresholding method is replacing a pixel to either white or black if the intensity is above or below a fixed constant. There are other thresholding methods (Local, Spatial, Object Attribute, Entropy, Clustering and Histogram), but for the sake of simplicity we’ve chose a fixed threshold.
Comparing Sampled Data
As you can see in the example above, there’s a clear change in the sampled data when you close your eyes, but how do we measure it? We need to compare the current sampled data with the previous sampled data in order to be able to identify tangible differences. We could do this on every frame or on a fixed interval. Because this detection was part of a big website build, we wanted to optimise website performance so we decided to implement a fixed interval every 100ms.
Now that we have two sampled data sets to compare, we need to identify how different they are from each other. To do this we need to
calculate the correlation between the two data sets.
The correlation should be 0% if the data sets (image data) are completely different on a pixel level, or 100% if the data set is the same.
Unfortunately since we’re sampling data from a webcam, and not still images, there is noise naturally introduced to the data sets, so we will never have a perfect result. This noise can be introduced by small light changes in the room, or the screen which illuminates the user’s face, or movement of the user, or position adjustments at the tracker level. However, the results are very satisfying even with the noise.
Like in the threshold calculation, there is a fixed constant for the correlation. Basically it states that if the correlation of an image is above the fixed constant, means that we have two different images, if it is below, means we probably have the same image. For this example the correlation was set to 17%, so, if 17% of the current data set is different from the previous data set, we have a substantially different image, and therefore we can conclude that a blink occurred.
This correlation value on the R8 website is dependant on the user distance to the monitor. You might need to adjust this value on the below example.
There are several improvements we can pursue from here. The most important being, improving the quality of our sampled data, and reducing the noise to the minimum.
If you take a look at the examples, you will notice that they don’t take rotation in consideration. You can slightly rotate your head and see the sampled eye data rotate along with it. We can stop this by calculating the angle between points 23 and 25, and then rotate the canvas based on that angle so the sampled data rotation will always be ~= 0°.
Another win to improve our sampled data is have a dynamic threshold (see Categorizing thresholding Methods).
If you have a look at the source code of the clmtrakr, on the runnerFunction method, you’ll notice that it runs every 60FPS sampling data, and every frame there’s also a while loop. This helps the tracker to quickly adapt to changes in the user’s position (in the webcam) but can be a downfall if you have several elements running or animating in your application. You can minimise this by creating a render method on your application that bypasses the clmtrackr build in one, giving you full control of when you want to run the runnerFunction yourself. In this example I’m only rendering it every 24FPS.
A big thank you for the fellow developers who’s work I’ve long admire. For building the tools or sharing knowledge on the open source community. Zeh Fernando
Ilmari Heikkinen and Audun Øygard.