How to build Face Detection system using Viola Jones Algorithm

Reading Time: 5 minutes

Object Detection is to locate the presence of objects and types or classes of the located objects in an image. Face detection is a particular case of Object Detection. The objective of face detection is to find and locate faces in an image. It is the first step in automatic face recognition applications. Face detection has been well studied for frontal and near frontal faces. There are many techniques in the field of face detection. For example, Viola-Jones face detector, (R-CNN), (YOLO), etc. Viola and Jone’s face detector is the most well-known face detection algorithm based on Haar-like features and cascade AdaBoost classifier.

                                        VIOLA JONES ALGORITHM 

Paul Viola and Michael Jones had proposed the algorithm in 2001. It is the first framework for object detection which gave viable results for real-time situations. It aims to target the problem of face detection and can be trained to detect different object classes. It’s implementation is available in OpenCV as cvHaarDetectObjects(). It is preferred for its robust nature and its fast detection of faces (full frontal upright faces)in practical situations.

It comprises of four stages namely: 

  • Haar Feature Selection
  • Creating an integral image.
  • Adaboost Training
  • Cascading Amplifiers

Given an image, the algorithm looks at many smaller subregions. It tries to find a face by looking for specific features in each subregion

1. Haar-Like Features

A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each region, and calculates the difference between these sums. This difference is used to categorize subsections of an image. Since, it is a common observation in humans, that the region of the eyes is darker than the region of the cheeks. A common Haar feature for face detection is a set of two adjacent rectangles that lie above the eye and the cheek region. Position of these rectangles is relative to a detection window. It acts like an bounding box to the target object (the face in this case).

The difference of the sum of pixels of areas inside the rectangle at any position and scale within the image is known as a simple rectangular Haar-like feature . This modified feature set is called a 2-rectangle feature. however, Viola and Jones also defined 3,4-rectangle features. The values indicate certain characteristics of a particular area of the image.

Value = Σ (pixels in black area) – Σ (pixels in white area) 

2. Creating an integral image. 

A summed-area table or an integral image is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid. In an integral image, the value of each point is the sum of all pixels above and to the left, including the target pixel:


3. Adaboost Training. 

AdaBoost is a type of ensemble technique (Boosting) in Machine Learning. It combines a set of weak learners to form a strong learner. Boosting is a sequential process, wherein each subsequent learner attempts to rectify the errors made by the previous learner in the sequence.

In the Viola-Jones framework, each Haar-like feature corresponds to a weak learner. AdaBoost checks the performance of all classifiers to decide the type and the size of a feature that goes into the final classifier.

To compute the performance of the classifier, we evaluate it on all subregions of all the images used for training.

  • Some subregions will produce a strong response in the classifier and will be classified as positives. It means the classifier thinks that it contains a human face.
  • Subregions that do not produce a strong response are classified as negative. The classifier thinks that they do not contain a human face.

The classifiers whose performance is better are assigned higher importance or weight. The final classifier(Strong) obtained is called a boosted classifier, which contains the best performing weak classifiers.

It’s an adaptive algorithm because, as the training is progressed, it gives more emphasis to those who were misclassified. The weak classifiers which perform better on these hard examples are assigned a higher weight when compared to others.

4. Cascading Amplifiers

Viola-Jones evaluated thousands of classifiers that specialized in finding faces in images. Since it was computationally expensive to run all these classifiers on every region in every image, they introduced the concept of Cascading Classifiers. The job of the cascade is to quickly discard non-faces, and avoid wasting precious time and computations. It is necessary to achieve speed for real-time face detection.

In an image, most of the image region is a non-face region. So it is better to check if a window is not a face region. If it is not, discard it in a single shot. Don’t process it again. Instead, focus on regions where there can be a face.

For this, they introduced the concept of Cascade of Classifiers. Instead of applying all the features on a window, group the features into different stages of classifiers then apply them one by one. (Normally first few stages will contain a very less number of features). If a window fails the first stage, discard it. We don’t consider the remaining features on it. If it passes, apply the second stage of features and continue the process. The window which passes all stages is a face region. 

Link to the implementation of this algorithm will be posted shortly. One can have a look at the output given below of face detection Module.