Can We Use Deep Learning to Recognize Human Emotions by Only Looking at Eyes?

Can We Use Deep Learning to Recognize Human Emotions by Only Looking at Eyes?John LiaoBlockedUnblockFollowFollowingMay 6Chew, if only you could see what I’ve seen with your eyesUsing machines to recognize human affect (a psychology term referring to feelings and emotions) has always been a challenging topic that amasses huge interest from both academia and industry.

There are currently a variety of approaches to recognize human affects and emotions such as analysis of body language, voice intonation, and more involved methods like MRI and EEG.

However, to me, the more interesting and practical approach in the age of deep neural network for affect and emotion recognition is to rely on computer vision by primarily looking at images of faces to analyze facial expressions.

In a world where portable devices like smartphones, VR and/or AR glasses like Oculus Rift, HoloLens, or gaming console like Nintendo Switch have been increasingly gaining popularity, there is a huge potential in changing how people interact with these devices (or vice versa) if it is possible to accurately predict or recognize their feelings or emotions when using such electronics.

However, acquiring a facial image for accurate recognition usually requires full focus or attention from the human subject, or the whole face features, and certainly the precise timing of capturing the facial image (capturing a face in transition can result in only a partial facial image).

Furthermore, when a subject’s face is partially occluded with masks, glasses, objects, or even a beard, the accuracy of recognizing affect and emotions can decrease drastically with current models.

With such challenges in mind, in this post, we will explore if it is possible to predict the affect of a human being by aiming our attention on just part of the face.

More precisely, what if the ocular region of a subject can convey just as much about an individual’s affect as looking at their whole face, especially through a portable device?I.

About Data SetAffectNet data set was used, and as its name suggests, it is one of the largest providers of affect and emotion labeled data for a set of face images with a size of 122 GB.

It provides approximately one million labeled images which were obtained by querying three different search engines using 1250 emotion keyword in six different languages.

Facial images with the same expression/emotion are put into the same category (happy, sad etc.


It is not freely available and requires consent from the owners.

The data is split into two groups: manually and automatically annotated.

Instead of using the facial expression label like “happy”, “angry” that is a categorical/discrete variable, the affect labels we are interested and will use for each facial image are provided as two separate sets of numerical continuous variables: valence and arousal.

While valence corresponds to a sense of how unpleasant/negative to pleasant/positive an event is, arousal focuses on smoothing/calming to exciting/agitating.

In both cases, the values have the following range: [-1, 1].

Using such continuous variables seems to better represent the human You can read in detail about how they collected and labeled the data here.


PreprocessingThe first image (top) demonstrates visually the constructed bounding box before and after expanding its size, as well as the calculated rotation θ.

The second image (bottom) is the eye slot derived from preprocessing.

A new data set was derived by extracting ocular regions from AffectNet.

Using the facial landmark points provided with the data set corresponding to both eyes and eyebrows and the closest nose point (these landmark points are usually obtained through another deep neuron network but luckily the data set already included them) , an initial bounding box comprised of minimum area for these points was calculated for each image.

The size of bounding box was proportionally increased by 10% and 25%, horizontally and vertically respectively, while keeping its center point the same.

Since the center line of a face rarely aligns to the horizontal of the image, many of the bounding boxes needed to be rotated.

Therefore, the degree of rotation θ was also determined and the image was rotated around the center point of the bounding box before cropping.

The process is depicted by the image on the left.

Due to the serial nature of the preprocessing of the images (loading one image at a time and performing cropping and rotation), the process could be incredibly slow; it took a total of 6 hours to preprocess both training and validation data.

In case a change is needed for some part of our preprocessing, we would have to wait another 6 hours for a new run, which is certainly not desirable in terms of cost and time efficiencies.

Hence, a parallel scheme was needed here to parallelize the whole process — specifically we want to partition our input across multiple workers so that each worker could process a sub-portion of the images.

In the end, Apache Spark through PySpark, and backed by Pandas was used to orchestrate and distribute image processing.

After we rented a few compute optimized EC2 instances (set up through some awesome CloudFormation scripts my peer wrote) and set up the Spark framework, the preprocessing (the code had to be changed to take advantage of Spark parallel computing capability, and data locality) on cloud only took just a bit short of 10 minutes which is quite impressive.

The processing speed certainly can be scaled up even further by adding more computing cores to our instance(s).

Using Spark to process this huge amount of data had me to come to the realization that it is quite a bit different for Data Science workloads compared against webscale workloads.

In many webscale workloads according to my peers, engineers are solving for high availability through load balancing, distribution of compute to multiple resources, and fault tolerance.

For data science workloads, such as processing data or modeling, we seem to be solving more for get-in and get-out as fast as possible with really strong expensive machinery.

Many of the tools are similar in that you still want the repeatability that tools like AWS CloudFormation afford you.

However, you only want to use the resources for the amount of time that the processing is actually occurring.


Modeling with Convolution Neural NetworkVGG16 ConfigurationVGG16We focused on the VGGNet-like architectures as it has been proven to yield good results in image classification and it is relatively less computation expensive compared to other models with deeper layers.

The architectures are comprised of groups called blocks, and implemented the configurations shown on the left.

A block consists of two convolution layers and each convolution used a stride of 3 x 3.

In a block, a convolution layer was followed by a batch normalization layer and a rectified linear unit (ReLU) activation layer.

Following a block, a max pool of 2 by 2 was used.

After the blocks, fully connected (or dense) layers of different sizes were added.

Finally, at the end of each DCNN configuration, a fully connected layer consisting of two units with a linear activation was added.

One for each matching response variable.

The model was built with Tensorflow Keras.

Wide ResNet with Cut-out Transformation on InputA separate model inspired by Wide ResNet was built using Pytorch and the “Cut out” of input image method inspired by this paper (the idea is to randomly cut out a small patch of each input image during training to encourage our model to look at more subtle features instead of those major ones) was implemented but I never got a chance to run it as training the VGG already took a full 6 days on the data set and from my personal experience Wide ResNet with cut out took about 4 times long to train on CIFAR10 data set compared to that of a VGG16 model.

I hope I can run it in future to see if it can achieve better results.


Training the ModelsThe more and varied the data we train our model with, the better it becomes with respect to accuracy and robustness.

One technique used in order to increase the size of the data set as well as introduce perturbations to images is data augmentation.

While preprocessing constructs standardized eye slots offline by leveraging facial landmarks, augmentation perturbs and standardizes the eye slot size at run time to construct an infinite amount of different eye slots from one.

During training, eye slot augmentation with different types of perturbations (transformations) were applied to the original eye slot.

For each type, a random value within a defined range was generated and used to transform the image.

The transformation includes brightness change, rotation, width shift, height shift, shear, horizontal flip; The previously mentioned cut-out was not used when training our two VGG models.

Being a dual regression problem (we are trying to learn both valence and arousal using the same filters in one model), the mean squared error loss function was selected to optimize using Adam [20].

We varied the optimizer per run to use different learning rates α ∈ {E−3 , E−4 , E−5}, β1 = 0.

9, β2 = 0.


For each run, training was performed with a batch size γ = 32.

While model M1 was run for 35 epochs, model M2 was run for 50 epochs.

Training and Validation Loss for Model M1Training and Validation Loss for Model M2V.

ValidationTo evaluate the models, we not only calculated the root mean squared error (RMSE), but also Pearson’s correlation coefficient (CORR) that addresses the issue of RMSE which can be heavily influenced by outliers, concordance correlation coefficient (CCC) which is an improved version of CORR and sign agreement metric (SAGR) (as the predicted sign of valence is extremely important here) on the test data set.

Summarized results for valence (V) and arousal (A) can be found in tables below for model M1 and M2.

Evaluation for Model M1Evaluation for Model M2The results here are comparable to some major research results on AffectNet (using the whole facial features) e.






It seems like that it is indeed possible to use the face ocular region in order to infer the affect of a person.

Under constraint of the data set in use and the models used, the results show that using the ocular region to predict arousal is more accurate than valence.

This seems to make sense if we take into consideration that the mouth is another common conveyor of valence especially when we smile.

I’m more than excited to see where in future research in DNN can lead us for recognizing human emotions through computer visions.

Can we eventually realize a future world that Blade Runner 2049 depicts (well, hopefully just the advanced technology, not the dystopian cyber punk world) where a machine is capable of detecting the subtlest emotions from the eyes of a human, or maybe, a robot?.

. More details

Leave a Reply