Adaptive Network Boosts Crucial Facial Expression Regions

Beijing Zhongke Journal Publising Co. Ltd.

Emotion is a complex state that integrates people′s feelings, thoughts and behaviors, and facial expression is one of the most direct signals to communicate their innermost thoughts. Therefore, facial expression recognition (FER) has attracted the attention of many researchers due to its important role in many practical application fields, such as human-computer interaction, recommendation system, patient monitoring, etc. In general, facial expressions are encoded into facial action units through facial action coding system, and any expressions can be described through a set of facial action units. Some facial action units are crucial for FER, such as those located in regions around the eyes and mouth, since they have more obvious actions than other facial regions (such as cheek and forehead). In the paper published in Machine Intelligence Research, the team of researchers from Xidian University regard these crucial facial action units as facial crucial regions (FCRs).

In view of the significance of FCRs, many studies have been proposed based on applying the information of facial local regions, where the facial landmarks are employed as the prior information to obtain facial crucial regions. However, the information of facial landmarks is obtained by manual annotation. Early, most of FER studies focused on lab-collected expression datasets, such as CK+, MMI, JAFFE, Oulu-CASIA. For lab-collected datasets, facial expressions images were collected from several or dozens of individuals under similar conditions (such as illumination, angle, posture, etc.), generally with a few uncontrollable factors. Thus, it is easily achieved to manually annotate the landmark of FCRs for lab-collected datasets.

However, compared with the lab-controlled datasets, the wild expression datasets are collected under more complex and uncontrollable conditions, such as RAF-DB, AffectNet, EmotionNet, etc. For the wild expression datasets, especially including a vast of images, it is very complicated and time-consuming for manually annotating FCRs. Moreover, the postures of different faces vary greatly on the wild database. One simple change in facial postures can cause multiple pixel deviations at the image level. One figure in this paper gives an example of the landmarks moving with the change of postures, where two expression images and their landmarks are from RAF-DB dataset. This example implies that the position of FCRs varies with the change in facial postures. Inevitably, it increases the complexity of manually annotating landmarks for FER, especially for wild datasets with vast numbers of images. In view of this, it is important to consider whether the significance of FCRs or their features could be spontaneously enhanced in the training of deep FER, without any prior information, such as landmarks of FCRs.

On the other hand, there exists a problem that some FCRs from different expression categories are similar, whereas some FCRs from the same category are very different. One figure in this paper illustrates that the FCRs of expression images belonging to the same category may be very different, but FCRs from different categories are similar. Distinctly, it is insufficient that only local information of facial expressions is utilized to construct one effective model for FER, especially for the wild dataset. Hence, it is still important to utilize the global information of the facial expression while FCRs are enhanced in deep facial expression recognition.

Based on the above analyses, researchers propose a new method of facial expression recognition in this paper, which constructs a local non-local joint network to adaptively enhance the facial crucial regions in the process of deep feature learning, shortened for LNLAttenNet. In LNLAttenNet, the local and non-local information of facial expressions are simultaneously considered to construct two parts of the network: a local multi-network ensemble and a non-local attention network, and then the generated local and non-local feature vectors are integrated and jointly optimized in feature learning. Specifically, the attention weights obtained by the non-local part are regarded as the significance of facial local regions and fed into the local multi-network ensemble system to combine multiple local networks. Interestingly, researchers find that some FCRs can be automatically enhanced in the process of deep feature learning by the proposed method. Moreover, U-Net is employed to generate feature maps where each pixel has a large receptive field and the local region also contains global information. There is one figure that shows a simple view of LNLAttenNet. From this figure, it is obvious that some crucial regions are given higher weights by LNLAttenNet.

Compared with state-of-the-art methods, the contributions of this paper are mainly three points: 1) Researchers propose LNLAttenNet to automatically enhance facial crucial regions in deep feature learning by utilizing the local and non-local information of facial expressions simultaneously. It is supposed to be the first study on how to explore and enhance the FCRs in CNNs for FER, where FCRs are automatically enhanced without any prior information for facial crucial regions or landmarks. It effectively improves the problem that it is difficult to annotate the facial landmarks of the wild facial datasets. 2) In LNLAttenNet, an attention mechanism is introduced to construct a non-local attention network that explores the significance of local regions for FER from a global perspective of facial expression. The obtained attention weights corresponding to local regions are fed into the local multi-network ensemble system to integrate multiple local features, and then the integration of features obtained by multiple local networks is jointly optimized with the facial global feature. 3) Experimental results demonstrate that FCRs can be enhanced in deep feature learning by LNLAttenNet, which validates that FCRs are more discriminative local regions for FER. Moreover, it also implies that the deep FER model can spontaneously focus on some crucial regions in the training process, which probably brings a new inspiration for designing deep FER methods.

The rest of this manuscript is organized as follows. Section 2 introduces related works about deep facial expression recognition. Section 3 introduces the details of the proposed method. In this paper, researchers propose a local non-local attention joint network for FER to adaptively enhance more crucial local regions of facial expression, named by LNLAttenNet. Section 3 consists of three parts: non-local attention network, local multi-networks ensemble and joint optimization of LNLAttenNet.

In Section 4, researchers will validate the performance of the proposed method from several items: 1) the performance comparison with state-of-the-art methods on benchmark datasets, 2) the analysis of non-local attention, 3) the visualization of local attention, 4) the change of the parameter α, 5) the performance of LNLAttenNet with different M, and 6) the analyses for overlapped pixels between local regions.

In this paper, researchers propose the LNLAttenNet method to effectively explore the significance of facial crucial regions in feature learning for FER, without any landmark information. In LNLAttenNet, the global information of the facial expression is utilized to construct the non-local attention network, and the local information is utilized to supervise self-information. By the joint optimization of facial non-local and local feature vectors, LNLAttenNet can adaptively enhance more crucial regions in the process of deep feature learning. Specifically, an ensemble of multiple networks corresponding to local regions is constructed to integrate the local feature with the non-local weights, which achieves interactive guidance between the facial global and local information. The experimental results also demonstrate that some local crucial regions can be effectively enhanced in feature learning by LNLAttenNet while there is no given information on landmarks in the training model. Moreover, the proposed method focuses on enhancing facial crucial regions in FER without any landmark information based on multiple patches, and thus researchers will explore it from the view of pixels for facial expressions in future works.

See the article:

Adaptively Enhancing Facial Expression Crucial Regions via a Local Non-local Joint Network

http://doi.org/10.1007/s11633-023-1417-9

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.