Multimodal emotion recognition with disentangled representations: private-shared multimodal variational autoencoder and long short-term memory framework
-
Aims: This study proposes a multimodal emotion recognition framework that combines a private-shared disentangled multimodal variational autoencoder (DMMVAE) with a long short-term memory (LSTM) network, herein referred to as DMMVAE-LSTM. The ...
MoreAims: This study proposes a multimodal emotion recognition framework that combines a private-shared disentangled multimodal variational autoencoder (DMMVAE) with a long short-term memory (LSTM) network, herein referred to as DMMVAE-LSTM. The primary objective is to improve the robustness and generalizability of emotion recognition by effectively leveraging the complementary features of electroencephalogram (EEG) signals and facial expression data.
Methods: We first trained a variational autoencoder using a ResNet-101 architecture on a large-scale facial dataset to develop a robust and generalizable facial feature extractor. This pre-trained model was then integrated into the DMMVAE framework, together with a convolutional neural network-based encoder and decoder for EEG data. The DMMVAE model was trained to disentangle shared and modality-specific latent representations across both EEG and facial data. Following this, the outputs of the encoders were concatenated and fed into a LSTM classifier for emotion recognition.
Results : Two sets of experiments were conducted. First, we trained and evaluated our model on the full dataset, comparing its performance with state-of-the-art methods and a baseline LSTM model employing a late fusion strategy to combine EEG and facial features. Second, to assess robustness, we tested the DMMVAE-LSTM framework under data-limited and modality dropout conditions by training with partial data and simulating missing modalities. The results demonstrate that the DMMVAE-LSTM framework consistently outperforms the baseline, especially in scenarios with limited data, indicating its capacity to learn structured and resilient latent representations.
Conclusion : Our findings underscore the benefits of multimodal generative modeling for emotion recognition, particularly in enhancing classification performance when training data are scarce or partially missing. By effectively learning both shared and private representations, DMMVAE-LSTM framework facilitates more reliable emotion classification and presents a promising solution for real-world applications where acquiring large labeled datasets is challenging.
Less -
Behzad Mahaseni, Naimul Mefraz Khan
-
DOI: https://doi.org/10.70401/ec.2025.0010 - June 29, 2025