We propose a novel multimodal-transformer-based architecture (BPMulT), which is an improvement of MulT model using twice the Crossmodal Transformer (biprojection) and proposed dynamic fusion modules (Fusion-GMU) to combine the modalities' information.
Movie genre classification (multimodal-4): Using the movie trailers where we obtain video frames, audio spectrograms, movie plot as text, and movie poster (image) from the Moviescope dataset.
Movie genre classification (bimodal): Using the movie plot as text and movie poster (image) from the MM-IMDb dataset.
Emotion detection (multimodal-3) Using the Aligned (20) and Unaligned version of IEMOCAP dataset, which contains monologues performed by actors and it gives facial features (vision), transcript (text), and voice features (audio).
Emotion detection (multimodal-3) Using the Unaligned polished version of CMU-MOSEI dataset, given by the paper Multimodal-End2end-Sparse (iemocap_data_noalign.pkl). Contains monologues from YouTube about random topics. It gives video preprocessed features, transcription, and voice features.
This repo contains the code used for the preprint version of a publication in submission to Elsevier Information Fusion Biprojection Multimodal Transformer (BPMulT) for Multimodal Data Classification
The paper above is a polished version of my MS degree Thesis. If you want to see more details about our research, we recommend to see the thesis doc.
Command example to run the training script on Moviescope:
>> python3 bpmult/train.py --from_seed 1 --model mmtrvapt --batch_sz 8 --gradient_accumulation_steps 8 --savedir /home/User/BPMulT_moviescope_examplePath --data_path /home/User/datasets --task moviescope --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 6 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100 --num_heads 6 --layers 5 --orig_d_v 4096 --hidden_sz 768
Command example to run the training script on MM-IMDb:
>> python3 bpmult/train.py --from_seed 1 --model mmtrvapt --batch_sz 6 --gradient_accumulation_steps 6 --savedir /home/User/BPMulT_mmimdb_examplePath --data_path /home/User/datasets --task mmimdb --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 6 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100 --num_heads 6 --orig_d_v 300 --orig_d_a 1
Command example to run the training script on CMU-MOSEI:
>> python3 bpmult/train.py --from_seed 1 --model mmtrvat --batch_sz 8 --gradient_accumulation_steps 8 --savedir /home/User/BPMulT_cmu-mosei_examplePath --data_path /home/User/datasets --task cmu-mosei --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 5 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100 --num_heads 12 --nlevels 8 --layers 8 --orig_d_v 35 --orig_d_a 74 --hidden_sz 300
Command example to run the training script on IEMOCAP:
>> python3 bpmult/train.py --from_seed 1 --model mmtrvat --batch_sz 8 --gradient_accumulation_steps 8 --savedir /home/User/BPMulT_iemocap_examplePath --data_path /home/User/datasets --task iemocap --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 5 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100 --num_heads 12 --nlevels 8 --layers 8 --orig_d_v 35 --orig_d_a 74 --hidden_sz 300
-
MulT-GMU: Multimodal Weighted Fusion of Transformers for Movie Genre Classification.
-
Mult End2end: Multimodal End-to-End Sparse Model for Emotion Recognition.
-
Mod-Trans: Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition.
-
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences.
-
MMBT: "Supervised Multimodal Bitransformers for Classifying Images and Text.
-
Moviescope Dataset: Moviescope: Large-scale Analysis of Movies using Multiple Modalities.
-
GMU and MM-IMDb: Gated Multimodal Units for Information Fusion.
- python 3.7.6
- torch 1.5.1
- tokenizers 0.9.4
- transformers 4.2.2
- Pillow 7.0.0