MS Degree Thesis: Biprojection Multimodal Transformer (BPMulT) for Multimodal Data Classification

We propose a novel multimodal-transformer-based architecture (BPMulT), which is an improvement of MulT model using twice the Crossmodal Transformer (biprojection) and proposed dynamic fusion modules (Fusion-GMU) to combine the modalities' information.

BPMulT architecture

Tasks

Movie genre classification (multimodal-4): Using the movie trailers where we obtain video frames, audio spectrograms, movie plot as text, and movie poster (image) from the Moviescope dataset.

Movie genre classification (bimodal): Using the movie plot as text and movie poster (image) from the MM-IMDb dataset.

Emotion detection (multimodal-3) Using the Aligned (20) and Unaligned version of IEMOCAP dataset, which contains monologues performed by actors and it gives facial features (vision), transcript (text), and voice features (audio).

Emotion detection (multimodal-3) Using the Unaligned polished version of CMU-MOSEI dataset, given by the paper Multimodal-End2end-Sparse (iemocap_data_noalign.pkl). Contains monologues from YouTube about random topics. It gives video preprocessed features, transcription, and voice features.

Publications

This repo contains the code used for the preprint version of a publication in submission to Elsevier Information Fusion Biprojection Multimodal Transformer (BPMulT) for Multimodal Data Classification

The paper above is a polished version of my MS degree Thesis. If you want to see more details about our research, we recommend to see the thesis doc.

Usage

Command example to run the training script on Moviescope:

>> python3 bpmult/train.py --from_seed 1  --model mmtrvapt --batch_sz 8 --gradient_accumulation_steps 8 --savedir /home/User/BPMulT_moviescope_examplePath --data_path /home/User/datasets --task moviescope --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 6 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100  --num_heads 6 --layers 5 --orig_d_v 4096 --hidden_sz 768

Command example to run the training script on MM-IMDb:

>> python3 bpmult/train.py --from_seed 1  --model mmtrvapt --batch_sz 6 --gradient_accumulation_steps 6 --savedir /home/User/BPMulT_mmimdb_examplePath --data_path /home/User/datasets --task mmimdb --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 6 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100  --num_heads 6 --orig_d_v 300 --orig_d_a 1

Command example to run the training script on CMU-MOSEI:

>> python3 bpmult/train.py --from_seed 1  --model mmtrvat --batch_sz 8 --gradient_accumulation_steps 8 --savedir /home/User/BPMulT_cmu-mosei_examplePath --data_path /home/User/datasets --task cmu-mosei --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 5 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100  --num_heads 12 --nlevels 8 --layers 8 --orig_d_v 35 --orig_d_a 74 --hidden_sz 300

Command example to run the training script on IEMOCAP:

>> python3 bpmult/train.py --from_seed 1  --model mmtrvat --batch_sz 8 --gradient_accumulation_steps 8 --savedir /home/User/BPMulT_iemocap_examplePath --data_path /home/User/datasets --task iemocap --visual both --task_type multilabel --num_image_embeds 3 --lr_patience 2 --patience 5 --dropout 0.1 --lr 5e-5 --warmup 0.1 --max_epochs 100  --num_heads 12 --nlevels 8 --layers 8 --orig_d_v 35 --orig_d_a 74 --hidden_sz 300

Related works:

MulT-GMU: Multimodal Weighted Fusion of Transformers for Movie Genre Classification.
Mult End2end: Multimodal End-to-End Sparse Model for Emotion Recognition.
Mod-Trans: Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition.
MulT: Multimodal Transformer for Unaligned Multimodal Language Sequences.
MMBT: "Supervised Multimodal Bitransformers for Classifying Images and Text.
Moviescope Dataset: Moviescope: Large-scale Analysis of Movies using Multiple Modalities.
GMU and MM-IMDb: Gated Multimodal Units for Information Fusion.

Versions

python 3.7.6
torch 1.5.1
tokenizers 0.9.4
transformers 4.2.2
Pillow 7.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bpmult		bpmult
outputs		outputs
.gitignore		.gitignore
Biprojection_Multimodal_Transformer_PaperDraft.pdf		Biprojection_Multimodal_Transformer_PaperDraft.pdf
CIMAT_Master_Thesis.pdf		CIMAT_Master_Thesis.pdf
DEMO_BPMultimodal_Transformer.ipynb		DEMO_BPMultimodal_Transformer.ipynb
DEMO_GMUs_proposals_Multimodal_Classification.ipynb		DEMO_GMUs_proposals_Multimodal_Classification.ipynb
Master_Thesis_degree_presentation.pdf		Master_Thesis_degree_presentation.pdf
README.md		README.md
bpmult_archColors.png		bpmult_archColors.png
jobs_to_run.txt		jobs_to_run.txt
slurm		slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MS Degree Thesis: Biprojection Multimodal Transformer (BPMulT) for Multimodal Data Classification

BPMulT architecture

Tasks

Publications

Usage

Related works:

Versions

About

Releases

Packages

Languages

Damorgal/Biprojection-Multimodal-Transformer

Folders and files

Latest commit

History

Repository files navigation

MS Degree Thesis: Biprojection Multimodal Transformer (BPMulT) for Multimodal Data Classification

BPMulT architecture

Tasks

Publications

Usage

Related works:

Versions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages