This repository contains the dataset and the implementation code for the paper MultiEmo: multi-task framework for emoji prediction.
- data/ contains raw and preprocessed datasets for train, validation and test. preprocessing.py would help preprocess the raw text. Also, pre-trained model weight and vocabulary for tokenizing will be located here.
- scripts/ contains code for implementing the model and thus reproducing results in the paper.
- checkpoints/ is a repository where a checkpoint of the trained model such as weight information or optimizer state would be saved.
- README.md
- requirements.txt
Environment setup
For experimental setup, requirements.txt
lists down the requirements for running the code on the repository. Note that a cuda device is required.
The requirements can be downloaded using,
pip install -r requirements.txt
Training setup
You can download the model checkpoints from the TorchMoji repo. We also employed tokenizer used in torchmoji, you should download this vocabulary here. Both pre-trained weights and vocabulary file is expected to be located in data/.
Our experiment needs two dataset for employing different task.
First one is the twitter emoji dataset which we get using REST API. For considering the imbalance in the actual usage of emoji on social media, we only considered the top 64 emojis.
✅✨🌚🎉🎶👀👇👌👍👏👑💀💔💕💖💗💙💚💛💜💞💪💯🔥😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😔😕😘😜😝😞😡😢😣😤😩😪😫😬😭😱😳😴🙈🙌🙏
These are the top 64 emojis which we get from emojitracker.
Twitter dataset can be found in ./data/Twitter.csv
. Each row represents a twitter post including at least one emoji of the top 64 emojis.
The data we used for training includes posts with only one emoji. You can preprocess the raw data by running the following script. After running the ./data/preprocessing.py
, preprocessed dataset will split into 3 files for train, validation, and test, respectively.
python ./data/preprocessing.py --data_path Twitter.csv
For emotion detection, we employed GoEmotion which was released here. GoEmotion is a dataset labeled 58,000 Reddit comments with 28 emotions. Furthermore, all the comments were also labeled with hierarchical grouping (positive, negative, ambiguous + neutral) and Ekman emotion (anger, disgust, fear, joy, sadness, surprise + neutral). To exclude ambiguous data as much as possible, we removed all the comments labeled as neutral during training stage. Since GoEmotion dataset is published in train, validation and test set, respectively, we only employed the same preprocessing pipeline we used for Twitter dataset.
You can run train.py setting arguments as follows:
Name | Required | Type | Default | Options |
---|---|---|---|---|
aux_num | Yes | int | - | 1,2,3 |
aux_task | Yes | str | - | 'emo', 'emo sent' |
gpu_num | Yes | int | - | 1,2 |
learning_rate | No | float | 1e-4 | - |
batch_size | No | int | 64 | - |
num_epoch | No | int | 50 | - |
save_cp | No | bool | True | - |
patience | No | int | 0 | - |
early_stop | No | int | 2 | - |
decay | No | bool | False | - |
fine_tuning | No | bool | False | - |
pre_trained | No | bool | True | - |
Since there are one type of single emoji classifier and three types of multi-task classifiers which we call "MultiEmo", you can clarify the type of model you want to train by varying the argument "aux_task". Options of aux_task can be one of,
- emo: emotion detection labeled for 27 emotions
- Ekman: emotion detection labeled for 6 Ekman emotions
- sent: emotion detection labeled for 3 sentiments
If you want a multi-task model with more than 1 auxiliary task, you can give several tasks as follows:
python ./scripts/train.py --aux_num 2 --aux_task emo sent --gpu_num 1
Note that the number of aux_task and aux_num should be equal.
You can test our model with this simple demo as follows:
python run_multiemo.py
If you find our framework useful in your research, please cite our paper:
@article{lee2022multiemo,
title={MultiEmo: Multi-task framework for emoji prediction},
author={Lee, SangEun and Jeong, Dahye and Park, Eunil},
journal={Knowledge-Based Systems},
pages={108437},
year={2022},
publisher={Elsevier}
}