Explore and leverage advanced language models to enhance sentiment prediction beyond BERT's capabilities. In this project, we focus on two architectures:
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
We use the architectures, investigate their training and optimization techniques, and apply them to classify human emotions into distinct categories.
The dataset, named "Emotion," comprises English Twitter messages annotated with six basic emotions: anger, fear, joy, love, sadness, and surprise. The dataset, sourced from the Hugging Face library, consists of three categories:
- Train: 16,000 rows, 2 columns
- Validation: 2,000 rows, 2 columns
- Test: 2,000 rows, 2 columns
The two columns represent labels and text, with labels corresponding to different emotions (0: sadness, 1: joy, 2: love, 3: anger, 4: fear, 5: surprise).
The project aims to build and evaluate two emotion classification models: RoBERTa and XLNet.
- Language:
Python
- Libraries:
datasets
,numpy
,pandas
,matplotlib
,seaborn
,ktrain
,transformers
,tensorflow
,sklearn
- Jupyter Notebook
- Google Colab Pro (Recommended)
- Install Required Libraries
- Load 'Emotion' Dataset
- Read Dataset Across Categories
- Convert Dataset to Dataframe and Create a New Feature
- Data Visualization
- Histogram Plots
- RoBERTa Model
- Create RoBERTa model instance
- Split train and validation data
- Perform Data Pre-processing
- Compile RoBERTa in a K-train learner object
- Find optimal learning rate
- Fine-tune RoBERTa on the dataset
- Evaluate performance metrics
- Save RoBERTa model
- Apply RoBERTa on test data and assess performance
- Understanding Autoregressive and Autoencoder Models
- XLNet Model
- Load required libraries
- Create XLNet model instance
- Split train and validation data
- Perform Data Pre-processing
- Compile XLNet in a K-train learner object
- Find optimal learning rate
- Fine-tune XLNet on the dataset
- Evaluate performance metrics
- Save XLNet model
- Apply XLNet on test data and assess performance
Src Folder
- Engine.py
- ML_Pipeline Folder
ML_Pipeline Folder
- Contains functions in different Python files, appropriately named, for each step. These functions are called inside the Engine.py file.
Output Folder
- Contains the best-fitted model trained for this data. This model can be loaded for future use without retraining. Note: The model is built on a subset of data; running Engine.py with the entire data retrains the models.
Lib Folder
- Contains the original IPython notebooks.