This project aims to classify Twitter posts regarding COVID-19 vaccines into supportive, opposed, or irrelevant categories using various machine learning models including Naive Bayes, Random Forest, and XGBoost. The project leverages embedding models such as COVID-Twitter-BERT (CT-BERT) and Sentence-BERT (SBERT) for feature extraction.
LLM-Vaccine-Sentiment-Classifier/
│
├── data/
│ └── processed/
│ └── final_embeddings.xlsx
│
├── src/
│ ├── classifiers/
│ │ ├── naive_bayes_classifier.py
│ │ ├── random_forest_classifier.py
│ │ ├── xgboost_classifier.py
│ │ └── voting_classifier.py
│ └── evaluation/
│ └── cross_validation.py
│
├── .gitignore
├── README.md
└── requirements.txt
The data used in this project is stored in an Excel file located at data/processed/final_embeddings.xlsx
. This file contains the tweet embeddings and their corresponding labels.
COVID-Twitter-BERT (CT-BERT) is a transformer-based model fine-tuned specifically for COVID-19 related text. It provides contextual embeddings that capture the nuances of language used in tweets about COVID-19 vaccines. You can find the CT-BERT model on Hugging Face here.
Sentence-BERT (SBERT) is a modification of BERT that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity. You can find the SBERT model on Hugging Face here.
The Naive Bayes classifier achieved the following performance metrics:
- Accuracy: 0.7107
- Macro F1-score: 0.7067
- Micro F1-score: 0.7107
The Random Forest classifier achieved the following performance metrics:
- Accuracy: 0.7619
- Macro F1-score: 0.7573
- Micro F1-score: 0.7619
The XGBoost classifier achieved the following performance metrics:
- Accuracy: 0.7862
- Macro F1-score: 0.7843
- Micro F1-score: 0.7862
The Ensemble Voting classifier achieved the following performance metrics:
- Accuracy: 0.7655
- Macro F1-score: 0.7622
- Micro F1-score: 0.7655
- 10-Fold Cross-Validation Macro F1-scores:
- 0.6869506146908032
- 0.7057253862391563
- 0.7082123843455866
- 0.7154541095017644
- 0.7081590134676249
- 0.7506295938657551
- 0.6943318608661774
- 0.7183845466880182
- 0.7140522779181637
- 0.7203213011698222
- Mean Macro F1-score: 0.7122221088752871
- 10-Fold Cross-Validation Macro F1-scores:
- 0.7500277601221228
- 0.7672225972178861
- 0.7544733193993615
- 0.7493114200049827
- 0.7716516421431496
- 0.7954943356032286
- 0.7452530546208895
- 0.7401998053484454
- 0.764328231292517
- 0.758727175921576
- Mean Macro F1-score: 0.7596689341674159
- 10-Fold Cross-Validation Macro F1-scores:
- 0.779121962706227
- 0.7883579530836989
- 0.7806530126618464
- 0.7721074197750838
- 0.7973539872921398
- 0.8246499105628962
- 0.7850879434342879
- 0.8210243046034328
- 0.7978669817097942
- 0.814729786451319
- Mean Macro F1-score: 0.7960953262280726
- 10-Fold Cross-Validation Macro F1-scores:
- 0.7464253325564107
- 0.7780922279139973
- 0.755819448010008
- 0.7567971660892505
- 0.7762816890430445
- 0.8039189027258297
- 0.7502319668747216
- 0.7691881495649714
- 0.7726332069466385
- 0.7768332848978009
- Mean Macro F1-score: 0.7686221374622674
-
Naive Bayes:
- Mean Macro F1-score: 0.7122
-
Random Forest:
- Mean Macro F1-score: 0.7597
-
XGBoost:
- Mean Macro F1-score: 0.7961
-
Ensemble Voting:
- Mean Macro F1-score: 0.7686
- XGBoost performed the best among the individual classifiers, achieving a high mean Macro F1-score of 0.7961.
- The Ensemble Voting Classifier performed well with a mean Macro F1-score of 0.7686.
- Random Forest and Naive Bayes followed with mean Macro F1-scores of 0.7597 and 0.7122, respectively.
Model | Accuracy | Macro F1-score | Micro F1-score |
---|---|---|---|
Naive Bayes | 0.7107 | 0.7067 | 0.7107 |
Random Forest | 0.7619 | 0.7573 | 0.7619 |
XGBoost | 0.7862 | 0.7843 | 0.7862 |
Ensemble Voting | 0.7655 | 0.7622 | 0.7655 |
Model | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5 | Fold 6 | Fold 7 | Fold 8 | Fold 9 | Fold 10 | Mean Macro F1-score |
---|---|---|---|---|---|---|---|---|---|---|---|
Naive Bayes | 0.6870 | 0.7057 | 0.7082 | 0.7155 | 0.7082 | 0.7506 | 0.6943 | 0.7184 | 0.7141 | 0.7203 | 0.7122 |
Random Forest | 0.7500 | 0.7672 | 0.7545 | 0.7493 | 0.7717 | 0.7955 | 0.7453 | 0.7402 | 0.7643 | 0.7587 | 0.7597 |
XGBoost | 0.7791 | 0.7884 | 0.7807 | 0.7721 | 0.7974 | 0.8246 | 0.7851 | 0.8210 | 0.7979 | 0.8147 | 0.7961 |
Ensemble Voting | 0.7464 | 0.7781 | 0.7558 | 0.7568 | 0.7763 | 0.8039 | 0.7502 | 0.7692 | 0.7726 | 0.7768 | 0.7686 |
To run this project, you need to have Python installed. You can install the required dependencies using:
pip install -r requirements.txt
To run the Naive Bayes classifier, execute the following command:
python src/classifiers/naive_bayes_classifier.py
To run the Random Forest classifier, execute the following command:
python src/classifiers/random_forest_classifier.py
To run the XGBoost classifier, execute the following command:
python src/classifiers/xgboost_classifier.py
To run the voting ensable classifier, execute the following command:
python src/classifiers/voting_classifier.py
To run the cross-validation script for all classifiers, execute the following command:
python src/evaluation/cross_validation.py
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.