- Introduction
- Installation
- Usage
- Exploratory Data Analysis
- Vectorization
- Model Comparison
- Contributing
- Acknowledgments
The project identifies and calculates the probability score of online toxic comments for each of six given categories: 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'.
To install and run this project, follow these steps:
- Clone the repository:
git clone https://github.com/username/Toxic-Comment-Classification-Challenge.git cd Toxic-Comment-Classification-Challenge
Load the data files of training and test sets using the read_csv
module of the pandas library.
The training data has 7 columns: comment_text
and six labels.
The data is cleaned using the re
library.
The comment_text
column is vectorized using the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm.
The logistic regression model outperformed Support Vector Machines, Multi-layer Perceptron, and MiniLM models of Sentence Transformers.
Model | Leaderboard Score |
---|---|
Logistic Regression | 0.97461 |
Support Vector Machines | 0.8242 |
Multi-layer Perceptron | 0.9092 |
Sentence Transformers | 0.9596 |
- The Scikit-Learn library and its contributors
- Kaggle for providing the Toxic Comment Classification Challenge
- The open-source community for their invaluable contributions