This project implements a simple spam comments detection system using Machine Learning. The system classifies YouTube comments as either Spam
or Not Spam
based on their content. The model is built using a Naive Bayes classifier and trained on a labeled dataset.
- Preprocesses and vectorizes text data using CountVectorizer.
- Trains a Bernoulli Naive Bayes model for binary classification.
- Provides predictions for new sample comments.
- Evaluates the model’s performance with accuracy on a test dataset.
The dataset used for this project is Youtube01-Psy.csv
. It contains YouTube comments with the following columns:
- CONTENT: The text of the comment.
- CLASS: A binary label where
0
representsNot Spam
and1
representsSpam Comment
.
Make sure you have the following Python libraries installed:
pandas
numpy
scikit-learn
You can install the dependencies using pip:
pip install pandas numpy scikit-learn
Clone this repository to your local machine:
git clone https://github.com/gawadx1/spam-detection-ml.git
cd spam-detection-ml
Run the Python script to use the model and test sample comments:
python spam_comments_detection_gui.py
The script includes example comments to test the model. You can modify these or add your own samples to see how the model performs.
-
Data Loading:
- Reads the dataset and filters the relevant columns (
CONTENT
andCLASS
).
- Reads the dataset and filters the relevant columns (
-
Preprocessing:
- Converts labels to human-readable values (
Spam Comment
,Not Spam
). - Converts text data to a bag-of-words representation using
CountVectorizer
.
- Converts labels to human-readable values (
-
Model Training and Evaluation:
- Splits the data into training and testing sets.
- Trains a Bernoulli Naive Bayes classifier.
- Evaluates the model’s accuracy on the test set.
-
Sample Prediction:
- Tests the model with predefined sample comments and predicts whether they are spam or not.
- Accuracy of the model on the test data.
- Predictions for sample comments.
Model Accuracy: 0.95
Sample Comment: "Check this out: https://thecleverprogrammer.com /"
Prediction: Spam Comment
Sample Comment: "Lack of information!"
Prediction: Not Spam
- Add additional features (e.g., comment length, special characters).
- Experiment with other machine learning models.
- Use a larger and more diverse dataset for better generalization.
- Implement a user interface for real-time predictions.
This project is licensed under the MIT License. See the LICENSE file for more details.