SMS Spam Detection Project

This project focuses on detecting spam messages using machine learning techniques to classify textual data as spam or not spam. The dataset, sourced from Kaggle, contains labeled examples of both types of messages. Key steps in this project include preprocessing the text data, applying natural language processing techniques, and building a classification model. The model's performance was improved through hyperparameter tuning and evaluation to ensure accurate and reliable predictions.

Exploratory Data analysis help me out to understand how different distributed. Visualization methods were used to create plots comparing the frequencies of spam and non-spam messages, providing a clear overview of their distribution. Word clouds were generated to highlight the most common words in spam and non-spam messages, offering insights into the language patterns of each category.

SMS Spam Detection WebApp 👈

Software And Tools Requirements

📋 Data Cleaning

The data was cleaned to ensure quality and consistency through the following steps:

Handling Null Values:
Text Standardization:
- Converted text to lowercase.
- Removed punctuation.
- Eliminated stopwords.
Tokenization and Lemmatization:
- Tokenized the text into meaningful units.
- Applied lemmatization for better semantic representation.
Function Input at least 5 words

🔧 Feature Engineering

Meaningful features were extracted using:

Word Frequency Analysis: Analyzed the distribution of words in spam and non-spam messages.
Term Importance: Used statistical measures to identify words crucial for spam classification.

🔄 Data Preparation

Text data was transformed into a format suitable for machine learning models using TF-IDF (Term Frequency-Inverse Document Frequency). This process represented text numerically, emphasizing the importance of words relative to the dataset.

🧪 Model Selection and Optimization

Several machine learning algorithms were tested, and the Random Forest Classifier delivered the best performance. Key metrics included:

Accuracy: ~97%
Precision: ~100%

Precision was prioritized to minimize false positives, as misclassifying non-spam messages as spam can cause significant issues. The spam detection threshold was adjusted to 75% to further enhance precision.

📚 Libraries Used

NLTK: For text cleaning, tokenization, and lemmatization.
Scikit-learn: For building and evaluating the model.
Pandas: For data manipulation.
Matplotlib/Seaborn: For visualizations during EDA.

🚀 Key Features

High precision and accuracy in spam detection.
Threshold-based classification to reduce false positives.
Robust preprocessing pipeline for textual data.

🏁 Conclusion

This project demonstrates how a well-structured pipeline can achieve exceptional results in spam detection tasks. The Random Forest Classifier, combined with TF-IDF and robust preprocessing techniques, delivered a highly reliable solution.

📈 Future Improvements

Experimenting with other algorithms like XGBoost.
Fine-tuning hyperparameters for further optimization.

Feel free to explore and improve this project!

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
myvenv		myvenv
notebook		notebook
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Model.pkl		Model.pkl
README.md		README.md
Tfidf.pkl		Tfidf.pkl
requirements.txt		requirements.txt
spam.csv		spam.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS Spam Detection Project

SMS Spam Detection WebApp 👈

Software And Tools Requirements

📋 Data Cleaning

🔧 Feature Engineering

🔄 Data Preparation

🧪 Model Selection and Optimization

📚 Libraries Used

🚀 Key Features

🏁 Conclusion

📈 Future Improvements

About

Releases

Packages

Languages

License

KanAfridi/Sms-Spam-Detection-Model

Folders and files

Latest commit

History

Repository files navigation

SMS Spam Detection Project

SMS Spam Detection WebApp 👈

Software And Tools Requirements

📋 Data Cleaning

🔧 Feature Engineering

🔄 Data Preparation

🧪 Model Selection and Optimization

📚 Libraries Used

🚀 Key Features

🏁 Conclusion

📈 Future Improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages