Skip to content

Shivmalge/Email-Sms-Spam-Classifier

Repository files navigation

Email-Sms-Spam-Classifier

Objective : Prediction of Spam Email or SMS using Machine Learning

Introduction :

Spam email is unwanted junk email sent out in massive amount or in bulk to an indiscriminate recipient list. Generally, spam is sent for commercial purposes. It is sent in massive volume by botnets, networks of infected computers. Spam email can often be a malicious attempt to gain access to your system. Spam prevents the user from making full and good utilization of cpu time, storage capacity and network bandwidth. It becomes a huge problem especially at times when there are Spam mails which come in between important business mails. Hence, it becomes inevitable to solve such problems which are encountered by spam email. So, this problem can be solved by using Machine Learning methods which can successfully detect and filter spam.

Problem Statement :

The person responsible for sending the spam messages is referred to as the spammer. Such a person gathers email addresses from different websites, chat rooms etc. The huge volume of Spam mails flowing through the computer networks have destructive effects on the memory space of the email server, communication bandwith, cpu power and user time. In all, existing system does not find spam mails effectively. Hence, it also results in untold financial losses to many users. It leads to low test and prediction accuracy, less security and also loss of data.

So according to above problem statement, I have built the machine learning model which will predict the Email or SMS is Spam or Not Spam.

Python libraries used to build the model:

  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn
  • Scikit-learn

The dataset contains the two columns:

  • TEXT (input)
  • Target (output: 1. Spam 2. Not Spam)
  • Shape of dataset is (5572 , 2)

The Model is built by using following steps :

  • Data cleaning
  • EDA
  • Text Preprocessing
  • Model building
  • Evaluation
  • Improvement

1. Data Cleaning :

  • I have renamed the columns as column names in the dataset were v1 and v2 and new columns are v1 as 'target' and v2 as 'text' image

  • As you can see above, our target variable is a categorical feature so I have converted into numerical variable by using LABEL ENCODING from SKLEARN library. i.e. I have maped the SPAM category as 1 and ham category as 0. image

  • After that I have checked the null values present in the dataset image I found no missing value in the dataset

  • I checked the duplicate values in the dataset image Now shape of the dataset becomes (5169,2) after dropping all duplicate rows

2. EDA(Exploratory Data Analysis):

  • I have checked the distrbution of ham and spam category in the target variable(i.e. how many are ham and how many are spam). I used the matplotlib library for the visualization in the form of pie chart image

  • After that I have created the 3 columns of number of characters, number of words and number of sentences in every single row or sample. I have used the NLTK which is a standard python library that provides a set of diverse algorithms for NLP. It is one of the most used libraries for NLP and Computational Linguistics. image

  • I have created the Histplot to see how number of characters and number of words distrbuted in input image image

  • The I checked how number of characters, number of words and number of sentences are correlated by using seaborn library image image

3. Data Preprocessing:

  • Lower case
  • Tokenization
  • Removing special characters
  • Removing stop words and punctuation
  • Stemming

1. Lower case:

I have converted all the words into lower words to not to repeat the words and characters as both meaning is same

2. Tokenization:

Tokenization is the preocess of converting paragraph into list of sentences and sentences into list of words and I converted text data into list of words for every sample row

3. Removing special characters:

The special characters like '!','%','*','$' are removed from sentences

4. Removing Stop words:

The words like 'The', 'is', 'am', 'it' are removed because it does no meaning and does not affects on output

5. Stemming:

Stemming is process of finding the root words of the all words in the text data. For eg. 'calling', 'called' have root word 'call' and 'gone', 'goes' have root word 'go'

6. I have created the wordcloud chart which shows most frequently words when EMAIL or SMS is spam and ham

For Spam

image

For ham

image

7. Our final datastet will be:

image

8. Using TFIDF(term frequency-inverse document frequency) I have converted the text input into vectors after that we wiil get the as many columns as we have unique words in the input dataset

image

The shape of dataset after preprocessing becomes image

4. Model Building and Evaluation :

As our problem is of text classification so the algorithm called Naive Bayes Classifier works very well on this type of data. i.e. Text Data

Naive Bayes Classifier have 3 types :

  • Multinomial Naive Bayes : Multinomial Naïve Bayes consider a feature vector where a given term represents the number of times it appears or very often i.e. frequency. Multinomial Naive Bayes - Widely used classifier for document classification which keeps the count of frequent words present in the documents.

  • Bernoulli Naive Bayes : Bernoulli is a binary algorithm used when the feature is present or not.

  • Guassian Naive Bayes : Gaussian is based on continuous distribution i.e. Used when we are dealing with continuous data.

I have calculated the accuracy, confusion matrix and precision score of each of three classifiers

image

From three I have got the good accuracy and prescision of Multinomial Naive Bayes which is 97.1 % and precision is 100 % which is best for our model.

Then I have tried with different algorithms like Logistic Regression, Decsion Tree, Random Forest Classiifier, ADABOOST, XG-BOOST, KNeighbours Classifier. After that I got the accuracy and precision as follows and among them Multinomial Naive Bayes Classifier performing best:

image

Lets see demo of model or test the model:

Website Link:

Testing 1 image

Testing 2 image

Conclusion :

This is how I have created the EMAIL/SMS Spam Classifier model by using the machine learning algorithm of Naive Bayes Classifier.

Project By - Shivsharan Malage

Github

Linkedin