Skip to content

Custom Naive Bayes classifier for spam and ham email filtering.

Notifications You must be signed in to change notification settings

GoodGuyGregory/spam_filter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Naïve Bayes Classification

The process of building a Naïve Bayes classifier was done within the main() method of the program attached, using the following method buildTrainingTestData(). this method takes the original dataset of spam and ham emails. Then splits it into trainingSpambaseData and testingSpamBaseData both of which have a close split consisting of 60% Non-Spam (Ham) and 40% Spam email features and labels.

The Classifier class for this implementation is emailClassifier. The classifier has a few constructor methods which pull the mean and standard deviation for each of the training data features for both Hams (Non-Spam) and Spams.

Once the mean and standard deviations are calculated they’re established on the class properties and used for our presumptions with our naïve bayes classification.

The next process is to call the prepareTestData() method which takes the target labels and feature vectors and splits them into separate entities for the classification process of the model.

def prepareTestData(testingData):
    # shuffle the test data.
    testingData = testingData.sample(frac=1)

    # split the testing features from the label.
    inputFeatures = np.array(testingData.iloc[:, :-1])

    # pull classifications and for comparison.
    inputTargets = np.array(testingData.iloc[:, -1].to_list())

    return inputFeatures, inputTargets

The emailClassifier object’s classifyEmails() is then called to classify the emails and pull attributes for true positive, true negative, false positive and false negative all values used to build the confusion matrix after the classification process has been completed. the allows for comparative benchmarks and overall performance metrics.

 def classifyEmails(self, emailFeatures, emailTargets):
        predictedValues = []

        for email in range(len(emailFeatures)):
            self.totalEmailsClassified += 1
            emailFeatureVector = emailFeatures[email]
            targetClass = emailTargets[email]
            # adds a posteriors list
            posteriors = []

            # classify ham
            hamPosterior = np.sum(np.log(self.gaussianNB(emailFeatureVector, self.hamsMean, self.hamsStd)) + np.log(self.trainingHamClassPrior))
            posteriors.append(hamPosterior)

            # classify spam
            spamPosterior = np.sum(np.log(self.gaussianNB(emailFeatureVector, self.spamsMean, self.spamsStd)) + np.log(self.trainingSpamClassPrior))
            posteriors.append(spamPosterior)

            prediction = np.argmax(posteriors, axis=0)

            predictedValues.append(prediction)

            if prediction == targetClass:
                # classify prediction for accuracy metrics:
                # case where the target is classified as spam
                # case where the email is correctly classified as spam
                if prediction == 1 and targetClass == 1:
                    self.truePositives += 1
                # case where classified as not spam and is non-spam
                if prediction == 0 and targetClass == 0:
                    self.trueNegatives += 1
            # case where it wasn't correctly predicted.
            else:
                # case where the email is falsely classified as spam
                if prediction == 1 and targetClass == 0:
                    self.falsePositives += 1
                # case where classified as non spam but is spam email
                if prediction == 0 and targetClass == 1:
                    self.falseNegatives += 1

        self.confusionMatrix = confusion_matrix(emailTargets, predictedValues)

The classifier was very accurate when implementing the Gaussian Naïve bayes Algorithm. The resulting accuracies, precision, and recall can be seen below.

Accuracy Precision Recall

Accuracy Precision Recall

Confusion Matrix

Confusion_Matrix

About

Custom Naive Bayes classifier for spam and ham email filtering.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages