GitHub - GoodGuyGregory/spam_filter: Custom Naive Bayes classifier for spam and ham email filtering.

Naïve Bayes Classification

The process of building a Naïve Bayes classifier was done within the main() method of the program attached, using the following method buildTrainingTestData(). this method takes the original dataset of spam and ham emails. Then splits it into trainingSpambaseData and testingSpamBaseData both of which have a close split consisting of 60% Non-Spam (Ham) and 40% Spam email features and labels.

The Classifier class for this implementation is emailClassifier. The classifier has a few constructor methods which pull the mean and standard deviation for each of the training data features for both Hams (Non-Spam) and Spams.

Once the mean and standard deviations are calculated they’re established on the class properties and used for our presumptions with our naïve bayes classification.

The next process is to call the prepareTestData() method which takes the target labels and feature vectors and splits them into separate entities for the classification process of the model.

def prepareTestData(testingData):
    # shuffle the test data.
    testingData = testingData.sample(frac=1)

    # split the testing features from the label.
    inputFeatures = np.array(testingData.iloc[:, :-1])

    # pull classifications and for comparison.
    inputTargets = np.array(testingData.iloc[:, -1].to_list())

    return inputFeatures, inputTargets

The emailClassifier object’s classifyEmails() is then called to classify the emails and pull attributes for true positive, true negative, false positive and false negative all values used to build the confusion matrix after the classification process has been completed. the allows for comparative benchmarks and overall performance metrics.

 def classifyEmails(self, emailFeatures, emailTargets):
        predictedValues = []

        for email in range(len(emailFeatures)):
            self.totalEmailsClassified += 1
            emailFeatureVector = emailFeatures[email]
            targetClass = emailTargets[email]
            # adds a posteriors list
            posteriors = []

            # classify ham
            hamPosterior = np.sum(np.log(self.gaussianNB(emailFeatureVector, self.hamsMean, self.hamsStd)) + np.log(self.trainingHamClassPrior))
            posteriors.append(hamPosterior)

            # classify spam
            spamPosterior = np.sum(np.log(self.gaussianNB(emailFeatureVector, self.spamsMean, self.spamsStd)) + np.log(self.trainingSpamClassPrior))
            posteriors.append(spamPosterior)

            prediction = np.argmax(posteriors, axis=0)

            predictedValues.append(prediction)

            if prediction == targetClass:
                # classify prediction for accuracy metrics:
                # case where the target is classified as spam
                # case where the email is correctly classified as spam
                if prediction == 1 and targetClass == 1:
                    self.truePositives += 1
                # case where classified as not spam and is non-spam
                if prediction == 0 and targetClass == 0:
                    self.trueNegatives += 1
            # case where it wasn't correctly predicted.
            else:
                # case where the email is falsely classified as spam
                if prediction == 1 and targetClass == 0:
                    self.falsePositives += 1
                # case where classified as non spam but is spam email
                if prediction == 0 and targetClass == 1:
                    self.falseNegatives += 1

        self.confusionMatrix = confusion_matrix(emailTargets, predictedValues)

The classifier was very accurate when implementing the Gaussian Naïve bayes Algorithm. The resulting accuracies, precision, and recall can be seen below.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
img		img
.DS_Store		.DS_Store
main.py		main.py
readme.md		readme.md
spambase.DOCUMENTATION		spambase.DOCUMENTATION
spambase.data		spambase.data
spambase.names		spambase.names

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Naïve Bayes Classification

Accuracy Precision Recall

Confusion Matrix

About

Releases

Packages

Languages

GoodGuyGregory/spam_filter

Folders and files

Latest commit

History

Repository files navigation

Naïve Bayes Classification

Accuracy Precision Recall

Confusion Matrix

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages