Use Bayesian classification and K-means clustering to analyze SMS messages.
After completing this assignment, you should understand:
- The uses of classification and clustering
- Good feature extraction
After completing this assignment, you should be able to:
- Parse text into features
- Classify textual data
- Cluster textual data
- A Git repo called sms-spam containing at least:
README.md
file explaining how to run your project- a
requirements.txt
file
- No PEP8 or Pyflakes warnings or errors
Download the SMS Spam collection from the UCI Machine Learning Repository.
Choose a set of features to use in order to separate SMS ham from spam.
Write a program to extract the features you want from each SMS message and then classify each SMS as ham or spam. Iterate on your feature extraction until you have a classification success level you are comfortable with (> 75% minimum.)
In addition to the requirements from Normal Mode:
Write a program to cluster the SMS messages in differing numbers of groups. Examine each cluster to see if they have meaning. Write up your findings.
Some features you might want to try:
- Presence of the following words: claim, winner
- Presence of money symbols
- Presence of numbers
- Presence of first-person words