Sklearn Tweet Classification
This project does a text classification by using three classes of the datasets. I use several machine learning algorithm provided by sklearn and cross-validate each algorithm to get the best one. But before doing the classification, i have to be able to deal with the challenge of the datasets. The content within the datasets can not be used directly, because the content still contain 'trash' information such as (@, #, RT, username, punctuation, local abbreviations, etc) So i have to clear the data first. Here is the main step i did during this project.
- Load dataset
- Clear dataset from unnecessarry characters (@,#. etc)
- Words normalization (yg > yang, dgn > dengan)
- Stopwords
- Stemming
- Train
- Test
- Report