We have developed a text classification model for Persian language using the ArmanEmo dataset and the Hugging Face model.
The ArmanEmo dataset has been collected from various sources, including Persian tweets from Twitter, users' comments on Instagram, and customers' comments on Digikala (an online shopping platform). These sources were chosen to capture a diverse range of textual data that reflects individuals' emotions and opinions on social and political topics. The dataset aims to be representative and encompass different online platforms commonly used in Iran, where individuals express their ideas and emotions.
The ArmanEmo dataset consists of seven classes with the following distribution:
We utilized the powerful Hugging Face platform, which provides access to a wide range of powerful models. We leveraged libraries within this platform, such as tokenization and more. We experimented with various models, including ParsBert, ALBERT, roberta_facebook, and roberta-base-ft-udpos28, in order to achieve the desired performance. Ultimately, we achieved the highest accuracy on the test data using the persian_xlm_roberta_large model.
We went through the following four steps:
- Preprocessed the train and test datasets.
- Tested different models and selected the best model.
- Saved the best model.
- Evaluated the performance using metrics such as F1 score, precision, recall, and accuracy.
The evaluation results on the test data were as follows: Precision: 0.732781028240936 Recall: 0.7115551694178974 F1 score: 0.7094612321075684
And the confusion matrix was as follows:
You can refer to the accompanying PDF document for detailed implementation steps. Please see the PDF file for a comprehensive guide on how to execute the code.
- Download the ArmanEmo dataset using the provided link.
- Run the notebook in the "preprocess" folder [Notebook].
- Run the notebook in the "main_model" folder [Notebook].
- You can also apply the text classification results on your own dataset by running the notebook in the "predict_evaluate" folder. Simply perform the second step, which is preprocessing, on your dataset, and then use the saved best model in the "save_best_model" folder to classify your text dataset [Notebook].
https://huggingface.co/docs/transformers/en/tasks/sequence_classification