Skip to content

sabirdvd/Topic-modeling-for-Arabic-Tweets-

Repository files navigation

Topic modeling for Arabic Tweets

This code is adopted from this study BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique (code)

Please refer to this blog post for more details about this repository.

huggingface

Interactive demo

Table of Contents

Create conda environments

conda create -n AraTop  python=3.7 anaconda 
conda activate AraTop   

Install req

pip install bertopic 
pip install flair  

Dataset

The dataset is based on the ArabGend dataset 2022 [1] 108053 tweets

Getting the tweets ID from data file or from [1] or and then retrieve tweets using Twitter API

pip install twarc
twarc2 hydrate ids.txt tweets.json
twarc2 hydrate twitt_ID.txt tweets.json

Convert json file to CSV twarc

pip3 install --upgrade twarc-csv
twarc2 csv --no-json-encode-all tweets.json tweets_CSV.csv
csvcut --columns id,text tweets_CSV.csv

To clean and pre-process the dataset

python arabic_cleaner.py

[1] ArabGend:Gender Analysis and Inference on Arabic Twitter

Training

For Topic modeling via umap

run_umap.sh

For Topic modeling via HDBSCAN

run_hdbscan.sh

For joint model (umap+hdbscan)

run joint.sh 

Inference

loading the tranined model

python infer.py

Citation

If you choose to cite this blog or not, please use the following citation as a reference:

@misc{ahmed2022arabtop,
  author       = {Ahmed},
  title        = {ArabTop 2023: Insights into Arab World Trends},
  year         = {2023},
  howpublished = {\url{https://ahmed.jp/blog/2022-12-ArabTop/ArabTop_2022.html}},
}

Acknowledgment

The implementation of the project relies on resources from BERTopic, Huggingface Transformers, and SBERT. We thank the original authors for their well-organized codebase.