Topic modeling for Arabic Tweets

This code is adopted from this study BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique (code)

Please refer to this blog post for more details about this repository.

Dataset

The dataset is based on the ArabGend dataset 2022 [1] 108053 tweets

Getting the tweets ID from data file or from [1] or and then retrieve tweets using Twitter API

pip install twarc
twarc2 hydrate ids.txt tweets.json
twarc2 hydrate twitt_ID.txt tweets.json

Convert json file to CSV twarc

pip3 install --upgrade twarc-csv
twarc2 csv --no-json-encode-all tweets.json tweets_CSV.csv
csvcut --columns id,text tweets_CSV.csv

To clean and pre-process the dataset

python arabic_cleaner.py

[1] ArabGend:Gender Analysis and Inference on Arabic Twitter

Training

For Topic modeling via umap

run_umap.sh

For Topic modeling via HDBSCAN

run_hdbscan.sh

For joint model (umap+hdbscan)

run joint.sh

Inference

loading the tranined model

python infer.py

Citation

If you choose to cite this blog or not, please use the following citation as a reference:

@misc{ahmed2022arabtop,
  author       = {Ahmed},
  title        = {ArabTop 2023: Insights into Arab World Trends},
  year         = {2023},
  howpublished = {\url{https://ahmed.jp/blog/2022-12-ArabTop/ArabTop_2022.html}},
}

Acknowledgment

The implementation of the project relies on resources from BERTopic, Huggingface Transformers, and SBERT. We thank the original authors for their well-organized codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
wordcloud		wordcloud
LDA.py		LDA.py
PMI_ar.py		PMI_ar.py
README.md		README.md
arabic_cleaner.py		arabic_cleaner.py
code_hdbscan.py		code_hdbscan.py
code_joint.py		code_joint.py
code_umap.py		code_umap.py
emoji_remover.py		emoji_remover.py
infer.py		infer.py
joint.sh		joint.sh
man_twitt_ID.txt		man_twitt_ID.txt
run_hdbscan.sh		run_hdbscan.sh
run_umap.sh		run_umap.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic modeling for Arabic Tweets

Table of Contents

Dataset

Training

Inference

Citation

Acknowledgment

About

Releases

Packages

Languages

sabirdvd/Topic-modeling-for-Arabic-Tweets-

Folders and files

Latest commit

History

Repository files navigation

Topic modeling for Arabic Tweets

Table of Contents

Dataset

Training

Inference

Citation

Acknowledgment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages