Skip to content

Latest commit

 

History

History
341 lines (304 loc) · 14.9 KB

readme.md

File metadata and controls

341 lines (304 loc) · 14.9 KB

SEERa*: An Open-Source Framework for Future Community Prediction

*Seer: A person who practiced divination in ancient Greece; to foresee, foretell, predict, or prophesy.

This is an open-source extensible end-to-end python-based framework to predict the future user communities in a text streaming social network (e.g., Twitter) based on the users’ topics of interest. User community prediction aims at identifying communities in the future based on the users' temporal topics of interest. We model inter-user topical affinities at each time interval via streams of temporal graphs. Our framework benefits from temporal graph embedding methods to learn temporal vector representations for users as users' topics of interests and hence their inter-user topical affinities are changing in time. We predict user communities in future time intervals based on the final locations of users' vectors in the latent space. Our framework employs layered software design that adds modularity, maintainability, ease of extensibility, and stability against customization and ad hoc changes to its components including topic modeling, user modeling, temporal user embedding, user community prediction and evaluation. More importantly, our framework further offers one-stop shop access to future communities to improve recommendation systems and advertising campaigns. Our proposed framework has already been benchmarked on a Twitter dataset and showed improvements compared to the state of the art in underlying applications such as news article recommendation and user prediction (see here, also below).

  1. Demo
  2. Structure
  3. Setup
  4. Quickstart
  5. Benchmark Result
  6. License
  7. Citation

1. 🎥 Demo

Tutorials: 1) Overview 2) Quickstart (Colab Notebook) 3) Extension

Workflow Layers

2. Structure

Framework Structure

Our framework has six major layers: Data Access Layer (dal), Topic Modeling Layer (tml), User Modeling Layer (uml), Graph Embedding Layer (gel), and Community Prediction Layer (cpl). The application layer (apl), is the last layer, as shown in the above figure.

Each layer process the input data from previous layer and produces new processed data for the next layer as explained below. Sample outputs on toy data can be seen here ./output/toy:

├── {#Topics}topics.csv                           -> N topics with their top 10 vocabulary set and probabilities
├── {#Topics}topics.model                         -> The topic model (e.g., LDA, GSDMM, BTM, ...)
├── {#Topics}TopicsDictionary.mm                  -> Dictionary of tokens/words
├── graphs
│   ├── Day{K}userSimilarities.npz  ->
│   ├── graphs.npz[.pkl]            ->
├── Day{K}UserIDs.pkl               -> User IDs for K-th day [Size: #Users × 1]
├── Day{K}UsersTopicInterests.pkl   -> Matrix of users to topics [Size: #Topics × #Users]
├── Users.npy                       -> User IDs [Size: #Users × 1]
├── Embeddings.pkl -> Embedded user graphs [Size: #Days-loockback × #Users × Embedding dim]
├── cluster2user.csv[.pkl]      -> 
├── ClusterTopic.csv[.pkl]      -> 
├── Graph.adjlist               -> Final predicted user graph for the future from last embeddings
├── Pred_users_similarity.npz   -> 
├── PredUserClusters.npy[.csv]  -> Cluster ID for each user [Size: #Users × 1]
├── user2cluster.csv[.pkl]      -> 
├── evl                                     ->
|   ├── Pred.Eval.csv                       ->
|   ├── Pred.Eval.Mean.csv                  ->
|   ├── UserMentions.pkl                    ->
├── NewsIds_ExpandedURLs.npy                ->
├── NewsTopics.pkl                          ->
├── RecommendationTableUser.pkl             ->
├── topRecommendationMentionerUser.pkl      ->
├── TopRecommendationsUser.pkl              ->
├── users_mentions_mentioned_user.pkl       ->

------------------*Previous*
├── ClusterNumbers.npy              -> Cluster IDs [Size: #Communities]
├── NewsIds.npy                     -> News IDs [Size: #News × 1]
├── CommunitiesTopicInterests.npy   -> Topic vector for each community [Size: #Communities × #Topics]
├── NewsTopics.npy                  -> Topic vector for each news article [Size: #News × #Topics]
├── RecommendationTable.npy         -> Recommendations scores of news articles for each community [Size: #Communities × #News]
├── TopRecommendations.npy          -> TopK recommendations scores of news articles for each community [Size: #Communities × TopK]

Code Structure

+---data
|   +---toy.synthetic
|   |   +---News.csv
|   |   +---TweetEntities.csv
|   |   \---Tweets.csv
|   |
|   +---toy
|   |   +---News.csv
|   |   +---TweetEntities.csv
|   |   +---Tweets.csv
|   |   \---readme.md
|   |
+---src
|   +---cmn (common functions)
|   |   \---Common.py
|   |
|   +---dal  (data access layer)
|   |   +---DataPreparation.py
|   |   \---DataReader.py
|   |
|   +---tml  (topic modeling layer)
|   |   \---TopicModeling.py
|   |
|   +---uml (user modeling layer)
|   |   +---UsersGraph.py
|   |   \---UserSimilarities.py
|   |
|   +---gel (graph embedding layer)
|   |   +---CppWrapper.py
|   |   +---GraphEmbedding.py
|   |   \---GraphToText.py
|   |
|   +---cpl (community prediction layer)
|   |   +---GraphClustering.py
|   |   \---GraphReconstruction_main.py
|   |
|   +---apl (application layer)
|   |   +---NewsTopicExtraction.py
|   |   +---NewsRecommendation.py
|   |   +---News.py
|   |   +---NewsCrawler.py
|   |   \---ModelEvaluation.py
|   |
|   +---params.py
|   +---ParamsTemplate.py
|   \---main.py
|
+---environment.yml
+---quickstart.ipynb
\---requirements.txt

3. Setup

SEERa has been developed on Python 3.6 and can be installed by conda or pip:

git clone https://github.com/fani-lab/seera.git
cd seera
conda env create -f environment.yml
conda activate seera
git clone https://github.com/fani-lab/seera.git
cd seera
pip install -r requirements.txt

This command installs compatible versions of the following libraries:

  • tml: gensim, tagme, nltk, pandas, requests, bitermplus
  • gel: networkx
  • others: scikit-network, scikit-learn, sklearn, numpy, scipy, matplotlib

Additionally, you need to install the following libraries from their source:

git clone https://github.com/rwalk/gsdmm.git
cd gsdmm
python setup.py install
cd ..
git clone https://github.com/palash1992/DynamicGEM.git
cd DynamicGEM
python setup.py install
pip install tensorflow==1.11.0 --force-reinstall #may be needed

4. Quickstart

Data

We crawled and stored ~2.9M Twitter posts (tweets) for 2 consecutive months 2010-11-01 and 2010-12-31. Tweet Ids are provided at ./data/TweetIds.csv for streaming tweets from Twitter using tools like hydrator.

For quickstart purposes, a toy sample of tweets between 2010-12-01 and 2010-12-04 has been provided at ./data/toy/Tweets.csv.

Run

This framework contains six different layers. Each layer is affected by multiple parameters, e.g., number of topics, that can be adjusted by the user via ./src/params_template.py in root folder.

You can run the framework via ./src/main.py with following command:

cd ../src
python -u main.py -r toy -t lda.gensim lda.mallet gsdmm btm -g AE DynAE DynAERNN

where the input arguements are:

-r: A unique description for the run, for example test1, required.

-t: A list of topic modeling methods among {lda.gensim, lda.mallet, gsdmm, btm}, required, case-insensitive.

-g: A list of graph embedding methods among {AE, DynAE, DynRNN, DynAERNN}, required, case-insensitive.

-p: A flag for the run to be time-profiled, optional.

A run will produce an output folder at ./output/{r} and subfolders for each topic modeling and graph embedding pair as baselines, e.g., lda.AE, lda.DynAE, and lda.DynAERNN. The final evaluation results are aggregated in ./output/{r}/pred.eval.mean.csv. See an example run on toy dataset at ./output/toy.

5. Benchmark Result

Method News Recommendation User Prediction
mrr ndcg5 ndcg10 Precision Recall f1-measure
Community Prediction
Fani et al.[ECIR'20] 0.255 0.108 0.105 0.012 0.035 0.015
Appel et al. [PKDD'18] 0.176 0.056 0.055 0.007 0.094 0.0105
Temporal community detection
Hu et al. [SIGMOD'15] 0.173 0.056 0.049 0.007 0.136 0.013
Fani et al. [CIKM'17] 0.065 0.040 0.040 0.007 0.136 0.013
Non-temporal link-based community detection
Ye et al.[CIKM'18] 0.139 0.056 0.055 0.008 0.208 0.014
Louvain[JSTAT'08] 0.108 0.048 0.055 0.004 0.129 0.007
Collaborative filtering
rrn[WSDM’17] 0.173 0.073 0.08 0.004 0.740 0.008
timesvd++ [KDD'08] 0.141 0.058 0.064 0.003 0.657 0.005

6. License

©2021. This work is licensed under a CC BY-NC-SA 4.0 license.

Authors

Soroush Ziaenejad1,2, Hossein Fani1,3

1School of Computer Science, Faculty of Science, University of Windsor, ON, Canada.

2ziaeines@uwindsor.ca, soroushziaeinejad@gmail.com 3hfani@uwindsor.ca

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Acknowledgments

In this work, we use bitermplus, dynamicgem, mallet, pytrec_eval and other libraries. We would like to thank the authors of these libraries.

7. Citation

@inproceedings{DBLP:conf/cikm/ZiaeinejadSF22,
  author    = {Soroush Ziaeinejad and Saeed Samet and Hossein Fani},
  title     = {SEERa: {A} Framework for Community Prediction},
  booktitle = {Proceedings of the 31st {ACM} International Conference on Information {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages     = {4762--4766},
  publisher = {{ACM}},
  year      = {2022},
  url       = {https://doi.org/10.1145/3511808.3557529},
  doi       = {10.1145/3511808.3557529},
  biburl    = {https://dblp.org/rec/conf/cikm/ZiaeinejadSF22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{DBLP:conf/ecir/FaniBD20,
  author    = {Hossein Fani and Ebrahim Bagheri and Weichang Du},
  title     = {Temporal Latent Space Modeling for Community Prediction},
  booktitle = {Advances in Information Retrieval - 42nd European Conference on {IR} Research, {ECIR} 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part {I}},
  series    = {Lecture Notes in Computer Science},
  volume    = {12035},
  pages     = {745--759},
  publisher = {Springer},
  year      = {2020},
  url       = {https://doi.org/10.1007/978-3-030-45439-5\_49},
  doi       = {10.1007/978-3-030-45439-5\_49},
  biburl    = {https://dblp.org/rec/conf/ecir/FaniBD20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}