NMF implementation of 2 cases:
- if you know the number of topics: nmf_fixed_k.py
- if you don't know the number of topics: nmf_unknown_k.py
LDA implementation incl. grid search for unknown number of topics (k): lda.py
This is for the sake of completeness, since
- most parts of the code are no different from NMF usage (hence a little redundant)
- the LDA results are not as good as those of NMF
A simple Keras implementation of a text multiclass classifier (with known classes): keras_simple_classifier.py
A topic can be represented resp. interpreted by the most important token / phrases of its documents. Sometimes, this is not as clear as one would like.
These scripts:
- identify_topic.py
- _request_wikipedia.py
try to solve this problem by requesting Wikipedia with top token on a document level and processing the returned categories for each topic.
The results are quite satisfying as shown in following example:
Top phrases from each topic:
[
[
"henry", "england", "elizabeth", "king", "anne", "marriage", "death", "son", "throne", "college"
],
[
"design", "architect", "architecture", "niemeyer", "building", "office", "movement", "designer", "furniture", "site"
],
[
"film", "swanson", "keaton", "bow", "hollywood", "actress", "cinema", "star", "pickford", "actor"
]
]
Top 3 phrases for the same topics from Wikipedia category phrase processing:
topic 0: 16th century | english | monarchs
topic 1: american | architects | 20th century
topic 2: american | actresses | 20th century
The directories in /data:
- source_texts:
Excerpts of wikipedia biographies falling in 3 broad topics:- Tudor dynasty (marked with "a")
- Midcentury Architects / Designer (marked with "b")
- Stars of the silent movie area (marked with "c")
- target_texts:
Very short texts based on source texts whith varying similarity, marked accordingly to the source texts. Also, one text about a movie star not included in source texts and one text about "Charlie Brown" without any topic affiliation (marked with "d").
This data is corresponding to: https://github.com/zushicat/text-similarity-extractive
- "Topic Analysis": https://monkeylearn.com/topic-analysis/
- "Latent Semantic Analysis using Python": https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
- "Stemming and Lemmatization in Python": datacamp.com/community/tutorials/stemming-lemmatization-python
- "An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec": https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
- Scikit Learn documentation
- Scikit Learn documentation
- "LDA": https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- "NMF": https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
- "NMF Example": https://scikit-learn.org/0.15/auto_examples/applications/topics_extraction_with_nmf.html
- "Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation": https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html
- General
- "Topic Modeling - Intro & Implementation": https://www.kaggle.com/akashram/topic-modeling-intro-implementation
- "Topic Modelling with Scikit-learn": http://derekgreene.com/slides/topic-modelling-with-scikitlearn.pdf
- "Using Machine Learning to Analyze Taylor Swift's Lyrics": https://news.codecademy.com/taylor-swift-lyrics-machine-learning/
- "LDA in Python – How to grid search best topic models?": https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/
- "Topic Modeling Quora Questions with LDA & NMF": https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd
- Hyperparameter Tuning
- "Topic Modeling using NMF and LDA using sklearn": https://shravan-kuchkula.github.io/topic-modeling/#gridsearch-the-best-lda-model
- "LDA in Python – How to grid search best topic models?": https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/#11howtogridsearchthebestldamodel
- Topic Coherence (unknown number of topics k)
- "Evaluation of Topic Modeling: Topic Coherence": https://datascienceplus.com/evaluation-of-topic-modeling-topic-coherence/
- "Evaluate Topic Models: Latent Dirichlet Allocation (LDA)": https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
- "derekgreene/topic-model-tutorial": https://github.com/derekgreene/topic-model-tutorial/blob/master/3%20-%20Parameter%20Selection%20for%20NMF.ipynb
- "Topic modelling with NMF": https://nbviewer.jupyter.org/urls/gitlab8.trifork.nl/sofiah/topic-modelling-blog/raw/master/notebooks/topic-modelling-nmf.ipynb