-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject.rmd
351 lines (269 loc) · 18 KB
/
project.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
title: "A Hyperparameter-Optimized Latent Topic Model"
author: "Siddharth Nand"
bibliography: references.bib
geometry: "left=0.9in,right=0.9in,top=.9in,bottom=.9in"
output:
pdf_document:
latex_engine: xelatex
toc: false
number_sections: false
fig_caption: true
fig_height: 2
fig_width: 5
highlight: arrow
df_print: kable
header-includes:
- \usepackage{float}
---
```{r setup, include=FALSE}
library(tidyverse)
library(knitr)
library(stringr)
library(rstan)
library(tidytext)
library(textdata)
library(SnowballC)
library(tm)
library(caret)
library(topicmodels)
library(ggplot2)
library(magrittr)
library(mcmcse)
library(bayesplot)
setwd("/Users/siddharthnand/Library/CloudStorage/OneDrive-UBC/Undergrad Courses/W2023/STAT 477C Special Topics in Statistics - BAYESIAN STATS/Project")
```
# Introduction
Topic modeling uncovers hidden thematic structures within a set of documents. It represents documents as mixtures of topics, where each topic is a probability distribution over words. This project introduces a novel topic modeling approach with a custom likelihood function and flexible parameterization. Users can adjust how word frequencies influence topics and the model's robustness to rare words.
The GitHub repository for this project can be found at [https://github.com/sidnand/Hyperparameter-Optimized-Latent-Topic-Model](https://github.com/sidnand/Hyperparameter-Optimized-Latent-Topic-Model).
# Literature Review
The origins of topic models lie in Latent Dirichlet Allocation (LDA) [@blei2003latent], a seminal work that introduced the probabilistic framework for unsupervised discovery of latent thematic structures. LDA represents documents as mixtures of probability distributions over words, where these distributions correspond to topics. Through Bayesian inference techniques, LDA estimates both the topic-word distributions (defining the vocabulary associated with each topic) and the topic proportions within each document (indicating how strongly a document relates to different topics). The Bayesian approach provides a natural way to incorporate prior beliefs, handle uncertainty, and prevent overfitting.
Building upon LDA's foundation, subsequent research has greatly enhanced the flexibility and capabilities of Bayesian topic modeling. Non-parametric models like the Hierarchical Dirichlet Process (HDP) [@teh2004sharing] address the constraint of pre-specifying the number of topics. HDP infers an optimal number of topics directly from the data, allowing models to adapt to the inherent complexity of the corpus. Further advancements have focused on capturing the relationships between topics, such as the Correlated Topic Model (CTM) [@blei2006correlated], which relaxes LDA's assumption of topic independence and allows for richer representations. For corpora where the content evolves over time, Dynamic Topic Models (DTM) [@blei2006dynamic] introduce a temporal dimension, modeling the evolution of topics and their changing word associations.
The power and adaptability of Bayesian topic models have led to their widespread adoption across a multitude of domains. They are invaluable tools in information retrieval and text classification [@blei2003latent], sentiment analysis and opinion mining [@titov2008modeling], social network analysis [@chang2009relational], and the exploration of scientific literature [@hall2008studying]. Bayesian topic modeling remains an active and vibrant research area, with ongoing developments in areas such as deep generative models, topic model interpretability, and applications to new data modalities.
# Problem Formulation
## Data Definitions
Our analysis begins with a word-document matrix $X$ which represents the observed word frequencies. Each entry $X_{i,j}$ denotes the number of occurrences for word $i$ within each document $j$. For clarity, we'll use the following notations: $m$ represents the number of topics, $n$ the size of the vocabulary (number of words), and $d$ the total number of documents within the corpus.
## Model
### Prior Distribution
The topic distribution for each word $i$ is modeled using a Dirichlet distribution, $\theta_i \sim Dirichlet(\textbf{a})$. This distribution is ideal for topic modeling because it represents probability distributions over categories (like our topics). It ensures that topic probabilities for each word sum to 1 and allows words to have varying probabilities of belonging to multiple topics. The Dirichlet concentration parameter, $\textbf{a}$ (a vector with one value per topic), controls the sparsity of topic distributions. Higher values of $\textbf{a}$ lead to more even probability distributions across topics, resulting in broader topics. Conversely, lower values of $\textbf{a}$ encourage words to concentrate their probability mass on a smaller number of topics, yielding more distinct and focused thematic groups.
### Likelihood Function
The likelihood function models how likely it is to observe our word-document matrix $X$, given a particular set of topic distributions $\theta$. $\theta$ is a matrix where each row corresponds to a word and each column corresponds to a topic. The core assumption is that words that appear frequently within a topic are more likely to be strongly associated with that topic. We define the likelihood function as follows:
$$\mathbb{P}(X | \theta) = \prod_{i=1}^n \left (\frac{\theta_{i,k} + \beta}{\sum_{k'=1}^{m} \theta_{i,k'} + \beta⋅m} \right)^{\alpha⋅\sum_{j=1}^{d} X_{i,j}}$$
The log-likelihood function is used for computational efficiency and numerical stability:
$$\log \mathbb{P}(X | \theta) = \sum_{i=1}^n \left (\alpha⋅\sum_{j=1}^{d} X_{i,j} \right) \log \left (\frac{\theta_{i,k} + \beta}{\sum_{k'=1}^{m} \theta_{i,k'} + \beta⋅m} \right)$$
The model includes two key parameters that control how word frequencies and topic assignments are related. The influence parameter ($\alpha$) controls the influence of word frequencies on the calculation of the likelihood. The smoothing parameter ($\beta$) adds a small amount to word counts, preventing zero probabilities and making the model more robust, especially for less frequent words.
# Case Study
We'll apply our hyperparameter-optimized latent topic model to a real-world dataset, uncovering hidden thematic structures in a collection of financial news headlines. We'll begin with a simple document classifier, then we'll compare our model to LDA and examine the differences. Finally we will do inference diagnostics to see the validity of our inference. Our goal is to organize the news headlines into coherent topics reflecting underlying themes. The `Stan` code for the model and data preprocessing details are provided in the appendix.
## Data Preprocessing
```{r, handle_data, echo=FALSE}
set.seed(123)
# Load the data
data <- read_csv("data/sentiment/sentiment.csv", show_col_types = FALSE) %>%
select(text)
# Convert data to utf-8
data$text <- iconv(data$text, to = "UTF-8")
# Clean the data
clean_data <- data %>%
mutate(text = removeNumbers(text)) %>%
mutate(text = removePunctuation(text,
reserve_intra_word_contractions = TRUE,
preserve_intra_word_dashes = TRUE,
preserve_intra_word_underscore = TRUE
)) %>%
mutate(text = removeWords(text, stopwords("en"))) %>%
mutate(text = str_replace_all(text, "�", "")) %>%
mutate(text = str_replace_all(text, "\\s+", " ")) %>%
mutate(text = str_replace_all(text, "’", "")) %>%
mutate(text = str_replace_all(text, "\\b\\w{1,2}\\b", "")) %>%
filter(text != "" | !is.na(text) | text != "NA") %>%
mutate(text = tolower(text)) %>%
mutate(text = str_trim(text)) %>%
sample_n(25) %>%
mutate(document_id = row_number())
# Tokenization and stemming
token_data <- clean_data %>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
# Create the word-document matrix with words as columns and documents as rows
X <- token_data %>%
count(document_id, word) %>%
cast_dtm(document_id, word, n) %>%
as.matrix()
```
```{r, get_model, echo=FALSE, warning=FALSE, message=FALSE}
model <- stan_model("model.stan")
```
After preprocessing our data, we create a word-document matrix, with rows representing documents, columns representing unique words, and cells indicating word frequencies in each document. Due to computational constraints, we randomly select a subset of 25 documents for analysis in our case study.
```{r, topic_word_dist, echo=FALSE}
plt_word_topic <- function(top_words) {
ggplot(
top_words,
aes(
x = term,
y = probability,
fill = factor(topic)
)
) +
geom_bar(stat = "identity", position = "dodge") +
labs(
y = "Probability",
fill = "Topic"
) +
theme_light() +
theme(
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1),
strip.background = element_blank(),
strip.text = element_blank(),
) +
scale_fill_brewer(palette = "Set2")
}
```
```{r, inference, echo=FALSE, results='hide', warning=FALSE}
fit <- sampling(model,
data = list(
n = ncol(X),
d = nrow(X),
X = X,
m = 3,
a = rep(0.5, 3),
alpha = 0.5,
beta = 0.5
),
iter = 1000,
chains = 3
)
lda_model <- LDA(X, k = 3, method = "Gibbs", control = list(seed = 123))
```
## Categorizing Documents
An application to topic modeling is to identify similar documents based on their thematic contents. This can be used for document clustering or recommendation systems. Due to $\theta_i$ being a probability distribution over topics for each word, one way we can categorize a document $j$ is if topic $k$ has the most number of highly probable words in a document, then document $j$ is most likely to be about topic $k$. The below table is a subset of the categorizations of the documents into topics. Due to the unsupervised nature of topic modeling, there is no "true" interpretation of the topics. From the below table, there does not seem to be a clear interpretation of the topics that we can infer. The below table is just a subset of the categorizations, so it could be that the full table would provide a clearer interpretation.
```{r, doc_sim, echo=FALSE, warning=FALSE, message=FALSE}
fit_summary <- summary(fit)
theta_means <- matrix(fit_summary$summary[, "mean"], nrow = ncol(X), ncol = 3, byrow = TRUE)
dist_word <- data.frame(
term = colnames(X),
topic = rep(1:3, each = ncol(X)),
probability = as.vector(theta_means)
)
prob_words <- dist_word %>%
group_by(term) %>%
summarise(topic = which.max(probability), probability = max(probability))
dist_document <- token_data %>%
left_join(prob_words, by = c("word" = "term")) %>%
group_by(document_id, topic) %>%
summarise(probability = max(probability)) %>%
top_n(1, probability)
topic_table <- dist_document %>%
group_by(topic) %>%
arrange(desc(probability)) %>%
slice(1:3) %>%
select(topic, document_id, probability) %>%
left_join(clean_data, by = "document_id") %>%
rename(document = text) %>%
select(topic, document) %>%
as.data.frame()
topic_table
```
## Comparison with LDA
Now we will compare our model to Latent Dirichlet Allocation (LDA) by plotting the word-topic distribution of the top 10 words which are the most probable for each topic. For our model, the hyperparameters $\alpha$ and $\beta$ are set to 0.5 and the Dirichlet concentration parameter $\textbf{a}$ are set to 0.5 for each topic. For LDA, we used the Gibbs sampling method. We will use 3 topics for this comparision. Figure 1 shows the word-topic distribution for our model, while Figure 2 shows the word-topic distributions for LDA. The y-axis is the probability of a word being in a specific topic. Also note that the words on the x-axis are after the stemming process (were we reduce words to their root form, e.g. "captures, capturing, captured" all become "captur"). The most noticeable difference is that the probability of a word belonging to a topic is more evenly distributed in our model compared to LDA and the value of those probabilities are higher. This is most likely due to the fact that our model only looks at the word frequencies within documents labeled with a topic, while LDA considers the entire corpus. Another difference between our models is the choice of words in each topic. Due to the unsupervised nature of topics modelling, there is no objective way to compare the quality of the categorization because both models use different liklihood functions and thus would have different interpretations of the topics.
```{r, echo=FALSE, fig.cap="Hyperparameter-Optimized Word-Topic Distribution", fig.show='asis', warning=FALSE, fig.align='center', fig.pos='H', fig.width=6}
top_words <- dist_word %>%
group_by(topic) %>%
arrange(desc(probability)) %>%
slice(1:10)
plt_word_topic(top_words)
```
```{r, echo=FALSE, fig.cap="LDA Word-Topic Distribution", fig.show='asis', warning=FALSE, fig.align='center', fig.pos='H', fig.width=6}
lda_topics <- tidy(lda_model, matrix = "beta")
top_words <- lda_topics %>%
group_by(topic) %>%
arrange(desc(beta)) %>%
slice(1:10)
top_words <- top_words %>%
rename(probability = beta)
plt_word_topic(top_words)
```
## Effective Sample Size
For our model, the parameter of interest, $\theta$ is a matrix. So for each chain, we will have an effective sample size (ESS) for each parameter in the matrix. We can make a boxplot of the ESS for each parameter in $\theta$ per chain. In Figure 3, we can see that the ESS values are quite high, indicating that the chains have mixed well and that the samples are independent. This tells us that our inference is valid.
```{r, echo=FALSE, fig.cap="Effective Sample Size per Chain", fig.show='asis', fig.width=3, fig.height=2.25, warning=FALSE, fig.align='center', fig.pos='H'}
# Extract the effective sample size (ESS) values
ess_values <- fit_summary$summary[, "n_eff"]
n_words <- ncol(X)
n_chains <- 3
n_topics <- 3
# Calculate the total number of parameters and chains
num_parameters <- n_words * n_topics # Number of words * Number of topics
num_chains <- n_chains # Number of chains
# Calculate the total number of iterations
num_iterations <- nrow(fit_summary$summary)
# Calculate the expected length of ess_values
expected_length <- num_parameters * num_chains
# Adjust expected_length to match the actual length of ess_values
if (expected_length != length(ess_values)) {
ess_values <- ess_values[1:expected_length]
}
ess_per_chain <- matrix(ess_values, nrow = num_parameters, ncol = num_chains, byrow = TRUE)
# Assuming ess_per_chain is your matrix of ESS values
# Convert ess_per_chain into a data frame
ess_df <- data.frame(chain = rep(1:num_chains, each = num_parameters),
parameter = rep(1:num_parameters, times = num_chains),
ess = as.vector(ess_per_chain))
ggplot(ess_df, aes(x = factor(chain), y = ess)) +
geom_boxplot() +
xlab("Chain") +
ylab("ESS") +
theme_minimal()
```
## Trace Plots
We can also plot the trace plots for each parameter in $\theta$ per chain. However, due to the large number of parameters, we will only plot the trace plots for the first 3 parameters in $\theta$ for each topic per chain. In Figure 4, we can see that as the number of iterations increases, the parameter values do not show any clear trend, indicating that the chains have mixed well and that the samples are independent.
```{r, echo=FALSE, fig.cap="Trace Plots per Chain", fig.show='asis', fig.width=7, fig.height=4, warning=FALSE, fig.align='center', fig.pos='H'}
pars <- c()
for (i in 1:3) {
for (j in 1:3) {
pars <- c(pars, paste("theta[", i, ",", j, "]", sep = ""))
}
}
mcmc_trace(fit, pars = pars) +
theme_minimal() +
theme(plot.margin = margin(0.25, 0.25, 0.25, 0.25, "cm"))
```
# Discussion \& Conclusion
From our case study, we can see that the inference behavior of our model is good, with Figure 3 and 4 showing high effective sample sizes and well-mixed chains. The word-topic distributions in Figure 1 show that our model clusters words into topics differently than LDA, with more evenly distributed probabilities. The categorization of documents into topics in Table 1 does not provide a clear interpretation of the topics, which is expected due to the unsupervised nature of topic modeling.
Our model is not without limitations. The choice of hyperparameters $\alpha$ and $\beta$ can significantly impact the model's performance. The Dirichlet concentration parameter $\textbf{a}$ can also affect the sparsity of the topic distributions because it controls how words are distributed across topics. Because our parameter of interest, $\theta$, is a matrix, the number of parameters can be quite large, which can impact the computational efficiency of the model. Fitting our data matrix of 25 documents took about 5mins as-opposed to LDA which took a few seconds. Thus our model will be difficult to scale to larger datasets. Another large limitation to our model is our liklihood function. It is quite simple and does not take into account the order of words in a document or the context in which they appear. This can lead to topics that are not coherent or meaningful.
Despite these limitations, our model offers a flexible and interpretable approach to topic modeling. By allowing users to adjust the influence of word frequencies and the robustness to rare words, our model can work on any type of textual data. Future work could focus on improving the likelihood function to capture more complex relationships between words and topics, as well as exploring more efficient computational methods to scale the model to larger datasets.
\newpage
# Appendix
## Data Preprocessing
```{r, handle_data, echo=TRUE}
```
\newpage
## Model Code
```{stan, output.var="model", echo=TRUE}
data {
int<lower=1> n; // Number of words
int<lower=1> d; // Number of documents
int<lower=0> X[d, n]; // Word-document matrix
int<lower=1> m; // Number of topics
vector<lower=0>[m] a; // Dirichlet concentration parameter
real<lower=0> alpha; // Word frequency influence
real<lower=0> beta; // Smoothing parameter
}
parameters {
simplex[m] theta[n];
}
model {
for (i in 1:n) {
theta[i] ~ dirichlet(a);
}
for (i in 1:n) {
for (j in 1:d) {
for (k in 1:m) {
real w = (alpha * X[j, i]);
real u = log((theta[i, k] + beta) / (sum(theta[i]) + beta * m));
target += w * u;
}
}
}
}
```
# References