-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
106 lines (76 loc) · 4.67 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
[![Travis-CI Build Status](https://travis-ci.org/mikajoh/stmprinter.svg?branch=master)](https://travis-ci.org/)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/mikajoh/stmprinter?branch=master&svg=true)](https://ci.appveyor.com/project/mikajoh/stmprinter)
[![MIT licensed](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/mikajoh/stmprinter/master/LICENSE)
## stmprinter: Print multiple [stm](http://www.structuraltopicmodel.com/) model dashboards to a pdf file for inspection
Estimate multiple [stm](http://www.structuraltopicmodel.com/) models and print a dashboard for each run in separate pdf pages for inspection. These function are designed for working with 15 or less number of topics (such as with survey data) and can be particularly useful when it is difficult to find a qualitiative good model on the first run.
The package includes two main functions:
| Function | Explanation |
|:----------------|:---------------|
| `many_models()` | Runs `stm::selectModel()` for all provided K number of topics (in parallel). Unlike `stm::manyTopics` it keeps all runs kept by `stm::selectModel()` for K number of topics. |
| `print_models()`| Prints all runs produced by either `many_models()` or `stm::manyTopics()` into a pdf file. The file makes it easy to look through several runs for several number of topics manually. Does not work well if you have more than 15 topics. An example is shown below. |
![Example 1](stmprinter_example_1.png) ![Example 2](stmprinter_example_2.png)
### Installation
You can install `stmprinter` from github with:
```{r gh-installation, eval = FALSE}
# install.packages("devtools")
devtools::install_github("mikajoh/stmprinter")
```
### Example
Here is an example with the `gadarian` data that is included with the `stm` package.
First let's prep the data as usual with `stm::textProcessor()` and `stm::prepDocuments()`.
```{r prep}
library(stm)
library(stmprinter)
processed <- textProcessor(
documents = gadarian$open.ended.response,
metadata = gadarian
)
out <- prepDocuments(
documents = processed$documents,
vocab = processed$vocab,
meta = processed$meta
)
```
We can then run the `many_models()` function included in this package for several K topics. It runs `stm::selectModel()` for several K topics (in parallel) and returns a list with the output. This is convenient if you wish to estimate several models, but unlike with `stm::manyTopics()` (which only keeps one model per K number of topics), you wish to keep several runs per K number of topic. Note though that the `print_models()` function is also compatiable with output from `manyTopics()`.
`many_model()` takes the same arguments as `stm::selectModel()` with the exception for `K` and `cores`. Here, `K` should be vector representing *all* the desired number of topics to run \code{stm::selectModel} for. The `cores` argument lets you choose how many cores to use (defaults to the amount of cores available on the machine).
With our `gadarian` example, we could run the following to estimate stm models for 3 to 13 number of topics.
```{r many_models, eval = FALSE}
set.seed(2018)
stm_models <- many_models(
K = 3:12,
documents = out$documents,
vocab= out$vocab,
prevalence = ~ treatment + s(pid_rep),
data = out$meta,
N = 4,
runs = 100
)
```
You can then print all N runs for each of the provided K topics using `print_models()` with following code.
Here, `stm_models` must either be the output from `many_model()` or `stm::manyTopics()`. The second argument is the texts to use for printing the most represantative text (see `?stm::findThoughts()`). You can also provide the file name (`file`) and title at the top of the first page (`title`).
```{r print_models, eval = FALSE}
print_models(
stm_models, gadarian$open.ended.response,
file = "gadarian_stm_runs.pdf",
title = "gadarian project"
)
```
An example of the output is shown below
Note that the `text` argument is the full text responses, but corresponding to the documents in `out$documents` (see `?stm::findThoughts`). *If* documents is removed during `stm::textProcessor` or `stm::prepDocuments`, you will need to remove the same texts from the original. You can typically do that with the following code.
```{r text_thingy, eval = FALSE}
text <- gadarian$open.ended.response[-c(as.integer(processed$docs.removed))][-c(as.integer(out$docs.removed))]
```
<!-- ### The `print_models()` output explained. -->
---
Pull requests, questions, suggestions, etc., are welcome!