-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathcorrelations.Rmd
211 lines (157 loc) · 6.34 KB
/
correlations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
title: 'Correlations'
---
```{r, include=FALSE}
library(tidyverse)
library(pander)
library(lavaan)
panderOptions('digits', 2)
panderOptions('round', 3)
panderOptions('keep.trailing.zeros', TRUE)
```
## Correlations {- #correlations}
The base R `cor()` function provides a simple way to get Pearson correlations,
but to get a correlation matrix as you might expect from SPSS or Stata it's best
to use the `corr.test()` function in the `psych` package.
Before you start though, plotting the correlations might be the best way of
getting to grips with the patterns of relationship in your data. A pairs plot is
a nice way of doing this:
```{r}
airquality %>%
select(-Month, -Day) %>%
pairs
```
If we were satisfied the relationships were (reasonably) linear, we could also
visualise correlations themselves with a 'corrgram', using the `corrgram`
library:
```{r, fig.cap="A corrgram, showing pearson correlations (above the diagonal), variable distributions (on the diagonal) and ellipses and smoothed lines of best fit (below the diagnonal). Long, narrow ellipses denote large correlations; circular ellipses indicate small correlations."}
library("corrgram")
airquality %>%
select(-Month, -Day) %>%
corrgram(lower.panel=corrgram::panel.ellipse,
upper.panel=panel.cor,
diag.panel=panel.density)
```
The ggpairs function from the `GGally` package is also a nice way of plotting
relationships between a combination of categorical and continuous data - it
packs a lot of information into a limited space:
```{r, message=F}
mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(mpg, wt, drat, cyl) %>%
GGally::ggpairs()
```
### Creating a correlation matrix {- #correlation-matrix}
The `psych::corr.test()` function is a quick way to obtain a pairwise
correlation matrix for an entire dataset, along with p values and confidence
intervals which the base R `cor()` function will not provide:
```{r}
mycorrelations <- psych::corr.test(airquality)
mycorrelations
```
One thing to be aware of is that by default `corr.test()` produces p values that
are adjusted for multiple comparisons in the top right hand triangle (i.e. above
the diagonal). If you want the uncorrected values use the values below the
diagonal (or pass `adjust=FALSE` when calling the function).
### Working with correlation matrices {-}
It's important to realise that, as with all R objects, we can work with
correlation matrices to continue our data ananalyses.
For example, as part of exploring your data, you might want to know whether
correlations you observe in one sample are similar to those from another sample,
when using the same questions. For example, let's say we ran a survey measuring
variables from the theory of planned behaviour first in students, and later in
older adults:
```{r, include=F, echo=F, message=F}
tbp <- 'behaviour ~ .5*intention + 0.8*control
intention ~ .5*social.norm + .5*attitude + .5*control'
tbp2 <- 'behaviour ~ .5*intention
intention ~ .5*social.norm + .5*attitude + .5*control'
set.seed(1234)
students <- simulateData(tbp, sample.nobs=100L)
public <- simulateData(tbp2, sample.nobs=250L)
```
We could run correlations for each sample separately:
```{r}
corr.students <- cor(students)
corr.public <- cor(public)
```
And we could 'eyeball' both of these correlation matrices and try and spot
patterns or differences between them, but this is quite hard:
```{r}
corr.students %>%
pander()
```
```{r}
corr.public %>%
pander
```
But we could also simply _subtract_ one matrix from the other to show the
difference directly:
```{r}
(corr.students - corr.public) %>%
pander()
```
Now it's much more obvious that the behaviour/control correlation differs
between the samples (it's higher in the students).
The point here is not that this is an analysis you are likely to actually report
--- although you might find it useful when exploring the data and interpreting
your findings.
But rather this show that a correlation matrix, in common with the results of
all the statistical tests we run, are themselves _just data points_. We can do
whatever we like with our results — storing them in data frames to display
later, or process as we need.
In reality, if you wanted to test the difference in correlations (slopes) in two
groups for one outcome variable you probably want to use
[multiple regression](#regression), and if you wanted to test a complex model
like the theory of planned behaviour, you might consider [CFA](#cfa) and/or
[SEM](#sem)).
### Tables for publication {- #correlation-tables-for-publication}
#### Using `apaTables` {-}
If you want to produce nice correlation tables for publication the `apaTables`
package might be useful. This block saves an APA formatted correlation table to
an [external Word document like this](Table1_APA.doc).
Note though, that the APA table format does encourage 'star gazing' to some
degree. Try to avoid interpreting correlation tables solely based on the
significance (or not) of the _r_ values. The `pairs` or `corrgram` plots shown
above are a much better summary of the data, and are can be just as compact.
```{r}
library(apaTables)
apa.cor.table(airquality, filename="Table1_APA.doc", show.conf.interval=F)
```
#### By hand {-}
If you're not bothered about strict APA format, you might still want to extract
_r_ and _p_ values as dataframes which can then be saved to a csv and opened in
Excel, or converted to a table some other way.
You can do this by storing the `corr.test` output in a variable, and the
accessing the `$r` and `$p` values within it.
First, we create the `corr.test` object:
```{r}
mycorrelations <- psych::corr.test(airquality)
```
Then extract the _r_ values as a table:
```{r}
mycorrelations$r %>%
pander()
```
And we can also extract p values:
```{r}
mycorrelations$p %>%
pander()
```
Saving as a .csv is the same as for other dataframes:
```{r}
write.csv(mycorrelations$r, file="airquality-r-values.csv")
```
And can also access the CI for each pariwise correlation as a table:
```{r}
mycorrelations$ci %>%
head() %>%
pander(caption="First 6 rows of the table of CI's for the correlation matrix.")
```
### Other methods for correlation {- #correlation-methods}
By default `corr.test` produces Pearson correlations, but You can pass the
`method` argument `psych::corr.test()`:
```{r, echo=T, eval=F}
psych::corr.test(airquality, method="spearman")
psych::corr.test(airquality, method="kendall")
```