forked from isi-avbulimen/R-cookbook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path05-factors.Rmd
181 lines (146 loc) · 5.58 KB
/
05-factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
```{r include=FALSE}
knitr::opts_chunk$set(collapse=TRUE, comment="#>")
```
# Working with factors {#factors}
## Common uses
Within the department there are three main ways you are likely to make use of
factors:
* Tabulation of data (in particular when you want to illustrate zero occurences
of a particular value)
* Ordering of data for output (e.g. bars in a graph)
* Statistical models (e.g. in conjunction with `contrasts` when encoding
categorical data in formulae)
### Tabulation of data
```{r}
# define a simple character vector
vehicles_observed <- c("car", "car", "bus", "car")
class(vehicles_observed)
table(vehicles_observed)
# convert to factor with possible levels
possible_vehicles <- c("car", "bus", "motorbike", "bicycle")
vehicles_observed <- factor(vehicles_observed, levels = possible_vehicles)
class(vehicles_observed)
table(vehicles_observed)
```
### Ordering of data for output
```{r}
# example 1
vehicles_observed <- c("car", "car", "bus", "car")
possible_vehicles <- c("car", "bus", "motorbike", "bicycle")
vehicles_observed <- factor(vehicles_observed, levels = possible_vehicles)
table(vehicles_observed)
possible_vehicles <- c("bicycle", "bus", "car", "motorbike")
vehicles_observed <- factor(vehicles_observed, levels = possible_vehicles)
table(vehicles_observed)
# example 2
df <- iris[sample(1:nrow(iris), 100), ]
ggplot(df, aes(Species)) + geom_bar()
df$Species <- factor(df$Species, levels = c("versicolor", "virginica", "setosa"))
ggplot(df, aes(Species)) + geom_bar()
```
### Statistical models
When building a regression model, R will automatically encode your independent
character variables using `contr.treatment` contrasts. This means that each
level of the vector is contrasted with a baseline level (by default the first
level once the vector has been converted to a factor). If you want to change
the baseline level or use a different encoding methodology then you need to
work with factors. To illustrate this we use the `Titanic` dataset.
```{r}
# load data and convert to one observation per row
data("Titanic")
df <- as.data.frame(Titanic)
df <- df[rep(1:nrow(df), df[ ,5]), -5]
rownames(df) <- NULL
head(df)
# For this example we convert all variables to characters
df[] <- lapply(df, as.character)
# save to temporary folder
filename <- tempfile(fileext = ".csv")
write.csv(df, filename, row.names = FALSE)
# reload data with stringsAsFactors = FALSE
new_df <- read.csv(filename, stringsAsFactors = FALSE)
str(new_df)
```
First lets see what happens if we try and build a logistic regression model for
survivals but using our newly loaded dataframe
```{r, error = TRUE}
model_1 <- glm(Survived ~ ., family = binomial, data = new_df)
```
This errors due to the **Survived** variable being a character vector. Let's
convert it to a factor.
```{r}
new_df$Survived <- factor(new_df$Survived)
model_2 <- glm(Survived ~ ., family = binomial, data = new_df)
summary(model_2)
```
This works, but the baseline case for **Class** is `1st`. What if we wanted it
to be `3rd`. We would first need to convert the variable to a factor and choose
the appropriate level as a baseline
```{r}
new_df$Class <- factor(new_df$Class)
levels(new_df$Class)
contrasts(new_df$Class) <- contr.treatment(levels(new_df$Class), 3)
model_3 <- glm(Survived ~ ., family = binomial, data = new_df)
summary(model_3)
```
## Other things to know about factors
Working with factors can be tricky to both the new, and the experienced `R`
user. This is as their behaviour is not always intuitive. Below we illustrate
three common areas of confusion
### Renaming factor levels
```{r, error=TRUE}
my_factor <- factor(c("Dog", "Cat", "Hippo", "Hippo", "Monkey", "Hippo"))
my_factor
# change Hippo to Giraffe
## DO NOT DO THIS
my_factor[my_factor == "Hippo"] <- "Giraffe"
my_factor
## reset factor
my_factor <- factor(c("Dog", "Cat", "Hippo", "Hippo", "Monkey", "Hippo"))
# change Hippo to Giraffe
## DO THIS
levels(my_factor)[levels(my_factor) == "Hippo"] <- "Giraffe"
my_factor
```
### Combining factors does not result in a factor
```{r}
names_1 <- factor(c("jon", "george", "bobby"))
names_2 <- factor(c("laura", "claire", "laura"))
c(names_1, names_2)
# if you want concatenation of factors to give a factor than the help page for
# c() suggest the following method is used:
c.factor <- function(..., recursive=TRUE) unlist(list(...), recursive=recursive)
c(names_1, names_2)
# if you only wanted the result to be a character vector then you could also use
c(as.character(names_1), as.character(names_2))
```
```{r, echo=FALSE}
# stop it messing up later code
rm(c.factor)
```
### Numeric vectors that have been read as factors
Sometimes we find a numeric vector is being stored as a factor
(a common occurence when reading a csv from Excel with #N/A values)
```{r}
# example data set
pseudo_excel_csv <- data.frame(names = c("jon", "laura", "ivy", "george"),
ages = c(20, 22, "#N/A", "#N/A"))
# save to temporary file
filename <- tempfile(fileext = ".csv")
write.csv(pseudo_excel_csv, filename, row.names = FALSE)
# reload data
df <- read.csv(filename)
str(df)
```
to transform this to a numeric variable we can proceed as follows
```{r check1234, error = TRUE}
df$ages <- as.numeric(levels(df$ages)[df$ages])
str(df)
```
## Helpful packages
If you find yourself having to manipulate factors often, then it may be worth
spending some time with the tidyverse package
[forcats](https://forcats.tidyverse.org). This was designed to make working
with factors simpler. There are many tutorials available online but a good
place to start is the official
[vignette](https://forcats.tidyverse.org/articles/forcats.html).