-
Notifications
You must be signed in to change notification settings - Fork 118
/
Copy path03_Vectors.Rmd
executable file
·404 lines (274 loc) · 14.8 KB
/
03_Vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
```{r, include = FALSE}
source("common.R")
```
# Vectors
<!-- 3 -->
\stepcounter{section}
## Atomic vectors
<!-- 3.2 -->
__[Q1]{.Q}__: How do you create raw and complex scalars? (See `?raw` and `?complex`.)
__[A]{.solved}__: In R, scalars are represented as vectors of length one. However, there's no built-in syntax like there is for logicals, integers, doubles, and character vectors to create individual raw and complex values. Instead, you have to create them by calling a function.
For raw vectors you can use either `as.raw()` or `charToRaw()` to create them from numeric or character values.
```{r}
as.raw(42)
charToRaw("A")
```
In the case of complex numbers, real and imaginary parts may be provided directly to the `complex()` constructor.
```{r}
complex(length.out = 1, real = 1, imaginary = 1)
```
You can create purely imaginary numbers (e.g.) `1i`, but there is no way to create complex numbers without `+` (e.g. `1i + 1`).
__[Q2]{.Q}__: Test your knowledge of vector coercion rules by predicting the output of the following uses of `c()`:
```{r, eval = FALSE}
c(1, FALSE) # will be coerced to double -> 1 0
c("a", 1) # will be coerced to character -> "a" "1"
c(TRUE, 1L) # will be coerced to integer -> 1 1
```
__[Q3]{.Q}__: Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?
__[A]{.solved}__: These comparisons are carried out by operator-functions (`==`, `<`), which coerce their arguments to a common type. In the examples above, these types will be character, double and character: `1` will be coerced to `"1"`, `FALSE` is represented as `0` and `2` turns into `"2"` (and numbers precede letters in lexicographic order (may depend on locale)).
__[Q4]{.Q}__: Why is the default missing value, `NA`, a logical vector? What's special about logical vectors? (Hint: think about `c(FALSE, NA_character_)`.)
__[A]{.solved}__: The presence of missing values shouldn't affect the type of an object. Recall that there is a type-hierarchy for coercion from character → double → integer → logical. When combining `NA`s with other atomic types, the `NA`s will be coerced to integer (`NA_integer_`), double (`NA_real_`) or character (`NA_character_`) and not the other way round. If `NA` were a character and added to a set of other values all of these would be coerced to character as well.
__[Q5]{.Q}__: Precisely what do `is.atomic()`, `is.numeric()`, and `is.vector()` test for?
__[A]{.solved}__: The documentation states that:
- `is.atomic()` tests if an object is an atomic vector (as defined in *Advanced R*) or is `NULL` (!).
- `is.numeric()` tests if an object has type integer or double and is not of class `factor`, `Date`, `POSIXt` or `difftime`.
- `is.vector()` tests if an object is a vector (as defined in *Advanced R*) or an expression and has no attributes, apart from names.
Atomic vectors are defined in *Advanced R* as objects of type logical, integer, double, complex, character or raw. Vectors are defined as atomic vectors or lists.
## Attributes
<!-- 3.3 -->
__[Q1]{.Q}__: How is `setNames()` implemented? How is `unname()` implemented? Read the source code.
__[A]{.solved}__: `setNames()` is implemented as:
```{r, eval = FALSE}
setNames <- function(object = nm, nm) {
names(object) <- nm
object
}
```
Because the data argument comes first, `setNames()` also works well with the magrittr-pipe operator. When no first argument is given, the result is a named vector (this is rather untypical as required arguments usually come first):
```{r}
setNames( , c("a", "b", "c"))
```
`unname()` is implemented in the following way:
```{r, eval = FALSE}
unname <- function(obj, force = FALSE) {
if (!is.null(names(obj)))
names(obj) <- NULL
if (!is.null(dimnames(obj)) && (force || !is.data.frame(obj)))
dimnames(obj) <- NULL
obj
}
```
`unname()` removes existing names (or dimnames) by setting them to `NULL`.
__[Q2]{.Q}__: What does `dim()` return when applied to a 1-dimensional vector? When might you use `NROW()` or `NCOL()`?
__[A]{.solved}__: From `?nrow`:
> `dim()` will return `NULL` when applied to a 1d vector.
One may want to use `NROW()` or `NCOL()` to handle atomic vectors, lists and NULL values in the same way as one column matrices or data frames. For these objects `nrow()` and `ncol()` return `NULL`:
```{r}
x <- 1:10
# Return NULL
nrow(x)
ncol(x)
# Pretend it's a column vector
NROW(x)
NCOL(x)
```
__[Q3]{.Q}__: How would you describe the following three objects? What makes them different to `1:5`?
```{r}
x1 <- array(1:5, c(1, 1, 5)) # 1 row, 1 column, 5 in third dim.
x2 <- array(1:5, c(1, 5, 1)) # 1 row, 5 columns, 1 in third dim.
x3 <- array(1:5, c(5, 1, 1)) # 5 rows, 1 column, 1 in third dim.
```
__[A]{.solved}__: These are all "one dimensional". If you imagine a 3d cube, `x1` is in the x-dimension, `x2` is in the y-dimension, and `x3` is in the z-dimension. In contrast to `1:5`, `x1`, `x2` and `x3` have a `dim` attribute.
__[Q4]{.Q}__: An early draft used this code to illustrate `structure()`:
```{r}
structure(1:5, comment = "my attribute")
```
But when you print that object you don't see the comment attribute. Why? Is the attribute missing, or is there something else special about it? (Hint: try using help.)
__[A]{.solved}__: The documentation states (see `?comment`):
> Contrary to other attributes, the comment is not printed (by print or print.default).
Also, from `?attributes`:
> Note that some attributes (namely class, comment, dim, dimnames, names, row.names and tsp) are treated specially and have restrictions on the values which can be set.
We can retrieve comment attributes by calling them explicitly:
```{r}
foo <- structure(1:5, comment = "my attribute")
attributes(foo)
attr(foo, which = "comment")
```
## S3 atomic vectors
<!-- 3.4 -->
__[Q1]{.Q}__: What sort of object does `table()` return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
__[A]{.solved}__: `table()` returns a contingency table of its input variables. It is implemented as an integer vector with class `table` and dimensions (which makes it act like an array). Its attributes are `dim` (dimensions) and `dimnames` (one name for each input column). The dimensions correspond to the number of unique values (factor levels) in each input variable.
```{r}
x <- table(mtcars[c("vs", "cyl", "am")])
typeof(x)
attributes(x)
# Subset x like it's an array
x[ , , 1]
x[ , , 2]
```
__[Q2]{.Q}__: What happens to a factor when you modify its levels?
```{r, eval = FALSE}
f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
```
__[A]{.solved}__: The underlying integer values stay the same, but the levels are changed, making it look like the data has changed.
```{r}
f1 <- factor(letters)
f1
as.integer(f1)
levels(f1) <- rev(levels(f1))
f1
as.integer(f1)
```
__[Q3]{.Q}__: What does this code do? How do `f2` and `f3` differ from `f1`?
```{r, results = "none"}
f2 <- rev(factor(letters))
f3 <- factor(letters, levels = rev(letters))
```
__[A]{.solved}__: For `f2` and `f3` either the order of the factor elements *or* its levels are being reversed. For `f1` both transformations are occurring.
```{r}
# Reverse element order
(f2 <- rev(factor(letters)))
as.integer(f2)
# Reverse factor levels (when creating factor)
(f3 <- factor(letters, levels = rev(letters)))
as.integer(f3)
```
## Lists
<!-- 3.5 -->
__[Q1]{.Q}__: List all the ways that a list differs from an atomic vector.
__[A]{.solved}__: To summarise:
- Atomic vectors are always homogeneous (all elements must be of the same type). Lists may be heterogeneous (the elements can be of different types) as described in the [introduction of the vectors chapter](https://adv-r.hadley.nz/vectors-chap.html#introduction).
- Atomic vectors point to one address in memory, while lists contain a separate reference for each element. (This was described in the list sections of the [vectors](https://adv-r.hadley.nz/vectors-chap.html#lists) and the [names and values](https://adv-r.hadley.nz/names-values.html#list-references) chapters.)
```{r}
lobstr::ref(1:2)
lobstr::ref(list(1:2, 2))
```
- Subsetting with out-of-bounds and `NA` values leads to different output. For example, `[` returns `NA` for atomics and `NULL` for lists. (This is described in more detail within the [subsetting chapter](https://adv-r.hadley.nz/subsetting.html).)
```{r}
# Subsetting atomic vectors
(1:2)[3]
(1:2)[NA]
# Subsetting lists
as.list(1:2)[3]
as.list(1:2)[NA]
```
__[Q2]{.Q}__: Why do you need to use `unlist()` to convert a list to an atomic vector? Why doesn't `as.vector()` work?
__[A]{.solved}__: A list is already a vector, though not an atomic one!
Note that `as.vector()` and `is.vector()` use different definitions of
"vector"!
```{r}
is.vector(as.vector(mtcars))
```
__[Q3]{.Q}__: Compare and contrast `c()` and `unlist()` when combining a date and date-time into a single vector.
__[A]{.solved}__: Date and date-time objects are both built upon doubles. While dates store the number of days since the reference date 1970-01-01 (also known as “the Epoch”) in days, date-time-objects (POSIXct) store the time difference to this date in seconds.
```{r}
date <- as.Date("1970-01-02")
dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "UTC")
# Internal representations
unclass(date)
unclass(dttm_ct)
```
As the `c()` generic only dispatches on its first argument, combining date and date-time objects via `c()` could lead to surprising results in older R versions (pre R 4.0.0):
```{r, eval = FALSE}
# Output in R version 3.6.2
c(date, dttm_ct) # equal to c.Date(date, dttm_ct)
#> [1] "1970-01-02" "1979-11-10"
c(dttm_ct, date) # equal to c.POSIXct(date, dttm_ct)
#> [1] "1970-01-01 02:00:00 CET" "1970-01-01 01:00:01 CET"
```
In the first statement above `c.Date()` is executed, which incorrectly treats the underlying double of `dttm_ct` (3600) as days instead of seconds. Conversely, when `c.POSIXct()` is called on a date, one day is counted as one second only.
We can highlight these mechanics by the following code:
```{r, eval = FALSE}
# Output in R version 3.6.2
unclass(c(date, dttm_ct)) # internal representation
#> [1] 1 3600
date + 3599
#> "1979-11-10"
```
As of R 4.0.0 these issues have been resolved and both methods now convert their input first into `POSIXct` and `Date`, respectively.
```{r}
c(dttm_ct, date)
unclass(c(dttm_ct, date))
c(date, dttm_ct)
unclass(c(date, dttm_ct))
```
However, as `c()` strips the time zone (and other attributes) of `POSIXct` objects, some caution is still recommended.
```{r}
(dttm_ct <- as.POSIXct("1970-01-01 01:00", tz = "HST"))
attributes(c(dttm_ct))
```
A package that deals with these kinds of problems in more depth and provides a structural solution for them is the [`{vctrs}` package](https://github.com/r-lib/vctrs) [@vctrs] which is also used throughout the tidyverse [@tidyverse].
Let's look at `unlist()`, which operates on list input.
```{r}
# Attributes are stripped
unlist(list(date, dttm_ct))
```
We see again that dates and date-times are internally stored as doubles. Unfortunately, this is all we are left with, when unlist strips the attributes of the list.
To summarise: `c()` coerces types and strips time zones. Errors may have occurred in older R versions because of inappropriate method dispatch/immature methods. `unlist()` strips attributes.
## Data frames and tibbles
<!-- 3.6 -->
__[Q1]{.Q}__: Can you have a data frame with zero rows? What about zero columns?
__[A]{.solved}__: Yes, you can create these data frames easily; either during creation or via subsetting. Even both dimensions can be zero.
Create a 0-row, 0-column, or an empty data frame directly:
```{r}
data.frame(a = integer(), b = logical())
data.frame(row.names = 1:3) # or data.frame()[1:3, ]
data.frame()
```
Create similar data frames via subsetting the respective dimension with either `0`, `NULL`, `FALSE` or a valid 0-length atomic (`logical(0)`, `character(0)`, `integer(0)`, `double(0)`). Negative integer sequences would also work. The following example uses a zero:
```{r}
mtcars[0, ]
mtcars[ , 0] # or mtcars[0]
mtcars[0, 0]
```
__[Q2]{.Q}__: What happens if you attempt to set rownames that are not unique?
__[A]{.solved}__: Matrices can have duplicated row names, so this does not cause problems.
Data frames, however, require unique rownames and you get different results depending on how you attempt to set them. If you set them directly or via `row.names()`, you
get an error:
```{r, error = TRUE}
data.frame(row.names = c("x", "y", "y"))
df <- data.frame(x = 1:3)
row.names(df) <- c("x", "y", "y")
```
If you use subsetting, `[` automatically deduplicates:
```{r}
row.names(df) <- c("x", "y", "z")
df[c(1, 1, 1), , drop = FALSE]
```
__[Q3]{.Q}__: If `df` is a data frame, what can you say about `t(df)`, and `t(t(df))`? Perform some experiments, making sure to try different column types.
__[A]{.solved}__: Both of `t(df)` and `t(t(df))` will return matrices:
```{r}
df <- data.frame(x = 1:3, y = letters[1:3])
is.matrix(df)
is.matrix(t(df))
is.matrix(t(t(df)))
```
The dimensions will respect the typical transposition rules:
```{r}
dim(df)
dim(t(df))
dim(t(t(df)))
```
Because the output is a matrix, every column is coerced to the same type. (It is implemented within `t.data.frame()` via `as.matrix()` which is described below).
```{r}
df
t(df)
```
__[Q4]{.Q}__: What does `as.matrix()` do when applied to a data frame with columns of different types? How does it differ from `data.matrix()`?
__[A]{.solved}__: The type of the result of `as.matrix` depends on the types of the input columns (see `?as.matrix`):
> The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g. all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give an integer matrix, etc.
On the other hand, `data.matrix` will always return a numeric matrix (see `?data.matrix()`).
> Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes. [...] Character columns are first converted to factors and then to integers.
We can illustrate and compare the mechanics of these functions using a concrete example. `as.matrix()` makes it possible to retrieve most of the original information from the data frame but leaves us with characters. To retrieve all information from `data.matrix()`'s output, we would need a lookup table for each column.
```{r}
df_coltypes <- data.frame(
a = c("a", "b"),
b = c(TRUE, FALSE),
c = c(1L, 0L),
d = c(1.5, 2),
e = factor(c("f1", "f2"))
)
as.matrix(df_coltypes)
data.matrix(df_coltypes)
```