-
Notifications
You must be signed in to change notification settings - Fork 220
/
Copy patheda_rowwise.Rmd
251 lines (172 loc) · 5 KB
/
eda_rowwise.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# tidyverse中行方向的操作 {#eda-rowwise}
```{r, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.showtext = TRUE
)
```
dplyr 1.0 推出之后,数据框**行方向**的操作得到完美解决,因此本章的内容已经过时,大家可以跳出本章,直接阅读第\@ref(tidyverse-colwise) 章。(留着本章,主要是让自己时常回顾下之前的探索。让自己最难忘的,或许就是曾经的痛点吧)
```{r rowwise-1, message = FALSE, warning = FALSE}
library(tidyverse)
```
tidyverse 喜欢数据框,因为一列就是一个向量,一列一列的处理起来很方便。然而我们有时候也要,完成行方向的操作,所以有必要介绍tidyverse中行方向的处理机制。
## 问题
```{r rowwise-2, eval=FALSE}
df <- tibble(x = 1:3, y = 4:6)
df
```
对每行的求和、求均值、最小值或者最大值?
## rowwise函数
dplyr提供了rowwise()函数
```{r rowwise-3, eval=FALSE}
df %>%
rowwise() %>%
mutate(i = sum(x, y))
```
```{r rowwise-4, eval=FALSE}
df %>%
rowwise() %>%
mutate(i = mean(c(x, y)))
```
```{r rowwise-5, eval=FALSE}
df %>%
rowwise() %>%
mutate(
min = min(x, y),
max = max(x, y)
)
```
```{r rowwise-6, eval=FALSE}
df %>%
rowwise() %>%
do(i = mean(c(.$x, .$y))) %>%
unnest(i)
```
## Row-wise Summaries
```{r rowwise-7, eval=FALSE}
df %>% mutate(row_sum = rowSums(.[1:2]))
```
```{r rowwise-8, eval=FALSE}
df %>% mutate(row_mean = rowMeans(.[1:2]))
```
```{r rowwise-9, eval=FALSE}
df %>% mutate(t_sum = rowSums(select_if(., is.numeric)))
```
固然可解决问题, 然而,却不是一个很好的办法,比如除了求和与计算均值,可能还要计算每行的中位数、方差等等, 因为,不是每种计算都对应的row_函数? 既然是tidyverse ,还是用tidyverse 的方法解决
## purrr::map方案
按照Jenny Bryan的方案
```{r rowwise-10, eval=FALSE}
df %>% mutate(t_sum = pmap_dbl(list(x, y), sum))
```
```{r rowwise-11, eval=FALSE}
df %>%
mutate(t_sum = pmap_dbl(select_if(., is.numeric), sum))
```
计算均值的时候, 然而报错了
```{r rowwise-12, eval=FALSE}
df %>% mutate(t_sum = pmap_dbl(select_if(., is.numeric), mean))
```
tidyverse 总会想出办法来解决,把`mean()` 变成 `lift_vd(mean)`
```{r rowwise-13, eval=FALSE}
df %>%
mutate(data = pmap_dbl(select_if(., is.numeric), lift_vd(mean)))
```
同理
```{r rowwise-14, eval=FALSE}
df %>% mutate(t_median = pmap_dbl(select_if(., is.numeric), lift_vd(median)))
```
```{r rowwise-15, eval=FALSE}
df %>% mutate(t_sd = pmap_dbl(select_if(., is.numeric), lift_vd(sd)))
```
## tidy 的方案
我个人推荐的方法(Gather, group, summarize, left_join)
```{r rowwise-16, eval=FALSE}
new_df <- df %>%
mutate(id = row_number())
s <- new_df %>%
gather("time", "val", -id) %>%
group_by(id) %>%
summarize(
t_avg = mean(val),
t_sum = sum(val)
)
s
```
```{r rowwise-17, eval=FALSE}
new_df %>%
left_join(s)
```
有点繁琐,但思路清晰
```{r rowwise-18, eval=FALSE}
ss <- new_df %>%
group_by(id) %>%
summarise(t_avg = mean(c(x, y)))
ss
```
```{r rowwise-19, eval=FALSE}
new_df %>%
left_join(ss)
```
之所以有这么多的搞法,是因为没有一个很好的搞法
## 用slide方案
[slide](https://github.com/DavisVaughan/slide)很强大,可以滚动喔
- 如果第一个参数是数据框,`slide`把数据框看作a vector of rows, 然后行方向的滚动,事实上, .x是一个个的小数据框(如下)
- 与`purrr::map`不同,因为map把数据框看作列方向的向量, 然后迭代
- 如果第一个参数是原子型向量的话,还是依次迭代逗号分隔的元素,只不过这里是slide比map更强大的是,还可以是滚动
```{r rowwise-20, eval=FALSE}
library(slider)
df <- tibble(a = 1:3, b = 4:6)
slide(
select_if(df, is.numeric),
~.x,
.before = 1
)
```
```{r rowwise-21, eval=FALSE}
df %>%
mutate(
r_mean = slide_dbl(
select_if(df, is.numeric),
~ mean(unlist(.x)),
.before = 1
)
)
```
## rowwise() + c_across()
```{r rowwise-22, eval=FALSE}
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
df %>%
rowwise(id) %>%
summarise(mean = mean(c(w, x, y, z)))
df %>%
rowwise(id) %>%
mutate(mean = mean(c(w, x, y, z)))
df %>%
rowwise(id) %>%
mutate(total = mean(c_across(w:z)))
df %>%
rowwise(id) %>%
mutate(mean = mean(c_across(is.numeric)))
# across()
df %>% mutate(mean = rowMeans(across(is.numeric & -id)))
```
## 用lay方案
[lay包](https://github.com/romainfrancois/lay)解决方案
```{r rowwise-23, eval = FALSE}
library(lay)
library(dplyr, warn.conflicts = FALSE)
iris <- as_tibble(iris)
# apply mean to each "row"
iris %>%
mutate(sepal = lay(across(starts_with("Sepal")), mean))
```
```{r rowwise-24, echo = F}
# remove the objects
# rm(df, new_df, s, ss)
```
```{r rowwise-25, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```