-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata-analysis.Rmd
130 lines (109 loc) · 2.34 KB
/
data-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "Snowflake Data Analysis"
output: html_notebook
---
```{r setup}
# Packages ----
library(DBI)
library(tidyverse)
# Plotting ----
theme_set(theme_bw())
```
Create connection to Snowflake DB. In this case, using a DSN named `snowflake`.
```{r connect}
con <- dbConnect(odbc::odbc(), "snowflake")
```
## Explore
List the tables available
```{r}
odbc::dbListTables(con, catalog = "DEMO_DATA", schema = "PUBLIC")
```
Query a table using SQL code chunk
```{sql, connection = con}
SELECT * FROM AIRPORTS LIMIT 10;
```
Query a table using `dplyr`
```{r}
tbl(con, "AIRPORTS") %>%
head(10)
```
Show SQL query generated by `dbplyr`
```{r}
tbl(con, "AIRPORTS") %>%
head(10) %>%
show_query()
```
## More `dplyr`
Create local *references* to DB tables
```{r}
db_airlines <- tbl(con, "AIRLINES")
db_airports <- tbl(con, "AIRPORTS")
db_flights <- tbl(con, "FLIGHTS")
db_planes <- tbl(con, "PLANES")
db_weather <- tbl(con, "WEATHER")
```
Check class of new objects
```{r}
class(db_airlines)
```
Check object size of table reference
```{r}
lobstr::obj_size(db_flights)
lobstr::obj_size(nycflights13::flights)
```
Query and visualize flights
```{r}
db_flights %>%
head()
```
Visualize flight distance
```{r}
db_flights %>%
select(carrier, distance) %>%
collect() %>%
ggplot(aes(x = distance)) +
geom_histogram()
```
```{r}
airlines_delays <- db_flights %>%
mutate(dist_bucket = case_when(
distance < 500 ~ "<500",
distance < 1000 ~ "500-1000",
distance < 1500 ~ "1000-1500",
distance < 2000 ~ "1500-2000",
distance < 2500 ~ "2000-2500",
distance >= 2500 ~ "2500+"
)) %>%
group_by(carrier, dist_bucket) %>%
summarize(n = n(),
avg_dep_delay = mean(dep_delay, na.rm = TRUE),
avg_arr_delay = mean(arr_delay, na.rm = TRUE),
avg_distance = mean(distance, na.rm = TRUE)) %>%
mutate(avg_diff = avg_dep_delay - avg_arr_delay) %>%
left_join(db_airlines)
```
```{r}
show_query(airlines_delays)
```
```{r}
local_airline_delays <- airlines_delays %>%
collect()
```
```{r}
local_airline_delays %>%
ggplot(aes(x = name, y = avg_diff)) +
geom_col() +
coord_flip() +
facet_wrap(~dist_bucket)
```
```{r}
local_airline_delays %>%
ggplot(aes(x = name, y = n)) +
geom_col() +
coord_flip() +
facet_wrap(~dist_bucket)
```
## Database Disconnect
```{r disconnect}
dbDisconnect(con)
```