-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathlab0.Rmd
98 lines (68 loc) · 2.57 KB
/
lab0.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: 'MA 415/615 -- Lab 0: Workflow example'
author: Spring 2023
date: "Week of 01-23-2023"
output: html_document
subtitle: Based on Lecture 2
---
```{r setup, include=FALSE}
options(show.signif.stars = FALSE)
knitr::opts_chunk$set(tidy = FALSE) # turn off formatR's tidy
knitr::opts_chunk$set(fig.path="figures/", out.width="70%", fig.align = "center")
```
# Introduction
This is a simple notebook documenting the basic analyses in Lecture 2 of MA 415 / MA 615.
## Setting up R
After starting RStudio, we load packages in `tidyverse`:
```{r tidyverse}
library(tidyverse)
```
If the package is not installed, we can do it with:
```{r install, eval=FALSE}
install.packages("tidyverse") # it may take a while!
```
# Exploring the data
We will be using the `mpg` dataset on fuel economy data:
```{r mpg}
mpg
```
We get a description of the dataset by issuing `help(mpg)` or just `?mpg`.
Our main *question of interest* is to find a relationship between `displ` (engine displacement in litres) and `hwy` (highway mileage). Initial visualization with a scatterplot:
```{r displhwy1}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
```
Mileage seems to *decrease* with engine size; what is the effect of `cyl` (number of cylinders)?
```{r displhwy2}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color = cyl))
```
We notice that `cyl` has only a few values so it's best seen as a *factor* (categorical variable):
```{r displhwy3}
ggplot(data = mpg %>% mutate(cyl = factor(cyl)),
aes(x = displ, y = hwy)) + geom_point(aes(color = cyl))
```
## A bit of modeling
There seems to be a functional relationship between `displ` and `hwy`:
```{r displhwy4, message=FALSE}
ggplot(mpg %>% mutate(cyl = factor(cyl)), aes(displ, hwy)) +
geom_point(aes(color = cyl)) +
geom_smooth()
```
We can attempt to find such a relationship using a linear model. We assume that if $H$ is `hwy` and $D$ is `displ` then, at least approximately on average, we have
$$H \propto D^{\beta}$$
After taking the log of both sides, this yields the following *linear model*:
$$\log(H) = \alpha + \beta \cdot \log(D)$$
Fitting the model yields
```{r lm}
fit <- lm(log(hwy) ~ log(displ), data = mpg)
coef(fit)
```
and thus, $\beta \approx -0.5$, which yields
$$H \propto 1 / \sqrt{D}.$$
A visual check seems to confirm that `hwy` has an approximately linear relationship with the inverse square root of `displ`:
```{r displhwy5, message=FALSE}
ggplot(mpg %>% mutate(cyl = factor(cyl)),
aes(1 / sqrt(displ), hwy)) +
geom_point(aes(color = cyl)) +
geom_smooth()
```