-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathLearn-R-tutorial.Rmd
653 lines (388 loc) · 20.2 KB
/
Learn-R-tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
---
title: "Learn-R-tutorial"
author: "R-Ladies-Vancouver"
date: "April 2, 2018"
output:
html_document:
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
---
Notes from a hands-on 3 hour workshop for participants with no prior knowledge of programming skills.
Learning materials used in this tutorial were adapted form:
* [Intro into R by Research Bazar](https://resbaz.github.io/RLadies-Intro-R-workshop/)
* [Software Carpentry - "R for reproducible scientific analysis"](https://swcarpentry.github.io/r-novice-gapminder/)
* [2-hour intro into R by Mine Cetinkaya-Rundel](https://github.com/mine-cetinkaya-rundel/rworkshop-mem)
**Required**
* a laptop
* R and R studio installed. [Click here](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Setup.md) for instructions.
* we will also need be installing the `dplyr` and `ggplot2`packages during the tutorial
**Other resources**
* [Slides](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Slides/2018-April_event-presentation.pdf)
* The open-source dataset with passanger onboard a Titanic from [Kaggle website](https://www.kaggle.com/c/titanic). Download the csv file with the dataset [here](https://drive.google.com/open?id=1iK6tiBsb4cabyi7mP5FLCH6LWYs4hOSI). We will use this learn dplyr and ggplot2 package
**Topics covered**
1. RStudio orientation & good practices for data analysis
2. Basics of coding in R
3. Working with functions and packages
4. Loading and inspecting the titanic.csv dataset
5. Basics of data wrangling with the `dplyr` package
6. Basics of data plotting with the `ggplot2` package
---
## What is R and RStudio?
What is R and why learn it?
* one of the leading languages in statistics and data science
* open-source and free
* powerfull for data wrangling and visualization
* large library of tools (packages and functions) for diverse applications (e.g. [Bioconductor for genomic research](https://www.bioconductor.org/), [rOpenGov for social sciences](http://ropengov.github.io/about/))
* increasing number of organizations use R, find out more in this [StackOverflow post](https://stackoverflow.blog/2017/10/10/impressive-growth-r/)
* with a programming language such as R you can document the process of your data analysis, making it easier for you or another user to reproduce. This is challenging using tools such as an excell spreadsheet, where calculations are hidden in individual cells.
* R has a highly active, supportive and fast-growing user and developer community.
What is RStudio?
* powerful and productive user interface for R
* free and open-source.
---
## Data analysis workflow
As adapted from R for Data Science by Hadley Wickham, the maker of the many popular R packages such as `dplyr` and `ggplot2` packages we will be working with today
![](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Img/2018-April_event-presentation.jpg)
---
## RStudio orientation & best practices for data analysis
This is what you should see when you launch RStudio:
* the interactive console (left), a version of R you are working with will be printed there every time you start a new session
* Environment/History (top right)
* Files/Plots/Packages/Help/Viewer
![](https://github.com/R-Ladies-Vancouver/April2018-Learn-R-Beginner/blob/master/Img/RStudio-launch.jpg)
---
## Setting yourself up for reproducible R work
Before we begin producing R code, we need to familiarize ourselves with where R work is going to be located on our computer.To find out, type `getwd()` into the console. This will prompt R to print out a path to a location on your computer where any scripts, plots, data you generate will be saved.
```{r }
getwd()
```
It is a good practice to keep all of your data analysis for a particular project in a single location. RStudio comes with a built-in framework called R projects. To create a new project in RStudio:
1. Click the “File” menu button, then “New Project”.
2. Click “New Directory”.
3. Click “Empty Project”.
4. Type in the name of the directory to store your project
5. Click the “Create Project” button.
A recommendation on how to organize your R project from [Good Enough Practices for Scientific Computing](https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf) as summarized by [Software Carpentry](https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/)
1. Put each project in its own directory, which is named after the project.
2. Put text documents associated with the project in the doc directory.
3. Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
4. Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
5. Name all files to reflect their content or function.
> Challenge
1. Create an Rproject called “R-beginner-workshop". Click **File**>**New Project**>**New directory**>**Empty project**
2. Create a folder called "data"
3. Put the titanic.csv into folder data
4. Create a folder called "scr"
---
## Basics of R coding
You can clear your console with a command `ctr + L`
We can use R to do simple arithmetic:
* division `/`
* multiple `*`
* addition `+`
* exponent `^`
* subtraction `-`
For instance, type the following in your console:
```{r }
5+5
10/5
```
The answer to this is printed below the command, preceded by [1]. Right now, none of our work exists anywhere besides the console.
To create a script:
1. Click File
2. Click New File
3. Click R script
4. Give your script a name
> Challenge
* Create a script called “R-beginner-code” in your current project and put it in `scr` folder
---
We can annotate a script with comments using the `#`. If you precede anything with this sign, R will ignore it. For instance
```{r }
# to do a addition in R, we do the following
5+5
```
To store results or data we need to assign it a name, using assignment operator `<-`
```{r }
# now I tell R to store my height
height_cm <- 172
```
Now, position your cursor at the line of the above command and press `CTRL + ENTER` (Windows) or `COMMAND + ENTER` (Mac). In the environment (top right) you should see a variable `height_cm` and its value appear
It is not always possible to view what is stored in a variable by looking at your environment. To find out what value is stored in the variable `height_cm` simply type it in the new line of our script and run `CTRL + ENTER`/`COMMAND + ENTER`.
```{r }
# what is stored in height_cm?
height_cm
```
Note, R is case sensitive. If we were to type `Height_cm` we would get an error. Try it!
```{r }
# what is stored in height_cm?
# Height_cm
```
Say you wanted to convert your heigh to inches.
```{r }
# what is my height in inches?
height_cm * 0.39
# oh wait, I want to store this value!
height_inches <- height_cm * 0.39
# what is it again?
height_inches
```
If we wanted to clear the environment of the variables we have 2 options:
* rm(list = ls())
* start a new R session - essentially a refresh button. In RStudio click "Session" > "Restart R"
> Challenge
1. Create a variable called `mass_kg` with a value of 50
2. Convert `mass_kg` to mass in pounds (multiply by 2.2) and store the answer in `mass_lbs`
## What are functions and packages?
**Functions:**
* fundamental building blocks of R
* a list of "build-in"R functions (come with installation) can be found [here](http://www.sr.bham.ac.uk/~ajrs/R/r-function_list.html)
* functions are used by first specifying first their name followed by `()`, into which we put arguments that a given function takes.
`log()` is an example of a function to compute a logarithm.
You can also get information on how to use a function in R studio by typing
`?log()`
This will promt a Help page in the bottom right pane to appear with description of a function, what arguments a given function uses and some examples of how to apply it.
```{r }
# find out log of height_cm
log(height_cm)
# find out exp of height_cm
exp(height_cm)
```
Objects can store more than one value, and this value can be
* numeric (e.g. 1,2, -3, .2)
* character (e.g."vancouver","victoria","nanaimo")
* locaical (TRUE or FALSE)
We will see shortly that an object can also be a large dataset with a variety of data types.
```{r }
# an object with few numeric values
heights_cm <- c(172, 143, 152, 180)
mean(heights_cm)
sd(heights_cm)
length(heights_cm)
```
**Packages**:
* Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data. (From: http://r-pkgs.had.co.nz)
* Packages need to be **installed** once only, therefore we don't put these commands into the script. To install a package we use a command `install.packages("package-name")`. Today, we will be working with two packages called the `dplyr` for data wrangling and `ggplot2`for data visualization
* Packages need to be **loaded** everytime we start a new R session. Typically this command goes on top of a script. To do so we use `library("package-name")` command
> Challenge
Install the `dplyr` and `ggplot2` packages.
Type the following into console.
```{r}
# install.packages("ggplot2")
# install.packages("dplyr")
```
Now load these packages.
```{r}
library("ggplot2")
library("dplyr")
```
## Loading and inspecting the titanic.csv dataset
---
To load a csv file into RStudio we will use function `read.csv()`and assign our dataset a name (create an object). As a result, what will be created in the R environment is known as a DATAFRAME (a type of data structure in R) which has:
* variables of a data set as columns
* observations as rows
* and is rectangular
```{r}
library("dplyr")
library("ggplot2")
titanic <- read.csv("Data/titanic.csv")
# tibble: a refined print method that shows only first 10 rows and all columns fit the screen
titanic <- as_tibble(titanic)
```
Before we dive into extracting information from this dataset, it is important to know what variables we are working with.
There are several functions that will provide information about the structure of the dataframe: `str()`, `head()` (part of base R) or `glimpse()`(part of DPLYR package).
Note the differences in the output in your R console when you run these three functions.
```{r}
head(titanic)
glimpse(titanic)
str(titanic)
```
It is also possible to view a dataframe in a spreadsheet-like view. Click on the `titanic` in your Environment to chek it out.
**What is in the titanic dataset?**
Variable | Definition | Key |
---------|:--------------|:-------------------------- |
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare| |
cabin | Cabin number | |
embarked | Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton | |
Other usefull commands to explore a dataset:
```{r}
names(titanic) # lists the variable names
dim(titanic) # dimentions of the dataset, it has 891 observations by 12 columns
nrow(titanic) # number of observations
summary(titanic) # gives me summary statistics for every column/variable
```
---
## Data wrangling with the `dplyr` package
The `dplyr` package is based on the concepts of functions as verbs that manipulate data frames, the core of dplyr lies in the 5 main verbs:
* `filter()`: pick rows matching criteria
* `select()`: pick columns by name
* `group_by()`: group rows based on columns
* `summarise()`: reduce groups to values
* `mutate()`: add new variables
Each one of these functions can be performed individually but the real power comes in when we string them together using the pipe (`%>%`) operator. This allows us to perform multiple operations in one hit and return a single tibble with our results at the end.
Let's explore some of these functions!
### select()
Use the `select()` function to extract specific variables from a dataframe. You can store variables of interest extracted with `select()` in new tibble should you want to have a small subset of the data to work on seperately.
Here's an example.
```{r}
# this 'pipes' the titanic dataset to the select command and will print only the age, sex and survived columns
titanic %>%
select(Age, Sex, Survived)
```
I am happy with the output, now let's create a seperate object containing just these 3 columns (aka variables)
```{r}
gender_survival <- titanic %>%
select(Age, Sex, Survived)
```
There are various ways to use `select()` function. For instance, we can extract variables that start or end with certain letters or strings.
```{r}
# select columns that begin with A
titanic %>%
select(starts_with("A"))
# select columns that end with s
titanic %>%
select(ends_with("s"))
```
> ### Challenge
Create a subset of titanic dataframe called `passanger_info` containing `Name`,`Age`,`Sex`
### filter()
The function `filter()` extracts data **row wise**.
An example: Column (variable) **Sex** contains two categories: females and males. Let's extract all of the information (all variables) contained in the `titanic` dataset but only for females
```{r}
# extract all of the variables but only for females
titanic %>%
filter(Sex == "female")
```
We can also do multiple levels of filtering.
```{r}
# extract all of the variables but only or females aged 20 and up
titanic %>%
filter(Sex == "female", Age > 20)
```
The trick with the `filter()` option is that we always want each statement to return a `True` or a `False`.
> Challenge
Create a new object called 'males' containing only male passengers and only the columns Age, Sex, Name and Fare. What dimensions does your object have?
### group_by() and summarize()
So far we learnt how to extract pieces of data from the `titanic` dataset, but we still have lots of rows of data. How do we summarize it so that we can find out something useful, say the average age of people who survived the tragedy? This is where two powerful functions come in: `group_by` and `summarize()`.
To perform operations on the `titanic` dataset, we first need to specify what variable we want to summarize with the `group_by`function.
In the example below we specify that the operation will be grouped by Sex for the `titanic` data. Because we have 2 categories in `Sex`, the female and male, any calculations we perform after the `group_by` will produce one value for females and one value for males.
```{r}
titanic %>%
group_by(Sex)
```
Not much will happen when running `group_by` by itself, although in the console above the dataframe that was printed we can see a new line #Groups: Sex [2]. To perform an operation we need to use `group_by` in conjunction with `summarise()`
With summarize we can use the functions we explored at the beginning of the workshop (`mean`,`sd`,`median`,`n()`) or we can perform calculations like we would do in a cell of an excel spreadhseet.
To calculate the mean age for females and males in the `titanic` dataset:
1. Group your data by variable **Sex**
2. Define what to average with the `mean()` function and put this inside the `summarize()`
3. Note, there are some missing values (NA's) in the variable **Age**, we must tell R to ignore NA's in those calculations using `na.rm = TRUE` (i.e. na remove = yes)
```{r}
titanic %>%
group_by(Sex) %>%
summarize(mean(Age, na.rm = TRUE))
```
Now, let's find out the average age of females and males that survived (1) and died (0) by adding **Survived** into the `groub_by()`
```{r}
titanic %>%
group_by(Sex, Survived) %>%
summarize(mean_age = mean(Age, na.rm = TRUE))
```
Just like with `groub_by()` where we can have many variables, we can apply many operations in summarize.
```{r}
titanic %>%
group_by(Sex, Survived) %>%
summarize(mean_age = mean(Age, na.rm = TRUE),
stdev = sd(Age, na.rm = TRUE),
num_of_obs = n())
```
> Challenge 2
1. Calculate mean, stdev and number of observations for variables Sex in conjunction with Passanger class. Hint!: follow the example we just did.
2. Calculate the average Survived value per Pclass and Sex. Which combination of Pclass and Sex had the highest Survived value and which combination had the lowest?
### `mutate()`
The `mutate()` function allows you to create brand new variables by performing operations on the existing data in the dataframe. For example, let's create a new variable called 'status' that combines the information of the `Pclass` variable with the age variable.
```{r}
titanic %>%
mutate(status = Pclass*Age)
```
> Advanced dplyr Challenge
Calculate the average Survived value for a group of 20 randomly selected females from each Pclass group. Then arrange the classes in descending order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions. Look at the help!
---
## Data plotting with the GGPLOT2 package
Ggplot2 is built on the [Grammar of Graphics](http://www.springer.com/gb/book/9780387245447). It is a package created by [Hadley Wickham](http://hadley.nz/).
The building blocks of a ggplot include:
* a dataset
* an aesthetic (position (x and y), color, shape, size, linetype, fill)
* a set of geoms--they define how data will be plotted (e.g geom_point is for scatterplots, while geom_bar is for bar plots)
* statistical transformation (optional)
* scales
* a coordinate system
* facetting (for creating subplots)
There is a handy [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) for using ggplot2.
And a crazy amount of graphs and options you can customise - [check it out!](http://ggplot2.tidyverse.org/index.html)
Examples of GEOMS:
* `geom_point()` - will create a scatterplot
* `geom_bar()` - will create a bar plot
* `geom_boxplot()` - you get the idea...
* `geom_line()`
* `geom_errorbar()`
* `geom_dotplot()`
## Scatter plot example
Below we make a simple scatterplot of Age versus Fare and colour the points by the passenger's sex.
```{r}
titanic %>%
ggplot(aes( x = Age, y = Fare, color = Sex )) +
geom_point()
```
## Dplyr + ggplot
We can mix in some `dplyr` verbs we learnt earlier to zoom in on a data of interest within our graph.
```{r}
titanic %>%
filter(Fare < 200) %>%
ggplot(aes(x = Age, y = Fare, color = Sex)) +
geom_point()
```
## Bar plots
Creating bar plots : `geom_bar()`
```{r}
titanic %>%
ggplot(aes(x = Pclass, fill = Sex)) +
geom_bar(position = "dodge")
```
## Avoiding overplotting
Visualizing data points with geom_jitter()
```{r}
titanic %>%
ggplot(aes(x = Pclass, y = Age, color = Sex)) +
geom_jitter()
```
You can also try `geom_hex()`, `geom_count()` or `geom_bin2d()`.
## Facetting
Creating multiple plots using `facet_grid()`
```{r}
titanic %>%
ggplot(aes(x= Pclass, y = Age, color = Sex)) +
geom_jitter() +
facet_grid(~Survived)
```
---
# Useful Resources: local and remote places to practice R
Cheatsheets: R for data science:
* [for various R tools - RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/)
* [RStudio - IDE](https://github.com/rstudio/cheatsheets/blob/master/rstudio-ide.pdf)
* [data import](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf)
* [dplyr](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
* [ggplot2](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)
R-Ladies resources: [global repository](https://github.com/rladies/resources)
* Stay tunes for future beginner R meetups with R-Ladies Vancouver!
* Ecoscope offers a variety of R worshops for all levels [Ecoscope](http://ecoscope.ubc.ca/events/) - next one is on reproducible research [check out tickets here](http://ecoscope.ubc.ca/events/)
* Check out a series of workshops on [Intro into Data Science](https://www.meetup.com/Women-Who-Code-Vancouver/events/249035344/) by [Women Who Code Vancouver](https://www.meetup.com/Women-Who-Code-Vancouver/)