Skip to content

Commit

Permalink
restructure tidyverse and use of R materials
Browse files Browse the repository at this point in the history
  • Loading branch information
MonkmanMH committed Mar 23, 2019
1 parent 294847c commit b56f591
Show file tree
Hide file tree
Showing 3 changed files with 135 additions and 101 deletions.
100 changes: 0 additions & 100 deletions 03_data_science_practice.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -304,107 +304,7 @@ Louisa Smith , [epi quals study calendar](https://docs.google.com/spreadsheets/d
* https://twitter.com/louisahsmith/status/1081955868864901120


***

## The tidyverse

All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

> There are three interrelated rules which make a dataset tidy:
> * Each variable must have its own column.
> * Each observation must have its own row.
> * Each value must have its own cell.
And

> Why ensure that your data is tidy? There are two main advantages:
>
> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
>
> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
(from Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/))

This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see:

* Hadley Wickham, ["Tidy data", _The Journal of Statistical Software_, vol. 59, 2014.](https://www.jstatsoft.org/article/view/v059i10)
+ [alternate link:](http://vita.had.co.nz/papers/tidy-data.html)
+ [informal and code-heavy version](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)


### Other references

Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets.

* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets)


Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/)


Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)





## Tidy Tools

[The tidyverse style guide](http://style.tidyverse.org/functions.html)


### tidyverse R packages

[The tidyverse: ](http://tidyverse.org/)

[The tidyverse R packages on github](https://github.com/hadley/tidyverse)

#### `broom`


#### `dplyr`

`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md)


#### `purrr`

* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03

* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github

### more about tidy data

* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/)

* Hadley Wickham
+ [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555)

* Garrett Grolemund
+ [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/))

* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)


***

## Categorical data

Emily Robinson, DataCamp course, [Categorical Data in the Tidyverse](https://www.datacamp.com/courses/categorical-data-in-the-tidyverse)

***

## Tidy Text

If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`.

* Julia Silge, [Term Frequency and tf-idf Using Tidy Data Principles](http://juliasilge.com/blog/Term-Frequency-tf-idf/), 2016-06-27

(See the companion page on the topics of [text analysis and text mining](https://github.com/MonkmanMH/DataScienceResources/blob/master/TextAnalysis.md)).



Expand Down
113 changes: 113 additions & 0 deletions 04_data_science_tools.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Using R {#using_r}



## Introduction

In addition to there being many resources available for using R to solve statistical and data science challenges, there are also many resources on how to maximize your effectiveness using R.

This chapter compiles what I consider to be the essential texts; articles and blog posts will be at a minimum.


## R as a programming environment

Colin Gillespie and Robin Lovelace, _Efficient R Programming_ [@Gillespie_Lovelace_2017]

Hadley Wickham, _Advanced R_ [@Wickham_advancedR]

***

## The tidyverse

All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

> There are three interrelated rules which make a dataset tidy:
> * Each variable must have its own column.
> * Each observation must have its own row.
> * Each value must have its own cell.
And

> Why ensure that your data is tidy? There are two main advantages:
>
> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
>
> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
(from [@Wickham_Grolemund2016])


This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see Hadley Wickham's article "Tidy data", in _The Journal of Statistical Software_ [@tidydata]

+ [alternate link:](http://vita.had.co.nz/papers/tidy-data.html)
+ [informal and code-heavy version](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)


### Other tidyverse references

Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets.

* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets)


Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/)


Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)


#### Categorical data

Emily Robinson, DataCamp course, [Categorical Data in the Tidyverse](https://www.datacamp.com/courses/categorical-data-in-the-tidyverse)


#### Tidy text

If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`.

See the companion chapter on the topics of [Text Analysis and Text Mining].




### tidyverse R packages

[The tidyverse: ](http://tidyverse.org/)

[The tidyverse R packages on github](https://github.com/hadley/tidyverse)

#### `broom`


#### `dplyr`

`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md)


#### `purrr`

* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03

* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github

### more about tidy data

* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/)

* Hadley Wickham
+ [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555)

* Garrett Grolemund
+ [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/))

* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)





-30-
23 changes: 22 additions & 1 deletion book.bib
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,15 @@ @Book{Gelman_etal_2014
}


@Book{Gillespie_Lovelace_2017,
title = {Efficient R Programming: A Practical Guide to Smarter Programming},
author = {Colin Gillespie and Robin Lovelace},
publisher = {O'Reilly},
year = {2017},
isbn = {ISBN 978-1-491-95078-4},
url = {https://csgillespie.github.io/efficientR/}
}


@Book{Graham_etal_2015,
title = {Exploring Big Historical Data: The Historian's Macroscope},
Expand Down Expand Up @@ -266,8 +275,20 @@ @article{tidydata
issn = {1548-7660},
pages = {1--23},
doi = {10.18637/jss.v059.i10},
url = {https://www.jstatsoft.org/index.php/jss/article/view/v059i10}
url = {https://www.jstatsoft.org/index.php/jss/article/view/v059i10},
}


@Book{Wickham_advancedR,
title = {Advanced R},
author = {Hadley Wickham},
publisher = {CRC Press},
year = {2015},
isbn = {ISBN 978-1-4665-8696-3},
url = {https://adv-r.hadley.nz/},
}


@Book{Wickham_Grolemund2016,
author = {Hadley Wickham and
Expand Down

0 comments on commit b56f591

Please sign in to comment.