diff --git a/03_data_science_practice.Rmd b/03_data_science_practice.Rmd index 294976a..34f607c 100644 --- a/03_data_science_practice.Rmd +++ b/03_data_science_practice.Rmd @@ -304,107 +304,7 @@ Louisa Smith , [epi quals study calendar](https://docs.google.com/spreadsheets/d * https://twitter.com/louisahsmith/status/1081955868864901120 -*** - -## The tidyverse - -All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values. - -One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data? - -> There are three interrelated rules which make a dataset tidy: -> * Each variable must have its own column. -> * Each observation must have its own row. -> * Each value must have its own cell. - -And - -> Why ensure that your data is tidy? There are two main advantages: -> -> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. -> -> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural. - -(from Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/)) - -This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness. - -For more about the principles of tidy data, see: - -* Hadley Wickham, ["Tidy data", _The Journal of Statistical Software_, vol. 59, 2014.](https://www.jstatsoft.org/article/view/v059i10) - + [alternate link:](http://vita.had.co.nz/papers/tidy-data.html) - + [informal and code-heavy version](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) - - -### Other references - -Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets. - -* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets) - - -Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/) - - -Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02) - - - - - -## Tidy Tools - -[The tidyverse style guide](http://style.tidyverse.org/functions.html) - - -### tidyverse R packages - -[The tidyverse: ](http://tidyverse.org/) - -[The tidyverse R packages on github](https://github.com/hadley/tidyverse) - -#### `broom` - - -#### `dplyr` - -`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md) - - -#### `purrr` - -* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03 - -* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github - -### more about tidy data - -* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/) - -* Hadley Wickham - + [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555) - -* Garrett Grolemund - + [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/)) - -* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) - - -*** - -## Categorical data - -Emily Robinson, DataCamp course, [Categorical Data in the Tidyverse](https://www.datacamp.com/courses/categorical-data-in-the-tidyverse) - -*** - -## Tidy Text - -If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`. - -* Julia Silge, [Term Frequency and tf-idf Using Tidy Data Principles](http://juliasilge.com/blog/Term-Frequency-tf-idf/), 2016-06-27 -(See the companion page on the topics of [text analysis and text mining](https://github.com/MonkmanMH/DataScienceResources/blob/master/TextAnalysis.md)). diff --git a/04_data_science_tools.Rmd b/04_data_science_tools.Rmd new file mode 100644 index 0000000..2bcafd9 --- /dev/null +++ b/04_data_science_tools.Rmd @@ -0,0 +1,113 @@ +# Using R {#using_r} + + + +## Introduction + +In addition to there being many resources available for using R to solve statistical and data science challenges, there are also many resources on how to maximize your effectiveness using R. + +This chapter compiles what I consider to be the essential texts; articles and blog posts will be at a minimum. + + +## R as a programming environment + +Colin Gillespie and Robin Lovelace, _Efficient R Programming_ [@Gillespie_Lovelace_2017] + +Hadley Wickham, _Advanced R_ [@Wickham_advancedR] + +*** + +## The tidyverse + +All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values. + +One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data? + +> There are three interrelated rules which make a dataset tidy: +> * Each variable must have its own column. +> * Each observation must have its own row. +> * Each value must have its own cell. + +And + +> Why ensure that your data is tidy? There are two main advantages: +> +> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. +> +> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural. + +(from [@Wickham_Grolemund2016]) + + +This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness. + +For more about the principles of tidy data, see Hadley Wickham's article "Tidy data", in _The Journal of Statistical Software_ [@tidydata] + + + [alternate link:](http://vita.had.co.nz/papers/tidy-data.html) + + [informal and code-heavy version](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) + + +### Other tidyverse references + +Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets. + +* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets) + + +Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/) + + +Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02) + + +#### Categorical data + +Emily Robinson, DataCamp course, [Categorical Data in the Tidyverse](https://www.datacamp.com/courses/categorical-data-in-the-tidyverse) + + +#### Tidy text + +If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`. + +See the companion chapter on the topics of [Text Analysis and Text Mining]. + + + + +### tidyverse R packages + +[The tidyverse: ](http://tidyverse.org/) + +[The tidyverse R packages on github](https://github.com/hadley/tidyverse) + +#### `broom` + + +#### `dplyr` + +`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md) + + +#### `purrr` + +* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03 + +* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github + +### more about tidy data + +* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/) + +* Hadley Wickham + + [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555) + +* Garrett Grolemund + + [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/)) + +* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) + + + + + +-30- diff --git a/book.bib b/book.bib index 596f142..19f1769 100644 --- a/book.bib +++ b/book.bib @@ -86,6 +86,15 @@ @Book{Gelman_etal_2014 } +@Book{Gillespie_Lovelace_2017, + title = {Efficient R Programming: A Practical Guide to Smarter Programming}, + author = {Colin Gillespie and Robin Lovelace}, + publisher = {O'Reilly}, + year = {2017}, + isbn = {ISBN 978-1-491-95078-4}, + url = {https://csgillespie.github.io/efficientR/} +} + @Book{Graham_etal_2015, title = {Exploring Big Historical Data: The Historian's Macroscope}, @@ -266,8 +275,20 @@ @article{tidydata issn = {1548-7660}, pages = {1--23}, doi = {10.18637/jss.v059.i10}, - url = {https://www.jstatsoft.org/index.php/jss/article/view/v059i10} + url = {https://www.jstatsoft.org/index.php/jss/article/view/v059i10}, } + + +@Book{Wickham_advancedR, + title = {Advanced R}, + author = {Hadley Wickham}, + publisher = {CRC Press}, + year = {2015}, + isbn = {ISBN 978-1-4665-8696-3}, + url = {https://adv-r.hadley.nz/}, +} + + @Book{Wickham_Grolemund2016, author = {Hadley Wickham and