From 20a31a210630777cd335fcb60aa562deddf32b51 Mon Sep 17 00:00:00 2001 From: Monkman Date: Thu, 18 Jul 2019 08:18:55 -0700 Subject: [PATCH] wrangling Data Wrangling (what have I done?!) --- 03_data_science_practice.Rmd | 14 +- 04_using_R.Rmd | 107 --------------- 20_data_wrangling.Rmd | 127 +++++++++++++++--- ...rmat.rmd => 21_data_reading_fileformat.rmd | 30 +++-- 4 files changed, 144 insertions(+), 134 deletions(-) rename 12_data_reading_fileformat.rmd => 21_data_reading_fileformat.rmd (79%) diff --git a/03_data_science_practice.Rmd b/03_data_science_practice.Rmd index c6db7c2..c36d6e2 100644 --- a/03_data_science_practice.Rmd +++ b/03_data_science_practice.Rmd @@ -78,7 +78,7 @@ Ray Li (2016) ["7 habits of highly effective data analysis"](http://dataconomy.c Noble, William Stafford (2009-07-31) [A Quick Guide to Organizing Computational Biology Projects](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424), _PLoS Comput Biol_ 5(7): e1000424. -### Version Control with Git & GitHub +## Version Control with Git & GitHub Jenny Bryan, the STAT 545 TAs, Jim Hester [Happy Git and GitHub for the useR](https://happygitwithr.com/) @@ -112,9 +112,9 @@ Emily Robinson, [Red Flags in Data Science Interviews](http://hookedondata.org/R -### R packages supporting robust workflow +## R packages supporting robust workflow -#### {janitor} +### {janitor} - [{janitor}](sfirke.github.io/janitor/index.html) -- "has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff." @@ -123,7 +123,7 @@ CRAN: [janitor: Simple Tools for Examining and Cleaning Dirty Data](https://cran GitHub: [sfirke/janitor](https://github.com/sfirke/janitor) -#### {packrat} +### {packrat} - [{Packrat} is a dependency management system for R](http://rstudio.github.io/packrat/) @@ -137,6 +137,12 @@ GitHub: [rstudio/packrat](https://github.com/rstudio/packrat) Miles McBain (2019-04-09) [A workflow for lightweight R dependency management](https://milesmcbain.xyz/packrat-lite/) +### {usethis} - + +"usethis is a workflow package: it automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects" + +[usethis 1.5.0](https://www.tidyverse.org/articles/2019/04/usethis-1.5.0/) (April 2019) + *** diff --git a/04_using_R.Rmd b/04_using_R.Rmd index 11756e1..903d36a 100644 --- a/04_using_R.Rmd +++ b/04_using_R.Rmd @@ -67,113 +67,6 @@ Jennifer Bryan and Jim Hester, ["Debugging R code"](https://whattheyforgot.org/d -## Data wrangling - -(See {The tidyverse}, below) - -#### {datapasta} - -Vignette: [How to Datapasta](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) - -*** - -## The tidyverse - -All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values. - -One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data? - -> There are three interrelated rules which make a dataset tidy: -> * Each variable must have its own column. -> * Each observation must have its own row. -> * Each value must have its own cell. - -And - -> Why ensure that your data is tidy? There are two main advantages: -> -> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. -> -> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural. - -(from [@Wickham_Grolemund2016]) - - -This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness. - -For more about the principles of tidy data, see Hadley Wickham's article "Tidy data", in _The Journal of Statistical Software_ [@tidydata] - - + [alternate link:](http://vita.had.co.nz/papers/tidy-data.html) - + [informal and code-heavy version](https://cran.r-project.org/package=tidyr/vignettes/tidy-data.html) - - -### Other tidyverse references - -Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets. - -* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets) - - -Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/) - - -Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02) - - -#### Categorical data - -Emily Robinson, _Categorical data in the tidyverse_ {link to DataCamp course removed} - - -#### Tidy text - -If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`. - -See the companion chapter on the topics of [Text Analysis and Text Mining]. - - - - -### tidyverse R packages - -[The tidyverse: ](http://tidyverse.org/) - -[The tidyverse R packages on github](https://github.com/hadley/tidyverse) - -#### {broom} - - -#### {dplyr} - -`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md) - - -#### {purrr} - -* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03 - -* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github - - -##### {usethis} - -[usethis 1.5.0](https://www.tidyverse.org/articles/2019/04/usethis-1.5.0/) (April 2019) - - -*** - -### more about tidy data - -* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/) - -* Hadley Wickham - + [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555) - -* Garrett Grolemund - + [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/)) - -* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) - *** diff --git a/20_data_wrangling.Rmd b/20_data_wrangling.Rmd index 00f351f..2da2c47 100644 --- a/20_data_wrangling.Rmd +++ b/20_data_wrangling.Rmd @@ -1,11 +1,52 @@ -# Data Wrangling (emphasis on **dplyr**) {#datawrangling} - +# Data Wrangling (emphasis on tidy data) {#datawrangling} ## Introduction Data is rarely in condition to use it...there's invariably something amiss. Data wrangling (a.k.a. data carpentry) is the process of getting it ready for analysis. +And all too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values. + +One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data? + +> There are three interrelated rules which make a dataset tidy: +> * Each variable must have its own column. +> * Each observation must have its own row. +> * Each value must have its own cell. + +And + +> Why ensure that your data is tidy? There are two main advantages: +> +> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. +> +> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural. + +(from [@Wickham_Grolemund2016]) + + +This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness. + +For more about the principles of tidy data, see Hadley Wickham's article "Tidy data", in _The Journal of Statistical Software_ [@tidydata] + + + [alternate link:](http://vita.had.co.nz/papers/tidy-data.html) + + [informal and code-heavy version](https://cran.r-project.org/package=tidyr/vignettes/tidy-data.html) + + +### Other tidyverse references + +Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets. + +* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets) + + +Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/) + + +Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02) + + + ## Theory and methods @@ -13,28 +54,30 @@ Data is rarely in condition to use it...there's invariably something amiss. Dat [Stat 545: Data wrangling, exploration, and analysis with R](http://stat545.com/index.html) -- course materials associated with the University of British Columbia's Statistics 545 course. Prepared in large part by Dr. Jenny Bryan. -### Tidy evaluation -* programming with **dplyr** +*** -Edwin Thoen, 2017-08-25 [Tidy evaluation, most common actions](https://edwinth.github.io/blog/dplyr-recipes/) +## Tools -### Reading messy files -Luis D. Verde, 2018-12-14, [Tidyeval meets PDF table hell](http://luisdva.github.io/rstats/Tidyeval-pdf-hell/) -- great solution to the common problem of broken rows ("values that are broken up into two lines for whatever reason (often to optimize space on a page in a table in a typeset pdf)"). +### {datapasta} - +Vignette: [How to Datapasta](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) -### Working with dates -

Updated Turing Test concept:
A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as "1996ish", "1941/xd01944", "1955?" and "WWII."
I'm not worried about AI until someone shows me the algorithm that can make sense of this. pic.twitter.com/IhzofigX2b

— Brooke Watson (/@/brookLYNevery1) January 19, 2018
- +### {janitor} - + +*** + +## The tidyverse + +[The tidyverse: ](http://tidyverse.org/) -## R +[The tidyverse R packages on github](https://github.com/hadley/tidyverse) -Arranged by package -### **dplyr** +### {dplyr} - **package** @@ -51,7 +94,8 @@ github: [hadley/dplyr](https://github.com/hadley/dplyr) -### **forcats** + +### {forcats} - [reference page](https://forcats.tidyverse.org/) @@ -60,9 +104,12 @@ Working with factors [Be the boss of your factors](https://stat545.com/block029_factors.html#change-order-of-the-levels-because-i-said-so) +Emily Robinson, _Categorical data in the tidyverse_ {link to DataCamp course removed} + -### **purrr** +### {purrr} - + [reference page](https://purrr.tidyverse.org/) @@ -84,4 +131,54 @@ Emorie D Beck, [Intro to purrr](https://emoriebeck.github.io/R-tutorials/purrr/) Sharon Machlis, [R Tip: Access nested list items with purrr](https://www.infoworld.com/video/90327/r-tip-access-nested-list-items-with-purrr) {video} +[A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03 + +Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github + + + +*** + +### more about tidy data + +* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/) + +* Hadley Wickham + + [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555) + +* Garrett Grolemund + + [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/)) + +* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer) + + + + +## Working with dates + +

Updated Turing Test concept:
A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as "1996ish", "1941/xd01944", "1955?" and "WWII."
I'm not worried about AI until someone shows me the algorithm that can make sense of this. pic.twitter.com/IhzofigX2b

— Brooke Watson (/@/brookLYNevery1) January 19, 2018
+ + + +{lubridate} + + + + +### Tidy evaluation + +* programming with **dplyr** + +Edwin Thoen, 2017-08-25 [Tidy evaluation, most common actions](https://edwinth.github.io/blog/dplyr-recipes/) + + + +### Tidy text + +If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`. + +See the companion chapter on the topics of [Text Analysis and Text Mining]. + + + -30- diff --git a/12_data_reading_fileformat.rmd b/21_data_reading_fileformat.rmd similarity index 79% rename from 12_data_reading_fileformat.rmd rename to 21_data_reading_fileformat.rmd index 0d8f009..e6d0fef 100644 --- a/12_data_reading_fileformat.rmd +++ b/21_data_reading_fileformat.rmd @@ -23,7 +23,7 @@ In particular, survey data [R database interfaces](http://www.burns-stat.com/r-database-interfaces/) -### **rio** +### {rio} - **package** @@ -33,7 +33,7 @@ vignette: [Import, Export, and Convert Data Files](https://cran.r-project.org/pa -### **googledrive** +### {googledrive} - **package** @@ -43,7 +43,7 @@ tidyverse page: [`googledrive`](https://tidyverse.github.io/googledrive/) -### **foreign** +### {foreign} - **package** @@ -54,17 +54,30 @@ CRAN page: [foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, W * [How to open an SPSS file into R](http://www.milanor.net/blog/how-to-open-an-spss-file-into-r/), by Davide Massidda (2014-03-26) +### {haven} - -### Stata files +**package** + +**articles** + + +## PDF files + +Luis D. Verde, 2018-12-14, [Tidyeval meets PDF table hell](http://luisdva.github.io/rstats/Tidyeval-pdf-hell/) -- great solution to the common problem of broken rows ("values that are broken up into two lines for whatever reason (often to optimize space on a page in a table in a typeset pdf)"). -**`read.dta`** + + + +## Stata files + +### {read.dta} - Reads a file in Stata version 5–12 binary format into a data frame. CRAN page: [`read.dta`: Read Stata Binary Files](http://stat.ethz.ch/R-manual/R-devel/library/foreign/html/read.dta.html) -**readstata13** +### {readstata13} - Function to read and write the 'Stata' file format. @@ -72,9 +85,9 @@ CRAN Page: [readstata13: Import 'Stata' Data Files](readstata13: Import 'Stata' +## Time series database files - -### **TSdbi** and related packages +### {TSdbi} and related packages **package** @@ -90,3 +103,4 @@ Note: `TSdbi` has some related extension packages: * CRAN page: [TSsdmx: 'TSdbi' Extension to Connect with 'SDMX'](https://cran.r-project.org/package=TSsdmx) +-30- \ No newline at end of file