Skip to content

Commit

Permalink
wrangling Data Wrangling (what have I done?!)
Browse files Browse the repository at this point in the history
  • Loading branch information
Monkman authored and Monkman committed Jul 18, 2019
1 parent 8f78d77 commit 20a31a2
Show file tree
Hide file tree
Showing 4 changed files with 144 additions and 134 deletions.
14 changes: 10 additions & 4 deletions 03_data_science_practice.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Ray Li (2016) ["7 habits of highly effective data analysis"](http://dataconomy.c
Noble, William Stafford (2009-07-31) [A Quick Guide to Organizing Computational Biology Projects](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424), _PLoS Comput Biol_ 5(7): e1000424.


### Version Control with Git & GitHub
## Version Control with Git & GitHub

Jenny Bryan, the STAT 545 TAs, Jim Hester [Happy Git and GitHub for the useR](https://happygitwithr.com/)

Expand Down Expand Up @@ -112,9 +112,9 @@ Emily Robinson, [Red Flags in Data Science Interviews](http://hookedondata.org/R
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


### R packages supporting robust workflow
## R packages supporting robust workflow

#### {janitor}
### {janitor} -

[{janitor}](sfirke.github.io/janitor/index.html) -- "has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff."

Expand All @@ -123,7 +123,7 @@ CRAN: [janitor: Simple Tools for Examining and Cleaning Dirty Data](https://cran
GitHub: [sfirke/janitor](https://github.com/sfirke/janitor)


#### {packrat}
### {packrat} -

[{Packrat} is a dependency management system for R](http://rstudio.github.io/packrat/)

Expand All @@ -137,6 +137,12 @@ GitHub: [rstudio/packrat](https://github.com/rstudio/packrat)
Miles McBain (2019-04-09) [A workflow for lightweight R dependency management](https://milesmcbain.xyz/packrat-lite/)


### {usethis} -

"usethis is a workflow package: it automates repetitive tasks that arise during project setup and development, both for R packages and non-package projects"

[usethis 1.5.0](https://www.tidyverse.org/articles/2019/04/usethis-1.5.0/) (April 2019)



***
Expand Down
107 changes: 0 additions & 107 deletions 04_using_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -67,113 +67,6 @@ Jennifer Bryan and Jim Hester, ["Debugging R code"](https://whattheyforgot.org/d



## Data wrangling

(See {The tidyverse}, below)

#### {datapasta}

Vignette: [How to Datapasta](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html)

***

## The tidyverse

All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

> There are three interrelated rules which make a dataset tidy:
> * Each variable must have its own column.
> * Each observation must have its own row.
> * Each value must have its own cell.
And

> Why ensure that your data is tidy? There are two main advantages:
>
> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
>
> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
(from [@Wickham_Grolemund2016])


This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see Hadley Wickham's article "Tidy data", in _The Journal of Statistical Software_ [@tidydata]

+ [alternate link:](http://vita.had.co.nz/papers/tidy-data.html)
+ [informal and code-heavy version](https://cran.r-project.org/package=tidyr/vignettes/tidy-data.html)


### Other tidyverse references

Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets.

* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets)


Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/)


Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)


#### Categorical data

Emily Robinson, _Categorical data in the tidyverse_ {link to DataCamp course removed}


#### Tidy text

If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`.

See the companion chapter on the topics of [Text Analysis and Text Mining].




### tidyverse R packages

[The tidyverse: ](http://tidyverse.org/)

[The tidyverse R packages on github](https://github.com/hadley/tidyverse)

#### {broom}


#### {dplyr}

`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md)


#### {purrr}

* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03

* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github


##### {usethis}

[usethis 1.5.0](https://www.tidyverse.org/articles/2019/04/usethis-1.5.0/) (April 2019)


***

### more about tidy data

* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/)

* Hadley Wickham
+ [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555)

* Garrett Grolemund
+ [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/))

* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)



***
Expand Down
127 changes: 112 additions & 15 deletions 20_data_wrangling.Rmd
Original file line number Diff line number Diff line change
@@ -1,40 +1,83 @@
# Data Wrangling (emphasis on **dplyr**) {#datawrangling}

# Data Wrangling (emphasis on tidy data) {#datawrangling}


## Introduction

Data is rarely in condition to use it...there's invariably something amiss. Data wrangling (a.k.a. data carpentry) is the process of getting it ready for analysis.

And all too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

> There are three interrelated rules which make a dataset tidy:
> * Each variable must have its own column.
> * Each observation must have its own row.
> * Each value must have its own cell.
And

> Why ensure that your data is tidy? There are two main advantages:
>
> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
>
> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
(from [@Wickham_Grolemund2016])


This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see Hadley Wickham's article "Tidy data", in _The Journal of Statistical Software_ [@tidydata]

+ [alternate link:](http://vita.had.co.nz/papers/tidy-data.html)
+ [informal and code-heavy version](https://cran.r-project.org/package=tidyr/vignettes/tidy-data.html)


### Other tidyverse references

Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets.

* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets)


Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/)


Jesse Sadler, [Excel vs R: A Brief Introduction to R (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)




## Theory and methods


[Stat 545: Data wrangling, exploration, and analysis with R](http://stat545.com/index.html) -- course materials associated with the University of British Columbia's Statistics 545 course. Prepared in large part by Dr. Jenny Bryan.


### Tidy evaluation

* programming with **dplyr**
***

Edwin Thoen, 2017-08-25 [Tidy evaluation, most common actions](https://edwinth.github.io/blog/dplyr-recipes/)
## Tools

### Reading messy files

Luis D. Verde, 2018-12-14, [Tidyeval meets PDF table hell](http://luisdva.github.io/rstats/Tidyeval-pdf-hell/) -- great solution to the common problem of broken rows ("values that are broken up into two lines for whatever reason (often to optimize space on a page in a table in a typeset pdf)").
### {datapasta} -

Vignette: [How to Datapasta](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html)

### Working with dates

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Updated Turing Test concept:<br>A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as &quot;1996ish&quot;, &quot;1941/xd01944&quot;, &quot;1955?&quot; and &quot;WWII.&quot;<br>I&#39;m not worried about AI until someone shows me the algorithm that can make sense of this. <a href="https://t.co/IhzofigX2b">pic.twitter.com/IhzofigX2b</a></p>&mdash; Brooke Watson (/@/brookLYNevery1) <a href="https://twitter.com/brookLYNevery1/status/954368989181902848?ref_src=twsrc%5Etfw">January 19, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

### {janitor} -

***

## The tidyverse

[The tidyverse: ](http://tidyverse.org/)

## R
[The tidyverse R packages on github](https://github.com/hadley/tidyverse)

Arranged by package

### **dplyr**
### {dplyr} -

**package**

Expand All @@ -51,7 +94,8 @@ github: [hadley/dplyr](https://github.com/hadley/dplyr)



### **forcats**

### {forcats} -

[reference page](https://forcats.tidyverse.org/)

Expand All @@ -60,9 +104,12 @@ Working with factors
[Be the boss of your factors](https://stat545.com/block029_factors.html#change-order-of-the-levels-because-i-said-so)


Emily Robinson, _Categorical data in the tidyverse_ {link to DataCamp course removed}



### **purrr**
### {purrr} -


[reference page](https://purrr.tidyverse.org/)

Expand All @@ -84,4 +131,54 @@ Emorie D Beck, [Intro to purrr](https://emoriebeck.github.io/R-tutorials/purrr/)
Sharon Machlis, [R Tip: Access nested list items with purrr](https://www.infoworld.com/video/90327/r-tip-access-nested-list-items-with-purrr) {video}


[A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03

Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github



***

### more about tidy data

* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/)

* Hadley Wickham
+ [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555)

* Garrett Grolemund
+ [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/))

* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)




## Working with dates

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Updated Turing Test concept:<br>A spreadsheet of dates, hand-entered by interns more than a decade ago, featuring such well-known time formats as &quot;1996ish&quot;, &quot;1941/xd01944&quot;, &quot;1955?&quot; and &quot;WWII.&quot;<br>I&#39;m not worried about AI until someone shows me the algorithm that can make sense of this. <a href="https://t.co/IhzofigX2b">pic.twitter.com/IhzofigX2b</a></p>&mdash; Brooke Watson (/@/brookLYNevery1) <a href="https://twitter.com/brookLYNevery1/status/954368989181902848?ref_src=twsrc%5Etfw">January 19, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


{lubridate}




### Tidy evaluation

* programming with **dplyr**

Edwin Thoen, 2017-08-25 [Tidy evaluation, most common actions](https://edwinth.github.io/blog/dplyr-recipes/)



### Tidy text

If you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy. Fortunately, there's an R package for that: `tidytext`.

See the companion chapter on the topics of [Text Analysis and Text Mining].



-30-
Loading

0 comments on commit 20a31a2

Please sign in to comment.