03_data_science_practice.Rmd

# Statistical & Data Science Practice {#practice}


```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 100)
library(ggplot2)
theme_set(theme_light())
```


First, an excellent list of data science resources:

* [Practical Data Science for Stats - a PeerJ Collection](https://peerj.com/collections/50-practicaldatascistats/)


## Introduction

How does one approach a statistics or data science project?

### A theory of data analysis

Roger Peng, 2018-12-11, ["The Role of Theory in Data Analysis"](https://simplystatistics.org/2018/12/11/the-role-of-theory-in-data-analysis/)


### The function of data science: solving business problems

Emily Robinson, 2017-09-27, [Managing Business Challenges In Data Science](https://robinsones.github.io/Managing-Business-Challenges-in-Data-Science/)


### Design thinking >> data science context

Roger Peng, 2019-01-09, [How Data Scientists Think - A Mini Case Study](https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/)


### Opinionated Analysis Development

An over-arching structure of what a project could (or should?) look like can be boiled down into three features: it is

1. Reproducible and auditable

2. Accurate

3. Collaborative


* Hilary Parker, 2017-08-30, [Opinionated Analysis Development](https://peerj.com/preprints/3210/)

  - some of Hilary Parker's earlier / supporting thoughts on this topic can be found in [her talk "Opinionated Analysis Development" from rstudio::conf2017 (2017-01-14)](https://www.rstudio.com/resources/videos/opinionated-analysis-development/) ([slides alone at slideshare](https://www.slideshare.net/hilaryparker/opinionated-analysis-development)), as well as the slides from her [keynote at EARL SF (2017-06-15)](https://www.slideshare.net/hilaryparker/opinionated-analysis-development-earl-sf-keynote)


### General practice and workflow

**Jenny Bryan on workflow:**

* Jenny Bryan, [Project-oriented workflow](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/)

* Jenny Bryan, [Ode to the here package](https://github.com/jennybc/here_here)

* Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

* Jenny Bryan (2017) [Workflow: you should have one](https://speakerdeck.com/jennybc/workflow-you-should-have-one), Keynote talk at EARL London 2017.  

  - [Supporting documentation: earl-london-2017-bryan](https://github.com/jennybc/earl-london-2017-bryan#readme)


**other authors**

* Keiran Healy, [_The Plain Person's Guide to Plain Text Social Science_](https://kieranhealy.org/files/papers/plain-person-text.pdf) {pdf}

* Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N (2016) ["Ten Simple Rules for Effective Statistical Practice"](http://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1004961). PLoS Comput Biol 12(6): e1004961. doi:10.1371/journal.pcbi.1004961

* Ray Li (2016) ["7 habits of highly effective data analysis"](http://dataconomy.com/2016/02/7-habits-of-highly-effective-data-analysis/). [Dataconomy.com](Dataconomy.com), 2016-02-16.


* Noble, William Stafford (2009-07-31) [A Quick Guide to Organizing Computational Biology Projects](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424),  _PLoS Comput Biol_ 5(7): e1000424. 

* Gabe Becker (2017) [Enhancing reproducibility, comparability and discoverability of results in multi-analyst settings](https://www.slideshare.net/GabrielBecker11/enhancing-reproducibility-comparability-and-discoverability-of-results-in-multianalyst-settings), presentation at EARL (Enterprise Applications of R Language), San Francisco, June 5-7, 2017

  -- this came to my attention via the Not So Standard Deviations podcast, [episode 40 "It's the CDs All Over Again"](https://www.patreon.com/posts/episode-40-its-11713845) (2017-06-13). The discussion of Gabe Becker's presentation begins at ~13' 10".
  
  - some of Hilary Parker's observations: the topic doesn't get much air time, his talk takes wide view of issues in an organization, trade-offs in collaborative environment (not one analyst); in multi-analyst system (e.g. where both a researcher/biologist and a statistician are both working with the same data) have to reconcile results, there might be parallel studies where results need to be reconciled, >> this creates a need for the data to be created in similar environments. 
  
  - Concerns: reproducibity, comparability,  discoverability (finding the results), and empowerment. Data scientists skew to emppowerment!  But the concerns are in tension--you can increase reproducibility but at cost of empowerment, etc.
  
  - "Most organizations aren't making that judgement call ahead of time...any organization ... if data scientists were there early, you're going to be skewed to agility and high empowerment. Because that's like what we want! ... Data scientists are allergic to process...having any template for results, or ... even having to put things into a specific tool."
  
  - Different systems need different constraints.
  
  - Roger Peng: most organizations don't realize that they are explicitly making these trade-offs, can't maximize all. Have to make a choice, and that's very unsatisfactory for people.


* Emily Robinson, [Red Flags in Data Science Interviews](http://hookedondata.org/Red-Flags-in-Data-Science-Interviews/)

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">New blog post, co-written with <a href="https://twitter.com/skyetetra?ref_src=twsrc%5Etfw">@skyetetra</a>! 12 red flags to watch out for in data science interviews 🚩<a href="https://t.co/hM2E7I46Da">https://t.co/hM2E7I46Da</a> <a href="https://t.co/jFVA7mmjjU">pic.twitter.com/jFVA7mmjjU</a></p>&mdash; Emily Robinson (@robinson_es) <a href="https://twitter.com/robinson_es/status/1014164367292747778?ref_src=twsrc%5Etfw">July 3, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


## Reproducible research

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">This is the future. Show. Your. Damn. Work. <a href="https://t.co/4GWFdXSs17">https://t.co/4GWFdXSs17</a></p>&mdash; Chris Albon (@chrisalbon) <a href="https://twitter.com/chrisalbon/status/953078784084754433?ref_src=twsrc%5Etfw">January 16, 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


* [reproducibleresearch.net](http://reproducibleresearch.net/)

* Roger Peng (2014-06-06) [The Real Reason Reproducible Research is Important](https://simplystatistics.org/2014/06/06/the-real-reason-reproducible-research-is-important/)

* Roger Peng, 2015-06-15, ["The reproducibility crisis in science: A statistical counterattack"](http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2015.00827.x/full), [_Significance_](http://rss.onlinelibrary.wiley.com/hub/journal/10.1111/(ISSN)1740-9713/)

* Rich FitzJohn, Matt Pennell, Amy Zanne, Will Cornwell (2014-06-09) [Reproducible research is still a challenge](https://ropensci.org/blog/2014/06/09/reproducibility/) (at [rOpenSci](https://ropensci.org/))

* Melissa Assel, MS; Andrew J. Vickers, PhD (2018-02-06) ["Statistical Code for Clinical Research Papers in a High-Impact Specialist Medical Journal"](http://annals.org/aim/article-abstract/2671924/statistical-code-clinical-research-papers-high-impact-specialist-medical-journal), _Annals of Internal Medicine_

* V.	Orozco,	C.	Bontemps,	E.	Maigné,	V.	Piguet,	A.	Hofstetter,	A.	Lacroix,		
F.	Levert,	J.M.	Rousselle	(2018-07) ["How	To	Make	A	Pie:	Reproducible	Research	for	Empirical	
Economics	&	Econometrics"](https://www.tse-fr.eu/sites/default/files/TSE/documents/doc/wp/2018/wp_tse_933.pdf)

* Jeffrey M. Perkel, ["A toolkit for data transparency takes shape"](https://www.nature.com/articles/d41586-018-05990-5), _Nature_, 2018-08-20.

* Daniel Barron (2018-08-13) [How Freely Should Scientists Share Their Data?](https://blogs.scientificamerican.com/observations/how-freely-should-scientists-share-their-data/), _Scientific American_ blog


### Reproducible research with R

* Jeremy Anglin, [Reproducible analysis with knitr, R Markdown, and RStudio](https://github.com/jeromyanglim/rmarkdown-rmeetup-2012)
 
* Ben Marwick, 20 July 2017, [Reproducible Research Compendia via R packages](https://rawgit.com/benmarwick/Marwick-Berlin-R-users-2017/master/Marwick-Berlin-R-users-2017.html#1), presentation at Berlin R Users


### Reproducible data

* Greg Finak (2018-09-18) [Building Reproducible Data Packages with DataPackageR](https://ropensci.org/blog/2018/09/18/datapackager/)


### spreadsheets: the anti-reproducible research

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I read &quot;Data analysis without scripting&quot; as &quot;Dystopian moonscape of unrecorded user actions&quot;. I may not be Tableau&#39;s target market. <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a></p>&mdash; Gordon Shotwell (@gshotwell) <a href="https://twitter.com/gshotwell/status/577485681146097664">March 16, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>


* Jenny Bryan's [spreadsheets](https://speakerdeck.com/jennybc/spreadsheets) talk given May & June 2016 reframes Shotwell as "Spreadsheets: a dystopian moonscape of unrecorded user actions."
  - live in-person (https://pbs.twimg.com/media/CmDykgRWAAE-_MP.jpg)
  
* Ignasi Bartomeus and F Rodriguez-Sanchez,  _Non-reproducible workflows: a horror movie_ :

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">That awesome video on reproducibility with <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> by <a href="https://twitter.com/ibartomeus?ref_src=twsrc%5Etfw">@ibartomeus</a> and <a href="https://twitter.com/frod_san?ref_src=twsrc%5Etfw">@frod_san</a> you can find here: <a href="https://t.co/indBflvupv">https://t.co/indBflvupv</a> <a href="https://t.co/QcdonwVTk8">https://t.co/QcdonwVTk8</a></p>&mdash; David Smith (@revodavid) <a href="https://twitter.com/revodavid/status/917779772385656832?ref_src=twsrc%5Etfw">October 10, 2017</a></blockquote>

    - with more at [Reproducibilidad](http://ecoinfaeet.github.io/2016/07/06/reproducibilidad/)


* Gordon Shotwell, 2017-02-02, [R for Excel Users](http://blog.shotwell.ca/post/r_for_excel_users/)

* Luis A. Apiolaza, 2017-11-11, [Reducing friction in R to avoid Excel](http://www.quantumforest.com/2017/11/reducing-friction-to-avoid-excel/)


## Collaboration

* Amit Bhattacharyya, 2017-11-01, [Become a Better Statistician by Actively Collaborating](http://magazine.amstat.org/blog/2017/11/01/collaborating/) (at [_Amstatnews_](http://magazine.amstat.org/))


* Peter Seibel, 2017-11-19, [Repo style wars: mono vs multi](http://www.gigamonkeys.com/mono-vs-multi/)


## Data Informed or Data Driven? Data Science at Work

Ricardo Bion, Robert Chang, and Jason Goodman (2017-08-23) [How R Helps Airbnb Make the Most of Its Data](https://peerj.com/preprints/3182.pdf)

[The Stitch Fix Algorithms Tour](http://algorithms-tour.stitchfix.com/)


Behavioural Insights Team (UK), 2017-12-14, [Using Data Science in Policy](http://www.behaviouralinsights.co.uk/publications/using-data-science-in-policy/)


## Agile practice

[Agile Scrum Guide](http://agile-guide.pathfinder.bcgov/), BC Government DevEx


## Data Quality & Context Compatibility

Or, do your data really mean what you hope they do?

Jacob Harris (2014) ["Distrust your data"](https://source.opennews.org/en-US/learning/distrust-your-data/), 2014-05-24.

Roger Peng (2018) [Context Compatibility in Data Analysis](https://simplystatistics.org/2018/05/24/context-compatibility-in-data-analysis/), 2018-05-30


## File storage and naming conventions

* [Sustainability of Digital Formats: Planning for Library of Congress Collections](http://www.digitalpreservation.gov/formats/index.shtml)

* Jenny Bryan, [naming things](http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf),
Reproducible Science Workshop, 2015


***

## Data practice

* Karl Broman, [data organization](http://kbroman.org/dataorg/)

* Karl Broman and Kara Woo, ["Data organization in spreadsheets"](http://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989), _The American Statistician_, 2017-09-29.


### Versioned data

* Daniel Falster, Richard G FitzJohn, Matthew W. Pennell, William K. Cornwell (2017-11-10) [Versioned data: why it is needed and how it can be achieved (easily and cheaply)](https://peerj.com/preprints/3401/)


***

## Coding practice

Hadley Wickham, [The tidyverse style guide](https://style.tidyverse.org/)

Hadley Wickham, ["Style Guide"](http://adv-r.had.co.nz/Style.html) chapter from [_Advanced R_](http://adv-r.had.co.nz/)

[Coding etiquette](https://ourcodingclub.github.io/2017/04/25/etiquette.html), a guide to writing clear, informative and easy-to-use #RStats code by @CodingClub 


Joel Lee, 2017-12=22, [The Weirdest Programming Principles You’ve Never Heard Of](https://www.makeuseof.com/tag/weird-programming-principles/)


### Version control

Jenny Bryan and the STAT 545 TAs, [_Happy Git and GitHub for the useR_](http://happygitwithr.com/)

John D. Blischak, Emily R. Davenport, Greg Wilson (2016-01-19) ["A Quick Introduction to Version Control with Git and GitHub"](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004668), _PLoS Computational Biology_.

### Documentation

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">&quot;Writing documentation is all about making future you remember things that present you knows future you will forget&quot; -- <a href="https://twitter.com/data_stephanie?ref_src=twsrc%5Etfw">@data_stephanie</a> <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/Rladies?src=hash&amp;ref_src=twsrc%5Etfw">#Rladies</a></p>&mdash; R-Ladies Chicago (@RLadiesChicago) <a href="https://twitter.com/RLadiesChicago/status/963576859152744456?ref_src=twsrc%5Etfw">February 14, 2018</a></blockquote>


### Literate programming

The practice of explaining the program logic in a natural language"; it goes beyond what might be called "documentation". In the world of R, the RMarkdown functionality within [RStudio](https://www.rstudio.com/products/rstudio/), including [R notebooks](http://rmarkdown.rstudio.com/r_notebooks.html), is a way to program in this manner.

The concept was introduced by Donald Knuth in 1984; the original article is
> Knuth, Donald E. (1984). ["Literate Programming"](http://www.literateprogramming.com/knuthweb.pdf) (PDF). _The Computer Journal_. British Computer Society. 27 (2): 97–111. 

* [literateprogramming.com](http://www.literateprogramming.com/)

* [Wikipedia entry](https://en.wikipedia.org/wiki/Literate_programming)


### Functions in R

Colin Fay, [Playing with R, infix functions, and pizza ](http://colinfay.me/playing-r-infix-functions/)


### Naming variables

"There are only two hard things in Computer Science: cache invalidation and naming things."
-- Phil Karlton

Andy Lester, ["The World's Two Worst Variable Names"](http://archive.oreilly.com/pub/post/the_worlds_two_worst_variable.html)


### Clean coding 

Robert C. Martin, 2008, _Clean Code: A Handbook of Agile Software Development_, Prentice Hall.

Robert C. Martin, 2011, _The Clean Coder: A Code of Conduct for Professional Programmers_, Prentice Hall.


### Further reading

Aimee Gott, 2015-07-16, [Developing a R Validation Framework](https://www.baselr.org/wp-content/uploads/sites/4/presentations/BaselR_-_Formalising_R_Development_-_Aimee_Gott_-_20150716.pdf), BaselR

Dani Marillas, 2017-01-25, ["Don’t document your code. Code your documentation."](https://dev.to/raddikx/dont-document-your-code-code-your-documentation)


## General research practice

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I didn&#39;t think my &quot;start-up&quot; document for my grad students (the resources I wish *I* had when I started a PhD) would prove to be so popular, so I figured I would share them here. So, here&#39;s what I&#39;ve compiled so far. <a href="https://twitter.com/hashtag/hiddencurriculum?src=hash&amp;ref_src=twsrc%5Etfw">#hiddencurriculum</a> 1/12</p>&mdash; Matt Hauer (@thehauer) <a href="https://twitter.com/thehauer/status/1021179403680862218?ref_src=twsrc%5Etfw">July 22, 2018</a></blockquote>


***

### things that are no doubt useful and/or interesting but don't really fit anywhere in the existing typology

Louisa Smith , [epi quals study calendar](https://docs.google.com/spreadsheets/d/1HyyfGZhAzVtyJ-tcuBWxKbmsaJ3uMkuXyfzAUYsfemw/edit#gid=0)

* https://twitter.com/louisahsmith/status/1081955868864901120


***

## The tidyverse

All too often, data are messy. There are rows with no contents, colour-coded cells, and inconsistent values.

One important way that data can be cleaned is to ensure that the structure is tidy. What do we mean by tidy data?

> There are three interrelated rules which make a dataset tidy:
> * Each variable must have its own column.
> * Each observation must have its own row.
> * Each value must have its own cell.

And 

> Why ensure that your data is tidy? There are two main advantages:
> 
> 1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
>
> 2. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

(from Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/))

This won't solve things like inconsistent values and colour-coded cells, but it will solve some other messiness.

For more about the principles of tidy data, see:

* Hadley Wickham, ["Tidy data", _The Journal of Statistical Software_, vol. 59, 2014.](https://www.jstatsoft.org/article/view/v059i10)
  + [alternate link:](http://vita.had.co.nz/papers/tidy-data.html)
  + [informal and code-heavy version](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
 

### Other references

Karl Broman and Kara Woo, ["Data organization in spreadsheets"](https://github.com/kbroman/Paper_DataOrg) (github page with source manuscript) -- application of tidy principles to spreadsheets.

* see also Karl Broman's tutorial, ["Data organization: organizing data in spreadsheets)


Bruno Rodriguez, [Modern R with the tidyverse](https://b-rodrigues.github.io/modern_R/)


Jesse Sadler, [Excel vs R: A Brief Introduction to R  (With examples using dplyr and ggplot](http://kbroman.org/dataorg/](https://www.jessesadler.com/post/excel-vs-r/) (2017-10-02)


## Tidy Tools

[The tidyverse style guide](http://style.tidyverse.org/functions.html)


### tidyverse R packages

[The tidyverse: ](http://tidyverse.org/)

[The tidyverse R packages on github](https://github.com/hadley/tidyverse)

#### `broom`


#### `dplyr`

`dplyr` now gets its own page, labelled [**Data Wrangling**](DataWrangling.md)


#### `purrr`

* [A purrr tutorial](https://github.com/Cascadia-R/purrr-tutorial) -- Cascadia-R, 2017-06-03

* Charlotte Wickham, [purr tutorial](https://github.com/cwickham/purrr-tutorial) -- github

### more about tidy data

* Hadley Wickham & Garrett Grolemund, [_R for Data Science_](http://r4ds.had.co.nz/)

* Hadley Wickham
  + [Tidy data and tidy tools (video of presentation, December 2011)](https://vimeo.com/33727555)

* Garrett Grolemund
  + [Data Tidying](http://garrettgman.github.io/tidying/) (part of [Data Science with R](http://garrettgman.github.io/))
  
* Chester Ismay and Ted Laderas, [A gRadual-intRoduction to the tidyverse](https://github.com/Cascadia-R/gRadual-intRoduction-tidyverse?utm_content=buffer98896&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer)  
  
  
***

## Categorical data

Emily Robinson, DataCamp course, [Categorical Data in the Tidyverse](https://www.datacamp.com/courses/categorical-data-in-the-tidyverse)

***

## Tidy Text

If  you're going to undertake text mining and natural language processing, your text (i.e. your data) needs to be tidy.  Fortunately, there's an R package for that: `tidytext`.

* Julia Silge, [Term Frequency and tf-idf Using Tidy Data Principles](http://juliasilge.com/blog/Term-Frequency-tf-idf/), 2016-06-27

(See the companion page on the topics of [text analysis and text mining](https://github.com/MonkmanMH/DataScienceResources/blob/master/TextAnalysis.md)).


-30-