Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner ETL #1

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
*.Rproj
278 changes: 146 additions & 132 deletions Shootings_html_version.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
---
title: "US Mass Shootings 1982 to 2019"
date: 15/04/2019
author: Nic Fox
date: 06/06/2019
author:
- Nic Fox
- Mark Goble (minor additions)
output:
html_document:
toc: true # table of content true
Expand All @@ -11,25 +13,68 @@ output:
css: "Shootings.css"
---

Updated to use "marks little helper" package including refactorer function

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r import_data, include=FALSE}
library(tidyverse)
library(operators)
library(magrittr)
library(dplyr)
library(knitr)
library(sf)
library(usmap)
library(waffle)
library(treemapify)
library(gganimate)
library(tweenr)
library(gifski)
library(png)
library(viridis)
# MG: Updated package list as things like dplyr loaded with tidyverse
# not sure you explictly need magrittr as common pipe functions are included as part of dplyr etc.
# also you might not need knitr?
# MG: Also took the liberty of using a package helper function which will load any missing packages and check that all packages are loaded before proceeding.

# this version requires pacman to load / install packages

pacman::p_load(tidyverse, operators, knitr, sf, usmap, waffle, treemapify, gganimate,
tweenr, gifski, png, viridis, rlang, lubridate)



```

```{r}
refactorer <- function(data_frame,
column,
level_vector,
label_vector,
other_label = "Unknown"){

checkmate::assert_data_frame(data_frame)
checkmate::assert_string(column)
checkmate::assert_vector(level_vector)
# check the two vectors are of the same length
level_vector_len <- length(level_vector)
checkmate::assert_vector(label_vector, len = level_vector_len)
checkmate::assert_string(other_label)

if(!column %in% colnames(data_frame)){
stop(paste(column, "is not not a column in the data frame"))
}


level_vector <- tolower(level_vector)
column_sym <- rlang::sym(column)
# note the special syntax i.e. !!column_sum := ...
# https://www.tidyverse.org/articles/2018/03/rlang-0.2.0/
data_frame %>%
dplyr::mutate(!!column_sym := trimws(!!column_sym)) %>% #get rid of whitespace...
dplyr::mutate(!!column_sym := tolower(!!column_sym)) %>% # change to lower case for better comparisons..
dplyr::mutate(!!column_sym := factor(!!column_sym,
levels = level_vector,
labels = label_vector)) %>%
dplyr::mutate(!!column_sym := forcats::fct_explicit_na(!!column_sym,
na_level = other_label)) # use forcats to deal with unset levels...

}

```



```{r}
# MG: Split out the data load from package load - saves having to reload all the packages if you debug a faulty load...

# Import the shootings data from the CSV file and save it as a data frame called imported_data
imported_data <- read.csv("ShootingsData.csv", strip.white = TRUE)
Expand All @@ -39,144 +84,110 @@ cleansed_data <- imported_data

# Create the colour palette for the data visualisations using the magma colour palette from the viridis package
colour_palette <- magma(101)

```


```{r cleanse_data, include=FALSE}

# Create new column containing 1 for each row, to use when showing number of shootings per grouping
cleansed_data$for_count <- 1

cleansed_data <- cleansed_data %>%
mutate(for_count = 1)
# Update the gender column so that there is only one version of "Female" etc
cleansed_data$gender = case_when(
cleansed_data$gender == "M" ~ "Male",
cleansed_data$gender == "Male" ~ "Male",
cleansed_data$gender == "F" ~ "Female",
cleansed_data$gender == "Female" ~ "Female",
cleansed_data$gender == "Male & Female" ~ "Male & Female"
)

# MG using factor labels to deal with the different naming...
# MG also extracting the process of sorting out the factors into a function

# MG note don't need two levels for M and m for example as taken care of in function.
lvls_gender <- c("M", "Male", "F", "Female", "Male & Female")
lbls_gender <- c("Male", "Male", "Female", "Female", "Male & Female")

cleansed_data <- refactorer(cleansed_data, "gender", lvls_gender, lbls_gender, "Unknown")

rm(lvls_gender, lbls_gender)

#MG: sanity check - check all levels accounted for - should be 0
# NOT RUN
# sum(is.na(cleansed_data$gender)

# Update the race column so that there is only one version of Unknown, Black, White etc
cleansed_data$race = case_when(
cleansed_data$race == "Other" ~ "Unknown",
cleansed_data$race == "-" ~ "Unknown",
cleansed_data$race == "unclear" ~ "Unknown",
cleansed_data$race == "" ~ "Unknown",
cleansed_data$race == "black" ~ "Black",
cleansed_data$race == "Black" ~ "Black",
cleansed_data$race == "Latino" ~ "Latino",
cleansed_data$race == "Native American" ~ "Native American",
cleansed_data$race == "white" ~ "White",
cleansed_data$race == "White" ~ "White",
cleansed_data$race == "White " ~ "White"
)

# For rows where race is "Asian" the approach in the code above is not recognising race as "Asian" for some reason and so its saving race as "Unknown". So I'm setting the race for those based on the case description.
cleansed_data[cleansed_data$case == "Yountville veterans home shooting", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "San Francisco UPS shooting", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "Oikos University killings", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "Su Jung Health Sauna shooting", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "Binghamton shootings", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "Virginia Tech massacre", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "Xerox killings", "race"] <- "Asian"
cleansed_data[cleansed_data$case == "University of Iowa shooting", "race"] <- "Asian"
lvls_race <- c("Black", "Latino", "Native American", "White", "Asian")
lbls_race <- c("Black", "Latino", "Native American", "White", "Asian")

# For rows where race is NA, set it to Unknown
cleansed_data[is.na(cleansed_data$race), "race"] <- "Unknown"
cleansed_data <- refactorer(cleansed_data, "race", lvls_race, lbls_race, "Unknown")

rm(lvls_race, lbls_race)

#MG: sanity check - check all levels accounted for - should be 0
# NOT RUN
# sum(is.na(cleansed_data$race))
# MG might also want to consider calling table to see the results for sainity checking...
#MG removed the handling code for Asian / residual NA as handled in the above factor code...

# Update the "Weapons Obtained Legally" column so that there is only one version of "Yes" etc
cleansed_data$weapons_obtained_legally = case_when(
cleansed_data$weapons_obtained_legally == "No" ~ "No",
cleansed_data$weapons_obtained_legally == "TBD" ~ "Unknown",
cleansed_data$weapons_obtained_legally == "-" ~ "Unknown",
cleansed_data$weapons_obtained_legally == "Unknown" ~ "Unknown",
cleansed_data$weapons_obtained_legally == "Kelley passed federal criminal background checks; the US Air Force failed to provide information on his criminal history to the FBI" ~ "Unknown",
cleansed_data$weapons_obtained_legally == "Yes" ~ "Yes",
cleansed_data$weapons_obtained_legally == "\nYes" ~ "Yes",
cleansed_data$weapons_obtained_legally == "Yes " ~ "Yes",
cleansed_data$weapons_obtained_legally %!in% c("Yes", "\nYes", "No", "Unknown", "-", "TBD") ~ "Unknown"
)

# Set weapons_obtained_legally to "Yes" for the Chattanooga military recruitment center shooting. The value in the imported data set contains quotes and I haven't figured out how to get the match condition to work with that value yet so this is a temporary workaround.
cleansed_data[cleansed_data$case == "Chattanooga military recruitment center", "weapons_obtained_legally"] <- "Yes"

#MG moved to before the other processing so it's clean before going into the function also converted to tidyverse...

cleansed_data <- cleansed_data %>%
mutate(weapons_obtained_legally =
ifelse(case == "Chattanooga military recruitment center",
"Yes",
weapons_obtained_legally))


lvls_weapons <- c("No","Yes")
lbls_weapons <- c("No","Yes")

cleansed_data <- refactorer(cleansed_data, "weapons_obtained_legally", lvls_weapons, lbls_weapons, "Unknown")

rm(lvls_weapons, lbls_weapons)

# Create a new column containing the text after the comma in the location column
cleansed_data$state <- sub('.*\\,', '', cleansed_data$location)

# Strip leading white spaces from state
cleansed_data$state <- sub("^\\s+", "", cleansed_data$state)

# Replace state names with their codes e.g. California becomes CA
cleansed_data$state <- ifelse(cleansed_data$state == "California", "CA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Pennsylvania", "PA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Florida", "FL", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Ohio", "OH", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Washington", "WA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Lousiana", "LA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Texas", "TX", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Kansas", "KS", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Michigan", "MI", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Colorado", "CO", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Oregon", "OR", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Tennessee", "TN", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "South Carolina", "SC", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Wisconsin", "WI", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "New York", "NY", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Connecticut", "CT", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Minnesota", "MN", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Georgia", "GA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Nevada", "NV", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Arizona", "AZ", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "North Carolina", "NC", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Kentucky", "KY", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Illinois", "IL", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Missouri", "MO", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Nebraska", "NE", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Virginia", "VA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Utah", "UT", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Mississippi", "MS", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Massachusetts", "MA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Hawaii", "HI", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Arkansas", "AR", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Iowa", "IA", cleansed_data$state)
cleansed_data$state <- ifelse(cleansed_data$state == "Oklahoma", "OK", cleansed_data$state)

# Strip leading white spaces from location.1 so that there's only one instance of "Workplace"
cleansed_data$location.1 <- sub("^\\s+", "", cleansed_data$location.1)

# Strip trailing white spaces from location.1 so that there's only one instance of "Other"
cleansed_data$location.1 <- sub("\\s+$", "", cleansed_data$location.1)
# MG used a safer method (using imported data) to split and take the state and trim whitespace

# Change "-", "TBD" and "Unclear" to "Unknown"
cleansed_data$prior_signs_mental_health_issues = case_when(
cleansed_data$prior_signs_mental_health_issues == "-" ~ "Unknown",
cleansed_data$prior_signs_mental_health_issues == "No" ~ "No",
cleansed_data$prior_signs_mental_health_issues == "TBD" ~ "Unknown",
cleansed_data$prior_signs_mental_health_issues == "Unclear" ~ "Unknown",
cleansed_data$prior_signs_mental_health_issues == "Unknown" ~ "Unknown",
cleansed_data$prior_signs_mental_health_issues == "Yes" ~ "Yes"
)
# Copy date to a new column called international_date as a character variable (rather than a factor)
cleansed_data$international_date <- as.character.Date(cleansed_data$date) # copy the dates to the new column
#MG tidyverse way to split out the state. In this case ok to take from cleansed as leaving orginal column untouched...

cleansed_data <-cleansed_data %>%
separate(location, c(NA, "state"), remove = FALSE, sep = ",") %>%
mutate(state = str_squish(state)) # squish is like trimws but takes out multiple whitespace between words

# MG deal wiith washington D.C. and mis-spelt Louisiana..

cleansed_data <- cleansed_data %>%
mutate(state = ifelse(state == "D.C.","WA", state)) %>%
mutate(state = ifelse(state == "Lousiana","Louisiana", state))


# MG: there are precanned lists of states in the US Maps - create a levels of both full name and abbreviations
lvls_state <- c(state.name, state.abb)
lbls_state <- c(state.abb, state.abb)

# Trim leading and trailing white spaces
cleansed_data$international_date <- trimws(cleansed_data$international_date)
cleansed_data <- refactorer(cleansed_data, "state", lvls_state, lbls_state, "Unknown")

# change /17 at the end of the string to /2017
cleansed_data$international_date <- str_replace(cleansed_data$international_date, "^(.*/)[1]{1}[7]{1}$", "\\12017")
rm(lvls_state, lbls_state)

# change /18 at the end of the string to /2018
cleansed_data$international_date <- str_replace(cleansed_data$international_date, "^(.*/)[1]{1}[8]{1}$", "\\12018")
# Strip leading white spaces from location.1 so that there's only one instance of "Workplace". # Strip trailing white spaces from location.1 so that there's only one instance of "Other"
# MG can do both cases with trimws <- didn't bother changing to tidyverse

# change /19 at the end of the string to /2019
cleansed_data$international_date <- str_replace(cleansed_data$international_date, "^(.*/)[1]{1}[9]{1}$", "\\12019")
cleansed_data$location.1 <- trimws(cleansed_data$location.1)

# Change the date from US format (month-day-year) to international date format (year-month-day)
cleansed_data$international_date <- format(as.Date(cleansed_data$international_date, format = "%m/%d/%Y"), "%Y-%m-%d")
# Change "-", "TBD" and "Unclear" to "Unknown"
# MG same trick as above

lvls_health <- c("No", "Yes")
lbls_health <- c("No", "Yes")

cleansed_data <- refactorer(cleansed_data, "prior_signs_mental_health_issues", lvls_health, lbls_health, "Unknown")

# Change international_date to date format
cleansed_data$international_date <- as.Date(cleansed_data$international_date)
rm(lvls_health, lbls_health)

# MG lubridate will radically reduce the processing needed to standardise the date
cleansed_data$international_date <- lubridate::mdy(cleansed_data$date)

#
```

# Data
Expand All @@ -195,10 +206,13 @@ The download is available on [motherjones.com](https://www.motherjones.com/polit
```{r create_state_victims_summary, include=FALSE}

# Create a new data frame that shows number of victims by state
victims_by_state <- aggregate(cleansed_data$total_victims, by=list(cleansed_data$state), FUN = sum)

# Change the column heading names to "state" and "victims"
colnames(victims_by_state) <- c("state", "victims")
# tidyverse way to aggregate & tally victims..
victims_by_state <- cleansed_data %>%
group_by(state) %>%
tally(total_victims, name = "victims")

#MG - refactoring ends here!!!!!!

```

Expand Down
Loading