Skip to content

Commit

Permalink
Merge pull request #12 from ecohealthalliance/feature/documentation
Browse files Browse the repository at this point in the history
Feature/documentation
  • Loading branch information
deanmarchiori authored Apr 17, 2024
2 parents 9e34caa + 197d0c5 commit 77f8ead
Show file tree
Hide file tree
Showing 16 changed files with 615 additions and 19 deletions.
3 changes: 3 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,6 @@
^README\.Rmd$
^LICENSE\.md$
^\.github$
^_pkgdown\.yml$
^docs$
^pkgdown$
48 changes: 48 additions & 0 deletions .github/workflows/pkgdown.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Workflow derived from https://github.com/r-lib/actions/tree/v2/examples
# Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help
on:
push:
branches: [main, master]
pull_request:
branches: [main, master]
release:
types: [published]
workflow_dispatch:

name: pkgdown

jobs:
pkgdown:
runs-on: ubuntu-latest
# Only restrict concurrency for non-PR jobs
concurrency:
group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }}
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
permissions:
contents: write
steps:
- uses: actions/checkout@v4

- uses: r-lib/actions/setup-pandoc@v2

- uses: r-lib/actions/setup-r@v2
with:
use-public-rspm: true

- uses: r-lib/actions/setup-r-dependencies@v2
with:
extra-packages: any::pkgdown, local::.
needs: website

- name: Build site
run: pkgdown::build_site_github_pages(new_process = FALSE, install = FALSE)
shell: Rscript {0}

- name: Deploy to GitHub pages 🚀
if: github.event_name != 'pull_request'
uses: JamesIves/github-pages-deploy-action@v4.5.0
with:
clean: false
branch: gh-pages
folder: docs
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
.RData
.Ruserdata
inst/doc
docs
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,4 @@ Remotes:
ecohealthalliance/containerTemplateUtils,
fcampelo/rdrop2,
ropensci/ruODK
URL: https://ecohealthalliance.github.io/ohcleandat/
8 changes: 8 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,11 @@ You can install the development version of ohcleandat from [GitHub](https://gith
# install.packages("devtools")
devtools::install_github("ecohealthalliance/ohcleandat")
```

## Getting Started

For help guides, check out the package vignettes.


## Getting Help
If you encounter a clear bug, please file a minimal reproducible example on [github](https://github.com/ecohealthalliance/ohcleandat/issues).
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,12 @@ You can install the development version of ohcleandat from
# install.packages("devtools")
devtools::install_github("ecohealthalliance/ohcleandat")
```

## Getting Started

For help guides, check out the package vignettes.

## Getting Help

If you encounter a clear bug, please file a minimal reproducible example
on [github](https://github.com/ecohealthalliance/ohcleandat/issues).
4 changes: 4 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
url: https://ecohealthalliance.github.io/ohcleandat/
template:
bootstrap: 5

54 changes: 54 additions & 0 deletions vignettes/idcheck.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: "ID Correction and Autobot"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{ID Correction and Autobot}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup}
library(ohcleandat)
```

The data cleaning and validation pipeline provides a way to specify rules that can then be applied to data in order to produce a validation log for manual corrections. However, in some cases particularly with the use of ID columns there are certain automatic corrections that can be made due to formatting errors.

For instance missing prefixes, incorrect case, or non-standard formatting of columns where there should be a predictable and fixed format. In this case, we wish to provide an automated cleaning step that makes these corrections to the data, but also produces a validation log for our records. This is done in two steps.

The first is applying the automatic corrections through the use of an `id_check()` function (or family of checking functions). These operate on the semi-clean data set to produce a new proposed column with the automated corrections. These functions are designed and implemented by users based on their requirements.

Once these corrections are made, both the original ID column and the new corrected ID column are provided to an `autobot()` function in the pipeline. The `autobot()` function compares these records and keeps only those where the original in the new column are different - indicating that some form of automatic correction has been made.

A validation log is generated in the exact same format as other validation logs, however a key change here is that this validation log does not require the manual overview of a reviewer. The proposed changes are automatically accepted by the autobot. The reason for producing the log is to persist changes and have a record of how IDs have changed due to automatic corrections.

## Example

Below is an example of a (fake) farm_id identifier. We can see the ID checker functions have corrected an 'O' to '0' in record 2. Case correction has taken place, and records that do not conform the the required pattern post corrections, are set to NA for manual review.

```
# A tibble: 6 × 2
farm_id farm_id_new
<chr> <chr>
1 123ABC0007 NA
2 1O3ABC010 103ABC010
3 143abc010 143ABC010
4 13DEFH005 NA
5 243DLF803 243DLF803
6 243DPF911 243DPF911
```

```
> ohcleandat::autobot(data = test, old_col = "farm_id", new_col = "farm_id_new", key = "farm_id")
# A tibble: 2 × 8
entry field issue old_value is_valid new_val user_initials comments
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1O3ABC010 farm_id Automated field format check failed 1O3ABC010 FALSE 103ABC0… autobot ""
2 143abc010 farm_id Automated field format check failed 143abc010 FALSE 143ABC0… autobot ""
```
Binary file added vignettes/img/erd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/img/html.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/img/pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added vignettes/img/targets.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 62 additions & 0 deletions vignettes/integration.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: "Integrating Different Datasets"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Integrating Different Datasets}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup}
library(ohcleandat)
```

## Integrating Data Sets

Once an individual data set has passed through the cleaning and validation process, it may need to be combined (or joined) with other data sets. This process is handled by the {targets} pipeline during the *integration* steps. There are two phases, the first is performing checks the second is the join operation.

The checks perform tests to ensure that the records in both data sets are compatible and match the expectations of the relationship between the two data sets. Secondly, the integration of the data sets is performed by using an SQL join operation. This is typically either a right-join, a left-join, an inner-join or a full-join.

The type of join operation selected depends on the relationship between the data sets. The critical information here are the primary-key (unique identifier) of the base table and the foreign-key of the table to be joined which is the attribute that should match the primary-key. In addition, the cardinality of the relationship is important to understand the expected result when joining the data. Below is an example of an entity relationship diagram that shows the relationship between two example data sets. Crow's feet notation is used to illustrate that there is an optional 1:Many relationship with the left table and the right table. There is also a mandatory 1:1 relationship with the right table and the left table.

![](img/erd.png)

Also below is the relevant target that performs the joining operation and integrates these data sets together.

```{r eval=FALSE}
tar_target(integrated_mosq_field,
left_join(
x = fs_mosquito_field_semiclean,
y = longitudinal_identification_semiclean,
by = c("Batch_ID" = "batch_id")
)
)
```

It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.

## Types of Data

Throughout the data cleaning pipeline, we take in raw data and convert it to some form of clean data. There are several intermediate steps in this process. A standard terminology has been adopted to describe the steps in this process.

- **raw data**: is data that is read in directly from the source systems.
- **combined data**: If the raw data is situated in multiple files or data frames, a compatible and united data set of these data are termed 'combined.'
- **semi-clean**: Data are semi-clean once they have been corrected using the values provided in the validation log.
- **integrated**: Data are integrated when they are joined to other data sets.
- **clean**: Data are termed clean when they are integrated, and records that are still pending validation in the logs are removed, thereby leaving only a clean subset of validated data.

## Tips for data management

Over the course of a long data collection exercise, standards and formats can diverge. This makes the data cleaning steps difficult and will slow down the ability to integrate data as above. Some general strategies can help to mitigate these risks:

- Design and enforce a Primary Key or Unique Identifier for each data set that will be meaningful and immutable.
- Think about storing data in a 'tidy' format where possible. See here: <https://www.jstatsoft.org/article/view/v059i10>\
- Store raw data in a machine readable format (i.e. CSV)
- Set some metadata standards at the start of the project around columns and data types. It is understandable that these might change over time, but having these standards will help plan how to best accommodate changes without breaking existing work.
Loading

0 comments on commit 77f8ead

Please sign in to comment.