Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
alistair-jones committed Nov 14, 2022
1 parent ee64f81 commit 94d8ea0
Show file tree
Hide file tree
Showing 150 changed files with 14,451 additions and 0 deletions.
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Byte-compiled / optimized / DLL files
__pycache__/

# Environments
*.bak
*.env
.venv

.vscode

# Notebooks containing credentials
dotenv.py
172 changes: 172 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Artificial Data Generator

Pipelines and reusable code for generating anonymous artificial versions of NHS Digital assets in Databricks.

> **This material is maintained by the [NHS Digital Data Science team](mailto:nhsdigital.artificialdata@nhs.net)**.
>
> See our other work here: [NHS Digital Analytical Services](https://github.com/NHSDigital/data-analytics-services).
To contact us raise an issue on Github or via [email](mailto:nhsdigital.artificialdata@nhs.net) and we will respond promptly.

## Overview

### What is artificial data?

#### Artificial data is an anonymous representation of real data
- Artificial data provides an anonymous representation of some of the properties of real datasets.
- Artificial data preserves the formatting and structure of the original dataset, but may otherwise be unrealistic.
- Artificial data reproduces some of the statistical properties and content complexity of fields in the real data, while excluding cross-dependencies between fields to prevent risks of reidentification.
- Artificial data is completely isolated from any record-level data.
- It is not possible to use artificial data to reidentify individuals, gain insights, or build statistical models that would transfer onto real data.

### How is artificial data generated?

There are three stages involved in generating the artificial data:

1. The Metadata Scraper extracts anonymised, high-level aggregates from real data at a national level.
- At this stage key identifiers (such as patient ID) are removed and small number suppression is applied in order to prevent reidentification at a later stage.
1. The Data Generator samples from the aggregates generated by the Metadata Scraper on a field-by-field basis and puts the sampled values together to create artificial records.
1. Postprocessing tweaks the output of the Data Generator to make the data appear more realistic (such as swapping randomly generated birth and death dates to ensure sensible ordering).
This also includes adding in randomly generated identifying fields (such ‘patient’ ID) which were removed at the Metadata Scraper stage.

### How do we ensure artificial data is anonymous?
Between steps 1 and 2 above, a manual review is performed to check for any accidental disclosure of personally identifiable information (PII) in the metadata.
The review is carried out against a checklist that has been approved by the Statistical Disclosure Control Panel at NHS Digital, chaired by the Chief Statistician.
The outcomes of the review are signed-off by a senior manager.

The data generator in step 2 uses only reviewed / signed off metadata and is completely isolated from any record-level data in the original dataset.

## Dependencies & environment
The code is designed to be run within the Databricks environment.

> **Warning**
>
> Python files represent Databricks notebooks, not python scripts / modules!
>
> This means things like imports don't necessarily work how you may expect!
The codebase was developed and tested on [Databricks Runtime 6.6](https://docs.databricks.com/release-notes/runtime/6.6.html).
We have packaged up the dependencies with the code, so the code should run on Databricks Runtime 6.6 without installing additional packages.

> **Note**
>
> We have plans to pull out the core logic into a Python package to make it reusable by others (outside of Databricks), but we're not there yet!
>
> Look out for future updates, or feel free to reach out to us via [email](mailto:nhsdigital.artificialdata@nhs.net) and we'd be happy to talk.
## Repo structure
### Top-level structure
The repo has the following structure as viewed from the top-level
```
root
|-- projects # Code Promotion projects (see full description below)
| |-- artificial_hes # For generating artificial HES data
| |
| |-- artificial_hes_meta # For scraping HES metadata
| |
| |-- iuod_artificial_data_generator # Reusable code library & dataset-specific pipelines
| |
| |-- iuod_artificial_data_admin # For managing reviewed metadata
|
|-- docs # Extended documentation for users
|
|-- notebooks # Helper notebooks
| |-- admin # Admin helpers
| |-- user # User helpers
|
|-- utils # Databricks API helper scripts
```

### Reusable logic
The common logic shared across different pipelines is stored within `projects/iuod_artificial_data_generator/notebooks`.
- The entry-points for scraping metadata are the `driver.py` files within `scraper_pipelines`.
- The entry-points for generating artificial data are the `driver.py` files within `generator_pipelines`
- The remaining notebooks / folders in this directory store reusable code

Note: in the NHS Digital Databricks environment, the driver notebooks are not triggered directly - rather they are executed as ephemeral notebook jobs by the `run_notebooks.py` notebooks in one of the `projects` (for example, `artificial_hes_meta` executes the driver notebook within scraper_pipelines/hes).
See below for more details on Code Promotion.

### Code Promotion projects
The children of the `projects` directory follow the 'Code Promotion' project structure, which is specific to the NHS Digital 'Code Promotion' process.
For more information on Code Promotion, see the subsection below.

For each dataset, there are two 'Code Promotion' projects:
- One to extract metadata (for example `artificial_hes_meta`)
- One to publish the generated artificial data (for example `artificial_hes`)

There are two, general projects that act across datasets
- `iuod_artificial_data_generator` is responsible for generating artificial data for a specified dataset.
- `iuod_artificial_data_admin` is used to move metadata between access-scopes (specifically from the sensitive to non-sensitive scope after review and sign-off)

#### What is Code Promotion?
Code Promotion is a process designed by Data Processing Services (DPS) at NHS Digital to allow users of Databricks in DAE to promote code between environments and run jobs on an automated schedule.

Jobs inside a code Code Promotion project has read/write access to the project's own database (which shares the project's name) and may have read or read/write access to possibly a number of other databases or tables.

#### Code Promotion structure
Each Code Promotion project must adhere to the following structure.
```
{project_name}
|-- notebooks # Library code called by run_notebooks.py
|
|-- schemas # Library code called by init_schemas.py
|
|-- tests # Tests for notebooks
|
|-- cp_config.py # Configures the Databricks jobs (a strict set
| # of variables must be defined based on the Code
| # Promotion specification)
|
|-- init_schemas.py # Sets up / configures the database associated to the project
|
|-- run_notebooks.py # Entry-point for the main processing for the project
|
|-- run_tests.py # Run all the tests, executed during the build process
|
```

#### Code Promotion jobs
Within Databricks, each Code Promotion project is associated with three jobs which trigger the `init_schemas.py`, `run_tests.py` and `run_notebooks.py` notebooks.
These jobs have a specific set of permissions to control their access scopes - the databases they can select from, modify and so on.
This is why driver notebooks are not triggered directly within Databricks (as per the note above), because the jobs are set up to have the necessary set of permissions to perform the tasks they are designed to do.

We have designed the jobs and their permissions in such a way to completely isolate the access scopes for pipelines that scrape metadata from those that generate artificial data.
It is not possible for the pipelines that generate artificial data to read from any sensitive data.

## Utilities

There are three scripts in the utils folder which allow developers to sync changes to the Code Promotion projects `iuod_artificial_data_generator`, `iuod_artificial_data_admin`, `artificial_hes` and `artificial_hes_meta` across environments.
Users will need to set up the Databricks CLI in order to use these scripts.

- `export.ps1`: exports all workspaces in Databricks from staging/dev to projects in your local environment.
The script will stash local changes before exporting. The stash is not applied until the user calls 'git apply stash'.
- `import.ps1`: imports all directories in your local version of projects to staging/dev in Databricks.
Importing will overwrite the version of any of these projects in staging, so the script includes user warnings and confirmation to prevent accidental overwriting.
If the overwrite is intentional then the user will need to confirm by typing the project name.
- `list_releases.ps1`: returns the most recent version of each Code Promotion project in code-promotion.

### Setting up the Databricks CLI
We use the Databricks API to import / export code to / from our development environment.
In your environment, install the Databricks CLI using

```
pip install databricks-cli
```

Then setup authentication using

```
databricks configure --token
```

You will be prompted to enter the host and your Databricks personal access token.

If you have not previously generated a Databricks personal access token: in Databricks go to Settings > User Settings > Access Tokens and click Generate New Token.
Make note of the token somewhere secure, and copy it into the prompt.

## Licence
The Artificial Data Generator codebase is released under the MIT License.

The documentation is © Crown copyright and available under the terms of the [Open Government 3.0](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/) licence.

28 changes: 28 additions & 0 deletions docs/artificial_data_admin_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## Guide for artificial data admins
This section is aimed at administrators of artificial data, working in an environment where they can trigger the different stages of the process.

### How do I extract metadata?
To extract metadata for a dataset, run the `run_notebooks.py` notebook in the corresponding `artificial_{dataset_name}_meta` (for example `artificial_hes_meta/run_notebooks.py`).

By default, this will read data from a database named according to the `{dataset_name}` parameter in the template string above and write to a database called `artificial_{dataset_name}_meta`. These can be changed via the notebook widgets.

When running in production on live data, this should be done by triggering the `run_notebooks` job for the respective project, as this will have the privileges to access the live data.
Approved human users will only have access to access aggregates for review, not to the underlying record-level data.

### What happens after metadata is extracted?
Once the metadata has been extracted, it should be manually reviewed by a member of staff working in a secure environment to ensure no personally identifiable information (PII) is disclosed. This should be signed off by a senior member of staff.

At NHS Digital we have a checklist that was approved by the Statistical Disclosure Control Panel, chaired by the Chief Statistician.

Once we have checked the metadata and signed it off we move it into a database which inaccessible to the metadata scraper and so is completely isolated from the database containing the real data.
This is done by executing the `run_notebooks.py` notebook in the `iuod_artificial_data_admin` project.

When running in production on live data, this should be done by triggering the `run_notebooks` job for the `iuod_artificial_data_admin` project, as this will have the privileges to read from / write to the appropriate databases.

### How do I generate artificial data?
To generate data and run postprocessing for a dataset, run the `run_notebooks.py` notebook in the `iuod_artificial_data_generator` project with the name of the dataset to generate artificial data for entered accordingly.
For example for artificial HES data set the 'artificial_dataset' parameter to 'hes'.

By default, this process will read metadata from and write artificial data to a database called `iuod_artificial_data_generator`, but this parameter can be changed via the notebook widgets.

When running in production on live data, this should be done by triggering the `run_notebooks` job for the `iuod_artificial_data_generator` project, as this will have the privileges to access the aggregated data.
58 changes: 58 additions & 0 deletions docs/artificial_data_user_notice.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<h1> Notice For Artificial Data Users </h1>

<h2> What is artificial data? </h2>

<h3> Artificial data is an anonymous representation of real data </h3>
<ul>
<li> Artificial data provides an anonymous representation of some of the properties of real datasets. </li>
<li> Artificial data preserves the formatting and structure of the original dataset, but may otherwise be unrealistic. </li>
<li> Artificial data reproduces some of the statistical properties and content complexity of fields in the real data, while excluding cross-dependencies between fields to prevent risks of reidentification. </li>
<li> Artificial data is completely isolated from any record-level data. </li>
<li> It is not possible to use artificial data to reidentify individuals, gain insights, or build statistical models that would transfer onto real data. </li>
</ul>

<h2> How is it generated? </h2>

There are three stages involved in generating the artificial data:

<ol>
<li> The Metadata Scraper: extracts anonymised, high-level aggregates from real data at a national level. At this stage key identifiers (such as patient ID) are removed and small number suppression is applied in order to prevent reidentification at a later stage. </li>
<li> The Data Generator: samples from the aggregates generated by the Metadata Scraper on a field-by-field basis and puts the sampled values together to create artificial records. </li>
<li> Postprocessing: using the output of the Data Generator, dataset-specific tweaks are applied to make the data appear more realistic (such as swapping randomly generated birth and death dates to ensure sensible ordering). This also includes adding in randomly generated identifying fields (such ‘patient’ ID) which were removed at the Metadata Scraper stage. </li>
</ol>

<h2> What is it used for? </h2>

The purpose of artificial data is twofold.

<h3> 1. Artificial data enables faster onboarding for new data projects </h3>

Users can begin work on a project in advance of access to real data being granted. There are multiple use cases for the artificial data.
<ul>
<li> For TRE/DAE users who have submitted (or intend to submit) an access request for a given dataset: artificial data can give a feel for the format and layout of the real data prior to accessing it. It also allows users to create and test pipelines before accessing real data, or perhaps without ever accessing real data at all. </li>
<li> For TRE/DAE users who are unsure which datasets are relevant to their project: artificial data allows users to understand which data sets would be useful for them prior to applying for access. Artificial data can complement technical information found in the Data Dictionary for a given dataset, and give users a feel for how they would work with that dataset. </li>
</ul>

<h3> 2. Artificial data minimises the amount of personal data being processed </h3>

The activities mentioned above can all be completed without handling personal data, improving the ability of NHS Digital to protect patient privacy by minimising access to sensitive data.


<h2> What are its limitations? </h2>

Artificial data is not real data, and is not intended to represent something real or link to real records. Artificial records are not based on any specific records found in the original data, only on high-level, anonymised aggregates. As outlined above, it is intended to improve efficiency and protect patient data.

It is crucial to note that artificial data is not synthetic data.

Synthetic data is generated using sophisticated methods to create realistic records. Usually synthetic data aims to enable the building of statistical models or gaining of insights that transfer onto real data. This is not be possible with artificial data. The downside of synthetic data is that it is associated with non-negligible risks to patient privacy through reidentification. It is not possible to reidentify individuals using artificial data.


<h2> Additional Information </h2>

<h3> Support from senior leadership </h3>
This work has undergone due process to fully assess any potential risks to patient privacy and has been approved by senior leadership within NHS Digital, including: the Senior Information Risk Owner (SIRO); the Caldicott Guardian; the Data Protection Officer (DPO); the Executive Director of Data and Analytics Services; and the IAOs for the datasets represented by the Artificial Data assets. A full DPIA has been completed and is available upon request.

For further details, please get in touch via the mailbox linked below.

<h2> Contact </h2>
For further details, please get in touch via: <a href="mailto:nhsdigital.artificialdata@nhs.net">nhsdigital.artificialdata@nhs.net</a>
5 changes: 5 additions & 0 deletions docs/build_docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

# How do I add user documentation to the production DAE?
There are 2 steps to building the documentation and adding it to DAE:
1. (Desktop Python) Run the python file `notebooks/user/build_docs/make_create_user_docs.py`: the full documentation exists in the `docs` folder in the top-level of the repo; this step takes this documentation and puts the contents into a file that can be run on DAE to make it readable by users.
1. (DAE Prod) Copy the notebook `notebooks/user/collab/create_user_docs.py` created in step 1 into DAE Prod and run it: this will add the documentation to a user-facing table in DAE Prod.
58 changes: 58 additions & 0 deletions docs/build_docs/create_user_docs_template.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Databricks notebook source
from pyspark.sql import functions as F

# COMMAND ----------

dbutils.widgets.text("db", "", "0.1 Project Database")

# COMMAND ----------

# Config
database_name = dbutils.widgets.get("db")
table_name = "user_docs"
table_path = f"{database_name}.{table_name}"

# Check database exists
if spark.sql(f"SHOW DATABASES LIKE '{database_name}'").first() is None:
# Database not found - exit
dbutils.notebook.exit({})

# Template variables replaced during build
user_notice_file_name = "artificial_data_user_notice"

# COMMAND ----------

user_notice_html = """{{artificial_data_user_notice.md}}"""

# COMMAND ----------

# Create and upload the docs
docs_data = [
[user_notice_file_name, user_notice_html]
]
docs_schema = "file_name: string, content_html: string"
docs_df = spark.createDataFrame(docs_data, docs_schema)
(
docs_df.write
.format("delta")
.mode("overwrite")
.saveAsTable(table_path)
)

# Make sure users can select from the table but not overwrite
if os.getenv("env", "ref") == "ref":
owner = "data-managers"
else:
owner = "admin"

spark.sql(f"ALTER TABLE {table_path} OWNER TO `{owner}`")

# Check the uploads
for file_name, content_html in docs_data:
result_content_html = (
spark.table(table_path)
.where(F.col("file_name") == file_name)
.first()
.content_html
)
assert result_content_html == content_html
Loading

0 comments on commit 94d8ea0

Please sign in to comment.