Overall Test Plan for 1000 Genome Whole Genome PheWAS Analysis on Polaris

We will have access to approximately 10,000 whole genome sequences from the MVP project. In the past the VA was able to perform mapping and SNP calling using the Trellis platform. However, SV calling at this scale has not been attempted before. Performing analysis on the whole genome sequences can be very computational and time intensive. We would like to use ALCF's Polaris machine to perform the workflows for NGS sequence analysis. We will have the opportunity to perform a population analysis so we can conclude with a Phewas analysis on the results from the whole genome analysis.

Here, we will be testing the workflow analysis plan on a subset of data from 1000 Genome. We will only concentrate on SNP calling and no structural variant (SV) analysis will be performed. We will be using a subset of data from 1K genomes EUR samples from a GBR cohort. Below is a draft of what steps are needed to achieve the goals on this project. Please modify appropriately.

Get familiar with Polaris

Get user accounts
Login
```
ssh arodriguez@polaris.alcf.anl.gov
```

Submit interactive and script jobs

qsub -A covid-ct -I -l select=1 -l walltime=1:00:00 -l filesystems=home:eagle -q debug

Download 1KG Whole genome GBR datasets:
- 92 GBR samples from 1KG (EUR) datasets [progress]
- Reference download and index build for hg38 [progress]
Gather a set of NGS and PheWAS/GWAS tools that will be tested on Polaris. Each tool will most likely encounter its own issues and will have to deal with it appropriately.

Alignment
- NVIDIA Clara-Parabricks
SNP Callers
- DeepVariant - included within parabricks
- GATK HaplotypeCaller - included with parabricks
Annotations
- Annotate with VEP
Population analysis
- SAIGE
- Regenie [progress]
- Hail-batch [progress]

Test each of the tools from the previous set
- NVIDIA Clara-Parabricks - Succesfully ran workflow on low coverage and 30X whole-genome fastq sequences [Progress]
Evaluate outputs - Determine which set of tools are best for our analysis, but be dynamic enough that if new tools come up, we can shift focus.
Create/test submission engine (i.e. Parsl, Balsam, etc)
- Look here
- Parsl on Polaris
Create workflow for submitting the genomes [Progress]
Generate statistics on rutime to determine how much our allocation on Polaris should be
Start processing 1KG whole-genome through workflow pipeline [Progress]
Convert VCF to Plink format for access to SAIGE and filter BED files [Progress].
Perform post-process analysis on VCFs (i.e QC, annotations, etc) [Progress]
Write paper on how VCFs were generated, what was found (computational and science?)
Create simulated phenotype files for 1KG VCF results. Use link to create phenotypes. To use Hail for phenotype creation look here. [Progress]
Setup SAIGE on Polaris [Progress]
Setup Regenie on Polaris [Progress]
Run SAIGE for 1KG WG BED data [Progress]
Run Regenie for 1KG WG PGEN data
Perform QC on SAIGE analysis
Package up the tools to perform same test on cloud.
Compare cloud results with Polaris results
Share data
Write paper on findings comparing different tools

References

This is what others are doing:

Take a look at what the Broad folks are doing here. They are calling whole genomes using the Broad workflow and SV calling is being done in a consensus manner.
Genomics England's very first initiative – sequencing 100,000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wgPheWAS_plan_1kg.md

wgPheWAS_plan_1kg.md

Overall Test Plan for 1000 Genome Whole Genome PheWAS Analysis on Polaris

References

Files

wgPheWAS_plan_1kg.md

Latest commit

History

wgPheWAS_plan_1kg.md

File metadata and controls

Overall Test Plan for 1000 Genome Whole Genome PheWAS Analysis on Polaris

References