Skip to content

exascale-genomics/mvp-wgs-sv

Repository files navigation

Overall Plan for MVP Whole Genome Structural Variant and PheWAS Analysis on Polaris and Frontier

We will have access to approximately 10,000 whole genome sequences from the MVP project. In the past the VA was able to perform mapping and SNP calling using the Trellis platform. However, SV calling at this scale has not been attempted before. Performing analysis on the whole genome sequences can be very computational and time intensive. We would like to use ALCF's Polaris machine to perform the workflows for NGS sequence analysis. We will have the opportunity to perform a population analysis so we can conclude with a Phewas analysis on the results from the whole genome analysis.

Below is a draft of what steps are needed to achieve the goals on this project. Please modify appropriately.

  1. Get familiar with Polaris
    • Get user accounts
    • Login
      ssh arodriguez@polaris.alcf.anl.gov
      
    • Submit interactive and script jobs
      qsub -A covid-ct -I -l select=1 -l walltime=1:00:00 -l filesystems=home:eagle -q debug
      
  2. Download data [progress]:
  3. Gather a set of NGS and PheWAS/GWAS tools that will be tested on Polaris. Each tool will most likely encounter its own issues and will have to deal with it appropriately.
  1. Test each of the tools from the previous set
    • NVIDIA Clara-Parabricks - Succesfully ran workflow on low coverage and 30X whole-genome fastq sequences [Progress]
    • SVision - Build tool and test on 30X whole-genome BAM file [Progress].
  2. Evaluate outputs - Determine which set of tools are best for our analysis, but be dynamic enough that if new tools come up, we can shift focus.
  3. Create/test submission engine (i.e. Parsl, Balsam, etc)
  4. Create workflow for submitting the genomes
  5. Generate statistics on rutime to determine how much our allocation on Polaris should be
  6. Start process to move MVP whole-genome data
  7. Start processing MVP whole-genome through workflow pipeline
  8. Convert VCF to PGENs for access to SAIGE
  9. Share VCFs as results become available
  10. Perform post-process analysis on VCFs (i.e QC, annotations, etc)
  11. Write paper on how VCFs were generated, what was found (computational and science?)
  12. Setup SAIGE on Polaris
  13. Run SAIGE for MVP WG PGEN data
  14. Perform QC on SAIGE analysis
  15. Post-process analysis - Will need Anurag and Jenny for this
    • What is different compared to gwPhewas analysis?
    • Novel SNPs, SVs (quantitities)
  16. Share data
  17. Write paper on findings

Summary of SV callers

Caller/web link Types of SVs AI based? Actively developed? Prg Env GPU acceleration?
Breakdancer* Deletions, insertions, inversions,
intra-chromosomal,
inter-chromosomal translocations
N N C++ N
BreakSeq Insertions, deletions,
translocations, inversions,
duplications
N N Python N
ClipCrop Insertions, deletions,
translocations, inversions,
duplications
N N Node.js N
CREST Insertions, deletions,
translocations, inversions,
duplications
N N Perl N
DELLY Deletions, inversions, duplications,
interchromosomal translocations
N Y C++ N
GRIDSS Insertions, deletions, translocations, inversions, duplications N N Java/R N
Gustaf Deletions, inversions, duplications, translocation N N C++ N
LUMPY Deletions, duplications, inversions, translocations N Y (June 2022) C++ N
Manta Insertions, deletions, translocations, inversions, duplications N N C++ N
Meerkat Insertions, deletions, translocations, inversions, duplications N N Perl N
Pindel Insertions, deletions, translocations, inversions, duplications N N C++ N
TARDIS Tandem and interspersed segmental duplications N N C N
TIGRA Insertions and deletions N N C++ N
Ulysses Insertions, deletions, translocations, inversions, duplications N N Python/R N
SvABA Insertion, deletions, somatic rearrangments N N C++/R N
Socrates N N Java N
SVSeq2 N N - N
Cue Deletions, tandem duplication, inversions, deletion-flanked inversions, inverted duplications larger than 5kbp Y Y Python Y
Strvctvre Deletions and duplications Y N Python -
Dysgu Y Y Python -
CNNgeno Deletions Y N Python Y
DeepSV Deletions Y N Python Y
sv-channels Deletions Y Y Python/R Y

* BreakDancer has two modes, BreakDancerMax and BreakDancerMini. While the former is for large SVs, the latter is designed for calling small indels (of 10-100 base pairs) using normally mapped read pairs.

References

This is what others are doing:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages