Date | Version |
---|---|
2023-08-28 |
0.9.5 |
This repository documents our Standard of Procedure for processing wastewater data.
Although it can transitionally contain segments of code, these are merely notebooks showing temporary procedures in place or results visualization (e.g.: uploading JSON formatted data to Cov-Spectrum). Any computation (e.g.: variants deconvolution) are usually integrated into the software component we rely upon.
The current software includes:
- V-pipe: Our main Virus NGS Analysis workflow.
- COJAC: Integrated into V-pipe, component tool for early detection based on combination of mutations
- LolliPop: Integrated into V-pipe, component tool for kernel-based deconvolution of variants
We're currently in the process of integrating most of the processing into V-pipe. A prototype of this integration is available in branch ninjaturtles, and the latest version is now config file-based and can be adapted to other sites.
There is no tutorial, yet. In the mean time, a draft HOWTO can help you understand the details and deploy V-pipe in your own setting.
This readme file describes the procedure that is used since 2023-04-08 to prepare the wastewater-based SARS-CoV-2 prevalence display on CoV-Spectrum and provide the data to BAG/FOPH. The main references are doi:10.1038/s41564-022-01185-x (preprint: doi:10.1101/2021.01.08.21249379), and doi:10.1101/2022.11.02.22281825.
It relies on V-pipe for most of the processing. The key steps are:
- Base processing: reads QC, alignments, etc. classic SARS-CoV-2 pipeline
- Cooccurrence analysis: detect variants early using cooccurrences of mutations.
- Wastewater analysis: build variant curves, by deconvoluting for the variants found in the previous step.
A separate procedure documents the steps to add a new variant.
TODO (TO BE DOCUMENTED, check HOWTO for now)
Sample data is processed using the same pipeline configuration as was used to process NGS data for the Swiss SARS-CoV-2 Sequencing Consortium (S3C).
Note this differs from the procedure documented in the HOWTO, as we still do a split process: first base-processing in one working directory with one branch (rubicon), then wastewater processing in a different working directory with a different branch (ninjaturtles).
*[NGS]: Next Generation Sequencing *[S3C]: Swiss SARS-CoV-2 Sequencing Consortium
Summary: all the necessary setup is already contained in the pangolin repository. The repository needs to be cloned with its submodules, and the user needs to provide a bioconda installation in
pangolin/miniconda3
with packages snakemake-minimal and mamba
The pipeline is configured as described in the repository https://github.com/cbg-ethz/pangolin and its submodules.
- branch: master, tip: 1c819ee42b5e112b8db07bc79d38f793981530dd
The V-pipe version provided by pangolin is set up following the same layout as produced by running the installer (it is equivalent to the results of running bash quick_install.sh -b rubicon -p pangolin -w working
, see tutorial and V-pipe's main README.md) and relies on the three directories:
pangolin/miniconda3
-- user-provided installation of miniconda3 with packages snakemake-minimal and mamba installed (see link for installation instruction).pangolin/V-pipe
-- installation of V-pipe as specified by the pangolin repository’s sub-module (simply checkout the submodule), namely:- repository: https://github.com/cbg-ethz/V-pipe
- branch: rubicon, tip: e8be5f1fad7a970e3299e142f329f037f3890fe4
pangolin/working
-- working directory provided by the pangolin repository. Of note:pangolin/working/vpipe.config
-- specific configuration used (the virus-specific base configuration for SARS-CoV-2 relies on bwa for alignment with reference NC_045512 and ShoRAH for SNV and local haplotype calling. In addition, the resource parameters have been fine-tuned to allow the processing of very large cohorts).pangolin/working/
*.sbatch
and.bsub
- jobs used to perform the processing
*[SNV]: Single Nucleotide Variant
This V-pipe setup will store snakemake environments in pangolin/snake-envs
. It is possible to pre-download them in advance by running cd pangolin/V-pipe; ./vpipe --jobs 16 --conda-create-envs-only
(see tutorial). Our installation on ETHZ's Euler cluster performs this download with the script pangolin/working/create_envs
which also takes care of HTTP proxy.
V-pipe takes care of performing quality controls, read filtering, alignments and tallying basecounts. More details about the internal functioning of V-pipe can be found in its primary publication doi:10.1093/bioinformatics/btab015.
V-pipe expects its input in subdirectories of pangolin/working/samples
following a two-level hierarchy, and a 3-column TSV file pangolin/working/samples.tsv
, as demonstrated in the tutorial.
Note for legacy reasons, we are still currently relying on a different layout than modern Snakemake Workflow recommendations and V-pipe official defaults:
- we process everything in the samples
samples/
directory, (V-pipe & Snakemake default: input goes insamples/
, output foes inresults/
)- we keep a copy of raw reads in top-level
sampleset/
that we hard-link intoworking/samples/
(V-pipe & Snakemake default: separate input directory can be directly accessed).- cohort-wide results (reports, etc. that aren't specific to a single samples) are currently in
variants/
(V-pipe default: goes into the base ofresults/
)- we still rely on the legacy INI-like format used by older V-pipe 1.0 and 2.0 (Snakemake default: uses
config.yaml
)
In the current procedure, this is provided autonomously by the setup used for S3C, and is similar to the output generated by scripts provided in pangolin/sort_samples_dumb
and pangolin/sort_samples_demultiplexstats
.
The present procedure relies on the wastewater from Eawag samples following this naming schema, specifically providing a numerical ID for wastewater treatment plant (WWTP), and a sampling date:
*[WWTP]: Wastewater Treatment Plant
┌──────────────── Wastewater Treatment Plant:
│ 05 - CDA Lugano
│ 10 - ARA Werdhölzli in Zurich
│ 12 - STEP Vidy in Lausanne
│ 17 - ARA Chur
│ 19 - ARA Altenrhein
│ 25 - ARA Sensetal
│ ┌───────────── Date
│ │ ┌── Sample properties
┴─ ┴───────── ┴─
09_2020_03_24_B
10_2020_03_03_B
10_2020_03_24_A
10_2020_04_26_30kd
The Kanton Zürich data follow a different schema:
┌───────── Date: YYmmdd)
│ ┌── (optional: alternate kit)
┴───── ┴─
KLZHCov210815_2
Basel has another schema:
┌──────────────────── (internal id)
│ ┌───────────── Date: YYYY-mm-dd
│ │ ┌── (optional: alternate kit)
┴───── ┴───────── ┴─
Ba210449_2021-11-10
- These name can be automatically parsed by regular expressions defined inside
regex.yaml
. - A
ww_locations.tsv
look-up table maps WWTP codes to full names.
(see Wastewater analysis below).
The configuration for running V-pipe resides in the file pangolin/working/vpipe.config
. The configuration as used presently in the procedure is provided in the repository pangolin as mentioned in installation.
In the directory pangolin/working
, submit the job vpipe-noshorah.bsub
(LSF) or vpipe-noshorah.sbatch
(SLURM) to the cluster.
This will run the initial steps of V-pipe but stop before calling SNV and local haplotypes with ShoRAH.
Note COJAC is now part of V-pipe.
Processing uses the V-pipe.
- branch: ninjaturtles, tip: e8be5f1fad7a970e3299e142f329f037f3890fe4
- this branch relies on COJAC 0.9 released on bioconda.
Because the above is still in an experimental state, we run in a separate working area (work-vp-test/
instead of working/
), the prototype branch is in the V-pipe-test
submodule of repo pangolin. Scripts (cowabunga.sh
) assist handling these directories.
The following document assumes user have already performed the base analysis with V-pipe, but because Snakemake is dependency-based, it is also possible to automatically regenerate any intermediate output.
(The script cowabunga.sh
will help you generate those)
/pangolin/work-vp-test/samples.wastewateronly.tsv
-- TSV table of the samples (the subset samples.tsv with wastewater samples)/pangolin/work-vp-test/reference/voc/
*.yaml
-- definition of the variants- a copy of this collection of YAML is in the
voc/
subdirectory - TODO procedure for creating and curating new one
- a copy of this collection of YAML is in the
/pangolin/work-vp-test/reference/primers.yaml
and/pangolin/work-vp-test/reference/primers/
: multiplex PCR amplicon definitions- a copy of this is provided in SARS-CoV-2 specific resource packaged as part of V-pipe
- at the time of writing, this includes ARTIC protocol version 3, 4, and 4.1
- documentation about amplicons is found in section amplicons of the config's README file
/pangolin/work-vp-test/results/
*/
*/alignment/REF_aln_trim.bam
-- Alignments generated by V-pipe (see base processing above).- Note that analysis is run on the trimmed alignments for convenience, but due to its nature (it's looking at the exact amplicons of the multiplex PCR), it's not affected by trimming.
Summary: actual processing is handled by V-pipe, here we document the exact commands that we use locally on our cluster.
Because the above is still in an experimental state, we run in a separate working area (work-vp-test/
instead of working/
, and the V-pip installed in V-pipe-test/
instead of V-pipe/
).
In the directory pangolin
, there is a tool called cowabunga.sh
(that can assist in preparing data for running V-pipe).
- run
./cowabunga.sh autoaddwastewater
- this adds the wastewater samples at top of file
/pangolin/work-vp-test/samples.wastewateronly.tsv
. The wastewater samples are recognized based on the project ID that the sequencing labb -- the Functional Genomic Center Zürich (FGCZ) -- assigns them to.
- this adds the wastewater samples at top of file
- run
./cowabunga.sh bring_results
- this creates the necessary hard-links inside
/pangolin/work-vp-test/results
so that V-pipe can reuse the alignments previously done as part of the base processing.
- this creates the necessary hard-links inside
- submit V-pipe to the cluster with SLURM:
- for interactive output on the terminal:
cd work-vp-test/ srun --pty --job-name=COWWID-vpipe --mem-per-cpu=4096 --mail-type=END --mail-user="ivan.topolsky@bsse.ethz.ch" --time=23:00:00 -- vpipe-test.sbatch
- for batch submission:
cd work-vp-test/ sbatch vpipe-test.sbatch
-
Note this will run both the cooccurrence and the deconvolution, see content of
vpipe-test.sbatch
:to only run COJAC use rules such as:… exec ./vpipe --profile ${SNAKEMAKE_PROFILE} --restart-times=0 \ allCooc variants/cohort_cooc_report.v41.csv tallymut deconvolution \ --groups sigmut=group0 cooc=group1 \ --group-components group0=10 group1=5
variants/amplicon.v41.yaml
-- amplicon description for protocol Artic v4.1.allCooc
-- runs COJAC on all samples.variants/cohort_cooc_report.v41.csv
-- assembles all COJAC results from all samples processed with Artic v4.1 into a table as featured in the additonnal materials of articles.
- for interactive output on the terminal:
*[FGCZ]: Functional Genomic Center Zürich
/pangolin/work-vp-test/variants/amplicons.v
*.yaml
-- amplicon description (generated from COJAC's definitions)/pangolin/work-vp-test/results/
*/
*/signatures/cooc.yaml
-- internal COJAC format with coocurrences/pangolin/work-vp-test/variants/cohort_cooc.
*.yaml
and.csv
-- COJAC results of all samples, aggregated by protocol./pangolin/work-vp-test/variants/cohort_cooc_report.
*.csv
-- COJAC results, in a publication-like table
You can load cohort_cooc_report.v41.csv in Libre-Calc, Excel, etc. or, for visualization on the terminal….
…refer to COJAC's README.md to display output:
# on the terminal of your laptop:
cojac cooc-colourmut --amplicons amplicons.v41.yaml --yaml cooc.v41.yaml | less -SR
Advices:
- Based on the curate information gathered from the background on Cov-Spectrum, you might want to edit and make a subset out of amplicon.v41.yaml that concentrates on the interesting amplicons and mutations for the variants you are seeking to detect early.
less -SR
will enable horizontal scrolling and will support the ANSI colour codes
Manually edit and adapt work-vp-test/var_dates.yaml
(the list of all variant will be correctly auto-guessed, no need to use one in work-vp-test/variants_config.yaml
) to require deconvolution to include these newly detected variants (see next section "Wastewater analysis").
Note
- LolliPop is now part of V-pipe and runs the deconvolution
- unlike the notebook
cojac/notebooks/ww_smoothing.ipynb
used in the Nature Microbiology article, posterior probabilities of the local haplotype generated by ShoRAH are not taken into account, and the curves are built using LolliPop's kernel based deconvolution which is both more robust against variants that share mutations in their exhaustive lists, and able to leverage the time component to compensate the extreme over-dispersion of typical wastewater samples. See LolliPop's preprint for details.
Processing uses the V-pipe.
- branch: ninjaturtles, tip: e8be5f1fad7a970e3299e142f329f037f3890fe4
- this branch relies on LolliPop 0.3 released on bioconda.
Because the above is still in an experimental state, we run in a separate working area (work-vp-test/
instead of working/
), the prototype branch is in the V-pipe-test
submodule of repo pangolin. Scripts (cowabunga.sh
) assist handling these directories.
You still need to update manually work-vp-test/var_dates.yaml
and re-run V-pipe if coocurrences have revealed early presence of another variant. See LolliPop's README.md for details.
/pangolin/work-vp-test/variant_config.yaml
-- general information (variants, locations) used for deconvolution- locations_list (fomerly cities_list): full names of the locations for which to generate curves (see right column of
ww_locations.tsv
). - variants_pangolin - dictionary mapping short names to the official PANGO lineages (see: "pangolin:" and "short:" fields in the
voc/
*.yaml
description)- V-pipe will auto-generate such a list in
/pangolin/work-vp-test/variants/variants_pangolin.yaml
you can copy-paste the section)
- V-pipe will auto-generate such a list in
- start_date: first date for the deconvolution
- to_drop: category of mutation to skip from tallymut (e.g.:
subset
), - (variants_not_reported and variants_list aren't provided manually anymore as LolliPop can autoguess them correctly)
- locations_list (fomerly cities_list): full names of the locations for which to generate curves (see right column of
/pangolin/work-vp-test/var_dates.yaml
-- which variants LolliPop needs to deconvolute for over which period of time.- at each starting date, list the combination of variants that will be deconvoluted for the next period of time, as determined by COJAC in the section above.
Configuration file regex.yaml
defines regular expressions that help parse the samples names as per section Processing raw-reads with V-pipe above.
sample
(and optionallybatch
) define regular expressions that are run against the first (and optionally second) column of V-pipe'ssamples.tsv
. They define the following named-groupslocation
: this named-group gives the code for the location (e.g.: Ewag's number code,Ba
,KLZH
, etc.)year
: year (inYYYY
orYY
format.YY
are automatically expanded to20YY
--- Yes, I am optimistic with the duration of this pandemic. Or pessimistic with long term use of V-pipe after the turn of century ;-) ).month
: monthday
: daydate
: an alternative to the year/month/day groups, if dates aren't in a standard format.- regex are parsed with the Python regex library, and multiple named groups can use the same name.
You can thus have a construction where you use
|
to give multiple alternative as long as each provide named-groupslocation
and eitheryear
,month
, andday
ordate
:(I swear I have personally typed the line above. It has nothing to do with cats walking on my keyboard).(?:(?P<location>\d+)_(?P<year>20\d{2})_(?:(?:(?P<month>[01]?\d)_(?P<day>[0-3]?\d))|(?:R_(?P<repeat>\d+))))|^(?:(?P<location>KLZHCo[vV])(?P<year>\d{2})(?P<month>[01]?\d)(?P<day>[0-3]?\d)(?:_(?P<location_extra>\w+))?)|^(?:(?P<location>B[aA])(?P<BAsam>\d{6})(?:[-_](?P<year>20\d{2})-(?P<month>[01]?\d)-(?P<day>[0-3]?\d))?)
datefmt
: strftime/strptime format string to be used on regex named groupdate
(e.g.: use"%Y%m%d"
to parse YYYYMMDD).- This is most useful for date formats that don't split nicely into the
year
,month
, andday
regex named groups: e.g. if your date format uses week number, day of the week, or day of year. In that case, write a regular expression that provides a named-groupdate
, and then use, e.g.,%W%w
or%j
in yourdatefmt
.
- This is most useful for date formats that don't split nicely into the
A look-up table ww_locations.tsv
maps the location code (see location
regex named group in the previous file) to their full description.
You can find a copy of these configuration in this repo.
The following document assumes user have already performed the base analysis and the cooccurrence with V-pipe, but because Snakemake is dependency-based, it is also possible to automatically regenerate any intermediate output.
(the script cowabunga.sh
will have already generated most of it for you)
/pangolin/work-vp-test/samples.wastewateronly.tsv
-- TSV table of the samples (the subset samples.tsv with wastewater samples)/pangolin/work-vp-test/reference/voc/
*.yaml
-- definition of the variants- a copy of this collection of YAML is in the
voc/
subdirectory - TODO procedure for creating and curating new one
- a copy of this collection of YAML is in the
/pangolin/work-vp-test/work-vp-results/
*/
*/alignment/basecnt.tsv.gz
-- Basecounts (pileup-like coverage plot) generated by V-pipe (see base processing above)./pangolin/work-vp-test/references/gffs/Genes_NC_045512.2.GFF3
-- genes table (used to add Gene names into the tallymut)./pangolin/work-vp-test/deconv_bootstrap_cowwid.yaml
-- preset for running the deconvolution with bootstrapping to generate confidence intervals (slower).
Note When variant curves approach 0, there are instabilities with the error margins generated by the current deconvolution methods when using Wald confidence interval. We use the slower bootstrap deconvolution preset instead.
Summary: actual processing is handled by V-pipe, here we document the exact commands that we use locally on our cluster.
Because the above is still in an experimental state, we run in a separate working area (work-vp-test/
instead of working/
, and the V-pip installed in V-pipe-test/
instead of V-pipe/
).
In the directory pangolin
, there is a tool called cowabunga.sh
(that can assist in preparing data for running V-pipe).
- run
./cowabunga.sh autoaddwastewater
- this adds the wastewater samples at top of file
/pangolin/work-vp-test/samples.wastewateronly.tsv
- this adds the wastewater samples at top of file
- run
./cowabunga.sh bring_results
- this creates the necessary hard-links inside
/pangolin/work-vp-test/results
so that V-pipe can reuse the alignments previously done as part of the base processing.
- this creates the necessary hard-links inside
- submit V-pipe to the cluster with SLURM:
- for interactive output on the terminal:
cd work-vp-test/ srun --pty --job-name=COWWID-vpipe --mem-per-cpu=4096 --mail-type=END --mail-user="ivan.topolsky@bsse.ethz.ch" --time=23:00:00 -- vpipe-test.sbatch
- for batch submission:
cd work-vp-test/ sbatch vpipe-test.sbatch
-
Note this will run both the cooccurrence and the deconvolution, see content of
vpipe-test.sbatch
:to only run LolliPop use rules such as:… exec ./vpipe --profile ${SNAKEMAKE_PROFILE} --restart-times=0 \ allCooc variants/cohort_cooc_report.v41.csv tallymut deconvolution \ --groups sigmut=group0 cooc=group1 \ --group-components group0=10 group1=5
tallymut
-- will run all the way to generating the heatmap-like structure (a tally of all mutations over dates and location) that serves as an input of the deconvolution.deconvolution
-- runs the deconvolution itself
- for interactive output on the terminal:
/pangolin/work-vp-test/variants/tallymut.tsv.zstd
-- table of all variant-characteristic mutations found/pangolin/work-vp-test/variants/deconvolute.tsv.zstd
-- output of the deconvolution/pangolin/work-vp-test/variants/deconvolute_upload.json
-- curves in a JSON format for uploading onto dashboards.
Note since switching to the exhaustive mutation list, the mutation heatmap display has become unusable and is not generated anymore.
Todo better visualization of the presence of variants.
Notebook DeconvolutionPrediagnostics.ipynb
help diagnose problems related to data (drop-out affecting mutations) and/or signatures (too much similarity) that could affect the deconvolution.
Note Previsualisation of the curves won't be done here, but is done before uploading to dashboard instead.
Configuration of the notebook relies on the same files work-vp-test/variants.yaml
and work-vp-test/var_dates.yaml
as used by V-pipe.
./work-vp-test/variants//tallymut.tsv.zst
-- table of all variant-characteristic mutations generated by V-pipe
Run all cells of DeconvolutionPrediagnostics.ipynb.
Notebook ww_cov_uploader_V-pipe.ipynb
previsualizes the curves generated by V-pipe before uploading and handles the upload onto CoV-Spectrum and BAG/FOPH's dashboards.
/pangolin/work-vp-test/variants/deconvolute_upload.json
-- V-pipe generated curves in a JSON format for uploading onto dashboards.
In addition the actual upload itself requires database credentials. These are currently retrieved from ~/.netrc
standard Unix credential storage, but alternative methods are commented-out in the cells under the "Upload to Cov-Spectrum" section.
Configuration of graph colors and input/output path is in section "Globals".
For better uniformity of all Swiss reporting, it is good practice to keep colors in sync between reports. Check upstream configuration of covariants.org for colors used by variants. Conversely, send pull requests on Cov-Spectrum's configuration file to update colours of the dashboard.
- Run up to and including "Plot"
- inspect the plots for anomalies.
- Share the generated plot
- Run cells in section "Upload to Cov-Spectrum" with the following:
- the cell tagged "
## Abort DB update !
" is used only to abort the update and rollback to the initial state in case or problems. It is skipped otherwise. - the cell tagged "
## Save to DB !
" is used instead to make the changes permanent in the database.
- the cell tagged "
- Run the remaining cells, section "Upload to FOPH/BAG's Polybox"
plots/combined-vpipe.pdf
-- previsualisation of the plotsww_update_data_combined.json
-- back-up of the data before upload (note: as mentioned before, the heatmap part is currently empty)./tmp/ww_update_data_smooth_kernel_rob.json
-- JSON reformated in the older format (order variants => locations => data, instead of the current locations => variants => data)
The curves are now loaded up on CoV-Spectrum. The curves are now shared on Polybox for BAG/FOPH to pick up.
Due to the excessively long list of mutations generated from Cov-Spectrum, and the excessively complex amplicon-cooccurrence that results of it, heatmaps aren't displayed anymore as they are unreadable. We are working toward fixing this.
Unlike the pre-print doi:10.1101/2021.01.08.21249379, the current procedure skips running ShoRAH: Instead of trying to build confidence at the level of single mutation, current method in LolliPop relies instead on a kernel-based deconvolution that enables us to leverage the time component and gives us higher confidence in our call despite drop-outs and extreme dispersion of data.
As an additional measure to avoid outliers, mutations which aren’t exclusive to a single variant aren’t taken into account in B.1.351/beta and P.1/gamma. E.g.: mutation nuc: A23063T (aa: N501Y) will not be used there as it is also present in B.1.1.7/alpha and would artificially inflate the numbers. Other similar signature mutations which do not follow the general trend of a variant are also filtered out.
'Undetermined' is defined by the absence of signature mutations, thus it doesn't have its own heatmap (it is defined by the complement of all other heatmaps).
The values of the curves are currently clipped to the [0;1] interval as the smoothing can currently generate values outside this range.
As the procedure is improved these caveats are subject to change in future revision of this document.