Comprehensive cancer signatures with reusable modules written in python, integrating SNV, SV and MSI profiles in signatures decomposed using non-negative matrix factorisation, and produce production ready pdf reports.
Dependencies - Currently, feature extraction of structural variants was based on data generated by FindSV and feature extraction of microsatellite instability was based on data generated by MSIsensor
Install the dependencies, then download and install pyCancerSig
git clone https://github.com/jessada/pyCancerSig.git
cd pyCancerSig
python setup.py install
echo -e "# set pyCancerSig environment variable\nexport PYCANCERSIG=`pwd`\n" >> ~/.bashrc
source ~/.bashrc # or logout and re-login
The workflow consists of 4 steps
-
Data preprocessing - The purpose of this step is to generate list of variants and/or information related. This step has to be performed by third party software.
- Single nucleotide variant (SNV) - recommending MuTect2, otherwise Muse, VarScan2, or SomaticSniper.
- Structural variant (SV) - dependency on FindSV
- Microsatellite instability (MSI) - dependency on MSIsensor
A note regarding vcf files generated by FindSV. Even though the VCF standard has support for SVs, callers may not always be fully interchangeable. Specifically, the “END” tag added by many callers and a “CHR2” tag are parsed out from the INFO field. Other information not evident from the VCF definition could be parsed by replacement or modification of a custom parseVCFLine function, as was done for FindSV. If any other SV callers are used, we would like to advise users to develop a parser to replace
cancersig profile sv
For MSIsensor,
cancersig profile msi
will look at output files *_somatic files. Each line represents one MSI locus. The fifth column indicates the repeat pattern. -
Profiling (Feature extraction) -
cancersig profile
- The purpose of this step is to turn information genereated in the first step into matrix features usable by the model in the next step. The output of this stage has similar format as https://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt, which consists of at least 3 columns.- Column 1, Variant type (Substitution Type in COSMIC)
- Column 2, Variant subgroup (Trinucleotide in COSMIC)
- Column 3, Feature ID (Somatic Mutation Type in COSMIC)
- From column 4 onward, each column represent one sample
There are subcommand to be used for each type of genetic variation
cancersig feature snv
is for extraction single nucletide variant featurecancersig feature sv
is for extraction structural variant featurecancersig feature msi
is for extraction microsatellite instability featurecancersig feature merge
is for merging all feature profiles into one single profile ready to be used by the next step
-
Deciphering mutational signatures -
cancersig signature decipher
- The purpose of this step is to use unsupervised learning model to find mutational signature components in the tumors. -
Visualizing profiles
cancersig signature visualize
- The purpose of this step is to visualize mutational signature component for each tumor.
usage: cancersig <command> [options]
Key commands:
profile extract mutational profile
signature decipher mutational cancer signature component and visualization from mutational profiles
cancersig profile
key commands:
snv extract SNV mutational profile
sv extract SV mutational profile
msi extract MSI mutational profile
merge merge all mutaitonal profile into a single profile
cancersig signature
key commands:
decipher perform unsupervised learning model to find mutational signature components
visualize visualize mutational signatures identified in tumors
cancersig profile snv
[options]:
-i {file} input VCF file (required)
-r {file} path to genome reference (required)
-o {file} output snv feature file (required)
cancersig profile sv
[options]:
-i {file} input VCF file (required)
-o {file} output sv feature file (required)
cancersig profile msi
[options]:
--raw_msisensor_report {file} an output from "msisensor msi" that have only msi score (percentage of MSI loci) (required)
--raw_msisensor_somatic {file} an output from "msisensor msi" that have suffix "_somatic" (required)
--sample_id {id} a sample id to be used as a column header in the output file (required)
-o {file} output msi feature file (required)
cancersig profile merge
[options]:
-i {directories} comma-separated directories containing feature files to be merged (required)
-o {file} output merged feature file (required)
--profile_types [SV,SNV,MSI] profile types to be merged, (default: SV,SNV,MSI)
cancersig signature decipher
[options]:
--mutation_profiles {file} input mutation calalog to be deciphered (required)
--min_signatures minimum number of signatures to be deciphered (default=2)
--max_signatures maximum number of signatures to be deciphered (default=15)
--out_prefix output file prefix (required)
cancersig signature visualize
[options]:
--mutation_profiles {file} input mutation calalog to be reconstructed (required)
--signatures_probabilities {file} input file with deciphered cancer signatures probabilities (required)
--output_dir {directory) output directory (required)
As this part is performed by third-party software, please check the original website for the documentation
cancersig profile snv
will
- scan the VCF (or vcf.gz) file in the genotype field for SNV changes on both strands
- then, use the genomic coordinates to look up the 5' and 3' base in the reference fasta (using samtools)
- then, perform SNV profiling of the sample by counting number of SNVs in each category and divide it by total number of variants in the sample.
The sample id in the output feature file will be the same as sample id in the input VCF file.
Example run:
cancersig profile snv -i input.vcf.gz -r /path/to/reference.fa -o snv_feature.txt
Example SNV feature output from Example SNV input.vcf.gz
cancersig profile sv
will
- check INFO field "SVTYPE" to determine type of structural variation
- check INFO field "END" for calculating the length of the event
- then, perform SV profiling of the sample by counting number of SVs in each category and divide it by total number of variants in the sample.
The sample id in the output feature file will be the same as sample id in the input VCF file (column 10).
Example run:
cancersig profile sv -i gunzip input.vcf -o sv_feature.txt
Note: Currently, cancersig profile sv
only accept uncompressed vcf file
Example SV feature output from Example SV input.vcf
cancersig profile msi
will
- scan for all possible repeat patterns of repeat unit with size between 1-3
- for size between 4-5, just count with no sub-categories
- then, perform MSI profiling of the sample by counting number of repeats in each category and divide it by total number of repeats.
The sample id in the output feature file has to be supplied as an input argument (--sample_id).
Example run:
cancersig profile msi --raw_msisensor_report msisensor_out --raw_msisensor_somatic msisensor_out_somatic --sample_id example_sample -o msi_feature.txt
Example MSI feature output from Example msisensor_out and Example msisensor_out_somatic
cancersig profile merge
will
- scan for *feature.txt or *profile.txt files in the input folder(s)
- if a sample has all feature of all mutation types (SNV, SV, MSI), it will be merged into one profile. The percentage weight of SNV, SV and MSI are 70%, 30% and 10% respectively, which can be redefined in features.py.
Example run:
cancersig profile merge -i /path/to/first/dir,/path/to/second/dir -o merged_feature.txt
Example run for mergeing certain profile types (SV and SNV in this case):
cancersig profile merge -i /path/to/first/dir,/path/to/second/dir -o merged_feature.txt --profile_types SNV,SV
Example merged feature file from example input directories -i /path1,/path2,/path3,/path4,/path5,/path6
cancersig signature decipher
will
- load mutational matrix profile
- identify underlying mutational signatures following EXPERIMENTAL PROCEDURES from Alexandrov et al., Cell Report, 2013
Example run:
cancersig signature decipher --mutation_profile merged_mutational_profile.txt --out_prefix deciphered_output_file_prefix
Example output:
- Example of signature probabilities (2 signatures)
- Example of deciphered signatures (2 signatures) from Example input mutation_profile
cancersig signature visualize
will
- display mutational signature composition of the sample
- display the original mutaitonal profile
- display the reconstruction mutational profile (based on the recomposition)
- display the reconstruction error
Example run:
cancersig signature visualize --mutation_profile merged_mutational_profile.txt --signatures_probabilities signatures_probabilities.txt --output_dir /path/to/output/dir
Example cancersig profile of sample1,sample2,sample3,sample4, and normalized_weights from input mutation_profile and signatures_probabilities
The amount of time needed for processing variants may depend on size of data and configuration of the machine. The following performance was based on execution the Uppsala Multidisciplinary Center for Advanced Computational Science computational cluster “bianca”, on a single Intel Xeon E5-2630 v3 core with 8 Gb RAM allocated.
cancersig profile snv
can process 3523 variants/secondcancersig profile sv
can process 17550 variants/secondcancersig profile msi
can process 218450 loci/secondcancersig signature decipher
took 33 minutes to process the combined profiles of 130 samplescancersig signature visualize
took 9 seconds to generate the pdf file of one sample.
In case that users use different variant callers which produce output in the format that cancersig profile snv
, cancersig profile sv
, or cancersig profile msi
cannot recognize, users can replace any profilers in this package with their own parsers. We have provided example input and output files of every processes in the example sections. As long as the new parser can generate output files in the same format as in the given examples, the workflow should continue to work correctly
If you need more information of have any questions, please don't hesitate to contact jessada.thutkawkorapin@gmail.com