Skip to content

Latest commit

 

History

History
304 lines (257 loc) · 17.3 KB

README.md

File metadata and controls

304 lines (257 loc) · 17.3 KB

plot

pySpade: Single cell Perturbations - Analysis of Differential gene Expression

Overview


pySpade performs the whole transcriptome analysis of single cell perturbation datasets. With the direct output of Cellranger, pySpade utilizes hypergeomtric test to analyze the whole transcriptome differential expression and generates hits table csv file and Manhattan plots. Currently we support human genome.

Requirement


  • Python (=3.7)
  • Numpy (=1.21)
  • Pandas (=1.3.5)
  • Scipy (<=1.6.2)
  • Matplotlib(=3.5)

Installation


pySpade can be installed with pip

pip install pySpade

Usage

Over-view


$pySpade
usage: pySpade [-h]  ...

pySpade 
Version: 0.1.5

optional arguments:
  -h, --help  show this help message and exit

functions:
  
    process   process mapping output and reformat for downstream analysis.
    explevel  check the average expression level of query genes in single cell matrix
    fc        check the fold change of sgrna
    DEobs     perform differential expression analysis of observed cells
    DErand    perform differential expression analysis of random selection background
    local     perform local hit analysis with observation data and random background
    global    perform global hit analysis with observation data and random background
    manhattan   generate Manhattan plots for each perturbation region
    
  • process : Process transcriptome output and sgrna output to remove experimental doublets and sgrna outlier cells.
    • Input 1: Transcriptome matrix path from Cellranger output (The folder contains filtered_feature_bc_matrix.h5 file).
    • Input 2: sgrna matrix column: cell barcodes consistent with transcriptome matrix, rows: sgrna sequence or sgrna names. The sgrna matrix is already filtered out potential noise sgrna. Boolean values of the sgrna matrix are used for analysis. Acceptable format: pkl or csv.
    • The final output format is h5 file. The final output can be compressed to save disk space, but it may take more time to write the final output file.
usage: pySpade process [-h] -f FEATURE_BC -s INPUT_SGRNA [-m CELL_MULTIPLEX] -o OUTPUT_DIRECTORY [-c COMP]

Process transcriptome output and sgrna output to remove experimental doublets. Transcriptome matrix is from Cellranger
output (outs folder),sgrna matrix is from fba output, other accepted format include pkl and csv file.sgrna matrix column:
cell barcodes consistent with transcriptome matrix, rows: sgrna sequence.The final output format is h5 file.

optional arguments:
  -h, --help            show this help message and exit
  -f FEATURE_BC, --feature_bc FEATURE_BC
                        Specify the output folder from cellranger pipeline (outs folder)
  -s INPUT_SGRNA, --sgrna INPUT_SGRNA
                        Specify the sgrna matrix file.Please make sure the barcodes are consistent with transcriptome if
                        there are multiple libraries. File format: pkl or csv
  -m CELL_MULTIPLEX, --cell_multiplex CELL_MULTIPLEX
                        specify the used antibody name in txt file separate with new line. Make sure that the antibody is
                        mapped together with transcriptome. Cellranger is recommended.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Specify output directory. The output file format will be h5
  -c COMP, --comp COMP  Final h5 file compress or not. "True" or "False", default is False.
  • explevel : Check the average expression level of query genes in single cell matrix.
    • Input 1: processed transcriptome matrix from the process output.
    • Input 2: Query genes list has to be txt file, genes are seperated with new line.
usage: pySpade explevel [-h] -t TRANSCRIPTOME_DIR -g GENE [-o OUTPUT_FILE]

Check the average expression level of query genes in single cell matrixInput: processed transcriptome matrix from the
process output,query genes list has to be txt file, genes are seperated with new line.

optional arguments:
  -h, --help            show this help message and exit
  -t TRANSCRIPTOME_DIR, --transcriptome_dir TRANSCRIPTOME_DIR
                        specify the directory from process function.
  -g GENE, --gene GENE  specify the query genes.
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        specify output file.
  • fc : Check the fold change of perturbed region and individual sgRNA for query region and gene. Good for test if positive controls work. p-value are calculated with Student's t-test.
    • Input 1: processed transcriptome and sgrna matrix from the process output
    • Inout 2: sgrna dict file (perturbation region hg38 coordinates and the sgrna name targeting that region. Region and sgrnas separated by tab, and sgrnas separated by comma. The sgrna name must match the index of sgrna matrix.)
    • Example:
    • chr1:1234567-1235067 sg1;sg2;sg3;sg4;sg5
    • chr2:1234567-1235067 sg6;sg7;sg8;sg9;sg10
    • Input 3: Query file, the query region and query test, separate by tab.
    • Example:
    • chr1:1234567-1235067 GENE1
    • chr2:1234567-1235067 GENE2
usage: pySpade fc [-h] -t TRANSCRIPTOME_DIR -d DICT -r REGION [-b BG] -o OUTPUT_FOLDER
Check the fold change of perturbed region and individual sgRNA for query region and gene.Input: processed transcriptome
matrix and sgrna matrix from the process output,sgrna dict file: perturbation region hg38 coordinates and the sgrna
sequence targeting that region.Region and sgrnas separated by tab, and sgrnas separated by comma

optional arguments:
  -h, --help            show this help message and exit
  -t TRANSCRIPTOME_DIR, --transcriptome_dir TRANSCRIPTOME_DIR
                        specify the processed transcriptome matrix file
  -d DICT, -dict DICT   specify the sgRNA annotation file: perturbation coordinates (hg38) and the sgRNA name.
  -r REGION, --region REGION
                        specify the query regions and their target genes to calculate repression efficiency.
  -b BG, --bg BG        the background cells for comparason. Default is complementary (all the other cells). Specify the key in sgRNA txt file.
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        specify output folder directory.
  • DEobs : Perfrom the genome wide differential expression analysis of all the perturbation regions.
    • Input 1: processed transcriptome and sgrna matrix from the process output
    • Input 2: sgrna dict file (perturbation region hg38 coordinates and the sgrna sequence targeting that region. Region and sgrnas separated by tab, and sgrnas separated by comma. The sgrna name must match the index of sgrna matrix).
    • Output files: up regulation p-value, downregulation p-value, fold change(compare with all the other cells) and average cpm.
    • The default background is all the other cells (complement). Self-defined background (ex. cells with non-target sgRNAs) should be described in the sgrna dict file.
    • Example:
    • NonTarget sgNC1;sgNC2;sgNC3;sgNC4
usage: pySpade DEobs [-h] -t TRANSCRIPTOME_DF -s INPUT_SGRNA -d DICT [-r THREADS] [-n NORM_METHOD] [-b BG] -o OUTPUT_DIR

Perfrom the genome wide differential expression analysis of all the perturbation regions.Input: processed transcriptome
matrix and sgrna matrix from the process outputsgrna dict file: perturbation region hg38 coordinates and the sgrna
sequence targeting that region.region and sgrnas separated by tab, and sgrnas separated by commaOutput: up regulation
p-value, downregulation p-value, fold change(compare with all the otehr cells) and average cpm

optional arguments:
  -h, --help            show this help message and exit
  -t TRANSCRIPTOME_DF, --transcriptome_df TRANSCRIPTOME_DF
                        specify the processed transcriptome matrix file (.h5)
  -s INPUT_SGRNA, --sgrna INPUT_SGRNA
                        specify the processed sgrna matrix file. (.h5)
  -d DICT, -dict DICT   specify the perturbation coordinates (hg38) and the sgRNA txt file.
  -r THREADS, --threads THREADS
                        set number of barcode comparison threads. The default is 1
  -n NORM_METHOD, --norm NORM_METHOD
                        choose normalization methods: "cpm" or "metacell".
  -b BG, --bg BG        the background cells for comparason. Default is complementary (all the other cells). Specify the
                        key in sgRNA txt file.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify an output directory.
  • DErand : Perfrom the genome wide differential expression analysis of 1000 random selection cells.
    • There are two options for random selection: all cells with equal probability or probability based on sgrna number in the cells. User should specify the cell number to select randomly. It is recommended with either exact cell number or bins (with large amount of perturbation experiment in order to reduce computational overhead).
usage: pySpade DErand [-h] -t TRANSCRIPTOME_DF -s INPUT_SGRNA -d DICT -m NUM [-i ITERATION] [-r THREADS] [-n NORM_METHOD]
                      -a RANDOMIZATION_METHOD [-b BG] -o OUTPUT_DIR

Perfrom the genome wide differential expression analysis of 1000 random selection cells.There are two options for random
selection: all cells with equal probability or probability based on sgrna number in the cellsUser should specify the cell
number to select randomly.It is recommended with either exact cell number or bins (large amount of perturbation region).

optional arguments:
  -h, --help            show this help message and exit
  -t TRANSCRIPTOME_DF, --transcriptome_df TRANSCRIPTOME_DF
                        specify the processed transcriptome matrix file
  -s INPUT_SGRNA, --sgrna INPUT_SGRNA
                        specify the processed sgrna matrix file.
  -d DICT, -dict DICT   specify the perturbation coordinates (hg38) and the sgRNA txt file.
  -m NUM, --num NUM     specify the number of cells to do random iteration.
  -i ITERATION, --iteration ITERATION
                        specify the number of iteration to perform.
  -r THREADS, --threads THREADS
                        set number of barcode comparison threads. The default is 1
  -n NORM_METHOD, --norm NORM_METHOD
                        choose normalization methods: "cpm" or "metacell".
  -a RANDOMIZATION_METHOD, --randomization_method RANDOMIZATION_METHOD
                        choose randomization methods: "equal" or "sgrna".
  -b BG, --bg BG        the background cells for comparason. Default is complementary (all the other cells). Specify the
                        key in sgRNA txt file.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        specify an output directory.
  • local : Using the observed p-value and randomization background p-value to calculate the Significance score based on gamma distribution approximation. Local hits calculation includes the genes within +/- 2 Mb of the perturbation region. The output is a csv file with all hits information.
usage: pySpade local [-h] -f FILE_DIR -d DATA_DIR -t DISTR -s SGRNA_DICT -o OUTPUT_FILE

Using the observation p-value and randomization bavckground p-valueto calculate the adjusted p-value based on gamma
distribution approximationLocal hits calculation includes the genes within plus and minus 2 Mb of the perturbation
region.The output is a csv file with all hits information.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE_DIR, --file_dir FILE_DIR
                        specify the file directory of "process" function output, the Trans_genome_seq.npy file is
                        required at this step.
  -d DATA_DIR, --data_dir DATA_DIR
                        specify the p-value matrix directory of observation test.
  -t DISTR, --distr DISTR
                        specify the random cell mean/std/10_perc file directory.
  -s SGRNA_DICT, -sgrna_dict SGRNA_DICT
                        specify the perturbation coordinates (hg38) and the sgRNA txt file.
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        specify an output file name incluseing the directory, it has to be in csv format.
  • global : Using the observed p-value and randomization background p-value to calculate the Significance score based on gamma distribution approximation. The output is a csv file with all hits information.
usage: pySpade global [-h] -f FILE_DIR -d DATA_DIR -s SGRNA_DICT -t DISTR -o OUTPUT_FILE

Using the observation p-value and randomization bavckground p-valueto calculate the adjusted p-value based on gamma
distribution approximationThe output is a csv file with all hits information.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE_DIR, --file_dir FILE_DIR
                        specify the file directory of "process" function output, the Trans_genome_seq.npy file is
                        required at this step.
  -d DATA_DIR, --data_dir DATA_DIR
                        specify the p-value matrix directory of observation test. (DEobs output folder)
  -s SGRNA_DICT, -sgrna_dict SGRNA_DICT
                        specify the perturbation coordinates (hg38) and the sgRNA txt file.
  -t DISTR, --distr DISTR
                        specify the random cell file directory. (DErand output folder)
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        specify an output file name including the directory, it has to be in csv format.
  • manhattan: Using the output csv file from global to systematically generate Manhattan plot. The default cutoffs are:
    • gene expression cutoff: genes expressed in more than 5% of cells (0.05)
    • fold change cutoff: more than 20% fold change (0.2)
    • significance score cutoff: smaller than -5 (-5)
usage: pySpade manhattan [-h] -f FILE_DIR -g GLOBAL_CSV [-cx CUTOFF_EXPRESSION] [-cf CUTOFF_FC] [-cs CUTOFF_SIGNIFICANCE]
                         -o OUTPUT_FOLDER

Use the output csv file from global function to generate Manhattan plots for each perturbation region

optional arguments:
  -h, --help            show this help message and exit
  -f FILE_DIR, --file_dir FILE_DIR
                        specify the file directory of "process" function output, the Trans_genome_seq.npy file is
                        required at this step.
  -g GLOBAL_CSV, --global_csv GLOBAL_CSV
                        specify the csv file directory from the output of global function.
  -cx CUTOFF_EXPRESSION, --cutoff_expression CUTOFF_EXPRESSION
                        specify the cutoff of expressed genes. Default is 0.05 (genes expressed in more than 5 percent of
                        cells)
  -cf CUTOFF_FC, --cutoff_fc CUTOFF_FC
                        specify the cutoff of fold change. Default is 0.2 (fold change is more than 20 percent)
  -cs CUTOFF_SIGNIFICANCE, --cutoff_significance CUTOFF_SIGNIFICANCE
                        specify the cutoff of Significance_score. Default is -5 (Significance score is smaller than -5)
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        specify an output folder directory.

Data Interpretation

  • exp

    • Average expression (cpm): The counts per million read of the query gene. This parameter can determine the expression level.
    • Median cpm: The cpm of the median cell. Due to the dropout, median cpm for many genes could be 0.
    • Portion of cell express: The portion of cells that expressed the query gene. 1 means 100%.
  • fc

    • fold change: Perturb cells divide by background cells. 1 means unchange, more than 1 means up-regulation, less than 1 means down-regulation.
    • log(p-val), t-test: The result of student's t-test comparing perturbed cells and background cells.
    • Perturb cpm: The expression level (cpm) of perturbation cells.
    • Background cpm: The expression level (cpm) of background cells.
  • local and global

    • Each row is one DE gene in the perturbation region.
    • gene_names: The differential expressed genes.
    • region: The perturbation region.
    • num_cell: The number of cells with sgRNAs targeting that perturbation region.
    • distance (only in local): The distance (base pair) between perturbation region and differential expressed gene.
    • log(pval)-hypergeom: The raw p-value output from DEobs, this does not adjust the background.
    • fc: fold change calculating from DEobs.
    • Significance_score: The p-value adjusting background from DErand. The higher the absolute value, the more significant that DE gene is.
    • fc_by_rand_dist_cpm: (perturb cells cpm)/(random select background cells cpm). Based on our experience, usualy fc and fc_by_rand_dist_cpm is pretty similar. Either one can be used as reference.
    • pval-empirical: The proportion of random select background p-value that is smaller than observation p-value.
    • cpm_perturb: The expression level (cpm) of perturbation cells.
    • cpm_bg: The expression level (cpm) of background cells.
  • manhattan

    • The genes upper part in red are up-regulated genes upon perturbation, while the genes in the lower part in blue are down-regulated genes upon perturbation.
    • The DE genes showing in the Manhattan plots are filtered based on the parameters running manhattan.

How to cite

Wang, Y. et al. Enhancer regulatory networks globally connect non-coding breast cancer loci to cancer genes. Genome Biol 26, 10 (2025). Link to paper

Contacts


  • Yihan Wang Yihan.Wang@UTSouthwestern.edu
  • Gary Hon Gary.Hon@UTSouthwestern.edu