Skip to content

sgupta1524/PheMED

 
 

Repository files navigation

PheMED

A statistical method and scalable software, PheMED, Phenotypic Measurement of Effective Dilution, leverages GWAS summary statistics to quantify genome-wide effect size dilution across different phenotypic definitions and cohorts to quantify phenotypic misclassification across GWAS using only summary statistics.

Dependencies

After installing Anaconda or Miniconda, running the following commands will create and activate an environment suitable to run PheMED.

conda env create --file environment.yml
source activate phemed

Note : There is no native build for python below 3.8 for Apple Silicon. As a workaround you can follow this steps to emulate x86_64 architecture in M1 Apple Silicon (ARM) and above.

## create and activate empty environment
conda create -n py37
conda activate py37

## use x86_64 architecture channel(s)
conda config --env --set subdir osx-64

## install python, numpy, etc. (add more packages here...)
conda install python=3.7.4

## Create and activate an environment suitable to run PheMED
conda env create --file environment.yml
conda activate phemed

Running PheMED

To prepare input file for PheMED file before running PheMED using summary statistics of a sample. Please ensure the same reference build for the summary stats files:

python merge_summary_stats.py --inputs "data/sum_stats1,data/sum_stats2" --output test
  • --inputs comma separated paths to input summary statistics files
  • --output path to output munged summary statistics file (optional)
  • --log path to log file (optional)
  • --effect-allele-col comma separated effect allele columns per file (optional)
  • --non-effect-allele-col comma separated non effect allele columns per file (optional)

In case the summary statistics are from different reference genome builds, use LiftOver to convert genetic variants across assemblies.

To run PheMED on the sample data, run the following command:

python phemed.py --sum_stats data/sim_data.csv --n_studies 2 --out output/local_test
  • --sum_stats identifies the csv with the merged log odds ratios and standard errors for each study
  • --n_studies denotes the number of studies being analyzed
  • --out denotes prefix for output files

For the sum_stats csv, PheMED expects the input csv file to have columns: SNP, CHR, and POS corresponding to the rsid, chromosome and base pair position of the SNP. The columns that follow are assumed to contain effect sizes {log(OR)1,log(OR)2,...,log(OR)n} followed by the SEs {SE1,SE2,...,SEn}, when analyzing n studies. Furthermore, the first study listed is used as the reference study when measuring effective dilution. (See the sim_data.csv in the data directory for 2 studies or an example with three studies below.)

SNP CHR POS STUDY1 STUDY2 STUDY3 SE1 SE2 SE3
rs1 1 1 0.069 0.010 0.040 0.044 0.01 0.027
rs2 1 2 0.038 0.015 0.026 0.044 0.01 0.027
rs3 1 3 0.045 -0.0004 0.022 0.044 0.01 0.027

Currently, the phemed.py script assumes that study samples do not overlap. For guidance on running PheMED when studies have sample overlap, please see the FAQ.

Understanding Outputs: PheMED produces both a log file and a csv <out>_Summary.csv with the relevant outputs. The csv file includes the effective dilution values (in the PheMED column). If CI's are computed, users can find the bounds of the bootstrapped 95% confidence interval with the columns CI_.025 and CI_.975. Furthermore, if CI's are computed, <out>_Summary.csv will also contain p-values for the effective dilution, denoted by the column P (see section P Values below for details). Finally, if the user provides sample size information for each of the studies, the csv will also produce a column estimating the dilution adjusted effective sample size in the DilutionAdjNEff column. For examples of output, see the output directory.

P Values: For computing the p-value we leverage three different p-value methodologies. For each of the methodologies, there is a corresponding qc column that indicates if it is appropriate to use that methodology to compute p-value. As such, we do not require that all three methodologies pass the QC check; instead, only one such methodology needs to pass the QC check to estimate the p-value.

Nevertheless, for naive count, if the p-value does not pass the QC check (e.g. the p-value is very small and becomes hard to estimate from the bootstrap simulation), if the number of bootstrap samples is sufficiently large (e.g. = 2000, the default), we can still use the confidence intervals to infer if p < .05. See our paper for details.

Brief Overview of Other Arguments Computing the CIs when measuring the effective dilution between two GWA studies can take around an hour. If you would like to run the pipeline without computing the CIs, use --compute_cis False or the --n_CIs <smaller integer> to reduce the number of bootstrap samples; however decreasing the number of samples below the default value can result in noisy p-values. While sample sizes are not required to run PheMED, if you would like to compute a dilution adjusted effective sample size, you will need a csv indicating the sample sizes for each study maintained in the same order as the input file and add either the argument --eff_sample_sizes <path_to_file> or --sample_sizes <path_to_file>, where the former argument expects a CSV with a column labeled N detailing the effective sample sizes and the later argument expects a CSV with columns labeled Cases and Controls, that detail the number of cases and controls for each study. For examples of these files, see data/eff_sample_size_sim.csv and data/sample_size_sim.csv respectively.

For additional arguments run

python phemed.py -h

Performing Dilution Adjusted Weights (DAW) Meta-Analysis with PheMED

After running the phemed script above, users can run a dilution adjusted weights meta-analysis with the following command

python daw_meta.py --sum_stats data/sim_data.csv \
                  --n_studies 2 \
                  --out output/local_test \
                  --dilution_weights output/local_test_Summary.csv
  • --sum_stats identifies the csv with the merged log odds ratios and standard errors for each study
  • --n_studies denotes the number of studies being analyzed
  • --out denotes the prefix for output files
  • --dilution_weights path to the summary csv outputted from the phemed script

The script will then output a csv containing meta-analyzed Z-scores, p-values and effect sizes as measured according to the reference study. Example output below (for the 3 studies input):

SNP CHR POS Z_META BETA_META P_META
rs1 1 1 2.216573171014446 0.04237889772244553 0.026652272616020767
rs2 1 2 2.001733259609271 0.03827135065330354 0.04531342729038265
rs3 1 3 0.8086750926075479 0.015461144927879799 0.41870205861603993

FAQ: For FAQ and troubleshooting, please see the FAQ here.

Support: If you have any outstanding questions not addressed by the docs or FAQ, feel free to post your question as a GitHub Issue here or send an email to david.burstein2 {at} mssm.edu

Citing our Paper: If you use our software, please cite our preprint on medRxiv: Burstein et al. Detecting and Adjusting for Hidden Biases due to Phenotype Misclassification in Genome-Wide Association Studies

About

Adding features, fixing bugs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 51.2%
  • Jupyter Notebook 48.8%