- Extract Wellderly genotypes only, remove variants that are not found in the wellderly, transform to vcf
python Create_job_extract_wellderly_vcf.py
- Parse to vcf all of the data
python Create_jobs_parse_genomeComb.py
- Sort wellderly vcf
python Create_jobs_sort_vcf.py
- Extract the varinats with at least one VQHIGH in white individuals
python remove_vqlow.py
*Extract variants that are clustered in >0.1 wellderly or inova
python create_jobs_remove_clustered.py
*Extract the variants with AF >0.01 python create_jobs_0.01AF.py ---> DID'T work with vcftools, will have to do it manually
*Extract the repeats, homopolymers, etc python Create_jobs_extractRepeats_etc.py
*Count the number of VQHIGH passed filters by AF python create_jobs_count_totalVQHIGH_byAF.py
*Count rest of the filters (in the Count_filters folder)
*Remove variants with >10% missing in either wellderly or inova python Create_jobs_extract_missing.py
*Remove variants with coverage <10 or >100 python create_jobs_remove_coverage.py
- Extract snp position based on rsID python ./snps_of_interest/Extract_position_of_snp.py
*Extract the snps of interest python Exract_snps_of_interest.py
*Extracting separatelly snps and delins with AF>0.01 python Extract_snpsOnly_AFmoreThen0.01.py
*Concatenating the vcf file by chrom into a final one: vcf-concat vcf_snps_AF0.01.chr1.vcf.gz vcf_snps_AF0.01.chr2.vcf.gz vcf_snps_AF0.01.chr3.vcf.gz vcf_snps_AF0.01.chr4.vcf.gz vcf_snps_AF0.01.chr5.vcf.gz vcf_snps_AF0.01.chr6.vcf.gz vcf_snps_AF0.01.chr7.vcf.gz vcf_snps_AF0.01.chr8.vcf.gz vcf_snps_AF0.01.chr9.vcf.gz vcf_snps_AF0.01.chr10.vcf.gz vcf_snps_AF0.01.chr11.vcf.gz vcf_snps_AF0.01.chr12.vcf.gz vcf_snps_AF0.01.chr13.vcf.gz vcf_snps_AF0.01.chr14.vcf.gz vcf_snps_AF0.01.chr15.vcf.gz vcf_snps_AF0.01.chr16.vcf.gz vcf_snps_AF0.01.chr17.vcf.gz vcf_snps_AF0.01.chr18.vcf.gz vcf_snps_AF0.01.chr19.vcf.gz vcf_snps_AF0.01.chr20.vcf.gz vcf_snps_AF0.01.chr21.vcf.gz vcf_snps_AF0.01.chr22.vcf.gz | gzip -c >final_vcf_allChrom_snps_AF0.01.vcf.gz
*Run the first step of the association on all data: python create_job_association_final.py
*Second step: python run_association.py
*Association shows biggest p-values in repeat regions, removing all of them --> skiped this for final analysis: python ./association/create_jobs_remove_ALLrepeats_association.py
*Extracted the related individuals <-- This doesn't work /gpfs/group/stsi/data/projects/wellderly/GenomeComb/vcf_snps_AFmore0.01> vcftools --gzvcf final_vcf_allChrom_snps_AF0.01.vcf.gz --remove eliminate_individuals.txt --out final_vcf_allChroms_snps_AD0.01_noRelated.vcf.gz
*Extract related better version python Excluded_related.py
*Concatenate files python concatenate.py
*Extracted inova coverage
*Concatenating v1 (still need to concatenate chr1-chr4) vcf-concat final_vcf_nokmer_snps_AF0.01.noRelated.chr5.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr6.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr7.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr8.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr9.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr10.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr11.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr12.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr13.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr14.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr16.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr17.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr18.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr19.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr20.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr21.vcf.gz final_vcf_nokmer_snps_AF0.01.noRelated.chr22.vcf.gz | gzip -c >final_vcf_nokmer_snps_AF0.01.noRelated.temp.vcf.gz
*Extract snps of interest from vcf file python Extract_snps_of_interest.vcf.py
*Extract coveage by individual snps of interest: python Extract_coverage_SnpSOfInterest_wellderly.py
*Calculate AF and median coverage (from median coverage file) SNPs of interest, missing geno and VQHIGH for both wellderly and inova: python Calculate_AF_welldVsInova.py
- Match p-values from association based on location final_association.sh
*Extract rare variants python Create_jobs_extract_rareVariants.py
*Extract AF, p-value (from association) snps of interest python Extract_AF_p-values.py
ASSOCIATION ./association/final_association.sh
PATHWAY ANALYSIS FOR pathway analysis, extract genes/positions (in ~/wellderly/resources) mysql -h genome-mysql.cse.ucsc.edu -u genome -D hg19 -N -A -e 'select kgXref.kgID, kgXref.geneSymbol,knownGene.name,knownGene.chrom,knownGene.txStart,knownGene.txEnd from kgXref, knownGene where knownGene.name=kgXref.kgID' >genes_positions.txt
- Extract genes with 100kb interval (from pathway analysis folder) python split_UCSC_geneByChrom.py
*Add gene to bim file (not needed in the end) python add_gene_to_bim.py
*Generate 10k *.pheno files and run the simulation python generate_pheno_files.py
*For pathway analysis read ./pathway_analysis/pathway_analysis.sh
TABLE 1 and 2 *Extract all filters: python Create_jobs_apply_all_filters.py
FOR Rare variants *Extracting ALL clustered variants: python ./Rare_variant_analysis/Create_jobs_remove_ALL_clustered.py
*Extract the variants removed by allele depth filter python ./Table2/Create_jobs_filter_by_AD.py
*Extract the AF after all filters except AD filter after removing all of the variants with 0.0 AF in both populations python ./Table2/Create_jobs_extract_AF_by_var.py
*Generate table2 counts python ./Table2/Create_jobs_table2.py
*Extracting snp position for cognitive snps python ./pathway_analysis/CognitiveSnps/Extract_snp_position.py
#Extract cognitive snps p-values from all 10k simulations python ./pathway_analysis/CognitiveSnps/Create_jobs_extract_sim_pvalues.py
*Count 36mers: python ./Count_filters/Create_jobs_count_36mers.py
*Count hwe: python ./Count_filters/Create_jobs_count_hwe.py
*Filter out HWE: python ./Create_jobs/Create_job_filter_HWE.py
*Split annotations by chromosomes python ./Create_jobs/Create_jobs_split_annotation_by_chrom.py
*Calculate the AF on the filtered out dataset with plink: python ./Create_jobs/Create_jobs_calcAF_cases_controls.py
*Suplimental table 2 python ./Create_jobs/Count_wellderly_characteristics.py
*Pathway analysis with variants inside genes only: python ./pathway_analysis/Reassign_genes/combine_simulations.py
*Redo pathway analysis
- Apply the filters: missing/uncertain genotype > 10 perc in either wellderly or inova, covereage <10 or >100, whites only (testing 0.85 white and 0.95)
python Extract_white_filter.py
- Run the plink analysis, maf > 0.01, in LD
python data_analysis.py
- Add filters to the association file (reapeat, homopolymer, segDup, Microsat)
python Add_filters.py
- Test different filters to see which one work better