Skip to content

This repository hosts the software package for BalLeRMix+, an extension of BalLeRMix that can jointly detect recent positive selection and long-term balancing selection.

License

Notifications You must be signed in to change notification settings

bioXiaoheng/BallerMixPlus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BalLeRMix+

---Balancing selection Likelihood Ratio Mixture models Plus

This repository hosts the software package for BalLeRMix+, an extension of BalLeRMix that can jointly detect recent positive selection and long-term balancing selection.

To cite this software: Xiaoheng Cheng, Michael DeGiorgio (2021) BalLeRMix+: Mixture model approaches for robust joint identification of both positive selection and long-term balancing selection. Bioinformatics btab720


In BalLeRMix+, we introduce the optional --findPos and --findBal, as well as --fixAlpha <abeta>, arguments to help specify the disperson parameter in beta-binomial distributions, with a value between 0 and 1 indicating positive selection, and a value larger than 1 indicating balancing selection. Based on feedback from BalLeRMix users, we also added a --minCount argument to indicate the smallest number of allele counts in the input data, in case the user chooses to remove rare variants from the data.

To help visualize the output, we also added an R script in the test/ folder in case you need.

Note that BalLeRMix+ does not consider multi-allelic balancing selection as does BalLeRMix_v2.2.


User Guide

usage: BalLeRMix+_v1.py [-h]  -i INFILE [-o OUTFILE] --spect SPECTFILE [--minCount MINCOUNT] [--getSpect] [--getConfig] 
                        [--noFreq] [--MAF] [--findBal] [--findPos] [--usePhysPos] [--rec RRATE]
                        [--fixWinSize] [-w W] [--noCenter] [-s STEP] [--fixX X] [--fixAlpha ABETA] [--rangeA SEQA] [--listA LISTA]
                       

You can use python BalLeRMix+_v1.py -h to see a more detailed help page.

0. Installation and dependency

To install BalLeRMix+, simply navigate to your working directory and clone this repository:

 git clone https://github.com/bioXiaoheng/BallerMixPlus.git

The current release supports Python version 3.6 and above, and needs to have numpy (>=1.19.1) and scipy (>=1.5.0) installed.

Commands to install requirements:

The requirements.txt file allow pip (pip 6 and above) to check whether your system already has the required packages installed, and if not, install them. Note that it doesn't check whether your python version meets the requirement. To check your python version, use python --version or pip --version.

 # navigate to the cloned repository
 cd BallerMixPlus/
  
 # check for required packages
 pip install -r requirements.txt

Alternatively, you can also install the packages (of their latest version) manually through

 pip install numpy
 pip install scipy

1. Quick start and examples

To run the program, the user can build commands based on python BalLeRMix+_v1.py -i <input file> [other arguments], with no particular order required for each argument. In particular, useful argument combinations are

B variant Input data type To generate helper file To run scan
B2 Derived allele frequency (DAF) plus substitutions (x==n) --getSpect --spect <spect file> -o <output file>
B2,MAF Minor allele frequency (MAF) plus substitutions (x==0) --getSpect --MAF``--spect <spect file> --MAF -o <output file>
B1 Polymorphism (0<x<n) or substitution (x==n) calls --getConfig --spect <config file> --noFreq -o <output file>
B0 Derived allele frequency (DAF) --getSpect --noSub --spect <spect file> --noSub -o <output file>
B0,MAF Minor allele frequency (MAF) --getSpect --noSub --MAF --spect <spect file> --noSub --MAF -o <output file>

Note that when generating helper files, the input must be concatenated from all sequences in the genome. One command-line option to concatenate them is (suppose your input files are named input_chr<n>.txt):

cat input_chr*.txt | awk '(NR == 1 || $1 != "position")' > Concatenated_input_for_helper_file.txt

There are two examples in test/ folder, including their input files as well as some helper files. Both examples are simulated using SLiM3.3 (Haller & Messer 2019), and their allele frequencies are polarized by the sequence from an outgroup that diverged from the main population 5 million years ago (similar to the human-chimpanzee divergence). Example 1 is an 100kb sequence evolving under the CEU demography (inferred by SMC++; Terhorst et al. 2017), carrying a de novo beneficial mutation that occurred 10,000 generations ago, with a selective advantage of s=0.01. Example 2 is a 50kb sequence carrying an ancient locus under over-dominance balancing selection (s=0.001, dominance h=20) occuring 5 million years before the outgroup diverges.

Note that the sequence in Example 2 actually evolved under a constant population size and, in reality, should not use the same set of helper files as does Example 1. However, because 1) footprints of ancient balancing selection are in general not sensitive to recent population size changes, and 2) the two examples here are only for demonstration purpose, we will use the set of helper files for Example 1 here throughout for convenience.

To obtain helper files from the concatenated input (test/HC_CEU_Neut_Concatenated-DAF.txt, concatenated from all the replicates evoling neutrally under the same demographic history as Example 1), you can run:

Click to check sample commands
# generate helper file for B1
python BalLeRMix+_v1.py -i test/HC_CEU_Neut_Concatenated-DAF.txt --getConfig --spect test_config.txt

# generate helper file for B2
python BalLeRMix+_v1.py -i test/HC_CEU_Neut_Concatenated-DAF.txt --getSpect --spect test_DFS.txt

# generate helper file for B2maf
python BalLeRMix+_v1.py -i test/HC_CEU_Neut_Concatenated-DAF.txt --getSpect --MAF --spect test_MFS.txt

# generate helper file for B0
python BalLeRMix+_v1.py -i test/HC_CEU_Neut_Concatenated-DAF.txt --getSpect --noSub --spect test_DFS-noSub.txt

# generate helper file for B0maf
python BalLeRMix+_v1.py -i test/HC_CEU_Neut_Concatenated-DAF.txt --getSpect --noSub --MAF --spect test_MFS-noSub.txt

The resulting files should be the same as test/HC_CEU_Neut_config_for_B1.txt, test/HC_CEU_Neut_DAF_spect_for_B2.txt, test/HC_CEU_Neut_MAF_spect_for_B2maf.txt, test/HC_CEU_Neut_DAF-noSub_spect_for_B0.txt, and test/HC_CEU_Neut_MAF-noSub_spect_for_B0maf.txt, respectively. Past BalLeRMix users may need to be aware that the output's header of BalLeRMix+ is named slightly differently from before.

To run B1, B2, and B2,MAF, respectively, on the first example:
#run B1
python BalLeRMix+_v1.py -i test/Example1_fullSweep_200kya_DAF.txt -o testout_ex1_B1.txt --noFreq --spect test/HC_CEU_Neut_config_for_B1.txt

# run B2
python BalLeRMix+_v1.py -i test/Example1_fullSweep_200kya_DAF.txt -o testout_ex1_B2.txt --spect test/HC_CEU_Neut_DAF_spect_for_B2.txt

# run B2maf
python BalLeRMix+_v1.py -i test/Example1_fullSweep_200kya_MAF.txt -o testout_ex1_B2maf.txt --MAF --spect test/HC_CEU_Neut_MAF_spect_for_B2maf.txt

You will find that the output should be close, if not identical, to the files test/output/Example1_B1.txt, test/output/Example1_B2.txt, and test/output/Example1_B2maf.txt, respectively. Example1 is simulated to have undergone a selective sweep of s=0.01. Because B0 and B0,MAF have little power for positive selection, we do not recommend the user to use them for inferring positive selection.

To visualize the B1, B2, and B2,MAF scores, you can try:

Rscript test/plotScores.r <output from BalLeRMix+_v1.py> <image name.png>

With the above command, you should be able to generate images resembling test/ScorePlot_example1_B2.png. Note that this R script requires the ggplot2 package and will output images in PNG format.

The same operations can be repeated on example 2, and you can check your output with test/output/Example2_B1.txt, test/output/Example2_B2.txt, and test/output/Example2_B2maf.txt accordingly.

By default, the BalLeRMix+_v1.py script uses all informative sites provided in the input. With additional arguments, users can customize the size and centers of their sliding windows for the analyses. Details on these areguments are listed in section 4 As an example, one can run

# compute B0maf (--noSub --MAF) while: 
## using physical positions as coordinates (--usePhysPos) 
## scan with sliding windows of a fixed physical size (--fixWinSize) of 1000 nt (-w 1000)
## centered on every other (--step 2) informative sites
python BalLeRMix+_v1.py --noSub --MAF --spect test/HC_CEU_Neut_MAF-nosub_spect_for_B0maf.txt -i test/Example2_balancing_10MYA_MAF_nosub.txt -o testout_B0maf_1kb-500b.txt --usePhysPos --fixWinSize -w 1000 --step 2  

The output should be close, if not identical, to test/output/Example2_B0maf_1kb-2sites.txt.

2. Input and helper files

The input file should have four columns, presenting physical positions, genetic positions, number of derived (or minor) alleles observed, and total number of alleles observed (*i.e.* sample size). We ask that a sample size be provided for each locus so that loci with missing data (and hence a difference sample size) can be better accounted for. Meanwhile, all (x,n) combinations in your input must also exist in the helper file.

The input file should be tab-delimited and should have a header, e.g.:

physPos genPos x n
16 0.000016 50 50
35 0.000035 12 50
...

Note that for B2,MAF, it is recommended to only have minor allele counts in the data. If, however, the user choose to run it on derived allele counts, as long as --MAF argument is given, the script will automatically fold the allele count in your data. For B1, the script will read all loci with allele counts different from their sample sizes (i.e., x != n ) as polymorphisms, and all that has identical values to the samples sizes as substitution (also known as fixed differences).

In addition to the allele count input file, the user also need to provide one helper file that records the genome-wide variation level. In particular:

For B2 and B2,MAF statistics, a site frequency spectrum file is needed, and should be tab-delimited and without header, e.g.:

<k> <sample size n> <proportion in the genome>
1 50 0.03572
2 50 0.02024
...

For B1, the helper file (also tab-delimited) records the genome-wide polymorphism/substitution ratio, without header, e.g.:

<sample size n> <% of substitutions> <% of polymorphisms>
50 0.7346 0.2654

The BalLeRMix+_v1.py program can help generate helper files from the concatenated input files. For site frequency spectrum:

  python BalLeRMix+_v1.py -i <concatenated input file> --getSpect --spect <spectrum file name>

For minor allele frequency spectrum:

  python BalLeRMix+_v1.py -i <concatenated input file> --getSpect --MAF --spect <spectrum file name>

For polymorphism-substitution configuration file:

  python BalLeRMix+_v1.py -i <concatenated input file> --getConfig --spect <config file name>

3. Running the B statistics

To perform B2 scans on your input data, use
python BalLeRMix+_v1.py -i <input> --spect <derived allele frequency spectrum> -o <output>

To perform B2,MAF scans on your input data, use

python BalLeRMix+_v1.py -i <input> --spect <minor allele frequency spectrum> -o <output> --MAF

To perform B1 scans on your input data, use

python BalLeRMix+_v1.py -i <input> --config <sub/poly configuration file> -o <output> --noFreq

4. Customizing the sliding window for your scan

All arguments besides the aforementioned ones are for customizing the scan. Their usage is largely based on the previous BalLeRMix (find out more in its detailed manual here ). Note that these arguments are not necessary for the scan to complete.

In general, they can be grouped into the following three categories:
  • [--usePhysPos] [--rec RRATE] :

    Because BalLeRMix uses genetic distances (in cM) to compute likelihood, to direct the software to use physical positions instead, you should use --physPos, and indicate the uniform recombination rate (cM/nt) in your species of interest with --rec. The default value is 10-6 cM/nt.

    This argument will be automatically incurred if you choose to fix the window size (e.g., 1000bp, 5kb, etc. ), in which case yuou want to make sure the software is correctly informed of the recombination rate. Using physical positions will also change how you define window sizes and step sizes, if you were to customize the scanning window.

  • [--fixX X] [--fixAlpha ABETA] [--rangeA SEQA] [--listA LISTA]:

    These areguments allow you to customize the parameter space that the software optimizes over. The presumed equilibrium frequency is x, the extent of dispersion is α, and the rate of decay in linkage disequilibrium is A.

  • [--fixWinSize] [-w R] [--noCenter] [-s STEP] [--usePhysPos]:

    These areguments are for customizing the scanning window. You probably won't need them because BalLeRMix and BalLeRMix+ are robust to window sizes. The software help page (python BalLeRMix+_v1.py --help) provides more details on how these arguments work. The table below lists combinations the users can go with to customize their scans:

Window Span Position tested Required Commands Optional Commands
Fixed sequence length (nt) Informative sites --fixWinSize -w <nt> [-s <sites>] [--usePhysPos] [--rec <cM/nt>]
Fixed sequence length (nt) Evenly-spaced arbitrary positions --fixWinSize --noCenter``-w <nt> -s <nt> [--usePhysPos] [--rec <cM/nt>]
Fixed number of informative sites Informative sites -w <sites in radius> -s <sites per step> [--usePhysPos] [--rec <cM/nt>]

About

This repository hosts the software package for BalLeRMix+, an extension of BalLeRMix that can jointly detect recent positive selection and long-term balancing selection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages