Skip to content

PsiBlastHelper - It's modulino that splits fasta input file into number of chunks for BLAST, PSI-BLAST and HMMER. It also writes SGE and HTCondor scripts to run these jobs on cluster or grid.

License

Notifications You must be signed in to change notification settings

msestak/PsiBlastHelper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME

PsiBlastHelper - It's modulino that splits fasta input file into number of chunks for BLAST, PSI-BLAST and HMMER. It also writes SGE and HTCondor scripts to run these jobs on cluster or grid.

SYNOPSIS

# run separately for BLAST and HMMER because application path changes

# test example for BLAST and PSI-BLAST
lib/PsiBlastHelper.pm --infile=t/data/dm_splicvar \
--out=t/data/dm_chunks/ --chunk_name=dm --chunk_size=50 --fasta_size 3000 \
--cpu 5 --cpu_l 5 \
--db_name=dbfull --db_path=/shared/msestak/db_full_plus --db_gz_name=dbfull_plus_format_new.tar.gz \
--email=msestak@irb.hr --app_path=/home/msestak/ncbi-blast-2.5.0+/bin/

# test example for HMMER
lib/PsiBlastHelper.pm --infile=t/data/dm_splicvar \
--out=t/data/dm_chunks/ --chunk_name=dm --chunk_size=50 --fasta_size 3000 \
--cpu 5 --cpu_l 5 \
--db_name=dbfull --db_path=/shared/msestak/dbfull --db_gz_name=dbfull.gz \
--email=msestak@irb.hr --app_path=/home/msestak/hmmer-3.1b2-linux-intel-x86_64/binaries/

# possible options for BLAST database
--db_name=dbfull  --db_path=/shared/msestak/db_full_plus --db_gz_name=dbfull_plus_format_new.tar.gz
--db_name=db90    --db_path=/shared/msestak/db90_plus    --db_gz_name=db90_plus_format_new.tar.gz
--db_name=db90old --db_path=/shared/msestak/db90old      --db_gz_name=db90old_format.tar.gz

# options for HMMER database
--db_name=dbfull  --db_path=/shared/msestak/dbfull --db_gz_name=dbfull.gz

DESCRIPTION

PsiBlastHelper is modulino that splits fasta file (input) into a number of chunks for high throughput BLAST+, PSI-BLAST+ or HMMER. Chunks get short name + different number for each chunk (+ sufix '_large' if larger than --fasta_size or in top N sequences by size). This is because BLAST works really slowly for large sequences and they are processed separately one by one. So you need to provide input file, size of the chunk, chunk name and either top N or length of sequences to run separately. You also meed to provide --cpu or --cpu_l to split SGE or HTCondor script on this number of jobs. The idea here is to reduce a number of BLAST database copies, which can lead to failed jobs if out of disk space on specific node. This means that one script == one database copy and multiple BLAST processes. You can also use -a (--append) to append remainder of sequences to last file or to create new file with this remainder, which is default. After splitting sequences it prints SGE and HTCondor bash scripts. All paths are hardcoded to ISABELLA cluster at tannat.srce.hr and CRO-NGI grid.

For help write:

perl FastaSplit.pm -h
perl FastaSplit.pm -m

Summary of options:

--infile       fasta file with proteins to be split
--out          directory where chunks and scripts will be written, recreated if it exists
--chunk_name   first part of the fasta chunk name, e.g., "dm" means that chunks dm1, dm2, dm3, ... will be created
--chunk_size   number of fasta sequences per chunk
--fasta_size   length of fasta sequence after which sequences will be run one by one due to problems with BLAST buffers, usually 3000
--cpu          number of BLAST jobs to run per one script for "normal" sequences, i.e., sequences with less than 3000 aminoacid
--cpu_l        number of BLAST jobs to run per one script for "long" sequences, i.e., sequences with more than 3000 aminoacids
--db_name      name of the BLAST database to be used in BLAST command
--db_path      path to BLAST database on tannat.srce.hr; recomendation is to put it on /shared/user/ path because of the infiniband connection to nodes; Isabella specific
--db_gz_name   name of the BLAST database on home directory on grid for CRO NGI jobs, not used on Isabella
--email        email address to send notifications when jobs start, abort or end
--app_path     path to the blastp, phammer or psiblast executable

Optional:

--append       append last remainder of sequences to last chunk file, default create new file
--top          top N largest sequences to run one by one
--v            verbose; by default logging level is INFO; -v sets it to DEBUG; -v -v sets it to TRACE
--q            quiet; opposite of verbose; run without logging to terminal; it still writes full log to file
--grid_address specify address of grid center other than ce.srce.cro-ngi.hr, specific to HTCondor

NOTE:

This scripts hardcodes $HOME path into all .submit scripts. This means that if you run generation of input files and runner scripts on one computer under one user and then copy resulting scripts to Isabella or grid where you run these scripts as another user jobs will fail. As a workaround for this run sed command that will change username written in scripts like this:

sed -i 's/user_local_desktop/user_isabella/g' *

where * means scripts generated by this module.

INSTALL

First install cpanm (Perl programming language package manager) and optional set local::lib as suggested.

curl -L https://cpanmin.us | perl - App::cpanminus

Add ~/perl5/bin to PATH. Add line to ~/.bashrc:

export PATH=$HOME/perl5/bin:$PATH
. ~/.bashrc

Copy module from GitHub repository:

git clone https://github.com/msestak/PsiBlastHelper

Install dependencies using cpanm:

cpanm --installdeps PsiBlastHelper/ --force
# or one by one
cpanm -n Path::Tiny

LICENSE

Copyright (C) Martin Sebastijan Šestak.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Martin Sebastijan Šestak martin.s.sestak@gmail.com

About

PsiBlastHelper - It's modulino that splits fasta input file into number of chunks for BLAST, PSI-BLAST and HMMER. It also writes SGE and HTCondor scripts to run these jobs on cluster or grid.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages