Copyright © 2021 Quantori. Custom Software Solutions. All rights reserved.
This program (wrapper) was created to help to download data from SRA database or European Nucleotide Archive. It uses one of the three different methods to download data depending on user's choice: fasterq-dump to download runs from NCBI's Sequence Read Archive (SRA), ** FTP** or Aspera to download runs from European Nucleotide Archive (ENA). This program also uses the ENA Portal API to retrieve metadata.
Author: Quantori
This program takes ena
or ncbi
as data source and a study identifier or run ids from its arguments and/or
textfile. It will then download the relevant files directly, or delegate downloading to fasterq-dump
or Aspera CLI.
This program will also take care of obtaining required metadata, verify checksums of the downloaded files, and retry
failed downloads. Text file containing study identifiers or runs ids should have separate lines whit ids. Ids passed
to program arguments should be separated by comma ,
. For more information see CLI usage
FastqHeat is being developed and tested under Python 3.9.x.
This is how to install the project and its external Python dependencies. Depending on the method you choose for downloading data, you may have to install additional command-line utilities, as explained in the supported methods section.
- Make sure you have installed a supported version of Python.
- Clone this project from GitHub or download it as an archive.
- Optional, but recommended: create and activate a fresh virtual environment.
- Install it directly with
pip
.
Full example for Linux systems:
$ git clone git@github.com:quantori/FastqHeat.git
$ python3 -m venv env
$ . env/bin/activate
$ pip install FastqHeat/
This project supports command line usage. You can use --help
to get information about the CLI.
Usage: python -m fastqheat [OPTIONS] COMMAND [ARGS]...
This help message is also accessible via `python3 -m fastqheat --help`.
Run 'python3 -m fastqheat COMMAND --help' for more information on a
command.
For more info see README.MD
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
ena
ncbi
Usage: fastqheat ena [OPTIONS]
Options:
--accession TEXT List of accessions separated by comma. E.g
"SRP163674,SRR7969880,SRP163674" [default:
]
--accession-file FILE File with accessions separated by a newline.
--metadata-file FILE Metadata filepath [default: (dynamic)]
--working-dir DIRECTORY Working directory. [default: <built-in
function getcwd>]
--attempts INTEGER RANGE Retry attempts in case of network error.
[default: 2]
--attempts_interval INTEGER RANGE
Retry attempts interval in seconds in case
of network error. [default: 0]
--transport [binary|ftp] Transport (method) to be user to download
data. [default: binary]
--skip-download BOOLEAN Skip data download step. Data check (if not
skipped) will expect data to be in the
working directory [default: False]
--skip-check BOOLEAN Skip data check step. [default: False]
--skip-download-metadata BOOLEAN
Skip metadata download step [default:
False]
--config FILE Configuration file path. [default:
(dynamic)]
--log-level [CRITICAL|ERROR|WARNING|INFO|DEBUG]
Logging level. [default: INFO]
--help Show this message and exit.
Usage: fastqheat ncbi [OPTIONS]
Options:
--accession TEXT List of accessions separated by comma. E.g
"SRP163674,SRR7969880,SRP163674" [default:
]
--accession-file FILE File with accessions separated by a newline.
--working-dir DIRECTORY Working directory. [default: <built-in
function getcwd>]
--attempts INTEGER RANGE Retry attempts in case of network error.
[default: 2]
--attempts_interval INTEGER RANGE
Retry attempts interval in seconds in case
of network error. [default: 0]
--skip-download BOOLEAN Skip data download step. Data check (if not
skipped) will expect data to be in the
working directory [default: False]
--skip-check BOOLEAN Skip data check step. [default: False]
--cpu-count INTEGER RANGE Sets the amount of cpu-threads used by
fasterq-dump (binary that downloads files
from NCBI) and pigz (binary that zips files)
[default: (dynamic)]
--config FILE Configuration file path. [default:
(dynamic)]
--log-level [CRITICAL|ERROR|WARNING|INFO|DEBUG]
Logging level. [default: INFO]
--help Show this message and exit.
For every study or run given, FastqHeat will download data for all runs and place them in a specific hierarchical directory structure.
For example, if you wish to download data for SRP163674
to /some/output/directory
,
FastqHeat will arrange downloaded files for runs in the following directory structure:
/some/output/directory/
├── SRR7969880
│ └── SRR7969880.fastq.gz
├── SRR7969881
│ └── SRR7969881.fastq.gz
├── SRR7969882
│ └── SRR7969882.fastq.gz
├── SRR7969883
│ └── SRR7969883.fastq.gz
├── SRR7969884
│ └── SRR7969884.fastq.gz
...
Here's an example for SRX4720625
:
/some/output/directory/
└── SRR7882015
├── SRR7882015_1.fastq.gz
└── SRR7882015_2.fastq.gz
If instead you download data just for SRR7969880
:
/some/output/directory/
└── SRR7969880
└── SRR7969880.fastq.gz
Note that the directory structure will always be exactly the same, regardless of the method you selected.
Requires fasterq-dump
executable installed and added to PATH
. Consult the
official SRA Toolkit documentation
for detailed instructions. After downloading files, FastqHeat will compress them with
pigz
(can be installed with apt
on Debian-based systems).
Both fasterq-dump
and pigz
support parallel execution and it's enabled by default. The --cpu-count
argument (see CLI usage) controls exactly how many threads these programs will spawn.
The default number of threads is equal to the number of logical CPUs in the system.
Refer to the following sections for usage examples:
Requires that you have Aspera Connect
installed and added to your PATH
.
Specifically, FastqHeat will invoke the ascp
executable to transfer files.
An instruction how to install Aspera Connect
:
wget -qO- https://d3gcli72yxqn2z.cloudfront.net/downloads/connect/latest/bin/ibm-aspera-connect_4.2.1.116_linux.tar.gz | tar xvz
You can find what the latest version is by going to the official website, clicking the right button of your mouse on "Download Aspera Connect for Linux" and pressing "Copy link address".
chmod +x ibm-aspera-connect_<version number>-linux_x86_64.sh
./ibm-aspera-connect_<version number>-linux_x86_64.sh
export PATH=$PATH:~/.aspera/connect/bin/
echo 'export PATH=$PATH:~/.aspera/connect/bin/' >> ~/.bash_profile
ascp --version
Refer to the following sections for usage examples:
FastqHeat will download files directly from ENA.
Refer to the following sections for usage examples:
# Download SRP163674 data to the current directory using fasterq-dump
$ python3 -m fastqheat ncbi --accession=SRP163674
# Same, but output files to /tmp instead
$ python3 -m fastqheat ncbi --accession=SRP163674 --working-dir=/tmp
# Download data for SRR7969880 to the current directory. Sets the number of cores
# to use by fasterq-dump and pigz, overriding the default setting
$ python3 -m fastqheat ncbi --accession=SRR7969880 --cpu-count=8
# Download data related to SRP163674 to the current directory using Aspera CLI
$ python3 -m fastqheat ena --accession=SRP163674
# Same, but output files to /tmp instead
$ python3 -m fastqheat ena --accession=SRP163674 --working-dir=/tmp
# Download data for SRR7969880 to /tmp
$ python3 -m fastqheat ena --accession=SRR7969880 --working-dir=/tmp
# Download data related to SRP163674 to the current directory using FTP
$ python3 -m fastqheat ena --transport=ftp --accession=SRP163674
# Same, but output files to /tmp instead
$ python3 -m fastqheat ena --transport=ftp --accession=SRP163674 --out /tmp
# Download data for SRR7969880 to /tmp
$ python3 -m fastqheat ena --transport=ftp --accession=SRR7969880 --out /tmp
$ python3 -m fastqheat ena --accession=SRR7969880,SRP150545 --out /tmp
Or create a .txt
file containing identifiers of SRA studies or runs.
# Download data for every entry in input_file.txt using fasterq-dump with 6 threads
$ python3 -m fastqheat ena --accession-file=/path/to/input_file.txt --cpu-count=6
Each identifier should be placed on a separate line. Example of a valid file:
$ cat /path/to/input_file.txt
SRP163674
SRX4720625
SRP150545
Development happens on the dev
branch. master
is the stable branch.
Clone the project, enter the project directory, and switch to the development branch:
~$ git clone git@github.com:quantori/FastqHeat.git
~$ cd FastqHeat/
~/FastqHeat$ git checkout dev
Install poetry
, then install the project:
~/FastqHeat$ poetry install # NOTE: includes dev dependencies
NOTE: to run commands within the project's virtual environment you will have to activate Poetry's shell (
poetry shell
) or run them viapoetry run
. It is also possible to, install the project viapip
in editable mode (-e
) instead, and then install project's dependencies withpoetry install --no-root
.
Make sure you've installed optional command-line utilities as well.
If you add new Python dependencies, they should be included in
pyproject.toml
in the relevant sections (don't forget
to recreate poetry.lock
after you're done).
To check that everything is in order:
~/FastqHeat$ make format # Formats code
~/FastqHeat$ make lint # Runs linters against code
~/FastqHeat$ make test # Runs unit tests
We welcome participation from all members of the community. We ask that all interactions conform to our Code of Conduct.
Feel free to open an issue!