Skip to content

Commit

Permalink
add warning on large db for training step, switch suggestion for buil…
Browse files Browse the repository at this point in the history
…d custom db
  • Loading branch information
ShaopengLiu1 committed Oct 25, 2023
1 parent 9201e5f commit 12db289
Showing 1 changed file with 16 additions and 6 deletions.
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ The workflow for YACHT is as follows:
2. Preprocess the reference genomes by removing the "too similar" genomes based on `ANI` using the `ani_thresh` parameter
3. Run YACHT to detect the presence of reference genomes in your sample

</br>

### Creating sketches of your reference database genomes

You will need a reference database in the form of [Sourmash](https://sourmash.readthedocs.io/en/latest/) sketches of a collection of microbial genomes. There are a variety of pre-created databases available at: https://sourmash.readthedocs.io/en/latest/databases.html. Our code uses the "Zipfile collection" format, and we suggest using the [GTDB genomic representatives database](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.zip):
Expand All @@ -104,18 +106,20 @@ sourmash sketch dna -f -p k=31,scaled=1000,abund --singleton <your multi-FASTA f
If you have a directory of FASTA files, one per genome:

```bash
## Method 1
# cd into the relevant directory
sourmash sketch dna -f -p k=31,scaled=1000,abund *.fasta -o ../training_database.sig.zip
# cd back to YACHT

## Method 2
## Method 1 (suggested)
# put all full paths of FASTA/FASTQ file into a file, one path per line
find <path of foler containg FASTA/FASTQ files> > dataset.csv
sourmash sketch fromfile dataset.csv -p dna,k=31,scaled=1000,abund -o ../training_database.sig.zip
# cd back to YACHT

## Method 2
# cd into the relevant directory
sourmash sketch dna -f -p k=31,scaled=1000,abund *.fasta -o ../training_database.sig.zip
# cd back to YACHT
```

</br>

### Creating sketches of your sample

You will then create a sketch of your sample metagenome, using the same k-mer size and scale factor
Expand Down Expand Up @@ -155,6 +159,12 @@ In the two preceding steps, you will obtain a k-mer sketch file in zip format (i

### Preprocess the reference genomes (Training Step)

##### Warning: the training process is time-consuming on large database

In our benchmark with `GTDB representive genomes`, it takes `15 minutes` using `16 threads, 50GB of MEM` on a system equipped with a `3.5GHz AMD EPYC 7763 64-Core Processor`. The processing time can be significant when executed on GTDB all genomes OR with limited resources. If only part of genomes are needed, one may use `sourmash sig` command to extract signatures of interests only.

#### </br>

The script `make_training_data_from_sketches.py` extracts the sketches from the Zipfile-format reference database, and then turns them into a form usable by YACHT. In particular, it removes one of any two organisms that have ANI greater than the user-specified threshold as these two organisms are too close to be "distinguishable".

```bash
Expand Down

0 comments on commit 12db289

Please sign in to comment.