Skip to content

Commit

Permalink
Merge pull request #50 from KoslickiLab/shaopeng
Browse files Browse the repository at this point in the history
Shaopeng
  • Loading branch information
dkoslicki authored Oct 25, 2023
2 parents 9201e5f + e245168 commit 1afc478
Showing 1 changed file with 16 additions and 6 deletions.
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ The workflow for YACHT is as follows:
2. Preprocess the reference genomes by removing the "too similar" genomes based on `ANI` using the `ani_thresh` parameter
3. Run YACHT to detect the presence of reference genomes in your sample

</br>

### Creating sketches of your reference database genomes

You will need a reference database in the form of [Sourmash](https://sourmash.readthedocs.io/en/latest/) sketches of a collection of microbial genomes. There are a variety of pre-created databases available at: https://sourmash.readthedocs.io/en/latest/databases.html. Our code uses the "Zipfile collection" format, and we suggest using the [GTDB genomic representatives database](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-reps.k31.zip):
Expand All @@ -104,18 +106,20 @@ sourmash sketch dna -f -p k=31,scaled=1000,abund --singleton <your multi-FASTA f
If you have a directory of FASTA files, one per genome:

```bash
## Method 1
# cd into the relevant directory
sourmash sketch dna -f -p k=31,scaled=1000,abund *.fasta -o ../training_database.sig.zip
# cd back to YACHT

## Method 2
## Method 1 (suggested)
# put all full paths of FASTA/FASTQ file into a file, one path per line
find <path of foler containg FASTA/FASTQ files> > dataset.csv
sourmash sketch fromfile dataset.csv -p dna,k=31,scaled=1000,abund -o ../training_database.sig.zip
# cd back to YACHT

## Method 2
# cd into the relevant directory
sourmash sketch dna -f -p k=31,scaled=1000,abund *.fasta -o ../training_database.sig.zip
# cd back to YACHT
```

</br>

### Creating sketches of your sample

You will then create a sketch of your sample metagenome, using the same k-mer size and scale factor
Expand Down Expand Up @@ -155,6 +159,12 @@ In the two preceding steps, you will obtain a k-mer sketch file in zip format (i

### Preprocess the reference genomes (Training Step)

##### Warning: the training process is time-consuming on large database

In our benchmark with `GTDB representive genomes`, it takes `15 minutes` using `16 threads, 50GB of MEM` on a system equipped with a `3.5GHz AMD EPYC 7763 64-Core Processor`. The processing time can be significant when executed on GTDB all genomes OR with limited resources. If only part of genomes are needed, one may use `sourmash sig` command to extract signatures of interests only.

</br>

The script `make_training_data_from_sketches.py` extracts the sketches from the Zipfile-format reference database, and then turns them into a form usable by YACHT. In particular, it removes one of any two organisms that have ANI greater than the user-specified threshold as these two organisms are too close to be "distinguishable".

```bash
Expand Down

0 comments on commit 1afc478

Please sign in to comment.