-
Notifications
You must be signed in to change notification settings - Fork 21
Searching proteins
This search and alignment tutorial is building off of the previous tutorial on how to build search databases. Here, we will provide guidance on how to perform structured queries against protein sequence datasets. We will first show how to pull down prebuilt databases, and how to build your own queries.
To get started first download the TM-Vec CATH model and the prebuilt database
wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt
wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json
wget https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-v3.ckpt
To search against the embeddings_cath_s100_final.npy
database, you just need to provide the proteins you want to query in the form a of fasta formatted file. Say that you have a file called my_proteins.fasta
that you want to query against the database. You can perform this query using the following command
tmvec-search \
--query bagel.fa \
--tm-vec-model tm_vec_cath_model.ckpt \
--tm-vec-config tm_vec_cath_model_params.json \
--database bagel_database/db.npy \
--metadata bagel_database/meta.npy \
--database-fasta bagel.fa \
--device 'gpu' \
--output-format tabular \
--output tabular.txt \
--output-embeddings test.npy
This will output a table of top K-nearest neighbors for each query protein, in addition to their TM-scores under tabular.txt
. The first few lines of this file should look as follows
query_id rank database_id tm-score
68-1147 0 68-1147 0.9999999
68-1147 1 Thiostrepton_B 0.98808634
68-1147 2 Siomycin_A_(Siomycin_B)_(Siomycin_C)_(Siomycin_D) 0.9833769
68-1147 3 Thiostrepton_A 0.96945214
68-1147 4 Nosiheptide 0.9661771
A10255_B_(A10255_E)_(A10255_G) 0 A10255_B_(A10255_E)_(A10255_G) 1.0000002
A10255_B_(A10255_E)_(A10255_G) 1 Thioxamycin 0.9883827
A10255_B_(A10255_E)_(A10255_G) 2 Promoinducin 0.9829995
A10255_B_(A10255_E)_(A10255_G) 3 A10255_J 0.97695875
The query id
refers to the id of a protein in the query fasta file. The database id
refers to the id of the protein in the database fasta file. The rank here specifies the ordering due to the TM-score. For instance 68-1147
vs 68-1147
have a TM-score close to 1 since they are identical (rank 0), and 68-1147
vs Thiostrepton_B
have a TM-score of 0.98808634, which making Thiostrepton_B
the second most structurally similar protein to the protein 68-1147
in the database. Because the query fasta file and the database fasta file are identical, we expect the top hits to correspond to identical proteins.
The --output-embeddings
contains the embeddings of the query proteins. With this file, it is possible to compute TSNE-representations
and directly visualize the similarity between proteins. Opening test.npy
will require the numpy.load function.
The GPU flag is needed to specify that a GPU will be used for computation -- this will drastically reduce the runtime for encoding the proteins (which is the slowest operation).
There are three different formats that TM-vec can output, (1) a tabular format, (2) an embedding format and (3) an alignment format. Generating the tabular format was demonstrated above. The embedding format takes the input query sequences and converts them to protein vectors. The alignment format computes structural alignments amongst the top K nearest neighbors using DeepBLAST. If computing alignments, you will need to specify additional files in order to retrieve the original protein sequences (namely the fasta file and the fasta index file). Specifying alignments can be done as follows
tmvec-search \
--query bagel.fa \
--tm-vec-model tm_vec_cath_model.ckpt \
--tm-vec-config tm_vec_cath_model_params.json \
--database bagel_database/db.npy \
--metadata bagel_database/meta.npy \
--database-fasta bagel.fa \
--database-faidx bagel.fai \
--deepblast-model deepblast-pt-l8.ckpt \
--device 'gpu' \
--k-nearest-neighbors 1 \
--output-format alignment \
--output alignments.txt \
--output-embeddings test.npy
The alignments file contains a list of gapped alignments (no statistics or TM-scores). In this tutorial the first 5 alignments are formatted below.
Tricyclic CLGVGSCNNFAGCGYAIVCFW
Tricyclic CLGVGSCNNFAGCGYAIVCFW
Capistruin MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN
Capistruin MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN
Chaxapeptin MEPQMTELQPEAYEAPSLIEVGEFSEDTLGFGSKPLDSFGLNFF
Chaxapeptin MEPQMTELQPEAYEAPSLIEVGEFSEDTLGFGSKPLDSFGLNFF
Enterocin MGNSILNKMTVEEMEAVKGGNLVCPPMPDYIKRLSTGKGVSSVYMAWQIANCKSSGSCMKGQTNRTC
Enterocin MGNSILNKMTVEEMEAVKGGNLVCPPMPDYIKRLSTGKGVSSVYMAWQIANCKSSGSCMKGQTNRTC
Caulonodin_V MTQVSPSPLRLIRVGRALDLTRSIGDSGLRESMSSQTYWP
Caulonodin_V MTQVSPSPLRLIRVGRALDLTRSIGDSGLRESMSSQTYWP
Gassericin MKNFNTLSFETLANIVGGRNNLAANIGGVGGATVAGWALGNAVCGPACGFVGAHYVPIAWAGVTAATGGFGKIRK
Gassericin MKNFNTLSFETLANIVGGRNNLAANIGGVGGATVAGWALGNAVCGPACGFVGAHYVPIAWAGVTAATGGFGKIRK
Because we are DeepBLASTing bagel.fa
against itself, most of the proteins will align against themselves (see Other things to consider for more discussion).
One thing to note about these alignments is that if two proteins are too dissimilar, then DeepBLAST will not return an alignment. As a result, TM-vec and DeepBLAST will not always agree, it won't be uncommon to have hits from TM-vec that have no DeepBLAST alignments. This is especially the case when the TM-score is slow (around 0.5), DeepBLAST will likely not report any alignments.
If you want to run these operations on a GPU, make sure to specify the --device
flag (i.e. --device 'gpu'
). Also keep in mind that performing alignments can be resource intensive (i.e. 4 alignments / second in this tutorial on a RTX 4090 GPU), so if you just want to identify remotely homologous proteins on large queries, you may want to avoid the alignment format.
Finally, we want to point out that DeepBLAST is not a perfect aligner -- it will not perform as well as traditional sequence aligners when the sequence identity is >90%. This is partially due to the training dataset, DeepBLAST was trained on proteins who have sequence identity <40% and TM-score > 0.6. As a result, there are edge cases where DeepBLAST will not be able to correctly rank proteins with high sequence identity (even if they are identical). This is apparent in this tutorial involving bacteriocins.