- Depending on how far though the pipeline you want to run the code from, download the relevant data from Dropbox.
data/colabfold_output
contains colabfold output filesdata/fastas_for_colabfold
contains the input fasta files to colabfolddata/foldseek_output
contains the foldseek output filesdata/raw
contains the protein sequences from each source
- Data is backed up on Dropbox here.
- rclone can make backing up on Dropbox easier.
sudo -v ; curl https://rclone.org/install.sh | sudo bash
to installrsync config
to configurerclone copy colabfold_output "remote:/Shared Data/AF2/colabfold_output" -P --ignore-existing
as an example, to copy the colabfold_output folder to Dropbox, observing progressing and ignoring duplicates.
- Install colabfold
- Install foldseek
- Install foldseek databases to
data/foldseek_databases
. In this example only PDB was used. - If using multiple local GPUs, such as via a compute engine on GCP then
pip install simple-gpu-scheduler
other/GCP_doc.pdf
includes setup of compute engines on GCP.- If some sequences have already been run on other machines and you want to avoid duplicating effort, rather than editing fastas, it's easier to copy the
.done.txt
from thedata/colabfold_output
folder. These files tell colabfold if predicted have already been generated for a sequence and colabfold will skip them.coda/other/move_done_txt.py
can assist with this.- The scripts assume there's a conda environment on Barkla called
my_gpu
that is a copy ofgpu
, and a venv calledaf2venv
.
code/preprocess/gen_seq_data_table.sh
- Merge sources and get unique sequences
- Saves as
source_unique_df.csv
code/preprocess/gen_priority_1_fastas.sh
- For priority one sequences, get chromosome of sequences and put into fasta's of 20 sequences, where the header is the corresponding id in
source_unique_df.csv
- See
code/other/gen_fasta.sh
andgen_fasta.py
for an example of how to generate fastas for all sequences, not just those that were priority for this project.
- For priority one sequences, get chromosome of sequences and put into fasta's of 20 sequences, where the header is the corresponding id in
code/colabfold/run_colabfold_until_done.sh
- Will run colabfold on Barkla.
- Sets up an array of 8 jobs (the max that can be requested at one time on Barkla) on the first GPU partittion that becomes available.
- Edits the vars
fasta_path
and/ orfasta_names
incolabfold/run_colabfold_barkla.sh
if you want to use different fastas. - Save to
data/colabfold_output/
code/colabfolabfold/run_colabfold_gcp.sh
- Will run colabfold on a compute engine on GCP.
- Edits the vars
fasta_path
and/ orfasta_names
incolabfold/run_colabfold_barkla.sh
if you want to use different fastas. - Save to
data/colabfold_output/
code/foldseek/foldseek_query.sh
- Run foldseek against the highest ranked relaxed models for each sequence in
data/colabfold_output/
- Outputs to
data/foldseek_output/
- This script ran on the PGB HPC.
- Run foldseek against the highest ranked relaxed models for each sequence in
data/process_output/get_results.sh
- calculate some results from colabfold and foldseek outputs and save as csv.
- Don't expect these scripts to work as part of the the pipeline, but here are some other scripts I used to solve problems I ran into.
code/other/move_done_txt.py
- copy
*done.txt
files fromcolabfold_output
and move them to a new directory calleddone_txt
. done.txt
files indicate to colabfold that a sequence has already been predicted and that these sequences can be skipped.- These can then be moved to
colabfold_output
on a different system to ensure that those fastas as skipped and results aren't duplicated when processing. - This allows all fastas to be moved, rather than selecting only those that need moving.
- copy
code/other/convert_old_sequences.sh
- Used to change the name of files that had already been generated, but incorrectly named
- This opens
.csv
's with the incorrect names (old_source_df
) and correct names (new_source_df
) and maps the old name to the new one. - Then renames the files in the incorrectly named fold (
colabfold_output
), and copies them to a new folder ('new_colabfold_output').
code/other/figs_for_presentation.py
- Opens the csv generated by
code/process_output/get_results.py
- In a folder called
plots
, violin plots are generated of the distribution of each feature by which dataset the sequence was obtained from.
- Opens the csv generated by
code/other/count_sequences.py
- Get names of sequences that need running, check if results have been generated for them in
colabfold_output
. - Print count of sequences left to process.
- Get names of sequences that need running, check if results have been generated for them in
code/other/gen_fasta.sh
- Similar to
code/preprocess/gen_priority_1_fasta.sh
, though doesn't check against being a priority 1 sequence.
- Similar to
code/other/rename_results.py
- Some fasta records were incorrectly named based on count, rather than a sequence id. This script renames those files.