Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

realize custum pipeline on server #11

Open
Bontempogianpaolo1 opened this issue Jul 22, 2021 · 20 comments
Open

realize custum pipeline on server #11

Bontempogianpaolo1 opened this issue Jul 22, 2021 · 20 comments

Comments

@Bontempogianpaolo1
Copy link
Collaborator

  1. look at genefusion repo if resolved
  2. look if software are working on hpc
  3. if working, realize custom pipelines on hpc using as draft genefusion
@federicacitarrella
Copy link
Owner

federicacitarrella commented Jul 29, 2021

UPDATE:

In order to check if the software on hpc work properly I wanted to try them directly on hpc but the input files were deleted from the working group to optimize memory occupation. Thus, I am trying to install the same tools used on hpc (Arriba, EricScript and FusionCatcher) on Philae.

To install and run ARRIBA, the following commands were used (https://arriba.readthedocs.io/en/latest/quickstart/#installation-using-bioconda):
conda install -c conda-forge -c bioconda arriba=2.1.0
/home/citarrella/miniconda3/envs/Arriba2/var/lib/arriba/download_references.sh GRCh38+ENSEMBL93

run_arriba.sh STAR_index_GRCh38_ENSEMBL93/ ENSEMBL93.gtf GRCh38.fa $ARRIBA_FILES/blacklist_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz $ARRIBA_FILES/known_fusions_hg19_hs37d5_GRCh37_v2.1.0.tsv.gz $ARRIBA_FILES/protein_domains_hg19_hs37d5_GRCh37_v2.1.0.gff3 8 /home/citarrella/sample/SRR064286/SRR064286_1.fastq.gz /home/citarrella/sample/SRR064286/SRR064286_2.fastq.gz

It works fine ✅

To install EricScreept, the following commands were used (https://sourceforge.net/projects/ericscript/files/):
conda install -c bioconda ericscript
chmod +x /home/citarrella/miniconda3/envs/EricScript/bin/ericscript.pl
/home/citarrella/miniconda3/envs/EricScript/bin/ericscript.pl --printdb
But after the last command, an error occurs:
Schermata 2021-07-29 alle 13 10 10

To install FusionCatcher the following instruction were used (https://github.com/ndaniel/fusioncatcher):
wget http://sf.net/projects/fusioncatcher/files/bootstrap.py -O bootstrap.py && python bootstrap.py --download -y
./bin/download-human_v102.sh

Now I am waiting for the end of the genome data download (90 min left).

@Bontempogianpaolo1
Copy link
Collaborator Author

ericScreept:
try 2 things:

  1. the installation is not gone well I suppose. So try to reinstall ericscript(maybe with -c bioconda at the end of the line) and update it
  2. try the command --help if it works or not

@Bontempogianpaolo1
Copy link
Collaborator Author

Final result?

@federicacitarrella
Copy link
Owner

I tryed to reinstall EricScript using conda install ericscript -c bioconda but it still gives me the same error. If I try the command --help it works showing the following message:

Schermata 2021-07-30 alle 03 47 39

@Bontempogianpaolo1
Copy link
Collaborator Author

This guy is having the same problem you have databio/ericscript#2

@Bontempogianpaolo1
Copy link
Collaborator Author

image

So the problem could be related to ftp client...

@Bontempogianpaolo1
Copy link
Collaborator Author

@Bontempogianpaolo1
Copy link
Collaborator Author

If they don't work just skip it for the moment. We will find a way 😉

@federicacitarrella
Copy link
Owner

About FusionCatcher, at the end of the download I tryed to run the test bash script running the commands:
cd ~
/home/citarrella/fusioncatcher/test/test.sh

But it gives this error:

/home/citarrella/fusioncatcher/tools/biopython/Bio/__init__.py:128: BiopythonWarning: You may be importing Biopython from inside the source tree. This is bad practice and might lead to downstream issues. In particular, you might encounter ImportErrors due to missing compiled C extensions. We recommend that you try running your code from outside the source tree. If you are outside the source tree then you have a setup.py file in an unexpected directory: /home/citarrella/fusioncatcher/tools/biopython.
  format(_parent_dir), BiopythonWarning)
Traceback (most recent call last):
  File "/home/citarrella/fusioncatcher/bin/sra2illumina.py", line 48, in <module>
    import phred
  File "/home/citarrella/fusioncatcher/bin/phred.py", line 58, in <module>
    import Bio.SeqIO
  File "/home/citarrella/fusioncatcher/tools/biopython/Bio/SeqIO/__init__.py", line 387, in <module>
    from Bio.Align import MultipleSeqAlignment
  File "/home/citarrella/fusioncatcher/tools/biopython/Bio/Align/__init__.py", line 22, in <module>
    from Bio.Align import _aligners
ImportError: cannot import name _aligners


################################################################################
################################################################################
TOTAL RUNNING TIME: 0 day(s), 0 hour(s), 0 minute(s), and 3 second(s)
################################################################################
################################################################################
sort: cannot read: test_fusioncatcher/summary_candidate_fusions.txt: No such file or directory



   WARNING: Test is NOT ok! There is something wrong with FusionCatcher installation!   

Here https://github.com/ndaniel/fusioncatcher/blob/master/doc/manual.md#4---installation-and-usage-examples is written:

Please, do not forget to build/download the organism data after this is done running (please notice the last lines displayed by bootstrap.py after it finished running and execute the commands suggested there, e.g. use download.sh)!

But at the end of the bootstrap.py I did not see any suggested command, so I checked the available bash scripts and there was a download-human_v102.sh file that I run, but after I got the error above. I also tried to run the build.sh file but it gives the same error.

@Bontempogianpaolo1
Copy link
Collaborator Author

Regarding the last comment have you already tried with conda?image

@federicacitarrella
Copy link
Owner

About the two EricScript links you sent before, the first one refers to a 0.5.5b version that I do not find, so I tried to download the last version (0.5.5) without using Conda but I got the same error.
About the second link I was trying to follow the instructions provided but I cannot find the file he suggests to delete in the first step:
Schermata 2021-07-30 alle 11 07 20

Now I try to install FusionCatcher using Conda.

@Bontempogianpaolo1
Copy link
Collaborator Author

Would you mind open two different issues for both tools? It s becoming confusing

@Bontempogianpaolo1
Copy link
Collaborator Author

In the meanwhile you can use this helpful link to find files https://www.google.it/amp/s/winaero.com/find-files-linux-terminal/amp/

@federicacitarrella
Copy link
Owner

federicacitarrella commented Jul 31, 2021

UPDATE:

EricScript installation:

conda create --name EricScript
conda activate EricScript
conda install -c bioconda ericscript
chmod +x /home/fcitarrella/miniconda3/envs/EricScript/bin/ericscript.pl

An error occurs running:
/home/fcitarrella/miniconda3/envs/EricScript/bin/ericscript.pl --printdb
To avoid this problem it is needed to download the EricScript database from this link and move the refid folder (ex. homo sapiens) in /home/fcitarrella/miniconda3/envs/EricScript/share/ericscript-0.5.5-5/lib/data/

FusionCatcher installation:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n fusioncatcher fusioncatcher
source activate fusioncatcher

Before proceeding check the download-human-db.sh file in miniconda3/envs/fusioncatcher/bin/:
if needed change the download version from v98 to v102 (just change the number) and the row "ln -sv current human_v102" in "ln -s human_v102 current"

Then run:
download-human-db.sh

Download fastqtk v0.27:

wget https://github.com/ndaniel/fastqtk/archive/refs/tags/v0.27.zip
unzip v0.27.zip
cd fastqtk-0.27
Make

Move "fastqtk" to fusioncatcher/bin/ directory.

It could be needed to run:
conda install tbb=2020.2 (hactar)

@Bontempogianpaolo1
Copy link
Collaborator Author

Bontempogianpaolo1 commented Jul 31, 2021

So now there are three tools correctly installed on the server?

@federicacitarrella
Copy link
Owner

Exactly! Actually they are four, but I installed STAR-Fusion few weeks ago so I would like to check again if it works correctly

@federicacitarrella
Copy link
Owner

federicacitarrella commented Aug 1, 2021

UPDATE:

INTEGRATE installation (https://sourceforge.net/p/integrate-fusion/wiki/Home/):
Download INTEGRATE.0.2.6.tar from https://sourceforge.net/projects/integrate-fusion/files/.
Run the following commands:

tar -xvf INTEGRATE.0.2.6.tar
cd INTEGRATE_0_2_6
mkdir INTEGRATE-build 
cd INTEGRATE-build
conda install -c anaconda cmake
cmake ../Integrate/ -DCMAKE_BUILD_TYPE=release 
make

[I tried the test and it worked. Then I tried to run a real example but the data preparation is quite complicated. I used the following command but I got some errors:

> wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
> gzip -d Homo_sapiens.GRCh37.75.gtf.gz
> conda install -c bioconda ucsc-gtftogenepred
> gtfToGenePred -genePredExt -geneNameAsName2 Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.genePred
> cut -f 1-10,12 Homo_sapiens.GRCh37.75.genePred > tmp.txt
> echo -e "#GRCh37.ensGene.name\tGRCh37.ensGene.chrom\tGRCh37.ensGene.strand\tGRCh37.ensGene.txStart\tGRCh37.ensGene.txEnd\tGRCh37.ensGene.cdsStart\tGRCh37.ensGene.cdsEnd\tGRCh37.ensGene.exonCount\tGRCh37.ensGene.exonStarts\tGRCh37.ensGene.exonEnds\tGRCh37.ensemblToGeneName.value" > annot.ensembl.GRCh37.txt
> cat tmp.txt >> annot.ensembl.GRCh37.txt
> wget https://genome-idx.s3.amazonaws.com/bt/GRCh37.zip
> wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
> tar -xvzf tophat-2.1.1.Linux_x86_64.tar.gz 
> cd tophat-2.1.1.Linux_x86_64
> cp * /home/citarrella/miniconda3/envs/INTEGRATE/bin/conda install -c bioconda bowtie
> conda install -c bioconda bowtie2

edit the first line in your_tophat_directory/tophat from:
#!/usr/bin/env python to: #!/usr/bin/env python2

> tophat2 --no-coverage-search /home/citarrella/GeneFusionTools/INTEGRATE_0_2_6/preparation/GRCh37/GRCh37 /home/citarrella/sample/SRR064286/SRR064286_1.fastq,/home/citarrella/sample/SRR064286/SRR064286_2.fastq
> fastq-dump --split-files SRR15290663
> wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz
> gzip –d GRCh37.p13.genome.fa.gz
> conda install -c bioconda bwa
> bwa index /home/citarrella/GeneFusionTools/INTEGRATE_0_2_6/preparation/GRCh37.p13.genome.fa
> bwa mem /home/citarrella/GeneFusionTools/INTEGRATE_0_2_6/preparation/GRCh37.p13.genome.fa /home/citarrella/sample/SRR064286/SRR064286_1.fastq /home/citarrella/sample/SRR064286/SRR064286_2.fastq | samtools sort -o dna.tumor.bam – [[ERROR: invalid BAM binary header (this is not a BAM file)]]
> samtools index *.bam
> ./Integrate mkbwt /home/citarrella/GeneFusionTools/INTEGRATE_0_2_6/preparation/GRCh37.p13.genome.fa [[ERROR: the .fa file was not accepted]]
> ./Integrate fusion /home/citarrella/GeneFusionTools/INTEGRATE_0_2_6/test-data//reference.fasta annot.ensembl.GRCh37.txt ./bwts accepted_hits.bam unmapped.bam dna.tumor.bam dna.normal.bam

]

@federicacitarrella
Copy link
Owner

I realized a simple pipeline using EricScript.
The pipeline.nf file was structured as follow:

#!/usr/bin/env nextflow

fasta1 = file(params.in1)
fasta2 = file(params.in2)

process arriba{
    conda '/home/fcitarrella/miniconda3/envs/EricScript'

    """
    #!/bin/bash
    /home/fcitarrella/miniconda3/envs/EricScript/bin/ericscript.pl /home/fcitarrella/sample/SRR064286/SRR064286_1.fastq /home/fcitarrella/sample/SRR064286/SRR064286_2.fastq
    """

}

print "DONE"

I run it using the following instruction:

./nextflow pipeline.nf --in1 '/home/fcitarrella/sample/SRR064286/SRR064286_1.fastq' --in2 '/home/fcitarrell/sample/SRR064286/SRR064286_2.fastq'

It worked fine.

I also tried to exploit the input files but it gives some errors as shown below.

pipeline.nf:

#!/usr/bin/env nextflow

fasta1 = file(params.in1)
fasta2 = file(params.in2)

process arriba{
    conda '/home/fcitarrella/miniconda3/envs/EricScript'

    """
    #!/bin/bash
    /home/fcitarrella/miniconda3/envs/EricScript/bin/ericscript.pl ${fasta1} ${fasta2}                                                            
    """

}

print "DONE"

error:

Error executing process > 'arriba'

Caused by:
  Process `arriba` terminated with an error exit status (2)

Command executed:

  #!/bin/bash
  /home/fcitarrella/miniconda3/envs/EricScript/bin/ericscript.pl /home/fcitarrella/sample/SRR064286/SRR064286_1.fastq /home/fcitarrell/sample/SRR064286/SRR064286_2.fastq

Command exit status:
  2

Command output:
  (empty)

Command error:
              Subcommands:
              --checkdb                       Check if your database is up-to-date, based on the latest Ensembl release.
              --downdb                        Download, build database. refid parameter need to be specified.
              --simulator                     Generate synthetic gene fusions with the same recipe of the ericscript's paper
              --calcstats                     Calculate the statistics that we used in our paper to evaluate the performance of the algorithms.         
 
              --------
              arguments for databases subcommands (downdb, checkdb):
  
                      -db, --dbfolder <string>        where database is stored. Default is ERICSCRIPT_FOLDER/lib/
                      --refid                         Genome reference identification. Run ericscript.pl --printdb to see available refid [homo_sapiens].
                      --printdb                       Print a list of available genomes and exit.
                      --ensversion            Download data of a specific Ensembl version (>= 70). Default is the latest one.
   
              -------
              arguments for simulator:
                      -o, --outputfolder <string>     where synthetic datasets will be stored [HOME/ericscript_simulator]
                      -rl, --readlength <int>         length of synthetic reads [75]
                      --refid                         Genome reference identification. Run ericscript.pl --printdb to see available refid [homo_sapiens].
                      -v, --verbose                   use verbose output
                      --insize                        parameter of wgsym. Outer distance between the two ends [200]
                      --sd_insize                     parameter of wgsym. Standard deviation [50]
                      --ngenefusion                   The number of synthetic gene fusions per dataset? [50]
                      --min_cov                       Minimum coverage to simulate [1]
                      --max_cov                       Maximum coverage to simulate [50]
                      --nsims                         The number of synthetic datasets to simulate [10]
                      --be                            Use --be to generate Broken Exons (BE) data [no]
                      --ie                            Use --ie to generate Intact Exons (IE) data [yes]
                      -db, --dbfolder                 where database is stored. Default is ERICSCRIPT_FOLDER/lib/ 
                      --background_1                  Fastq file (forward)  for generating background reads. 
                      --background_2                  Fastq file (reverse) for generating background reads. 
                      --nreads_background             The number of reads to extract from background data [200e3].
  
              -------
              arguments for calcstats:
                      -o, --outputfolder <string>     where statistics file will be stored [HOME/ericscript_calcstats]
                      --resultsfolder <string>        path to folder containing algorithm results.
                      --datafolder <string>           path to folder containing synthetic data generated by ericscript simulator.
                      --algoname <string>             name of the algorithm that generated results. 
                      --dataset <string>              type of synthetic data to considered for calculating statistics. IE or BE? 
                      -rl, --readlength <int>         length of synthetic reads 
                      --normroc <int>                 factor to normalize the score given by the algorithm.
...

@Bontempogianpaolo1
Copy link
Collaborator Author

Uhmm strange... have you already tried to print parameters inside that process?

@federicacitarrella
Copy link
Owner

There was an error in the executed command 'fcitarrell' instead of 'fcitarrella', now it works too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants