Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTF entries on sequence which is not present in FASTA #151

Closed
yshen16 opened this issue Jun 18, 2024 · 11 comments · Fixed by #157
Closed

GTF entries on sequence which is not present in FASTA #151

yshen16 opened this issue Jun 18, 2024 · 11 comments · Fixed by #157
Labels
bug Something isn't working

Comments

@yshen16
Copy link

yshen16 commented Jun 18, 2024

Description of the bug

recently I tried to help a user to run the nf-core/circRNA pipeline, but met tons of errors. And this is one of the latest after I downloaded the reference genome from aws, and rerun everything from scratch, that is, I removed all the residual ~/.nextflow, ~/.singularity and all that in my working directory when I run the command. Here is the command line I ran:

module purge
module load java/17.0.8
module load nextflow/23.04.3 # old 21.10.6 doesn't work

#Set the new Singularity cache directory path
export SING_DIR=/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3

# Export the NXF_SINGULARITY_CACHEDIR environment variable to point to the new cache directory
export NXF_SINGULARITY_CACHEDIR=$SING_DIR/.singularity

# Create the new Singularity cache directory if it doesn't already exist
mkdir -p $NXF_SINGULARITY_CACHEDIR

# Remove the existing ~/.singularity symbolic link
rm -rf ~/.singularity

# Create a new symbolic link for ~/.singularity that points to the new cache directory
ln -s $NXF_SINGULARITY_CACHEDIR ~/.singularity

cd $SING_DIR

# Yun Test_rerun on 06/16/2024:
nextflow run nf-core/circrna -profile singularity --genome GRCh37 --igenomes_base /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/ref/references --input samples.csv --module 'circrna_discovery' --outdir test3_out -r dev

And here is my local references folder structure:
./Homo_sapiens
./Homo_sapiens/Ensembl
./Homo_sapiens/Ensembl/GRCh37
./Homo_sapiens/Ensembl/GRCh37/Annotation
./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA
./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA/mature.fa
./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA/hairpin.fa
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/refFlat.txt.gz
./Homo_sapiens/Ensembl/GRCh37/Sequence
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.dict
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.index
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/GenomeSize.xml
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.rev.2.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.2.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.1.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.3.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.4.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.rev.1.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrName.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/exonInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/SAindex
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/Genome
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/geneInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbList.out.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbInfo.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/transcriptInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbList.fromGTF.out.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrLength.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/Log.out
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genes.gtf
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genomeParameters.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrStart.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/SA
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrNameLength.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/exonGeTrInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.3.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.1.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.rev.1.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.4.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.2.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.rev.2.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.pac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.ann
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.pac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.ann
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.sa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.amb
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.bwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.pac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.ann
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rpac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rsa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.sa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.amb
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rbwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.bwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.sa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.amb
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.bwt

and the error I met is as below:

Jun-17 17:41:27.965 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 11; name: NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome); status: COMPLETED; exit: 1; error: -; workDir: /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20]
Jun-17 17:41:27.970 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome); work-dir=/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20
error [nextflow.exception.ProcessFailedException]: Process NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome) terminated with an error exit status (1)
Jun-17 17:41:27.987 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome)'

Caused by:
Process NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome) terminated with an error exit status (1)

Command executed:

gffread
-g genome.fa
-w transcriptome.fasta

transcriptome.filtered.gtf

cat <<-END_VERSIONS > versions.yml
"NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME":
gffread: $(gffread --version 2>&1)
END_VERSIONS

Command exit status:
1

Command output:
(empty)

Command error:
FASTA index file genome.fa.fai created.
Warning: couldn't find fasta record for 'GL000191.1'!
Error: no genomic sequence available (check -g option!).

Work dir:
/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20

Looking into a big details about 'gffread' command, I think it is expecting a 'gff' file, while the pipeline provided a 'gtf' as an input file (transcriptome.filtered.gtf) and would this be the cause of the error? if so, how could this happen? is there anything missing in my command line? We used to be able to run the pipeline around Oct 2023, or maybe even this April, but now everything seems broken.

Please help. Thank you very much!!!

@yshen16 yshen16 added the bug Something isn't working label Jun 18, 2024
@nictru
Copy link
Collaborator

nictru commented Jun 18, 2024

Hey - thanks for the issue.
The error occurs in a relatively new pipeline section which probably was not there when you last executed the pipeline. It was tested a bit already, but there are always edge cases that have not been considered yet.

Breaking this down

  1. The TRANSCRIPTOME process tries to extract the sequences of the detected circRNAs into a FASTA file based on an input GTF file. This GTF file is pipeline-internal and not provided by you.
  2. The gffread command is used for this and receives the FASTA file (genome.fa) and the transcriptome GTF file (transcriptome.filtered.gtf) as input. The differences between GTF and GFF should not be relevant here, works in other test cases.
  3. The error says, Warning: couldn't find fasta record for 'GL000191.1'!. This suggests that there are entries in the GTF file that are located on a chromosome that are not present in the FASTA file. So now the question arises: How can we detect circRNAs that are not present in the provided reference genome sequence?
  4. Answer: They can't. But in the QUANTIFICATION subworkflow, a common quantification of the circular and linear transcriptome is performed (the goal is to allow unbiased correlation analyses). So before the GFFREAD step, the detected circRNAs are merged with all the linear transcript locations you provide in the input GTF file. Most likely, the problematic regions originate from the input GTF file.

TL,DR:

The GTF file you provide as input most likely contains entries that are not present in the input FASTA file.

How to fix?

This is something that the pipeline should be able to handle, and it is not complicated to implement. Can probably do this later this week. If you need this more urgently, you could manually remove the problematic entries from the GTF file.

Anything else?

As the pipeline is still under a lot of development, please make sure you use the latest version (nextflow pull nf-core/circrna). Recently, the module parameter has been deprecated (still present in your CLI call).

@yshen16
Copy link
Author

yshen16 commented Jun 21, 2024

Thank you very much for taking time to write me back. Really appreciate it. I wonder if you can provide me some programable way to exclude those transcriptome entries in gtf file that are not in genome.fa? I feel there must be some tools to handle it. I'll look for such tool myself as well. And I'll look forward to hearing the good news from you that the pipeline is fixed.

@nictru nictru changed the title "Error: no genomic sequence available (check -g option!)." error in 'gffread' command for task NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME error GTF entries on sequence which is not present in FASTA Jun 23, 2024
@nictru
Copy link
Collaborator

nictru commented Jun 23, 2024

Hey, could you check out the branch associated with this issue and give it a try? You can do this by adding -r 151-error-no-genomic-sequence-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome-error to the nextflow run command. Sorry for the long name

@yshen16
Copy link
Author

yshen16 commented Jun 26, 2024

Hey, could you check out the branch associated with this issue and give it a try? You can do this by adding -r 151-error-no-genomic-sequence-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome-error to the nextflow run command. Sorry for the long name

Thank you, Nictru, I will test it out and let you know how it goes.

@nictru
Copy link
Collaborator

nictru commented Jul 2, 2024

Any updates?

@yshen16
Copy link
Author

yshen16 commented Jul 2, 2024

Sorry for the delay, and I ran the new dev version of the pipeline and here are the results:

@yshen16
Copy link
Author

yshen16 commented Jul 2, 2024

  1. the test only run finished successfully:
    nextflow run nf-core/circrna -profile test,singularity --outdir test_afterfix -r 151-error-no-genomic-sequenc e-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome- error

@nictru
Copy link
Collaborator

nictru commented Jul 2, 2024

Okay great! Will create a PR

@nictru nictru linked a pull request Jul 2, 2024 that will close this issue
@yshen16
Copy link
Author

yshen16 commented Jul 2, 2024

However, the actual pipeline run with the user data and configuration failed at the 'trimgalore', or more specifically, 'cutadapt' stage:
error info:
`ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/cutadapt/pipeline.py", line 626, in run
raise e
EOFError: Compressed file ended before the end-of-stream marker was reached

cutadapt: error: Compressed file ended before the end-of-stream marker was reached

Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...
forDebug.zip
`
I attached a zip file which contains the .command.sh and .command.err in it. The actual sequence files are too big to attach as a file here. If needed I can send it to you some other way.

@nictru
Copy link
Collaborator

nictru commented Jul 2, 2024

Hey, could you verify that the corresponding input fastq file has not been corrupted? Maybe a download problem or a problem during compression?

@yshen16
Copy link
Author

yshen16 commented Jul 3, 2024

Yes, you are right. It turned out the region2 read file was truncated. I will try to find if I can get the good sequence files and rerun the test. Thank you for pointing this out to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants