GTF entries on sequence which is not present in FASTA #151

yshen16 · 2024-06-18T17:08:55Z

Description of the bug

recently I tried to help a user to run the nf-core/circRNA pipeline, but met tons of errors. And this is one of the latest after I downloaded the reference genome from aws, and rerun everything from scratch, that is, I removed all the residual ~/.nextflow, ~/.singularity and all that in my working directory when I run the command. Here is the command line I ran:

module purge
module load java/17.0.8
module load nextflow/23.04.3 # old 21.10.6 doesn't work

#Set the new Singularity cache directory path
export SING_DIR=/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3

# Export the NXF_SINGULARITY_CACHEDIR environment variable to point to the new cache directory
export NXF_SINGULARITY_CACHEDIR=$SING_DIR/.singularity

# Create the new Singularity cache directory if it doesn't already exist
mkdir -p $NXF_SINGULARITY_CACHEDIR

# Remove the existing ~/.singularity symbolic link
rm -rf ~/.singularity

# Create a new symbolic link for ~/.singularity that points to the new cache directory
ln -s $NXF_SINGULARITY_CACHEDIR ~/.singularity

cd $SING_DIR

# Yun Test_rerun on 06/16/2024:
nextflow run nf-core/circrna -profile singularity --genome GRCh37 --igenomes_base /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/ref/references --input samples.csv --module 'circrna_discovery' --outdir test3_out -r dev

And here is my local references folder structure:
./Homo_sapiens
./Homo_sapiens/Ensembl
./Homo_sapiens/Ensembl/GRCh37
./Homo_sapiens/Ensembl/GRCh37/Annotation
./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA
./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA/mature.fa
./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA/hairpin.fa
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed
./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/refFlat.txt.gz
./Homo_sapiens/Ensembl/GRCh37/Sequence
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.dict
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.index
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai
./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/GenomeSize.xml
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.rev.2.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.2.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.1.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.3.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.4.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.rev.1.ebwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrName.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/exonInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/SAindex
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/Genome
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/geneInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbList.out.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbInfo.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/transcriptInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbList.fromGTF.out.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrLength.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/Log.out
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genes.gtf
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genomeParameters.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrStart.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/SA
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrNameLength.txt
./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/exonGeTrInfo.tab
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.3.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.1.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.rev.1.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.4.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.2.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.rev.2.bt2
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.pac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.ann
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.pac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.ann
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.sa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.amb
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.bwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.pac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.ann
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rpac
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rsa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.sa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.amb
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rbwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.bwt
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.sa
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.amb
./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.bwt

and the error I met is as below:

Jun-17 17:41:27.965 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 11; name: NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome); status: COMPLETED; exit: 1; error: -; workDir: /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20]
Jun-17 17:41:27.970 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome); work-dir=/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20
error [nextflow.exception.ProcessFailedException]: Process NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome) terminated with an error exit status (1)
Jun-17 17:41:27.987 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome)'

Caused by:
Process NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome) terminated with an error exit status (1)

Command executed:

gffread
-g genome.fa
-w transcriptome.fasta

transcriptome.filtered.gtf

cat <<-END_VERSIONS > versions.yml
"NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME":
gffread: $(gffread --version 2>&1)
END_VERSIONS

Command exit status:
1

Command output:
(empty)

Command error:
FASTA index file genome.fa.fai created.
Warning: couldn't find fasta record for 'GL000191.1'!
Error: no genomic sequence available (check -g option!).

Work dir:
/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20

Looking into a big details about 'gffread' command, I think it is expecting a 'gff' file, while the pipeline provided a 'gtf' as an input file (transcriptome.filtered.gtf) and would this be the cause of the error? if so, how could this happen? is there anything missing in my command line? We used to be able to run the pipeline around Oct 2023, or maybe even this April, but now everything seems broken.

Please help. Thank you very much!!!

The text was updated successfully, but these errors were encountered:

nictru · 2024-06-18T18:02:35Z

Hey - thanks for the issue.
The error occurs in a relatively new pipeline section which probably was not there when you last executed the pipeline. It was tested a bit already, but there are always edge cases that have not been considered yet.

Breaking this down

The TRANSCRIPTOME process tries to extract the sequences of the detected circRNAs into a FASTA file based on an input GTF file. This GTF file is pipeline-internal and not provided by you.
The gffread command is used for this and receives the FASTA file (genome.fa) and the transcriptome GTF file (transcriptome.filtered.gtf) as input. The differences between GTF and GFF should not be relevant here, works in other test cases.
The error says, Warning: couldn't find fasta record for 'GL000191.1'!. This suggests that there are entries in the GTF file that are located on a chromosome that are not present in the FASTA file. So now the question arises: How can we detect circRNAs that are not present in the provided reference genome sequence?
Answer: They can't. But in the QUANTIFICATION subworkflow, a common quantification of the circular and linear transcriptome is performed (the goal is to allow unbiased correlation analyses). So before the GFFREAD step, the detected circRNAs are merged with all the linear transcript locations you provide in the input GTF file. Most likely, the problematic regions originate from the input GTF file.

TL,DR:

The GTF file you provide as input most likely contains entries that are not present in the input FASTA file.

How to fix?

This is something that the pipeline should be able to handle, and it is not complicated to implement. Can probably do this later this week. If you need this more urgently, you could manually remove the problematic entries from the GTF file.

Anything else?

As the pipeline is still under a lot of development, please make sure you use the latest version (nextflow pull nf-core/circrna). Recently, the module parameter has been deprecated (still present in your CLI call).

yshen16 · 2024-06-21T19:24:18Z

Thank you very much for taking time to write me back. Really appreciate it. I wonder if you can provide me some programable way to exclude those transcriptome entries in gtf file that are not in genome.fa? I feel there must be some tools to handle it. I'll look for such tool myself as well. And I'll look forward to hearing the good news from you that the pipeline is fixed.

nictru · 2024-06-23T07:04:34Z

Hey, could you check out the branch associated with this issue and give it a try? You can do this by adding -r 151-error-no-genomic-sequence-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome-error to the nextflow run command. Sorry for the long name

yshen16 · 2024-06-26T21:00:09Z

Hey, could you check out the branch associated with this issue and give it a try? You can do this by adding -r 151-error-no-genomic-sequence-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome-error to the nextflow run command. Sorry for the long name

Thank you, Nictru, I will test it out and let you know how it goes.

nictru · 2024-07-02T13:48:23Z

Any updates?

yshen16 · 2024-07-02T14:20:23Z

Sorry for the delay, and I ran the new dev version of the pipeline and here are the results:

yshen16 · 2024-07-02T14:23:12Z

the test only run finished successfully:
nextflow run nf-core/circrna -profile test,singularity --outdir test_afterfix -r 151-error-no-genomic-sequenc e-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome- error

nictru · 2024-07-02T14:43:17Z

Okay great! Will create a PR

yshen16 · 2024-07-02T14:48:01Z

However, the actual pipeline run with the user data and configuration failed at the 'trimgalore', or more specifically, 'cutadapt' stage:
error info:
`ERROR: Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/cutadapt/pipeline.py", line 626, in run
raise e
EOFError: Compressed file ended before the end-of-stream marker was reached

cutadapt: error: Compressed file ended before the end-of-stream marker was reached

Cutadapt terminated with exit signal: '256'.
Terminating Trim Galore run, please check error message(s) to get an idea what went wrong...
forDebug.zip
`
I attached a zip file which contains the .command.sh and .command.err in it. The actual sequence files are too big to attach as a file here. If needed I can send it to you some other way.

nictru · 2024-07-02T17:09:42Z

Hey, could you verify that the corresponding input fastq file has not been corrupted? Maybe a download problem or a problem during compression?

yshen16 · 2024-07-03T20:05:26Z

Yes, you are right. It turned out the region2 read file was truncated. I will try to find if I can get the good sequence files and rerun the test. Thank you for pointing this out to me.

yshen16 added the bug Something isn't working label Jun 18, 2024

nictru changed the title ~~"Error: no genomic sequence available (check -g option!)." error in 'gffread' command for task NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME error~~ GTF entries on sequence which is not present in FASTA Jun 23, 2024

nictru mentioned this issue Jun 24, 2024

Problems with empty annotation intersection #150

Closed

nictru linked a pull request Jul 2, 2024 that will close this issue

Remove GTF regions on unknown chromosomes #157

Merged

nictru closed this as completed Jul 4, 2024

nictru mentioned this issue Jul 17, 2024

FATAL: the requested image was not found #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTF entries on sequence which is not present in FASTA #151

GTF entries on sequence which is not present in FASTA #151

yshen16 commented Jun 18, 2024 •

edited

Loading

nictru commented Jun 18, 2024 •

edited

Loading

yshen16 commented Jun 21, 2024

nictru commented Jun 23, 2024

yshen16 commented Jun 26, 2024

nictru commented Jul 2, 2024

yshen16 commented Jul 2, 2024

yshen16 commented Jul 2, 2024 •

edited

Loading

nictru commented Jul 2, 2024

yshen16 commented Jul 2, 2024 •

edited

Loading

nictru commented Jul 2, 2024

yshen16 commented Jul 3, 2024

GTF entries on sequence which is not present in FASTA #151

GTF entries on sequence which is not present in FASTA #151

Comments

yshen16 commented Jun 18, 2024 • edited Loading

Description of the bug

nictru commented Jun 18, 2024 • edited Loading

Breaking this down

TL,DR:

How to fix?

Anything else?

yshen16 commented Jun 21, 2024

nictru commented Jun 23, 2024

yshen16 commented Jun 26, 2024

nictru commented Jul 2, 2024

yshen16 commented Jul 2, 2024

yshen16 commented Jul 2, 2024 • edited Loading

nictru commented Jul 2, 2024

yshen16 commented Jul 2, 2024 • edited Loading

nictru commented Jul 2, 2024

yshen16 commented Jul 3, 2024

yshen16 commented Jun 18, 2024 •

edited

Loading

nictru commented Jun 18, 2024 •

edited

Loading

yshen16 commented Jul 2, 2024 •

edited

Loading

yshen16 commented Jul 2, 2024 •

edited

Loading