filter_vep is too slow and should be replaced #522

fellen31 · 2025-01-10T14:47:24Z

Description of feature

filter_vep is currently painfully slow for SNVs and INDELs (~2h 45m).

Other tools should be able to filter on HGNC ID, for example bcftools which completes in a couple of minutes.

bcftools +split-vep pr_000_000_snv_annotated_ranked.vcf.gz --gene-list hgnc_ids_with_hgnc_prefix.txt --gene-list-fields HGNC_ID -c HGNC_ID -x

It keeps the exact same sites, but does not remove annotations for HGNC IDs that are not a match. E.g. Only HNC_ID:4053 is kept with filter_vep:

CSQ=T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding|||||||||||4977|1|cds_end_NF|HGNC|HGNC:4053||3|||ENSP00000485313||A0A096LNZ9.50|UPI00053BD5BA|||Ensembl||G|G|||||||||||0.501|0.01|||||||||,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding|||||||||||4970|1||HGNC|HGNC:4053||3|||ENSP00000485643||A0A096LPJ4.45|UPI0004F23698|||Ensembl||G|G|||||||||||0.501|0.01|||||||||;

but with split-vep both HNC_ID:4053 and HGNC:24149 are kept, even though only 4053 was in hgnc_ids_with_hgnc_prefix.txt:

CSQ=T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000304952|protein_coding|||||||||||2796|-1||HGNC|HGNC:24149|YES|1|P1|CCDS5.1|ENSP00000304595|Q9HCC6.162||UPI000006EC19|||Ensembl||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000428771|protein_coding|||||||||||2794|-1||HGNC|HGNC:24149||2||CCDS44034.1|ENSP00000393198||E9PB28.80|UPI0001881B51|||Ensembl||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000481869|retained_intron|||||||||||2798|-1||HGNC|HGNC:24149||2|||||||||Ensembl||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000484667|protein_coding|||||||||||2802|-1||HGNC|HGNC:24149||3||CCDS90837.1|ENSP00000425085||D6REB3.79|UPI0001D3BBEE|||Ensembl||G|G||||||||||||0.00|||||||||,T|non_coding_transcript_exon_variant|MODIFIER||ENSG00000272512|Transcript|ENST00000606034|lncRNA|1/1||ENST00000606034.1:n.1884C>A||1884|||||||-1||||YES||||||||||Ensembl||G|G|||||||||||||||||||||,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding|||||||||||4977|1|cds_end_NF|HGNC|HGNC:4053||3|||ENSP00000485313||A0A096LNZ9.50|UPI00053BD5BA|||Ensembl||G|G|||||||||||0.501|0.01|||||||||,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding|||||||||||4970|1||HGNC|HGNC:4053||3|||ENSP00000485643||A0A096LPJ4.45|UPI0004F23698|||Ensembl||G|G|||||||||||0.501|0.01|||||||||,T|downstream_gene_variant|MODIFIER|HES4|57801|Transcript|NM_001142467.2|protein_coding|||||||||||2796|-1||EntrezGene|HGNC:24149|||||NP_001135939.1||||||RefSeq||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|57801|Transcript|NM_021170.4|protein_coding|||||||||||2796|-1||EntrezGene|HGNC:24149|YES||||NP_066993.1||||||RefSeq||G|G||||||||||||0.00|||||||||;

The text was updated successfully, but these errors were encountered:

jemten · 2025-01-10T14:53:16Z

Wow that's quite the difference! One issue might be that when getting the most severe consequence we might get it from the "wrong" hgnc id and then genmod will score that consequence rather than the intended. Just needs some thought to make sure we get it right 😅

fellen31 · 2025-01-10T14:57:53Z

Wow that's quite the difference! One issue might be that when getting the most severe consequence we might get it from the "wrong" hgnc id and then genmod will score that consequence rather than the intended. Just needs some thought to make sure we get it right 😅

I added filtering at the very end (one reason why it might be slower than in raredisease, although it looks like it still takes an hour or so there), so the most severe consequence is already set and everything else should be the same I think.

dnil · 2025-01-10T15:10:42Z

Well, agree with @jemten here - as long as the most severe consequences are not touched we should be reasonably good. I'm guessing the cutoff distances for "upstream" etc should be ok, but if they are generous we may be seeing more variants with multiple genes matched. They do take longer for the analysts to look at - not a whole lot, but one has to consider each of them, note they are not in the panel and see that the consequence is just nothing. But man hours spent vs the runtime, it this is more than just a handful of variants affected.

fellen31 added the enhancement New feature or request label Jan 10, 2025

github-project-automation bot added this to Nallo Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter_vep is too slow and should be replaced #522

filter_vep is too slow and should be replaced #522

fellen31 commented Jan 10, 2025

jemten commented Jan 10, 2025

fellen31 commented Jan 10, 2025

dnil commented Jan 10, 2025

filter_vep is too slow and should be replaced #522

filter_vep is too slow and should be replaced #522

Comments

fellen31 commented Jan 10, 2025

Description of feature

jemten commented Jan 10, 2025

fellen31 commented Jan 10, 2025

dnil commented Jan 10, 2025