Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_vep is too slow and should be replaced #522

Open
fellen31 opened this issue Jan 10, 2025 · 3 comments
Open

filter_vep is too slow and should be replaced #522

fellen31 opened this issue Jan 10, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@fellen31
Copy link
Collaborator

Description of feature

filter_vep is currently painfully slow for SNVs and INDELs (~2h 45m).

Other tools should be able to filter on HGNC ID, for example bcftools which completes in a couple of minutes.

bcftools +split-vep pr_000_000_snv_annotated_ranked.vcf.gz --gene-list hgnc_ids_with_hgnc_prefix.txt --gene-list-fields HGNC_ID -c HGNC_ID -x

It keeps the exact same sites, but does not remove annotations for HGNC IDs that are not a match. E.g. Only HNC_ID:4053 is kept with filter_vep:

CSQ=T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding|||||||||||4977|1|cds_end_NF|HGNC|HGNC:4053||3|||ENSP00000485313||A0A096LNZ9.50|UPI00053BD5BA|||Ensembl||G|G|||||||||||0.501|0.01|||||||||,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding|||||||||||4970|1||HGNC|HGNC:4053||3|||ENSP00000485643||A0A096LPJ4.45|UPI0004F23698|||Ensembl||G|G|||||||||||0.501|0.01|||||||||;

but with split-vep both HNC_ID:4053 and HGNC:24149 are kept, even though only 4053 was in hgnc_ids_with_hgnc_prefix.txt:

CSQ=T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000304952|protein_coding|||||||||||2796|-1||HGNC|HGNC:24149|YES|1|P1|CCDS5.1|ENSP00000304595|Q9HCC6.162||UPI000006EC19|||Ensembl||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000428771|protein_coding|||||||||||2794|-1||HGNC|HGNC:24149||2||CCDS44034.1|ENSP00000393198||E9PB28.80|UPI0001881B51|||Ensembl||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000481869|retained_intron|||||||||||2798|-1||HGNC|HGNC:24149||2|||||||||Ensembl||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|ENSG00000188290|Transcript|ENST00000484667|protein_coding|||||||||||2802|-1||HGNC|HGNC:24149||3||CCDS90837.1|ENSP00000425085||D6REB3.79|UPI0001D3BBEE|||Ensembl||G|G||||||||||||0.00|||||||||,T|non_coding_transcript_exon_variant|MODIFIER||ENSG00000272512|Transcript|ENST00000606034|lncRNA|1/1||ENST00000606034.1:n.1884C>A||1884|||||||-1||||YES||||||||||Ensembl||G|G|||||||||||||||||||||,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624652|protein_coding|||||||||||4977|1|cds_end_NF|HGNC|HGNC:4053||3|||ENSP00000485313||A0A096LNZ9.50|UPI00053BD5BA|||Ensembl||G|G|||||||||||0.501|0.01|||||||||,T|upstream_gene_variant|MODIFIER|ISG15|ENSG00000187608|Transcript|ENST00000624697|protein_coding|||||||||||4970|1||HGNC|HGNC:4053||3|||ENSP00000485643||A0A096LPJ4.45|UPI0004F23698|||Ensembl||G|G|||||||||||0.501|0.01|||||||||,T|downstream_gene_variant|MODIFIER|HES4|57801|Transcript|NM_001142467.2|protein_coding|||||||||||2796|-1||EntrezGene|HGNC:24149|||||NP_001135939.1||||||RefSeq||G|G||||||||||||0.00|||||||||,T|downstream_gene_variant|MODIFIER|HES4|57801|Transcript|NM_021170.4|protein_coding|||||||||||2796|-1||EntrezGene|HGNC:24149|YES||||NP_066993.1||||||RefSeq||G|G||||||||||||0.00|||||||||;

@fellen31 fellen31 added the enhancement New feature or request label Jan 10, 2025
@jemten
Copy link
Collaborator

jemten commented Jan 10, 2025

Wow that's quite the difference! One issue might be that when getting the most severe consequence we might get it from the "wrong" hgnc id and then genmod will score that consequence rather than the intended. Just needs some thought to make sure we get it right 😅

@fellen31
Copy link
Collaborator Author

Wow that's quite the difference! One issue might be that when getting the most severe consequence we might get it from the "wrong" hgnc id and then genmod will score that consequence rather than the intended. Just needs some thought to make sure we get it right 😅

I added filtering at the very end (one reason why it might be slower than in raredisease, although it looks like it still takes an hour or so there), so the most severe consequence is already set and everything else should be the same I think.

@dnil
Copy link

dnil commented Jan 10, 2025

Well, agree with @jemten here - as long as the most severe consequences are not touched we should be reasonably good. I'm guessing the cutoff distances for "upstream" etc should be ok, but if they are generous we may be seeing more variants with multiple genes matched. They do take longer for the analysts to look at - not a whole lot, but one has to consider each of them, note they are not in the panel and see that the consequence is just nothing. But man hours spent vs the runtime, it this is more than just a handful of variants affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

3 participants