Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A short "genome" sequence causes a hang, not a fail #9

Open
rmcolq opened this issue Feb 15, 2019 · 8 comments
Open

A short "genome" sequence causes a hang, not a fail #9

rmcolq opened this issue Feb 15, 2019 · 8 comments
Labels

Comments

@rmcolq
Copy link

rmcolq commented Feb 15, 2019

I realise this is not the typical use case, but I was trying to simulate reads with the error profile of nanopore and did not care about the length. I ran the example case just fine, but found that when I simulated reads from a short sequence (2000bps) this caused nanosim-h to start and just hang until my server timed out. Adding 15,000 "A"s to the beginning and end of my short sequence enabled it to work. A sanity check on user input would be a good idea.

nanosim-h -p ecoli_R9_2D -n 500 random_path.fa --unalign-rate 0 --max-len 10000
@karel-brinda
Copy link
Owner

Hi Rachel, thanks for the report! I completely agree that this needs to be fixed.

@mbhall88
Copy link
Contributor

mbhall88 commented Mar 5, 2019

For anyone coming across this, I found a workaround.
I was working with a small toy reference 300bp long. If run with default parameters it hangs forever due to the length of one of the simulated reads being longer than my actual reference. So if you set --max-len to be the length of your reference then it works fine.

@karel-brinda I wonder if max_readlength should be set by default to the length of the reference rather than infinity?

max_readlength = float("inf")

It doesn't make sense to be able to generate reads longer than the reference?

@mbhall88
Copy link
Contributor

mbhall88 commented Mar 5, 2019

Actually, I have been digging into this a little more and found something interesting. If I indeed set --max-len to the length of the reference (300) it gets part of the way through and then hangs. If I take that down to about 290 it runs fine.
But if I also set --min-len to the same as --max-len then it just hangs.

However, if I make --min-len 1 less than --max-len it has an ETA of 2 hours for 1000 reads - this seems to be expected though as mentioned in the nanosim README

If you want the fasta file I am working with I'll attach it below (it says it's a .txt file but it is actually fasta).

test.txt

@ExplodingCabbage
Copy link
Contributor

ExplodingCabbage commented Jun 21, 2019

Pretty sure there's a bug in extract_read. In the use-case where your reference sequence is a single short non-circular chromosome (perhaps because the "chromosome" is really a PCR amplicon you want to simulate reads of), it is highly likely that get_length will return some values greater than your sequence length. When extract_read gets called with such a length value, length gets set to max(seq_len.values()) which in turn equals genome_len. Then we repeatedly try to generate a read start position with this line:

ref_pos = random.randint(0, genome_len)

Then we hit this if/elif/else block:

if ref_pos + length < seq_len[key]:
    new_read = seq_dict[key][ref_pos:ref_pos + length]
    read_info = (key, ref_pos)
    break
elif ref_pos < seq_len[key]:
    break
else:
    ref_pos -= seq_len[key]

But even in the best case where ref_pos is 0, ref_pos + length < seq_len[key] is guaranteed to be false in the single-short-chromosome scenario because length equals seq_len[key]. So new_read never gets set and we stay in the while True: loop generating new random start positions forever.

I'll have a fiddle around and see if there's an easy fix.

@ExplodingCabbage
Copy link
Contributor

#16 fixes the hang. However, when running against a short reference with that fix applied, I now see this crash instead:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/bin/nanosim-h", line 11, in <module>
    load_entry_point('NanoSim-H', 'console_scripts', 'nanosim-h')()
  File "/Users/markamery/NanoSim-H/nanosimh/__init__.py", line 19, in nanosimh_simulate
    simulate.main()
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 1010, in main
    min_l=min_readlength, merge=merge, rnf=rnf, rnf_cigar=rnf_cigar
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 455, in simulation
    read_mutated, cigar = mutate_read(new_read, new_read_name, out_error, error_dict, kmer_bias)
  File "/Users/markamery/NanoSim-H/nanosimh/simulate.py", line 702, in mutate_read
    tmp_bases.remove(read[key + i])
IndexError: string index out of range

I'll keep debugging. :)

@ExplodingCabbage
Copy link
Contributor

I've pushed another commit to #16 to fix the error mentioned above.

@ExplodingCabbage
Copy link
Contributor

I think this is fixed on devel and can be closed?

Not yet released to master or PyPI, though.

@karel-brinda
Copy link
Owner

It's included in the devel branch, but not yet in master (sorry for the delay).

You can install the devel version by:

pip install git+https://github.com/karel-brinda/NanoSim-H@devel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants