-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: Use temporary directories for sra and temporary fastq files #961
Comments
I guess this could work, and makes sense. I do think there are some some issues, specifically when running on e.g. slurm. I'm not sure how to do this nicely.. caveats:
I don't think An alternative without using tmpdir would be to set the |
Also the sra files and the fastq files are all removed when not needed anymore in the newer seq2science versions |
The problem is not disk space, as the files are indeed removed when no longer needed, but IO on the NFS which leads to slow response of our whole server. If I understand it correctly, the following could solve this: rule download_fastq_whatever:
The $TMPDIR can be set using the tmpdir resource. This should work in a slurm context as it will delay the evaluation of that variable:
The effect is that IO on NFS is only for the final fastq output (which is inevitable anyway), and on a local |
Thinking a bit about it again, I guess your original proposal actually could work, (dynamically) "grouping" and "shadowing" the rules, and setting the shadow-prefix to I guess a new config variable should specify whether or not the temp dirs setting is needed, and where that would be: https://github.com/vanheeringen-lab/seq2science/blob/master/seq2science/schemas/config/download.schema.yaml
Those rules then need dynamic Only question is if it nicely works with the tmpdir resource. |
See #963 for an initial PR. I think shadowing is not even necessary. As far as I understand it, snakemake will set the The Due to merging the SRA download and fastq-dump in one rule, using Let me know if there are other potential issues with this solution. |
I don't think I like this approach 😇. For example Doesn't changing |
That's why you set/use
Kinda.. but...
I see your point. I'm not sure this is always the case practically though, due to slow NFS performance in the current configuration. I will think a bit more on it :). |
I'll do some benchmarking of this PR relative to the develop branch. It could be that improved performance of the SRA -> fastq conversion outweighs the downside of not having as many parallel downloads? Other than that I'm not sure there is another viable solution. (This fix does solve my immediate problems, so at the least it was good for that purpose ;)) |
I do, yes. But I don't think everyone does. The problem you're having seems specific to your hardware setup as I've never gotten any of these issues. Way back on sara I think I've set parallel-downloads to +/- 20 and all went fine. Similarly when using /scratch on the servers it doesn't cause latency issues. Perhaps using I still think setting |
Or just run it with less parallel downloads, |
I don't think the parallel downloads are the problem, but the SRA -> fastq conversion (limiting parallel downloads is just an unfortunate side effect of grouping the download and conversion as I have done here). I could be wrong, and it could indeed be an issue with our NFS config. In my case Feel free to close the PR. I'm happy with it on my side (everything is running smoothly now) but I can see why you would not happy with it being a default. |
Im all for fixing needless IO, but if prefetch cannot be piped into fastq-dump we're still stuck with writing sras. That said, we can limit how many of those exist at any one moment. #968 prioritizes finishing a fastq over downloading sras, and can easily be expanded to include trimming and beyond: #969. And this pr is not mutually exclusive |
Our version of sra-tools was old. #970 has the latest version, which received a few bugfixes. That should mean fewer retries, and less file moving. To optimize for your setup, I think we should go with PRs #969 and #970, and you should try while looking into prefetch I noticed something else: you could consider downloading SRAlite files (with simplified quality scores, thus smaller) instead of the full SRAs. Their docs. |
This will all help, but don't worry too much about my use case. I'm happy with the code of my own PR on my side ;) I just wonder if this is a broader issue, but that would need to be tested. If you don't experience slowdown (as in server really becoming unresponsive, waiting seconds for just an SSH login), then this may be due to our configuration. Don't spend too much (or any) time on it for my sake. The thing is, I'm not 100% sure this is the underlying issue. What I do know that with the code of this PR I don't experience it anymore. |
Useful pipeline you have here ;)
Is your feature request related to a problem? Please describe.
When working on an NFS-mounted systsem, it seems like downloading sra and converting to fastq files is taking longer than needed or expected. In addition, IO becomes very slow on that NFS system to the point that it's noticable in other applications and while working on the command-line. My (lightly tested) hunch is that this happens when SRA files are converted to FASTQ as, this means reading and writing to disk with potentially many parallel processes. Most likely, you will not want to keep the SRA files anyway.
Describe the solution you'd like
I think this could be solved by using
$TMPDIR
with the snakemaketmpdir
resource and/or shadow rules. However, you may have already tried this, and refrained from implementing due to other issues that occur.If this would be relevant, it would maybe also make sense to couple the SRA and the FASTQ rules, so that the SRA file is only used internal to the rule, and can be kept on the temporary fileystem and be deleted when the rule successfully finishes. Downside is that you may have duplicated code, as the SRA download would be copied. On the other hand, this could also be implemented in a bash script, which then could be re-used.
Would this be worth considering (I'd be happy to supply a PR)?
The text was updated successfully, but these errors were encountered: