-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow scheduling #47
Comments
Just to add some detail - I think I've come a bit closer to the source of the problem and may have found an inefficiency with scheduling. First of all, the exported dataset is rather large and may be as large as 100MB. Second, the folder where the .future directory is located is on a small drive that went from 87% before the job started to 100% when it crashed (this being the cause of crash). There's a very good chance that the slowdown has to do with the home folder filling up. Assuming that this was, in fact, the problem, I wanted to note a few things, with the caveat that I know little about the future.batchtools internal architecture and how doFuture interacts with it. When I try to run a number of chunks and they share all of the data that's being exported, three seem to be 100 copies of 100-mb large object being staged in the same folder before being pushed to the remote node. Would it be possible to have a single copy output? Another possibility is that these stored files belong to separate (and many) consecutive foreach runs. If this is the case, foreach doesn't seem to clean up after itself properly, and the doFuture package doesn't seem to have a visible command to clean up (an equivalent to stopCluster()). Does any of this seem like the possible cause of the problem? |
Sorry about adding to this thread all the time, but as I continue to research what's going on, other things come to light. I found this illuminating thread (thank you, Henrik): |
Further update - after a few iterations of foreach, saved objects accumulate and eventually fill the disk. In the early runs, which I monitored for an hour, despite slowness all files were being deleted, but after a while a large number of runs - 12 runs of my entire foreach loop's worth (each run being 60 jobs, so > 660 jobs total) are found in the .future folder. There's some sort of a cleanup issue that I would welcome help in troubleshooting. |
I have been able to reproduce the problem in small scripts, and also to narrow it down quite a lot. Specifically, the following order of operations works:
These orders of operations DO NOT work and match scripts 2.1, 2.2, 2.3 (attached) respectively. testCluster.R works.
chances are, I need some variant of #3 but some variables set by plan() are not visible to doFuture in this case and it runs locally and in a single process. |
Thanks for this. Could you please update to make use of Markdown code blocks to make this a bit easier to read? See https://guides.github.com/features/mastering-markdown/ (= the 'M↓' icon in the lower right of every comment field here) - click 'code' panel under Section 'Examples'. |
Done. Any thoughts on the reason for the problem? In summary, main symptoms are slowdown (one job at a time) and failure to clean up files in .future folder. |
I have a (hackish) solution. My suspicion was that plan() was creating a variable in the parent environment - and it is, I can see it in the %plan% function. Therefore, instead of the invocation
and this solves the problem!!! So this is a variant of 3 but with a globally stored plan. What's weird is, if I try to create a new plan each time, jobs collide and fill up the hard drive. This is a real (and serious) bug; and I also think that my solution of using an external variable is quite hackish and probably should be addressed inside the plan() function somehow. But in any case, I think this narrows down the scope of work for the package and provides a work-around in the meantime. |
I wrote too soon. My "fix" results in always-local execution. |
And I now it works. The work-around is ugly. the first invocation of plan() returns the sequential strategy, so I have to call it twice to get the "latest and greatest" value of the stack global variable from the package environment. It would really help to have an accessor method to get the current stack after plan() is executed. That said, apparently my strategy of making the plan a global variable is exactly what's implemented under the hood... hmm... |
I am sorry to not have a reproducible example yet. My code base is very large and was running just fine until the job size became small. So I'll make a reproducible example after hearing some suggestions on what to test. In my case, the jobs are rather small - 10s each. The problem I'm seeing is they don't get scheduled very quickly. In fact, at any given time only one slurm job, or at most two, are running (the machine they are running on can run ~15 jobs by ram and cpu requirements). I'm trying to run 60 chunks, and to resolve this problem I set scheduling to 5, which did bump up the number of running jobs to 2-3. However, the main problem is that the chunks seem to take 10-15 seconds to launch, and I don't know what I changed - a few days ago, with larger jobs - this was not the case. So to my specific questions, before I try to generate a small reproducible example:
The text was updated successfully, but these errors were encountered: