Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.cluster run asynchronous parallel? #13

Open
Gibbsdavidl opened this issue Sep 19, 2018 · 1 comment
Open

as.cluster run asynchronous parallel? #13

Gibbsdavidl opened this issue Sep 19, 2018 · 1 comment

Comments

@Gibbsdavidl
Copy link

Hi there,
Great package! Really cool. I am using the plan function along with as.cluster where docker images are being pulled on the VMs. The issue is that this processes is done sequentially for a list of VMs ... and it seems like it could/should take place asynchronously in parallel... which would save a lot of time if you have a long list of VMs (as made using googleComputeEngineR).

Here's what I'm talking about:

my_rscript <- c("docker", 
                "run", c("--net=host","--shm-size=10G"),
                "gibbsdavidl/google_r:v3", 
                "Rscript")

plan(cluster, 
     workers = as.cluster(vms, 
                          docker_image="gibbsdavidl/google_r:v3",
                          rscript=my_rscript)
     )

The as.cluster call pulls the docker image for each VM in order ... one at a time ... and it's really slow. I could probably make a 'skinnier' docker image however.

Thanks!
-dave

@HenrikBengtsson
Copy link
Collaborator

Sorry for the slow reply. Yes, this would be a neat feature, especially when setting up remote connections is "slow" or when setting up a lot of parallel workers. I actually have an old note of mine on this:

Would it make sense to "parallelize" makeClusterPSOCK(), i.e.

  1. launch all workers
  2. then connect to each of them

instead of as now:

  1. for all workers,
    a. launch it
    b. connect to it

A downside of this approach is when the setup of the connections fail - then we might have launched lots of zombie workers. The risk for this happening could be mitigated by first making sure that one worker can be set up, and only after that has been confirmed, the rest are launched in parallel.

PS. Note that future::makeClusterPSOCK() has nothing to do with the Future API per se. I've deliberately implemented it such that it could be moved elsewhere, e.g. incorporated into the parallel package itself. In other words, the same feature request applies to the sibling parallel::makePSOCKcluster() as well.

@HenrikBengtsson HenrikBengtsson transferred this issue from futureverse/future Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants