add mpi and distributed

JuliaParallel · Oct 25, 2024 · 70074b0 · 70074b0
1 parent 577809f
commit 70074b0
Show file tree

Hide file tree

Showing 30 changed files with 45,705 additions and 0 deletions.
diff --git a/parts/distributed/explanation/01_distributed.ipynb b/parts/distributed/explanation/01_distributed.ipynb
diff --git a/parts/distributed/explanation/01_distributed.slides.html b/parts/distributed/explanation/01_distributed.slides.html
diff --git a/parts/distributed/explanation/02_dagger.ipynb b/parts/distributed/explanation/02_dagger.ipynb
diff --git a/parts/distributed/explanation/02_dagger.slides.html b/parts/distributed/explanation/02_dagger.slides.html
diff --git a/parts/distributed/explanation/Project.toml b/parts/distributed/explanation/Project.toml
@@ -0,0 +1,6 @@
+[deps]
+ClusterManagers = "34f1f09b-3a8b-5176-ab39-66d58a4d544e"
+Dagger = "d58978e5-989f-55fb-8d15-ea34adc7bf54"
+DistributedArrays = "aaf54ef3-cdf8-58ed-94cc-d582ad619b94"
+NetworkInterfaceControllers = "6f74fd91-2978-43ad-8164-3af8c0ec0142"
+Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
diff --git a/parts/mpi/README.md b/parts/mpi/README.md
@@ -0,0 +1,55 @@
+# Diffusion 2D - MPI
+
+In this part, we want to use MPI (distributed parallelism) to parallelize our Diffusion 2D example.
+
+The starting point is (once again) the serial loop version [`diffusion_2d_loop.jl`](./../diffusion_2d/diffusion_2d_loop.jl). The file [`diffusion_2d_mpi.jl`](./diffusion_2d_mpi.jl) in this folder is a modified copy of this variant. While the computational kernel `diffusion_step!` is essentially untouched, we included MPI bits at the beginning of the `run_diffusion` function and introduced the key function `update_halo!`, which is supposed to take care of data exchange between MPI ranks. However, as of now, the function isn't communicating anything and it will be (one of) your tasks to fix that 😉.
+
+
+## Task 1 - Running the MPI code
+
+Although incomplete from a semantic point of view, the code in `diffusion_2d_mpi.jl` is perfectly runnable as is. It won't compute the right thing, but it runs 😉. So **let's run it**. But how?
+
+First thing to realize is that, on Perlmutter, **you can't run MPI on a login node**. You have two options to work on a compute node:
+
+1) **Interactive session**: You can try to get an interactive session on a compute node by running `sh get_compute_node_interactive.sh`. But unfortunately, we don't have a node for everyone, so you might not get one (Sorry!). **If you can get one**, you can use `mpiexecjl --project -n 4 julia diffusion_2d_mpi.jl` to run the code. Alternatively, you can run `sh job_mpi_singlenode.sh`.
+
+2) **Compute job**: You can always submit a job that runs the code: `sbatch job_mpi_singlenode.sh`. The output will land in `slurm_mpi_singlenode.out`. Check out the [Perlmutter cheetsheet](../../help/perlmutter_cheatsheet.md) to learn more about jobs.
+
+Irrespective of which option you choose, **go ahead an run the code** (with 4 MPI ranks).
+
+To see that the code is currently not working properly (in the sense of computing the right thing), run `julia --project visualize_mpi.jl` to combine the results of different MPI ranks (`*.jld2` files) into a visualization (`visualization.png`). Inspect the visualization and notice the undesired dark lines.
+
+## Task 2 - Halo exchange
+
+Take a look at the general MPI setup (the beginning of `run_diffusion`) and the `update_halo!` function (the bits that are already there) and try to understand it.
+
+Afterwards, implement the necessary MPI communication. To that end, find the "TODO" block in `update_halo!` and follow the instructions. Note that we want to use **non-blocking** communication, i.e. you should use the functions `MPI.Irecv` and `MPI.Isend`.
+
+Check that your code is working by comparing the `visualization.png` that you get to this (basic "eye test"):
+
+<img src="./solution/visualization.png" width=500px>
+
+## Task 3 - Benchmark
+
+### Part A
+
+Our goal is to perform a rough and basic scaling analysis with 4, 8, and 16 MPI ranks distributed across multiple nodes. Specifically, we want to run 4 MPI ranks on a node and increase the number of nodes to get up to 16 ranks in total.
+
+The file `job_mpi_multinode.sh` is a job script that currently requests a single node (see the line `#SBATCH --nodes=1`) that runs 4 MPI ranks (see the line `#SBATCH --ntasks-per-node=4`), and then runs our Julia MPI code with `do_save=false` for simplicity and `ns=6144`.
+
+Submit this file to SLURM via `sbatch job_mpi_multinode.sh`. Once the job has run, the output will land in `slurm_mpi_multinode.sh`. Write the output down somewhere (copy & paste), change the number of nodes to 2 (= 8 MPI ranks in total) and rerun the experiment. Repeat the same thing, this time requesting 3 nodes (= 12 MPI ranks in total) and then requesting 4 nodes (= 16 MPI ranks in total).
+
+### Part B
+
+Inspect the results that you've obtained and compare them.
+
+**Questions**
+* What do you observe?
+* Is this what you'd expected?
+
+Note that in setting up our MPI ranks, we split our global grid into local grids. In the process, the meaning of the input parameter `ns` changed compared to previous codes (serial & multithreading). It now determines the resolution of the **local grid**  - that each MPI rank is holding - rather than the resolution of the global grid. Since we keep `ns` fixed (6144 in `job_mpi_multinode.sh`), we thus increase the problem size (the total grid resolution) when we increase the number of MPI ranks. This is known as a "weak scaling" analysis.
+
+**Question**
+
+* Given the comment above, what does "ideal parallel scaling" mean in the context of a "weak scaling" analysis?
+* What do the observed results tell you?
diff --git a/parts/mpi/diffusion_2d_mpi.jl b/parts/mpi/diffusion_2d_mpi.jl
@@ -0,0 +1,124 @@
+# 2D linear diffusion solver - MPI
+using Printf
+using JLD2
+using MPI
+include(joinpath(@__DIR__, "../shared.jl"))
+
+# convenience macros simply to avoid writing nested finite-difference expression
+macro qx(ix, iy) esc(:(-D * (C[$ix+1, $iy] - C[$ix, $iy]) / dx)) end
+macro qy(ix, iy) esc(:(-D * (C[$ix, $iy+1] - C[$ix, $iy]) / dy)) end
+
+function diffusion_step!(params, C2, C)
+    (; dx, dy, dt, D) = params
+    for iy in 1:size(C, 2)-2
+        for ix in 1:size(C, 1)-2
+            @inbounds C2[ix+1, iy+1] = C[ix+1, iy+1] - dt * ((@qx(ix+1, iy+1) - @qx(ix, iy+1)) / dx +
+                                                             (@qy(ix+1, iy+1) - @qy(ix+1, iy)) / dy)
+        end
+    end
+    return nothing
+end
+
+# MPI functions
+@views function update_halo!(A, bufs, neighbors, comm)
+    #
+    # !!! TODO
+    #
+    # Complete the halo exchange implementation. Specifically, use non-blocking
+    # MPI communication (Irecv and Isend) at the positions marked by "TODO..." below.
+    #
+    # Help:
+    #   left neighbor:  neighbors.x[1]
+    #   right neighbor: neighbors.x[2]
+    #   up neighbor:    neighbors.y[1]
+    #   down neighbor:  neighbors.y[2]
+    #
+
+    # dim-1 (x)
+    (neighbors.x[1] != MPI.PROC_NULL) && copyto!(bufs.send_1_1, A[2    , :])
+    (neighbors.x[2] != MPI.PROC_NULL) && copyto!(bufs.send_1_2, A[end-1, :])
+
+    reqs = MPI.MultiRequest(4)
+    (neighbors.x[1] != MPI.PROC_NULL) && # TODO... receive from left neighbor into bufs.recv_1_1
+    (neighbors.x[2] != MPI.PROC_NULL) && # TODO... receive from right neighbor into bufs.recv_1_2
+
+    (neighbors.x[1] != MPI.PROC_NULL) && # TODO... send bufs.send_1_1 to left neighbor
+    (neighbors.x[2] != MPI.PROC_NULL) && # TODO... send bufs.send_1_2 to right neighbor
+    MPI.Waitall(reqs) # blocking
+
+    (neighbors.x[1] != MPI.PROC_NULL) && copyto!(A[1  , :], bufs.recv_1_1)
+    (neighbors.x[2] != MPI.PROC_NULL) && copyto!(A[end, :], bufs.recv_1_2)
+
+    # dim-2 (y)
+    (neighbors.y[1] != MPI.PROC_NULL) && copyto!(bufs.send_2_1, A[:, 2    ])
+    (neighbors.y[2] != MPI.PROC_NULL) && copyto!(bufs.send_2_2, A[:, end-1])
+
+    reqs = MPI.MultiRequest(4)
+    (neighbors.y[1] != MPI.PROC_NULL) && # TODO... receive from up neighbor into bufs.recv_2_1
+    (neighbors.y[2] != MPI.PROC_NULL) && # TODO... receive from down neighbor into bufs.recv_2_2
+
+    (neighbors.y[1] != MPI.PROC_NULL) && # TODO... send bufs.send_2_1 to up neighbor
+    (neighbors.y[2] != MPI.PROC_NULL) && # TODO... send bufs.send_2_2 to down neighbor
+    MPI.Waitall(reqs) # blocking
+
+    (neighbors.y[1] != MPI.PROC_NULL) && copyto!(A[:, 1  ], bufs.recv_2_1)
+    (neighbors.y[2] != MPI.PROC_NULL) && copyto!(A[:, end], bufs.recv_2_2)
+    return nothing
+end
+
+function init_bufs(A)
+    return (; send_1_1=zeros(size(A, 2)), send_1_2=zeros(size(A, 2)),
+              send_2_1=zeros(size(A, 1)), send_2_2=zeros(size(A, 1)),
+              recv_1_1=zeros(size(A, 2)), recv_1_2=zeros(size(A, 2)),
+              recv_2_1=zeros(size(A, 1)), recv_2_2=zeros(size(A, 1)))
+end
+
+function run_diffusion(; ns=64, nt=100, do_save=false)
+    MPI.Init()
+    comm      = MPI.COMM_WORLD
+    nprocs    = MPI.Comm_size(comm)
+    dims      = MPI.Dims_create(nprocs, (0, 0)) |> Tuple
+    comm_cart = MPI.Cart_create(comm, dims)
+    me        = MPI.Comm_rank(comm_cart)
+    coords    = MPI.Cart_coords(comm_cart) |> Tuple
+    neighbors = (; x=MPI.Cart_shift(comm_cart, 0, 1), y=MPI.Cart_shift(comm_cart, 1, 1))
+    (me == 0) && println("nprocs = $(nprocs), dims = $dims")
+
+    params = init_params_mpi(; dims, coords, ns, nt, do_save)
+    C, C2  = init_arrays_mpi(params)
+    bufs   = init_bufs(C)
+    t_tic  = 0.0
+    # time loop
+    for it in 1:nt
+        # time after warmup (ignore first 10 iterations)
+        (it == 11) && (t_tic = Base.time())
+        # diffusion
+        diffusion_step!(params, C2, C)
+        update_halo!(C2, bufs, neighbors, comm_cart)
+        C, C2 = C2, C # pointer swap
+    end
+    t_toc = (Base.time() - t_tic)
+    # "master" prints performance
+    (me == 0) && print_perf(params, t_toc)
+    # save to (maybe) visualize later
+    if do_save
+        jldsave(joinpath(@__DIR__, "out_$(me).jld2"); C = Array(C[2:end-1, 2:end-1]), lxy = (; lx=params.L, ly=params.L))
+    end
+    MPI.Finalize()
+    return nothing
+end
+
+# Running things...
+
+# enable save to disk by default
+(!@isdefined do_save) && (do_save = true)
+# enable execution by default
+(!@isdefined do_run) && (do_run = true)
+
+if do_run
+    if !isempty(ARGS)
+        run_diffusion(; ns=parse(Int, ARGS[1]), do_save)
+    else
+        run_diffusion(; ns=256, do_save)
+    end
+end