Skip to content

Commit

Permalink
add mpi and distributed
Browse files Browse the repository at this point in the history
  • Loading branch information
JBlaschke committed Oct 25, 2024
1 parent 577809f commit 70074b0
Show file tree
Hide file tree
Showing 30 changed files with 45,705 additions and 0 deletions.
1,133 changes: 1,133 additions & 0 deletions parts/distributed/explanation/01_distributed.ipynb

Large diffs are not rendered by default.

8,318 changes: 8,318 additions & 0 deletions parts/distributed/explanation/01_distributed.slides.html

Large diffs are not rendered by default.

865 changes: 865 additions & 0 deletions parts/distributed/explanation/02_dagger.ipynb

Large diffs are not rendered by default.

8,372 changes: 8,372 additions & 0 deletions parts/distributed/explanation/02_dagger.slides.html

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions parts/distributed/explanation/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[deps]
ClusterManagers = "34f1f09b-3a8b-5176-ab39-66d58a4d544e"
Dagger = "d58978e5-989f-55fb-8d15-ea34adc7bf54"
DistributedArrays = "aaf54ef3-cdf8-58ed-94cc-d582ad619b94"
NetworkInterfaceControllers = "6f74fd91-2978-43ad-8164-3af8c0ec0142"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
55 changes: 55 additions & 0 deletions parts/mpi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Diffusion 2D - MPI

In this part, we want to use MPI (distributed parallelism) to parallelize our Diffusion 2D example.

The starting point is (once again) the serial loop version [`diffusion_2d_loop.jl`](./../diffusion_2d/diffusion_2d_loop.jl). The file [`diffusion_2d_mpi.jl`](./diffusion_2d_mpi.jl) in this folder is a modified copy of this variant. While the computational kernel `diffusion_step!` is essentially untouched, we included MPI bits at the beginning of the `run_diffusion` function and introduced the key function `update_halo!`, which is supposed to take care of data exchange between MPI ranks. However, as of now, the function isn't communicating anything and it will be (one of) your tasks to fix that 😉.


## Task 1 - Running the MPI code

Although incomplete from a semantic point of view, the code in `diffusion_2d_mpi.jl` is perfectly runnable as is. It won't compute the right thing, but it runs 😉. So **let's run it**. But how?

First thing to realize is that, on Perlmutter, **you can't run MPI on a login node**. You have two options to work on a compute node:

1) **Interactive session**: You can try to get an interactive session on a compute node by running `sh get_compute_node_interactive.sh`. But unfortunately, we don't have a node for everyone, so you might not get one (Sorry!). **If you can get one**, you can use `mpiexecjl --project -n 4 julia diffusion_2d_mpi.jl` to run the code. Alternatively, you can run `sh job_mpi_singlenode.sh`.

2) **Compute job**: You can always submit a job that runs the code: `sbatch job_mpi_singlenode.sh`. The output will land in `slurm_mpi_singlenode.out`. Check out the [Perlmutter cheetsheet](../../help/perlmutter_cheatsheet.md) to learn more about jobs.

Irrespective of which option you choose, **go ahead an run the code** (with 4 MPI ranks).

To see that the code is currently not working properly (in the sense of computing the right thing), run `julia --project visualize_mpi.jl` to combine the results of different MPI ranks (`*.jld2` files) into a visualization (`visualization.png`). Inspect the visualization and notice the undesired dark lines.

## Task 2 - Halo exchange

Take a look at the general MPI setup (the beginning of `run_diffusion`) and the `update_halo!` function (the bits that are already there) and try to understand it.

Afterwards, implement the necessary MPI communication. To that end, find the "TODO" block in `update_halo!` and follow the instructions. Note that we want to use **non-blocking** communication, i.e. you should use the functions `MPI.Irecv` and `MPI.Isend`.

Check that your code is working by comparing the `visualization.png` that you get to this (basic "eye test"):

<img src="./solution/visualization.png" width=500px>

## Task 3 - Benchmark

### Part A

Our goal is to perform a rough and basic scaling analysis with 4, 8, and 16 MPI ranks distributed across multiple nodes. Specifically, we want to run 4 MPI ranks on a node and increase the number of nodes to get up to 16 ranks in total.

The file `job_mpi_multinode.sh` is a job script that currently requests a single node (see the line `#SBATCH --nodes=1`) that runs 4 MPI ranks (see the line `#SBATCH --ntasks-per-node=4`), and then runs our Julia MPI code with `do_save=false` for simplicity and `ns=6144`.

Submit this file to SLURM via `sbatch job_mpi_multinode.sh`. Once the job has run, the output will land in `slurm_mpi_multinode.sh`. Write the output down somewhere (copy & paste), change the number of nodes to 2 (= 8 MPI ranks in total) and rerun the experiment. Repeat the same thing, this time requesting 3 nodes (= 12 MPI ranks in total) and then requesting 4 nodes (= 16 MPI ranks in total).

### Part B

Inspect the results that you've obtained and compare them.

**Questions**
* What do you observe?
* Is this what you'd expected?

Note that in setting up our MPI ranks, we split our global grid into local grids. In the process, the meaning of the input parameter `ns` changed compared to previous codes (serial & multithreading). It now determines the resolution of the **local grid** - that each MPI rank is holding - rather than the resolution of the global grid. Since we keep `ns` fixed (6144 in `job_mpi_multinode.sh`), we thus increase the problem size (the total grid resolution) when we increase the number of MPI ranks. This is known as a "weak scaling" analysis.

**Question**

* Given the comment above, what does "ideal parallel scaling" mean in the context of a "weak scaling" analysis?
* What do the observed results tell you?
124 changes: 124 additions & 0 deletions parts/mpi/diffusion_2d_mpi.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# 2D linear diffusion solver - MPI
using Printf
using JLD2
using MPI
include(joinpath(@__DIR__, "../shared.jl"))

# convenience macros simply to avoid writing nested finite-difference expression
macro qx(ix, iy) esc(:(-D * (C[$ix+1, $iy] - C[$ix, $iy]) / dx)) end
macro qy(ix, iy) esc(:(-D * (C[$ix, $iy+1] - C[$ix, $iy]) / dy)) end

function diffusion_step!(params, C2, C)
(; dx, dy, dt, D) = params
for iy in 1:size(C, 2)-2
for ix in 1:size(C, 1)-2
@inbounds C2[ix+1, iy+1] = C[ix+1, iy+1] - dt * ((@qx(ix+1, iy+1) - @qx(ix, iy+1)) / dx +
(@qy(ix+1, iy+1) - @qy(ix+1, iy)) / dy)
end
end
return nothing
end

# MPI functions
@views function update_halo!(A, bufs, neighbors, comm)
#
# !!! TODO
#
# Complete the halo exchange implementation. Specifically, use non-blocking
# MPI communication (Irecv and Isend) at the positions marked by "TODO..." below.
#
# Help:
# left neighbor: neighbors.x[1]
# right neighbor: neighbors.x[2]
# up neighbor: neighbors.y[1]
# down neighbor: neighbors.y[2]
#

# dim-1 (x)
(neighbors.x[1] != MPI.PROC_NULL) && copyto!(bufs.send_1_1, A[2 , :])
(neighbors.x[2] != MPI.PROC_NULL) && copyto!(bufs.send_1_2, A[end-1, :])

reqs = MPI.MultiRequest(4)
(neighbors.x[1] != MPI.PROC_NULL) && # TODO... receive from left neighbor into bufs.recv_1_1
(neighbors.x[2] != MPI.PROC_NULL) && # TODO... receive from right neighbor into bufs.recv_1_2

(neighbors.x[1] != MPI.PROC_NULL) && # TODO... send bufs.send_1_1 to left neighbor
(neighbors.x[2] != MPI.PROC_NULL) && # TODO... send bufs.send_1_2 to right neighbor
MPI.Waitall(reqs) # blocking

(neighbors.x[1] != MPI.PROC_NULL) && copyto!(A[1 , :], bufs.recv_1_1)
(neighbors.x[2] != MPI.PROC_NULL) && copyto!(A[end, :], bufs.recv_1_2)

# dim-2 (y)
(neighbors.y[1] != MPI.PROC_NULL) && copyto!(bufs.send_2_1, A[:, 2 ])
(neighbors.y[2] != MPI.PROC_NULL) && copyto!(bufs.send_2_2, A[:, end-1])

reqs = MPI.MultiRequest(4)
(neighbors.y[1] != MPI.PROC_NULL) && # TODO... receive from up neighbor into bufs.recv_2_1
(neighbors.y[2] != MPI.PROC_NULL) && # TODO... receive from down neighbor into bufs.recv_2_2

(neighbors.y[1] != MPI.PROC_NULL) && # TODO... send bufs.send_2_1 to up neighbor
(neighbors.y[2] != MPI.PROC_NULL) && # TODO... send bufs.send_2_2 to down neighbor
MPI.Waitall(reqs) # blocking

(neighbors.y[1] != MPI.PROC_NULL) && copyto!(A[:, 1 ], bufs.recv_2_1)
(neighbors.y[2] != MPI.PROC_NULL) && copyto!(A[:, end], bufs.recv_2_2)
return nothing
end

function init_bufs(A)
return (; send_1_1=zeros(size(A, 2)), send_1_2=zeros(size(A, 2)),
send_2_1=zeros(size(A, 1)), send_2_2=zeros(size(A, 1)),
recv_1_1=zeros(size(A, 2)), recv_1_2=zeros(size(A, 2)),
recv_2_1=zeros(size(A, 1)), recv_2_2=zeros(size(A, 1)))
end

function run_diffusion(; ns=64, nt=100, do_save=false)
MPI.Init()
comm = MPI.COMM_WORLD
nprocs = MPI.Comm_size(comm)
dims = MPI.Dims_create(nprocs, (0, 0)) |> Tuple
comm_cart = MPI.Cart_create(comm, dims)
me = MPI.Comm_rank(comm_cart)
coords = MPI.Cart_coords(comm_cart) |> Tuple
neighbors = (; x=MPI.Cart_shift(comm_cart, 0, 1), y=MPI.Cart_shift(comm_cart, 1, 1))
(me == 0) && println("nprocs = $(nprocs), dims = $dims")

params = init_params_mpi(; dims, coords, ns, nt, do_save)
C, C2 = init_arrays_mpi(params)
bufs = init_bufs(C)
t_tic = 0.0
# time loop
for it in 1:nt
# time after warmup (ignore first 10 iterations)
(it == 11) && (t_tic = Base.time())
# diffusion
diffusion_step!(params, C2, C)
update_halo!(C2, bufs, neighbors, comm_cart)
C, C2 = C2, C # pointer swap
end
t_toc = (Base.time() - t_tic)
# "master" prints performance
(me == 0) && print_perf(params, t_toc)
# save to (maybe) visualize later
if do_save
jldsave(joinpath(@__DIR__, "out_$(me).jld2"); C = Array(C[2:end-1, 2:end-1]), lxy = (; lx=params.L, ly=params.L))
end
MPI.Finalize()
return nothing
end

# Running things...

# enable save to disk by default
(!@isdefined do_save) && (do_save = true)
# enable execution by default
(!@isdefined do_run) && (do_run = true)

if do_run
if !isempty(ARGS)
run_diffusion(; ns=parse(Int, ARGS[1]), do_save)
else
run_diffusion(; ns=256, do_save)
end
end
Loading

0 comments on commit 70074b0

Please sign in to comment.