Lab 4 for COMP4300 - Distributed Memory Programming with MPI
Distributed Memory Parallel Programming with MPI
The aim of this lab is introduce and practice using MPI for distributed memory parallel programming paradigm
Distributed memory programming utilises multiple cores which do not have a shared memory system. This lab does require the use of the Gadi system; you should complete it there. Login instructions are contained within Lab 1, so if you’re not sure how to do this you should refer to those notes.
Once you’ve got access to Gadi, you should fork this repo, and then make a clone on Gadi.
Getting Started with MPI
There are quite a few moving parts when it comes to MPI programs, so we’re going to start simple, with a Hello, World program. Before you run this program, you need to tell Gadi that you’re going to be using MPI, which can be done by entering the module load openmpi
command into the terminal (openMPI is the specific implementation of MPI we use). This is done so that the correct libraries are loaded into your interactive session.
Next, take a look at the provided mpi_hello.c
file, the code of which is replicated here:
1 |
|
Take note of a couple of things. Firstly, the MPI_Init
call; your code will compile, but complain without this. We also get our first look at the MPI_Send
and MPI_Recv
calls; which are the backbone of parallel programming with MPI. Looking at the man pages, we see that the arguments to these calls are the following:int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
You can see a couple of things immediately: we can only send and receive data of a type that MPI has implemented as an MPI_Datatype
. We also need to know the length of the message at both send and receive time. The receiver has an argument for who it’s listening for messages from, but this isn’t a restriction - the special argument MPI_ANY_SOURCE
will allow you to listen for any messages directed to you.
Having read through the program (and asked your tutor about anything that’s unclear!), now it’s time to compile and run it. The compilation can be done with a wrapper for the gcc compiler, mpicc
in the same fashion as normal, e.gmpicc -g -Wall -o mpi_hello mpi_hello.c
When we run an MPI program, we need to specify how many processes we want; this is done by utilising the mpirun
command to run our program, like sompirun -n 8 ./mpi_hello
.
A couple of tasks before we move on to more interesting MPI programs:
- You can use only this program, along with command line arguments to determine how many processes are available to you on a Gadi login node. How many are there?
- When you did the last question, you might have found that there we some arguments you could enter without getting errors, but that took a long time (potentially causing your terminal to disconnect). Why?
- (Theory question) What is the MPI_COMM_WORLD thing floating around this code? Can you think of a circumstance where you would ever want to use a communicator other than MPI_COMM_WORLD?
Collective Communication
The main thing that differentiates distributed memory paradigms from shared memory ones is the need for processes to explicitly communicate. So it’s worth our time to look over some of the collective communication options that exist, besides MPI_Send
and MPI_Recv
.
To start with, you’ve seen a number of operations throughout this course that are some sort of reduction; global sums, global max, etc. MPI has implementations of many of these through the MPI_Reduce
function, which has a signature like this:
1 | int MPI_Reduce( |
Many of these arguments you’ve seen before in other functions; the count is the amount of data being transmitted from process to process at any given time, the datatype is the type of that data, the destination process is where you want the output to end up, etc. The main new one is the MPI_Op operator
argument, which is taken from the following list:
MPI_MAX
: MaximumMPI_MIN
: MinimumMPI_SUM
: SumMPI_PROD
: ProductMPI_LAND
: Logical ANDMPI_BAND
: Bitwise ANDMPI_LOR
: Logical ORMPI_BOR
: Bitwise ORMPI_LXOR
: Logical XORMPI_BXOR
: Bitwise XORMPI_MAXLOC
: Maximum and location of maximumMPI_MINLOC
: Minimum and location of minimum
Your task is this: start by reading the provided mpi_trap.c
file. This file contains a relatively inefficient method for computing the trapezoidal rule-based integration of a function f
. Replacing the appropriate section of code with an MPI_Reduce
operation, make the program more efficient at computing the trapezoidal rule. You will know that you have succeeded if your program is consistently more performant for higher numbers of processes (>=16). This means to know if what you’ve done worked, you will need to take some baseline measurements of runtime first. If you’re struggling to see a speedup, see the next section for details on how to submit to the PBS queue, which will give you access to many more CPUs to play with
PBS Queues, and Accessing Gadi
In the previous two exercises, you’ve so far used the interactive nodes on Gadi. These are great, because they don’t use up the CPU time budget, and let you easily test your parallel code for small numbers of processors. However, you don’t get access to all that many CPUs on the interactive nodes, and you have to compete for them with others using that node.
Enter the PBS Queue system! This is a system where you submit batch jobs to a queue that determines whose code is allowed to run on Gadi at what time. This is done through a submission script; a bit of bash which specifies what you want Gadi to do, and which resources you want to use. Let’s take a look at the preamble of one (provided as batch_trap
):
1 | #!/bin/bash |
You can see a quick guide which summarises all available directives on the NCI support page, as well as a more detailed guide on directives specifically but we’ll take you through the ones used here.
-P c37
indicates that this job should be attached to thec37
project (which is the one for our course). This line is only essential if you utilise Gadi for multiple projects, but it’s good to know about in case you do further research in HPC.- The
-q normal
line says that we are submitting to thenormal
queue; other options include anexpress
queue (which uses up more of the grant, but is less busy), and agpuvolta
queue, which allows you to utilise GPUs as well as CPUs. You will only ever need thenormal
queue in this course, unless you’re doing extension GPU work on the major project. -j oe
merges STOUT and STDERR into STDOUT, so any errors are printed to STDOUT-l walltime=00:01:00, mem=16GB
is probably the most important line, and you must never leave it out. It limits the amount of walltime used to 1 minute, and the amount of memory used to 16GB. This line is essential, as without a timeout limit, an infinite loop could use up the entire compute budget of the course!l wd
causes Gadi to enter the directory from which the job was submitted, meaning you can access files in the same directory as the submission script in your codel ncpus=8
indicates we want to request 8 CPUs for this job
The rest of the script is bash code indicating what Gadi should do. You don’t need to understand this in great depth (this isn’t a bash scripting course, after all), but do read it and try to get a sense for what it’s doing. In particular, on this line: mpirun -np $p ./mpi_trap
, you need to replace ./mpi_trap
with whatever the name of your compiled executable is.
Once you’ve got a submission script, the next step is to submit it to the queue! This is done using the qsub
instruction in your terminal, e.g: qsub batch_trap
. If you want to get details on your running PBS jobs, you can use the qstat
command. qstat -a
returns details of all your active jobs, while qstat <job number>
returns information for just that job. The job number is provided to you in the terminal when you run the qsub
command.
An example of what the qstat -a
output might look like is below:
1 | gadi-pbs: |
gadi-pbs:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
35648347.gadi-pbs tw3382 normal-* batch_trap 808926 2 96 16gb 00:01 R 00:00
1 | Once the job is complete, a new file will appear which contains the STDOUT/STDERR produced by the program during its time running on Gadi, along with a summary of the execution at the bottom. |
Point-to-point broadcast: 52 msecs
MPI broadcast: 32 msecs
Point-to-point reduce: 48 msecs
MPI reduce: 34 msecs
#### If you need to use other versions of MPI
To change the version of mpi, one can use `module load openmpi/X.X.X` instead of `module load openmpi`. Adding the same directive at the start of the job file will use the specified version on the compute node.
To avail the available versions of openmpi, one can use module avail openmpi to show available versions. Note that you are always encouraged to use the default verion of MPI, so use this instruction only if you need to.
# Submission Guidelines
Your assignment will be submitted in Gitlab repo as follows:
- [Fork](https://gitlab.cecs.anu.edu.au/comp4300/student-comp4300-2022-lab4/-/blob/master/README.md#forking-and-cloning-the-lab) the lab repo, then add the marker user `comp4300-2022-s1-marker` as a developer to your repo.
**How to add user**: Click “Members” on the left panel -> input `comp4300-2022-s1-marker` under “Gitlab member or Email address” -> Select the role permission as `Developer` -> Skip the "access expiration date" -> Invite
- Clone the lab to your own machine and complete all tasks. Once you finish your answers, please commit and push your code to your repo.
We will mark your answers and provide feedbacks to your repo.
# Submission Deadline
You must commit and push your code to your repo before **11:59pm on 4 May**. Any late updates after the deadline will not be considered.
# References
[MPI Documentation](https://www.open-mpi.org/doc/)