Lab 4 for COMP4300 - Distributed Memory Programming with MPI

Distributed Memory Parallel Programming with MPI

The aim of this lab is introduce and practice using MPI for distributed memory parallel programming paradigm

Distributed memory programming utilises multiple cores which do not have a shared memory system. This lab does require the use of the Gadi system; you should complete it there. Login instructions are contained within Lab 1, so if you’re not sure how to do this you should refer to those notes.

Once you’ve got access to Gadi, you should fork this repo, and then make a clone on Gadi.

Getting Started with MPI

There are quite a few moving parts when it comes to MPI programs, so we’re going to start simple, with a Hello, World program. Before you run this program, you need to tell Gadi that you’re going to be using MPI, which can be done by entering the module load openmpi command into the terminal (openMPI is the specific implementation of MPI we use). This is done so that the correct libraries are loaded into your interactive session.
Next, take a look at the provided mpi_hello.c file, the code of which is replicated here:

#include <stdio.h>
#include <string.h>  /* For strlen             */
#include <mpi.h>     /* For MPI functions, etc */ 

const int MAX_STRING = 100;

int main(void) {
   char       greeting[MAX_STRING];  /* String storing message */
   int        comm_sz;               /* Number of processes    */
   int        my_rank;               /* My process rank        */

   /* Start up MPI */
   MPI_Init(NULL, NULL); 

   /* Get the number of processes */
   MPI_Comm_size(MPI_COMM_WORLD, &comm_sz); 

   /* Get my rank among all the processes */
   MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); 

   if (my_rank != 0) { 
      /* Create message */
      sprintf(greeting, "Greetings from process %d of %d!", 
            my_rank, comm_sz);
      /* Send message to process 0 */
      MPI_Send(greeting, strlen(greeting)+1, MPI_CHAR, 0, 0,
            MPI_COMM_WORLD); 
   } else {  
      /* Print my message */
      printf("Greetings from process %d of %d!\n", my_rank, comm_sz);
      for (int q = 1; q < comm_sz; q++) {
         /* Receive message from process q */
         MPI_Recv(greeting, MAX_STRING, MPI_CHAR, q,
            0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
         /* Print message from process q */
         printf("%s\n", greeting);
      } 
   }

   /* Shut down MPI */
   MPI_Finalize(); 

   return 0;
}  /* main */

Take note of a couple of things. Firstly, the MPI_Init call; your code will compile, but complain without this. We also get our first look at the MPI_Send and MPI_Recv calls; which are the backbone of parallel programming with MPI. Looking at the man pages, we see that the arguments to these calls are the following:
int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
You can see a couple of things immediately: we can only send and receive data of a type that MPI has implemented as an MPI_Datatype. We also need to know the length of the message at both send and receive time. The receiver has an argument for who it’s listening for messages from, but this isn’t a restriction - the special argument MPI_ANY_SOURCE will allow you to listen for any messages directed to you.
Having read through the program (and asked your tutor about anything that’s unclear!), now it’s time to compile and run it. The compilation can be done with a wrapper for the gcc compiler, mpicc in the same fashion as normal, e.g
mpicc -g -Wall -o mpi_hello mpi_hello.c
When we run an MPI program, we need to specify how many processes we want; this is done by utilising the mpirun command to run our program, like so
mpirun -n 8 ./mpi_hello.
A couple of tasks before we move on to more interesting MPI programs:

You can use only this program, along with command line arguments to determine how many processes are available to you on a Gadi login node. How many are there?
When you did the last question, you might have found that there we some arguments you could enter without getting errors, but that took a long time (potentially causing your terminal to disconnect). Why?
(Theory question) What is the MPI_COMM_WORLD thing floating around this code? Can you think of a circumstance where you would ever want to use a communicator other than MPI_COMM_WORLD?

Collective Communication

The main thing that differentiates distributed memory paradigms from shared memory ones is the need for processes to explicitly communicate. So it’s worth our time to look over some of the collective communication options that exist, besides MPI_Send and MPI_Recv.
To start with, you’ve seen a number of operations throughout this course that are some sort of reduction; global sums, global max, etc. MPI has implementations of many of these through the MPI_Reduce function, which has a signature like this:

int MPI_Reduce(
    void* input_data_p,
    void* output_data_p,
    int count,
    MPI_Datatype datatype,
    MPI_Op operator,
    int dest_process,
    MPI_Comm comm
)

Many of these arguments you’ve seen before in other functions; the count is the amount of data being transmitted from process to process at any given time, the datatype is the type of that data, the destination process is where you want the output to end up, etc. The main new one is the MPI_Op operator argument, which is taken from the following list:

MPI_MAX: Maximum
MPI_MIN: Minimum
MPI_SUM: Sum
MPI_PROD: Product
MPI_LAND: Logical AND
MPI_BAND: Bitwise AND
MPI_LOR: Logical OR
MPI_BOR: Bitwise OR
MPI_LXOR: Logical XOR
MPI_BXOR: Bitwise XOR
MPI_MAXLOC: Maximum and location of maximum
MPI_MINLOC: Minimum and location of minimum

Your task is this: start by reading the provided mpi_trap.c file. This file contains a relatively inefficient method for computing the trapezoidal rule-based integration of a function f. Replacing the appropriate section of code with an MPI_Reduce operation, make the program more efficient at computing the trapezoidal rule. You will know that you have succeeded if your program is consistently more performant for higher numbers of processes (>=16). This means to know if what you’ve done worked, you will need to take some baseline measurements of runtime first. If you’re struggling to see a speedup, see the next section for details on how to submit to the PBS queue, which will give you access to many more CPUs to play with

PBS Queues, and Accessing Gadi

In the previous two exercises, you’ve so far used the interactive nodes on Gadi. These are great, because they don’t use up the CPU time budget, and let you easily test your parallel code for small numbers of processors. However, you don’t get access to all that many CPUs on the interactive nodes, and you have to compete for them with others using that node.

Enter the PBS Queue system! This is a system where you submit batch jobs to a queue that determines whose code is allowed to run on Gadi at what time. This is done through a submission script; a bit of bash which specifies what you want Gadi to do, and which resources you want to use. Let’s take a look at the preamble of one (provided as batch_trap):

#!/bin/bash
#PBS -p c37
#PBS -q normal
#PBS -j oe
#PBS -l walltime=00:01:00,mem=16GB
#PBS -l wd
#PBS -l ncpus=8

You can see a quick guide which summarises all available directives on the NCI support page, as well as a more detailed guide on directives specifically but we’ll take you through the ones used here.

-P c37 indicates that this job should be attached to the c37 project (which is the one for our course). This line is only essential if you utilise Gadi for multiple projects, but it’s good to know about in case you do further research in HPC.
The -q normal line says that we are submitting to the normal queue; other options include an express queue (which uses up more of the grant, but is less busy), and a gpuvolta queue, which allows you to utilise GPUs as well as CPUs. You will only ever need the normal queue in this course, unless you’re doing extension GPU work on the major project.
-j oe merges STOUT and STDERR into STDOUT, so any errors are printed to STDOUT
-l walltime=00:01:00, mem=16GB is probably the most important line, and you must never leave it out. It limits the amount of walltime used to 1 minute, and the amount of memory used to 16GB. This line is essential, as without a timeout limit, an infinite loop could use up the entire compute budget of the course!
l wd causes Gadi to enter the directory from which the job was submitted, meaning you can access files in the same directory as the submission script in your code
l ncpus=8 indicates we want to request 8 CPUs for this job

The rest of the script is bash code indicating what Gadi should do. You don’t need to understand this in great depth (this isn’t a bash scripting course, after all), but do read it and try to get a sense for what it’s doing. In particular, on this line: mpirun -np $p ./mpi_trap, you need to replace ./mpi_trap with whatever the name of your compiled executable is.
Once you’ve got a submission script, the next step is to submit it to the queue! This is done using the qsub instruction in your terminal, e.g: qsub batch_trap. If you want to get details on your running PBS jobs, you can use the qstat command. qstat -a returns details of all your active jobs, while qstat <job number> returns information for just that job. The job number is provided to you in the terminal when you run the qsub command.
An example of what the qstat -a output might look like is below:

gadi-pbs: 
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
35648171.gadi-pbs    tw3382   normal-* batch_trap    --    1   8   16gb 00:01 Q   -- 
```  
This indicates the job ID, queue, jobname, and confirms the memory and time caps. The S field indicates the status of the job; Q means that it is in the queue for execution. If it is currently running at the time you run the `qstat` command, it will instead have an R, like so:

gadi-pbs:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

35648347.gadi-pbs tw3382 normal-* batch_trap 808926 2 96 16gb 00:01 R 00:00

Once the job is complete, a new file will appear which contains the STDOUT/STDERR produced by the program during its time running on Gadi, along with a summary of the execution at the bottom.  
Your task for this exercise is to get familiar with utilising PBS queues, by testing your implementation of collective communication from the previous exercise with >48 CPUs. Note that this will require modifying the bash script; if you're struggling to figure out what to modify, ask your tutor!  
NB: In order to get sensible results, you may need to increase the number of trapezoids being computed, to make the machine have to do more work.

# Take Home Assignment Tasks

## 1. MPI Send/Recv[ 1 mark ]

   - The code given in [`mpi_send_fix.c`](mpi_send_fix.c) runs on exactly two processes, with both processes trying to send a big array to the other and then receive after which. But there is an issue in the code that stops it from running correctly. Find and fix the issue to make the program run correctly.

   - Modify the code in [`mpi_send_nonblock.c`](mpi_send_nonblock.c) to using non-blocking communication (MPI_Isend/MPI_Irecv). The received messages should be communicated correctly.

## 2. Parallel Pi Calculation [ 1 mark ]

Parallelize the program in [`mpi_pi_reduce.c`](mpi_pi_reduce.c) which currently computes the pi value serially. Your code should properly distribute the computation to each process and use `MPI_Reduce` to collect the result. 

## 3. Manual Reduction [ 1 mark ]

Implement a reduction in [`mpi_manual_reduce.c`](mpi_manual_reduce.c) to compute the sum of all the processor IDs using MPI send/recv routines rather than the built-in MPI reduction. Your code should achieve a time conplexity of O(log n). As an example, 5 number of processors used should output a 10 (10 = 0+1+2+3+4). 

## 4. (Optional) Benchmarking [ 1 star ]

This is an additional optinal task, and you will earn a star if you complete it. Please use [`timer.h`](timer.h) to measure:
1. The running time of manual broadcast via point-to-point communications (MPI send/recvs) vs. MPI broadcast. 
2. The running time of manual reduce via point-to-point communications (MPI send/recvs) vs. MPI reduce.

The message is:

- A random integer. The broadcast send other processes this random integer; while the reduce collects others' integers and calculate the sum.

Your manual implementation should use only MPI send and MPI recv. You should compare the performance using the same number of processes. Your code should correctly measure the broadcast/reduce operation runnting times. For the manual implementations of broadcast and reduce, you could use your code in Task 3 and the broadcast implementation should not be hard to write. The manual broadcast should use the O(log n) implementation, such code for broadcast should be easily obtained by altering from Task 3.

Please complete your code in the provided file [`mpi_perf_comp.c`](mpi_perf_comp.c), which also needs to print out the running time at the end as follows:

Point-to-point broadcast: 52 msecs
MPI broadcast: 32 msecs
Point-to-point reduce: 48 msecs
MPI reduce: 34 msecs


#### If you need to use other versions of MPI

To change the version of mpi, one can use `module load openmpi/X.X.X` instead of `module load openmpi`. Adding the same directive at the start of  the job file will use the specified version on the compute node.
 
To avail the available versions of openmpi, one can use module avail openmpi to show available versions. Note that you are always encouraged to use the default verion of MPI, so use this instruction only if you need to.

# Submission Guidelines
Your assignment will be submitted in Gitlab repo as follows:
- [Fork](https://gitlab.cecs.anu.edu.au/comp4300/student-comp4300-2022-lab4/-/blob/master/README.md#forking-and-cloning-the-lab) the lab repo, then add the marker user `comp4300-2022-s1-marker` as a developer to your repo.

  **How to add user**: Click “Members” on the left panel -> input `comp4300-2022-s1-marker` under “Gitlab member or Email address” -> Select the role permission as `Developer` -> Skip the "access expiration date" -> Invite
- Clone the lab to your own machine and complete all tasks. Once you finish your answers, please commit and push your code to your repo.

We will mark your answers and provide feedbacks to your repo.

# Submission Deadline
You must commit and push your code to your repo before **11:59pm on 4 May**. Any late updates after the deadline will not be considered.  


# References

[MPI Documentation](https://www.open-mpi.org/doc/)