Lab 5 for COMP4300 - CUDA and GPU Programming

Lab Tasks

Welcome to the fifth lab for COMP4300! Today, we will take a brief look at programming using the CUDA API discussed in lectures, enabling us to harness the considerable power of GPU accelerators.

Firstly, a quick overview of the systems available to you with GPUs installed.

The easiest system to access (and what we recommend you complete this lab on) is the stugpu2.anu.edu.au server administered by cecs. You can access this server via ssh in a similar fashion to partch (instructions here), using your ANU id as your username. NB: We have seen students struggle with firewall issues connecting to this server - if this happens to you, you should try and log in to partch.anu.edu.au first, and then log in to stugpu2.anu.edu.au once logged into partch.
Gadi has a special queue available for GPU nodes, called gpuvolta (see the queue structure page). All the normal caveats about using Gadi apply - in particular, make sure to limit the walltime for any jobs you submit. You will also need to enter a number of GPUs for the job to utilise. One additional restriction is that you have to request 12 times as many CPUs as GPUs; so if you request 2 GPUs you need 24 CPUs.
If your own computer has a GPU, you can run all of the CUDA code provided on that card! NVIDIA provides relatively good instructions on how to install the CUDA compilers here - if you have conda installed you can do it in a single command!

As mentioned in the list, we recommend that you use stugpu2 for this lab - it allows you to focus on GPU programming without having to worry about Gadi queues, which are known to be quite long for the GPUs.

Hello GPU!

To start with, let’s take a simple Hello, World program to introduce ourselves to the compilation tools, and look at how to make a kernel call. The following code may be found in hello_cuda.cu (the .cu extension is used for files containing CUDA code).

#include <stdio.h>
#include <cuda.h>

__global__ void Hello(void) {
    printf("Hello from thread %d in block %d\n", threadIdx.x, blockIdx.x);
} /* Hello */

int main(int argc, char* argv[]) {
    int blk_ct;       // Number of thread blocks
    int th_per_block; // Number of threads per block

    blk_ct = strtol(argv[1], NULL, 10);
    th_per_block = strtol(argv[2], NULL, 10);

    Hello <<<blk_ct, th_per_block>>>();

    cudaDeviceSynchronize();

    return 0;
}  /* main */

We utilise the NVidia Cuda Compiler (nvcc) to compile this code in a way that makes sense to the GPU and the host machine. Specifically, you can write
nvcc -o hello_cuda hello_cuda.cu

You might notice this program has additional runtime arguments provided to main - these can be provided in the standard way, e.g ./hello_cuda 4 5 (for a program execution with 4 blocks of threads with 5 threads per block).

Of interest in this program as well is the __global__ specifier in front of the Hello function. This indicates that the code in this file can be called from either the host (the CPU) or the device (the GPU), and is executed on the device. Such functions (known as a kernel) also require an execution context, which is provided on this line Hello <<<blk_ct, th_per_block>>>(); The other specification options available are __host__ and __device__.

You should compile and run this program with a couple of different arguments to check that your chosen method of accessing a GPU is working correctly.

A couple of questions follow to check your understanding of basic CUDA concepts from the lectures; try answering them first, and then check that your answers are correct by experimenting with the program. As always, ask your tutor lots of questions if you’re not sure of an answer!

If you were to comment out the call to cudaDeviceSynchronize();, what would happen?
What is the largest number of threads you can start? What happens if you exceed that number?
For small values of blk_ct and th_per_block (try 4,4), you will notice that the thread outputs within a block are always ordered; thread 0 says hello, then thread 1, etc. Is this always the case? If not, what is the smallest value of th_per_block that we can provide to get outputs that are not ordered within a block?

Vector Addition and Scheduling

Hello world is great, but having covered off how to compile and run a CUDA program it’s time to move on to more complex applications. Specifically, we’re going to start off with a program to implement vector addition. In programming terms, we’re going to take two arrays x and y of size n, and return another array z, also of size n where for all indices i, z[i] = x[i] + y[i].
You will find in the file vector_add.cu a number of functions to assist in the serial parts of this task: initialising the arrays with random values, a serial function to perform the vector addition on the CPU, and a Two_norm_diff function, which computes the average difference between the GPU output and the CPU output. The mathematical details of this function are not important, other than that if your code is correct, the two-norm difference should be 0.
What is missing is the body of three functions:

Allocate_vectors
Free_vectors
Vec_add

The Allocate_vectors and Free_vectors functions are self-explanatory; there are four vectors in this program of length n. x, y, and z need to be exposed to the GPU and the CPU, while cz only needs to exist in CPU memory, as cz will contain the result of the CPU computing z. You can do this by utilising the cudaMallocManaged, and cudaFree, as well as standard malloc and free calls. Further details on how to use these methods can be found in lectures, and in the man pages accessible from your terminal using (eg) man cudaMallocManaged.

The Vec_add method is the main kernel of this code, and where you will put the work of computing z on the GPU. The idea here is to give each thread a single index to operate on (for now you can assume that the number of threads is equal to the number of elements in the arrays). So, you will need to decide which index the current thread should work on, calculate the correct value at that index, and write it to z. Recall that to help you with this task, you have the threadIdx, blockIdx, gridDim, and blockDim of the thread at hand.

Your first task is to fill in these functions such that the GPU returns the correct sum into z[], which will be evidenced by compiling and running the program, and it telling you the two-norm of difference is 0.

Having got the code correctly computing the sum of two vectors, the next thing to do is to relax the assumption we made earlier that the number of threads is equal to the number of elements in each array. There are many different ways of doing this, but the basic idea is to somehow choose multiple indices that each thread should operate on, and then loop through those indices.
Two common ways of choosing this are to decide how many elements each thread should operate on, and then giving each thread a block of that size. For instance, if each thread gets 3 elements, thread (0,0) should operate on indices 0,1, and 2. This is an example of a block distribution
Another way of doing it would be to compute the total number of threads, and then use a cyclic distribution, so if there are 4 threads in total, then thread (0,0) operates on elements 0,4,8,…
Decide on one of these methods (or try implementing both), and implement it into your Vec_add kernel, so that the program is capable of handling cases where n is larger or smaller than the total number of threads available.

Trapezoids, Trees, and Warp Shuffles

For the final part of this lab, we will move away from looking at getting correct parallel algorithms, and instead measure the walltime taken to compute the integral of some function f over an interval a to b using the trapezoidal rule.
For those not familiar, the trapezoidal rule <insert description of the trapezoidal rule’s mathematics>.
For those of us who prefer to think in code rather than mathematics, here’s a serial implementation of the trapezoidal rule, which can be found alongside some other serial functions in the file trap.cu:

float Serial_trap(const float a, const float b, const int n) {
    float x,h = (b-a)/n;
    float trap = 0.5*(f(a) + f(b));

    for (int i = 1; i <=n-1; i++) {
        x = a + i*h;
        trap += f(x);
    }
    trap = trap*h;

    return trap;
}

Note that for our program, the number of trapezoids used in the approximation is parameterised as n.

As is hopefully becoming more familiar, the first part of improving the performance of a program like this with CUDA is to write a device kernel. Somewhat unhelpfully, the code for the kernel has not been provided to you, so your first task is to write the kernel function Dev_trap. Two gotchas to be aware of here:

You can’t just add to the float pointed to with trap_p directly, as that’ll create a race condition. Helpfully, CUDA provides us with a function atomicAdd() that you should use for this task. It takes a pointer to a shared variable as the first argument, and a numeric value to add to it as the second.
Reading the Trap_wrapper code carefully, you’ll notice that trap_p is already filled in for the first value; make sure not to compute the 0th trapezoid again inside your kernel!
You may assume that the number of threads will always equal n - no need to create a scheduling system this time!

Now that you’ve got a nice shiny kernel, you should try running it for a number of values for n, and with a couple of different block and thread counts. a and b aren’t super important, because we’re not checking for correctness, but -3 and 3 respectively are good values.
Your values may vary, but one of your tutors got these average times for a couple of runs of the program:

n	blk_count	th_per_blk	Serial Time (s)	Parallel Time (s)
256	16	16	0.0000114	0.000218
1024	32	32	0.0000515	0.000246
4096	64	64	0.000182	0.000243
16384	128	128	0.000704	0.000261
65536	256	256	0.00173	0.000375
262144	512	512	0.00330	0.000737
1048576	1024	1024	0.00879	0.00227

These results show a speed-up, especially for larger n, but getting that speedup requires using 1,048,576 cores over the 1 core the CPU gets! Clearly we can do better.

Warp Shuffles

As you’ve probably guessed by now, the problem with this kernel is that every single thread has to synchronise with every other thread to write their contribution to the overall sum into the shared variable trap_p. So, if we want to improve our function, we need to reduce the number of atomic operations involved. As you’ve seen in lectures, the core unit of execution on an NVIDIA GPU is a warp, and helpfully for us, there are some built-in functions that allow threads inside a warp to talk to each other.
Specifically, we’re interested in the warp shuffle - a tree based communication method. In this method, threads within a warp pair off with each other, with one thread sending data, and the other thread receiving it. This is implemented in CUDA through the builtin function __shfl_down_sync, and the below diagram (taken from the course textbook) demonstrates how this can work in a loop to make a global sum when you sum the values read from the other threads.
<>
This figure is for a hypothetical system with a warp size of 8 (our GPUs have a warp size of 32) for readability. To understand how this allows for a global summing, we can trace the values that are recived by thread 0.
On the first iteration, it receives a value from thread 4.
On the second iteration, it receives a value from combining threads 6 and 2.
On the third iteration, it receives a value from combining threads 1 and 3. Thread 1 has a value from combining threads 1 and 5, and thread 3 has a value from combining threads 3 and 7, so essentially we have the information from threads 1,3,5, and 7 on the third iteration.
At this point, thread 0 has information from every other thread in the warp, so it has the global sum.

To get a little more technical, the signature of __shfl_down_sync is as follows:

__device__ float __shfl_down_sync(
  unsigned mask,
  float    var,
  unsigned diff,
  int      width = warpsize
);

The mask variable specifies which of the 32 threads in the warp are participating in this communication; it is a 32-bit value read in binary. So, if you want all threads to participate, thats 0xFFFFFFFF, if you want only the first thread to participate, that’s 0x00000001, etc.
The var variable is simply the variable that is being communicated to the other threads.
The diff variable is the distance in the thread block that we’re communicating by. So if thread 0 should talk to thread 4, you should set diff=4.
Finally, the width variable (which is optional) specifies the width of the communication. You will rarely want to change this from just being the width of a warp.
The value that is returned from this call is the corresponding value from the thread that we are paired with. So if we’re thread 0 paired with thread 4, the float returned will be the var variable held by thread 4.

Your final task for this lab is to modify your trap.cu file to utilise the warp shuffle method outlined above to obtain an improvement in the computation time taken by the CUDA trap function. Some pointers are below:

It is highly recommended that you define another function to do the summation Warp_sum with a signature like __device__ float Warp_sum(float var). Get this working to sum up variables in a warp, and then think about how to use it to speed up your Dev_trap kernel after.
Remember that only one thread will have the correct result from the Warp_sum call - don’t add the result from every thread to trap_p!
You may assume that thread blocks will contain only one warp. As such, you should only test this code with th_per_blk=32.

Take-home Tasks

For this week’s take-home tasks, we will be exploring the performance of a program to compute the dot product of two vectors. This is a reduction operation on two vectors x and y, where dot=x[0]*y[0] + x[1]*y[1] + x[2]*y[2]... for all indices of the two vectors. As usual, serial code to implement this is provided, and it looks like this:

float Serial_dot_prod(const float x[], const float y[], const int n) {
    float cdot = 0.0;

    for (int i = 0; i < n; i++) {
        cdot += x[i] * y[i];
    }
    return cdot;
}

Exercise 1 - Basic CUDA Implementation (1 mark)

As in the lab exercises, you’ve been provided in dot_prod.cu a file with serial functions and some helper functions to manage inputs and outputs. Fill in this file with three things:

A kernel that will compute the dot product in parallel. You should use atomicAdd, and may assume that the number of threads always equals n.
Functions to allocate and de-allocate the memory used in your kernel. Note carefully that one of the values you’ll need to pass into your kernel is not an array, and is just a single float.
Timing calls to time both the provided serial implementation, and your parallel kernel
Note that because of the large number of atomic operations, your code may well be slower than the serial version. This does not neccessarily indicate a problem with your submission for this exercise.

Exercise 2 - Scheduling (1 mark)

As before, we would like to relax the assumption that the number of threads always equals n. For this task, implement two additional kernels; one which implements a block distribution of array indices, and one which implements a cyclic distribution of array indices. Also record which of these kernels is more performant, utilising a suitable experiment. Describe your experimental procedure, your results, and a brief explanation of why the more performant kernel performs better in the provided dot_prod_observations.md file.

Exercise 3 - Shared memory (1 mark)

Finally, we explore one more method of increasing performance, in addition to the warp shuffle described in the lab. CUDA provides the ability for an array to be shared using local (i.e on-card) memory, which allows for an alternative method of accessing values held by other threads. Declaring such an array should be done in the kernel, and can be done by utilising the __shared__ keyword. For instance __shared__ int my_shared_int_array[10]; declares a shared array with 10 ints inside. Note that the number of elements in the array must be known at compile-time; you can’t dynamically allocate these arrays.
One other piece of information that might be useful depending on how you do this exercise is that you can implement a barrier for all the threads in a block using __syncthreads().
You may also make the same assumption as when we were doing the warp shuffles, that a block will always have 32 threads.
Utilising shared arrays to access values held by other threads in a warp, implement an additional kernel using the more performant scheduling method that performs better than the kernels from exercise 2.

NB: You might need to use large values for n and compute the dot product many times in order to see a performance difference. Try experimenting with values that make the CUDA version take around a second.

References

CUDA Programming Guide