Lab 5: Matrix Multiply – Improved Scheduling

Prologue: Logistics

Due Dates

For this lab, you’ll be turning in the following deliverables:

Checkpoint: Due Monday, October 6, 11:59pm. For this checkpoint, submit your responses to the prelab (Part 0), and tell us how you’re doing in optimizing your implementations for Parts 1 and 2.
Final Submission: Due Friday, October 10, 11:59pm. Submit your completed code for matmul_2.cu, as well as a write-up containing your answers to Questions 1 - 3.

Starter Code

You can get the starter code for this lab by cloning the lab repository:

git clone git@github.com:accelerated-computing-class/lab5.git

Introduction

Goals for This Lab

In Lab 4, we started developing an implementation of matrix multiplication, and looked at how to design our matrix multiplication kernel to exploit data reuse at the level of the L1 SRAM and the register file. In this lab, we’ll be optimizing our implementation further, mostly by improving the scheduling of our computation to increase hardware utilization.

This lab has three parts:

First, we’ll study how occupancy on our SMs actually impacts performance as a function of parallelism. You will have to submit this part of the lab as part of your checkpoint.
Second, we’ll try to improve the performance of our kernel on the 3072 * 3072 * 3072 problem size that we’ve already seen. This part of the lab will be relatively open-ended, but we’ll have some suggestions for things you might want to try.
Finally, we’ll extend our implementation to be better at handling varying problem sizes, and analyze the ways in which problem size affects performance.

Part 0: Understanding Occupancy

So far, we’ve used parallelism — both through multi-threading and instruction level parallelism — to hide latency.

However, simply increasing the amount of parallelism does not guarantee better performance beyond a certain threshold. To understand why, we can turn to Little’s Law, which provides a useful framework for reasoning about the relationship between throughput, latency, and required parallelism. Little’s Law is a fundamental result from queueing theory that states that:

$\text{Required Parallelism} = \text{Latency} \times \text{Throughput}$

Little’s Law captures an important relationship: to maintain a steady rate of completing work (throughput) when each piece of work takes time to finish (latency), you need to have multiple pieces of work in progress simultaneously (parallelism). Intuitively, if operations take longer to complete, you need more of them happening in parallel to keep results flowing out at a constant rate.

Prelab Question 1: Using Little’s law, how many bytes must be in flight simultaneously (or, “required parallelism”) to saturate DRAM bandwidth? You can approximate DRAM latency as 800 cycles (as we saw from Lab 3). The GPU clock speed is 2175 MHz, and DRAM bandwidth is 360 GB/s.

In this part of the lab, we will put Little’s Law into practice by examining how occupancy affects our ability to saturate memory bandwidth.

Occupancy is a metric of how well we’re utilizing an SM, telling us the percentage of an SM’s capacity that we’re actually using:

$\text{Occupancy (\%)} = \frac{\text{Active Warps per SM}}{\text{Maximum Warps per SM}} \times 100$

To calculate occupancy, we can do some simple hand calculations. On our RTX 4000 Ada GPU, each SM has 4 warp schedulers, and each scheduler can manage up to 12 warps. This gives us a maximum of $4 \times 12 = 48$ active warps per SM. Then, to calculate the number of active warps per SM, we can calculate how many blocks (through thread / shared memory / register pressure) can be co-resident on an SM and how many warps each block uses.

Prelab Question 2: Assuming the kernel is not limited by register usage, how many blocks can be active simultaneously on an SM when each block requests 1000, 10000, or 30000 bytes of shared memory? What percentage occupancy does the kernel achieve in each case if each block is running 64 threads?

Now, we’ll run a simple kernel in occupancy.cu. This file contains a memcpy function that copies data from a source to a destination array, and will report the memory bandwidth achieved. To control occupancy, we use a simple trick: by reserving different amounts of shared memory, we artificially limit how many blocks can be co-resident on an SM. Our goal is to determine the minimum occupancy percentage that achieves peak bandwidth.

Prelab Question 3: In occupancy.cu, fill out the amounts of shared memory in shared_memory_configuration required to run the kernel with 20%, 50%, and 80% occupancy. What sizes did you choose and why?

Prelab Question 4: What is the minimum occupancy needed to reach peak bandwidth? How many threads were active, how much data did each thread load, and how much total data was in flight?

Note: You may also experiment with the TILE variable and vectorized loads – just make sure it divides the array size evenly.

Part 1: Improving Performance

Our first goal of this lab as a whole will be to improve the performance of our implementation from Lab 4, as measured in the same setting in which we previously tested it: multiplying square matrices with side length 3072. Our goal will be to reduce our kernel’s run time to under 5 ms:

Deliverable: In the file matmul_2.cu, implement the functions matmul_improved and launch_matmul_improved so that they can complete the 3072 * 3072 * 3072 benchmark in 5 ms or less. You can feel free to use any techniques you like to try to achieve this run time target.

(If your implementation from Lab 4 already runs in under 5 ms, you don’t need to do anything, although we encourage you to try to optimize it further if you want.)

Depending on exactly how you wrote your implementation for Lab 4, your kernel’s run time may be bottlenecked on different things. We don’t know exactly what you’ll need to do to make your specific implementation run faster, but we do have some suggestions for things you can try that may help.

Suggestion: Revisiting Issues from Lab 4

In Lab 4, we discussed a number of factors that could affect the performance of your kernel. Before trying any more sophisticated techniques to improve performance, you may want to check that your kernel is already doing well according to these criteria:

Maximize data reuse: Try to tune the numerical parameters of your kernel so that you minimize the number of global-to-L1 and L1-to-register loads you need to perform.
Avoid register spills: Stack variables can spill from registers to local memory if they exceed the capacity of the register file, or if you access them in the wrong way (see “Using Registers Effectively”).
Avoid bank conflicts: Whenever multiple lanes in a warp try to simultaneously access different shared memory addresses mapping to the same word index mod 32, you will pay a performance penalty (see “Using L1 Effectively”).
Avoid touching multiple cache lines: Whenever multiple lanes in a warp try to simultaneously access addresses on different L1 cache lines (when using the L1 SRAM as an implicit cache), you will pay a performance penalty (see “Using L1 Effectively”).

Suggestion: Hiding Memory Access Latency

In Lab 4, we mostly discussed the bandwidth of different memory resources (DRAM, L1) as a limiting factor in determining the performance achievable with different kernel designs.

On top of bandwidth, it is also sometimes important to consider the latency of individual memory requests. From the prelab in lab 3, we learned that:

Requests to global memory typically complete within ~700-800 cycles.
Requests from L1 typically complete with a latency of ~35 cycles.

The fact that memory requests execute with high latencies means that it’s important to not create long chains of serial dependencies between small memory requests and small units of computation. Instead, it’s better to fire off a batch of memory requests all at once, and then compute on its results all at once.

Serial vs Batched Memory Accesses and Computation

Concretely, this means that if you’re explicitly loading data from global memory to shared memory in your kernel, you may want to consider loading data from multiple positions along the k dimension all at once, rather than loading single columns/rows of the A/B matrices at a time.

If you’re already loading from multiple k positions at a time, you may want to think about what benefits that design might be providing, and how you should set your kernel’s numerical parameters so that you minimize latency overheads while also making effective use of the SM’s limited shared memory capacity.

Suggestion: Overlapping Data Movement and Computation

Memory requests and floating-point operations make use of separate physical resources, and are able to execute simultaneously. To make the most effective use of the hardware, it is desirable to overlap memory requests with computation as much as possible.

The compiler will generally do a good job of scheduling instructions to overlap L1-to-register loads with computation wherever possible, by leveraging ILP. However, overlapping global-to-shared-memory data movement with computation can be trickier, because writing to shared memory typically involves explicit synchronization with __syncthreads(). If you’re not careful, writing code using the straightforward design pattern of “load to shared memory, synchronize, compute” will erase all opportunities for overlapping.

There are three main strategies you might consider pursuing to overlap loads from global memory with computation in your matrix multiply kernel.

The first strategy is to rely on the fact that under certain circumstances, multiple blocks can be scheduled on a single SM simultaneously. If you have two blocks running on each SM, there is a good chance that as the SM settles into its steady-state behavior, each block’s computation will run opportunistically while the other block’s warps are stalled waiting on loads from global memory.

If you choose to go this route, it may be helpful to keep in mind that in order for two blocks to be scheduled on the same SM, each block must use less than half of the SM’s shared memory and register capacity. To hint to the compiler that you want it to allocate registers in a way which leaves room for at least two blocks per SM, you can pass a value of 2 as the second argument to __launch_bounds___(...). You can check how many blocks of your kernel invocation can run on a single SM using the CUDA API function cudaOccupancyMaxActiveBlocksPerMultiprocessor.

Achieving overlapping by running multiple blocks on each SM may be easier to program since you can rely on the hardware to perform the overlapping, but it gives you less direct control; in particular, it can be hard to guarantee that multiple blocks of your kernel invocation will fit on a single SM. To ensure that overlap happens, one alternative is to structure the kernel so that separate warps are assigned distinct responsibilities. In this way, you can seize control of the machine by explicitly defining how warps are assigned to tasks, which leads us to the second strategy.

The second strategy is to use different warps to carry out different tasks. You could program your kernel so that some warps in each block are responsible solely for loading data from global memory to L1, while other warps in the block are responsible solely for performing computations on data already in L1. Because warps run concurrently with respect to each other, performing data movement and computation on different warps will result in those operations being overlapped. This technique is sometimes known as “warp specialization”:

If you do adopt such a warp-specialized design, you will likely need to double-buffer the storage for your A and B tiles, to avoid overwriting data while you’re still computing on it.

The final strategy you might adopt for overlapping data movement and computation is to use the relatively new “async copy” instructions introduced in NVIDIA’s Ampere generation of GPUs (we are using an Ampere-generation GPU). Async copy instructions are like ordinary load instructions, with two key differences:

Async copies can transfer data directly from global memory to shared memory, without going through registers.
As the name suggests, async copies execute asynchronously. To be sure that the data transfer initiated by an async copy instruction has completed, you need to execute an explicit “wait” instruction which blocks on completion of the copy.

Unlike the prior two overlapping strategies discussed in this section, by using async copies you can overlap data movement and computation without any warps ever being stalled on memory requests. Because register space is a precious resource, and because every warp consumes register space whether it’s currently executing or not, reducing the number of warps required to implement overlapping can potentially deliver significant improvements in efficiency.

Similarly to the warp-specialized overlapping design, when using async copies to implement overlapping you will likely need some kind of double buffering.

On top of double buffering, you may want to organize the async copies that you create, which you can do by bundling multiple asynchronous copies into a commit group. The idea of a commit group is that you can queue up several async copies before telling the hardware to “commit” them as a batch. Critically, within a group, there are no guarantees about the order in which individual copies will complete, but ordering between groups is enforced: all copies in group N will complete before any copies in group N+1 are made visible.

To wait for a particular commit group to complete, you need to specify how many commit groups are allowed to be pending at that point in time. The wait instruction will block until the number of outstanding groups falls below the threshold you provide.

For example, in the timeline below, three asynchronous groups are issued at once. The code then waits until no more than 1 group is still pending before proceeding. Finally, it waits for all asynchronous operations to complete by blocking until zero groups remain pending

Timeline 1

Alternatively, we could first wait until at most 2 groups are pending, then start computing as soon as the first group arrives. This is shown in the timeline below.

Timeline 2

To leverage this functionality of our hardware, we’ve provided three functions in the starter code which you can use to access the hardware’s async copy functionality¹:

cp_async4, which initiates an asynchronous copy of 4 words (16 bytes) per CUDA thread from an address in global memory to an address in shared memory. The source and target addresses must each be 16-byte aligned.
async_commit_group, which commits the current group of asynchronous copy operations initiated by the current CUDA thread.
async_wait_pending, which blocks until there are at most N pending asynchronous commit groups (note that N is a template parameter, and must be specified at compile time). Note that this does not guarantee completion of any copies initiated by other CUDA threads; to synchronize across CUDA threads, you still need to separately call __syncthreads();.

Note: Because of the balance between memory bandwidth and compute throughput on our machine, many warp-specialized or asynchronous variants may not give you performance improvements. The benefit depends on your baseline performance and how the warp specialization or asynchrony is actually implemented. We suggest starting with simpler, easier-to-program optimizations before moving on to these heavier-weight techniques for overlapping data movement and computation.

Suggestion: Vectorized Loads

So far in this course, we’ve always loaded data from memory to registers in denominations of 1 word (4 bytes) per CUDA thread. However, the hardware also supports loading 2 words (8 bytes) or 4 words (16 bytes) per CUDA thread in a single instruction. This applies both when accessing global memory and when accessing shared memory.

These wide load instructions are known as “vectorized loads,” and using them can both improve global memory bandwidth utilization and reduce instruction issue overhead (Source). Reducing instruction issue overhead for loads from shared memory is especially important in the inner loop of a matrix multiply kernel, where every cycle not spent issuing FMA instructions is a cycle wasted.

You can access vectorized load instructions by using the special built-in float2 and float4 types, which represent a vector of 2 floats and a vector of 4 floats, respectively. To perform a vectorized load, just cast the source pointer to a pointer to the appropriate vector type:

float *buf = /* ... */;
int32_t idx = /* ... */;
float4 data = *reinterpret_cast<float4*>(buf + idx); // note: must be aligned!

When performing a vectorized load, the address you load from must be aligned to the width of the vector type. You can access the individual word-sized values inside the vector types using the fields .x, .y, .z, .w (in that order).

When using vectorized loads to access shared memory, you may wonder about the interaction between vectorized loads and bank conflicts. After all, strided access patterns like the following will typically cause bank conflicts:

float *buf_in_shmem = /* ... */;
// 2-way bank conflict on every even-numbered bank:
float x = buf_in_shmem[2 * threadIdx.x];
// 2-way bank conflict on every odd-numbered bank:
float y = buf_in_shmem[2 * threadIdx.x + 1];

You may ask: is loading a float2 from shared memory fundamentally doing the same thing as the code above, and therefore incurring a 2-way bank conflict? Do we get a bank conflict when we run code like the following?

float *buf_in_shmem = /* ... */;
// bank conflict? (Spoiler: no!)
float2 xy = *reinterpret_cast<float2*>(buf_in_shmem + 2 * threadIdx.x);

The short answer is: no, vectorized loads do not incur bank conflicts in cases like the above, where all the lanes in a warp access contiguous addresses. Rather, the L1 SRAM can first spend one cycle servicing the first 16 lanes’ requests, and then spend a subsequent cycle servicing the next 16 lanes’ requests.² On each cycle, we use every bank of the L1 SRAM exactly once, so we can make use of the L1’s full bandwidth with no conflicts.

Finally, note that, thanks to compiler optimizations, your code may already implicitly contain vectorized loads. To see if the compiler is inserting vectorized loads into your program, look for instructions like ld.shared.v2 and ld.shared.v4 in the generated PTX.

Suggestion: Block Scheduling for L2 Reuse

When your kernel invocation has many more blocks than SMs, not all blocks will run at the same time. Instead, the blocks launched by your kernel invocation will go into a queue, and will only begin executing when resources become available. Although CUDA provides no formal guarantees about the order in which blocks will be scheduled, in practice blocks are scheduled in ascending order of linear block index (with blockIdx.x being the fastest-moving dimension).

By thinking about the order in which blocks are scheduled on the GPU, you may be able to find ways to improve L2-level data reuse by designing your matrix multiplication kernel such that it is likely that blocks running at the same time will access the same data in L2 as often as possible. To do so, you would need to change how each block in your kernel computes its logical i, j, k indices based on its blockIdx.

Questions

Once you’ve implemented your matmul_improved kernel and achieved a run time under 5 ms on the 3072 * 3072 * 3072 problem size, you can answer the following question:

Question 1 for final write-up: What was the best run time you were you able to achieve on the 3072 * 3072 * 3072 problem size? What percentage of peak FLOP/s utilization does that correspond to? What techniques did you try to use to optimize your kernel, and which of those did you find actually deliver a speedup in practice? Where did the biggest wins come from? Did you encounter any interesting bugs along the way? Are there any remaining puzzles about your kernel’s performance that you haven’t yet resolved?

Part 2: Handling More Problem Sizes

So far, we’ve focused on optimizing the performance of our kernel for just a single problem size: 3072 * 3072 * 3072. In real applications, matrix multiplications can consist of a wide range of different problem sizes, some of which have fundamentally different performance characteristics than the problem size we’ve been studying until now.

In this section, we’ll be extending our kernel to be able to handle a greater variety of problem sizes efficiently. Designing matrix multiplication algorithms to be robust to truly arbitrary problem sizes is an entire art unto itself, and we’ll be investigating it in more detail in Lab 6. For now, we’ll be focusing on just this restricted set of problem sizes:

`size_i`	`size_j`	`size_k`
`3072`	`3072`	`3072`
`512`	`3072`	`3072`
`256`	`3072`	`3072`
`128`	`3072`	`3072`
`64`	`3072`	`3072`
`32`	`3072`	`3072`
`16`	`3072`	`3072`
`1`	`3072`	`3072`
`256`	`256`	`256`
`256`	`256`	`1024`
`256`	`256`	`8192`
`128`	`128`	`32768`

(You can also find this table in machine-readable JSON format in the file sizes.json included with the starter code.)

Analysis

Before we start trying to optimize our code for these problem sizes, it may be helpful to have the answers to a few questions in mind:

Question 2 for final write-up: For each of the problem sizes in this lab, walk through the following analysis (you’re welcome to use a script to automate the calculations!):

How many total FLOPs does the problem size require us to perform?

Considering (1), what is the fastest we could process this problem size if FLOPs were the only constraint?

How many unique bytes of data does this problem size require us to load from / store to DRAM?

Considering (3), what is the fastest we could process this problem size if data movement were the only constraint?

Considering (2) and (4) together, what lower bound does this imply for the actual run time of our algorithm? Is the workload compute-bound or bandwidth-bound?

Considering (5), what is the maximum TFLOP/s we could achieve on this problem size? (This is just (1) divided by (5).)

In your matmul_improved implementation from Part 1, how many threadblocks will be launched for this problem size?

How does (7) compare to the number of SMs on the machine?

What might (8) imply about how your matmul_improved kernel could be expected to perform on this problem size?

How does your matmul_improved kernel perform on this problem size in reality? How does that compare to (6) and to your hypothesis in (9)?

Implementation

For some of the problem sizes we’re considering, it’s difficult to fill the whole machine with useful work if each threadblock is responsible for an independent subset of the output matrix C. The fundamental problem is that the C matrix in some problems (e.g. the 128 * 128 C matrix in 128 * 128 * 32768) simply isn’t large enough to cover it with as many tiles as we have SMs.

To address this, we need to somehow adjust our partitioning scheme so that we are able to launch more parallel threadblocks than we have output tiles, in order to give each SM enough work to do. Recalling our three-dimensional visualization of the i, j, k iteration space, one relatively simple way to create more units of parallel work is to split the iteration space along the k dimension:

Perhaps unsurprisingly, this partitioning strategy for matrix multiplications is often referred to by the name “split-k.”

Of course, splitting on the k dimension introduces a complication we didn’t previously have to deal with: now there can be multiple threadblocks computing different additive contributions to the same output tile.

That means that in order to implement our split-k partitioning strategy, we’re going to need to somehow perform a reduction to combine the partial sums contributed by each threadblock.

There are many ways to perform reductions on GPUs in general, and there are even many types of reduction strategies which are used specifically in the context of matrix multiplication. Some matrix multiplication reduction strategies in contemporary use are quite complicated, and involve communication between actively-running threadblocks in a single kernel. However, it’s possible to do quite well using a relatively simple approach.

The simple approach we suggest you follow in this lab is to adopt a two-kernel design:

The main kernel computes the partial sums contributed by each threadblock, without reducing them.
The reduction kernel computes the final output by combining the partial sums.

The advantage of this two-kernel design is that it requires no intra-kernel synchronization. In both the main kernel and the reduction kernel, each CUDA thread can write its output to an independent location in global memory. The only synchronization required is that the reduction kernel must not start running until the main kernel has completed, but (as you might recall from Lab 3) that is already guaranteed by default by the kernel launch mechanism.

With this design, we’re ready to extend our implementation to handle more problem sizes:

Deliverable: In the file matmul_2.cu, implement the function launch_matmul_improved_reduce, and any associated CUDA kernels, to be able to efficiently process the full suite of problem sizes we’re considering in this lab. Specifically, for each problem size, aim to achieve the percent of the maximum available throughput shown in the table. You may ignore 256 * 256 * 256 and 256 * 256 * 1024.

`size_i`	`size_j`	`size_k`	`throughput %`
`3072`	`3072`	`3072`	`40%`
`512`	`3072`	`3072`	`40%`
`256`	`3072`	`3072`	`35%`
`128`	`3072`	`3072`	`35%`
`64`	`3072`	`3072`	`40%`
`32`	`3072`	`3072`	`40%`
`16`	`3072`	`3072`	`40%`
`1`	`3072`	`3072`	`40%`
`256`	`256`	`256`	`n/a`
`256`	`256`	`1024`	`n/a`
`256`	`256`	`8192`	`25%`
`128`	`128`	`32768`	`40%`

(Tip: You can find a machine-readable JSON report containing all benchmark results in ./telerun-out/<job_id>/.)

When you write your implementation, you will probably find it necessary to have access to a temporary buffer of GPU memory in which to store the partial sums. To support this, the starter code will provide your launch_matmul_improved_reduce function with a pointer to a pre-allocated workspace buffer. The size of this workspace buffer is up to you; its size (in bytes) is determined based on the output of the function get_workspace_size.

Questions

Once you’ve finished your implementation, you can answer the final question of the lab:

Question 3 for final write-up: What fraction of theoretical peak throughput were you able to achieve for each problem size? How does the performance of your implementation for Part 2 compare to your original matmul_improved kernel from Part 1? What numerical parameters did you need to define for your implementation, and how did you go about setting them? What performance did you achieve on the specific problem sizes 256 * 256 * 256 and 256 * 256 * 1024, and what, if anything, do you think might make those problem sizes different from the others? Did you encounter any interesting bugs along the way?

Acknowledgments

Much of the content in Lab 4 and Lab 5 in this course was adapted from Simon Boehm’s article “How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance,” which the course staff found to be an exceptionally useful reference when learning about this material themselves.

CUDA C++ includes a built-in high-level interface to access async copy functionality (Reference), but it can be quite unreliable. In particular, if you pass the high-level async copy API an unsupported size in bytes, it will silently degrade to use an ordinary synchronous copy instruction. In the starter code, we’ve exposed async copies to you using inline PTX in order to take more direct control.

Our exact description of what the hardware is doing here is an educated guess on our part; to the best of our knowledge there is no documentation on precisely how NVIDIA GPUs decompose vectorized accesses into shared memory transactions.