Lab 10: H100 Matrix Multiplication

Prologue: Logistics

Due Dates

For this lab, you’ll be turning in the following deliverables:

Checkpoint: Due Friday, November 14, 11:59 pm. Submit your completed code for Part 0, your responses to Prelab Questions 1-2, and a discussion of your progress on Part 1.
Final Submission: Due Friday, November 21, 11:59 pm. Submit your completed code for all parts of the lab (including Part 0), as well as a PDF writeup containing your answers for Question 1 of the final writeup.

Starter Code

You can get the starter code for this lab by cloning the lab repository:

git clone git@github.com:accelerated-computing-class/lab10.git

Goals for This Lab

In Lab 9, we explored the H100’s Tensor Memory Accelerator (TMA), learning how to move data from global to shared memory and building software-managed pipelines to hide latency. In this lab, we’ll turn our attention to the second major feature of the H100, the warp-group level tensor core instructions, or wgmma.

Tensor cores account for the vast majority of our machine’s computational capability. On the H100, the tensor cores provide approximately 1000 TFLOPs of half-precision matrix multiply throughput. To put this in perspective, the standard FMA units can run at a half-precision throughput of 267.6 TFLOPs. This means that almost 80% of our machine’s half-precision throughput comes from tensor cores! To write high-performance kernels on the H100, we will need to understand how to use these tensor core instructions effectively.

Like in Lab 6, using tensor cores will require understanding not just how to invoke the instructions themselves, but also how to organize data in memory to match the expectations of these specialized units. Ultimately, our goal is to combine TMA (for data movement) and wgmma (for compute) to write a fast matrix multiplication kernel.

This lab has two main parts:

Part 0 (Prelab) introduces the wgmma instruction. You’ll learn how to invoke this instruction and set up data for it, working with both swizzled and non-swizzled memory formats.
Part 1 combines TMA and wgmma to implement a fast matrix multiplication kernel on the H100.

Part 0: The `wgmma` instruction

To use the wgmma instruction, we first need to understand how it differs from the mma instruction we used in Lab 6.

The first difference is semantic. In mma, the operation computes D = A × B + C, where the accumulator C is separate from the output D. In wgmma, the operation instead computes D = A × B + D, meaning the output serves as its own accumulator. Additionally, wgmma provides optional transformations that mma does not: you can transpose A or B, scale either matrix by 1 or -1, and set the D scale to either 0 (to clear the accumulator) or 1 (to accumulate).
The wgmma instruction operates at the level of a warp group, a new organizational unit consisting of 4 warps, or 128 threads total. mma, on the other hand, operates on a single warp of 32 threads.
The matrix tile sizes that wgmma can handle are significantly larger than those supported by mma. For bf16, the data type we’ll focus on in this lab, wgmma supports tiles with M = 64, K = 16, and N ranging from 8 to 256 in steps of 8. For comparison, the mma instruction in Lab 6 used M = 16, N = 8, and K = 16 (or could also have been set to K = 8). These larger tiles enable greater data reuse.
There are also differences in where the input and output matrices can reside. The A matrix can be in either registers or shared memory, while the B matrix must be in shared memory. Unlike mma, the B matrix no longer needs to be loaded into registers before use. The output matrix D must still reside in registers.
Finally, unlike mma, the wgmma instruction is asynchronous. When you issue a wgmma instruction, the warp group can immediately continue executing other instructions while the matrix multiply completes in the background.

In fact, most of these differences are visible in the PTX instruction itself. Here is a concrete example of the wgmma instruction with size m64n256k16:

$\underbrace{\texttt{wgmma}}_{\text{warp group \texttt{mma}}} \texttt{.} \underbrace{\texttt{mma\_async}}_{\text{asynchronous}} \texttt{.} \underbrace{\texttt{sync}}_{\substack{\text{synchronize warp group} \\ \text{before issue}}} \texttt{.} \underbrace{\texttt{aligned}}_{\substack{\text{all threads in warp group} \\ \text{execute the same \texttt{wgmma} instruction}}} \texttt{.}$ $\underbrace{\texttt{m64n256k16}}_{\text{bigger tile size}} \texttt{.f32.bf16.bf16}$

$\underbrace{\texttt{d}}_{\substack{\text{output (in registers),} \\ \text{also serves as accumulator} \\ \text{(notice the missing } C \text{ param)}}} \texttt{,} \quad \underbrace{\texttt{a-desc, b-desc}}_{\substack{\text{shared memory descriptors,} \\ \text{not registers}}} \texttt{,} \quad \underbrace{\texttt{scale-d}}_{\substack{\text{scale for } D \\ \text{(0 or 1)}}} \texttt{,}$

$\underbrace{\texttt{imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-b}}_{\text{configuration parameters for scaling and transposition}}$

Now that we understand how wgmma differs from mma, let’s try invoking it. We’ll start with non-swizzled data layouts before moving on to swizzled layouts.

Part 0: Invoking the `wgmma` instruction

The unswizzled `wgmma` instruction

Input Matrix Memory Layout

Since the A and B matrices reside in shared memory, their memory layout (row-major or column-major) becomes important. Unlike with registers, where each register is mapped to the same logical index regardless of the original layout, shared memory layouts must be explicitly specified through the descriptors so the hardware knows how to interpret the matrix data.

While wgmma supports both row-major and column-major variants for the input matrices, for simplicity we’ll use a consistent convention throughout this lab for matrices in both global and shared memory: A is stored in row-major order, while B is stored in column-major order. For simplicity, we will refer to this as both A and B being K-major. In all cases, we also make the output matrix M-major (i.e. stored as NxM) for consistency with the cuBLAS API.

Core Matrices

The wgmma instruction operates on a core matrix (as referred to in the PTX documentation) or atom (as CUTLASS calls it). A core matrix represents the fundamental unit of data that must be contiguous, or tightly packed together in memory. For the non-swizzled case, each core matrix is 8 × 16 bytes in size. It is important to note that the values within each core matrix must be contiguous and K-major, meaning that you can’t directly copy the whole matrix from global memory into shared, since the condition that core matrices must be contiguous will be violated.

While each core matrix must itself be contiguous (8 × 16 bytes with no gaps), there can be gaps between different core matrices. The spacing between consecutive core matrices in the K dimension is given by the leading byte offset, while the spacing between consecutive core matrices in the M or N dimension is given by the stride byte offset.

In practice, we tell wgmma about our shared memory using shared memory descriptors. The descriptor is set up in a special format. We’ve provided a wrapper function construct_smem_desc that encodes descriptors in this format. This function takes three parameters in this order:

The address of the shared memory buffer.
The leading byte offset.
The stride byte offset.

Now, let’s examine what these core matrices look like in an example, and how wgmma expects these matrices to be laid out in shared memory. We’ll start with the smallest wgmma variant, where M=64, K=16, and N=8 and the matrices are K-major.

In this case, let’s calculate the leading byte offset and the stride byte offset for the M*K matrix. Moving in the K direction, core matrices are adjacent, with spacing at 128 bytes. Hence, the leading byte offset is 128 bytes. Moving in the M/N direction, core matrices are separated by 2, with spacing at 2*128=256 bytes. Hence, the stride byte offset is 256 bytes. The important thing to take away is that leading byte offset = moving in the K dimension and stride byte offset = moving in the M/N dimension.

Note: Like TMA, operations on wgmma happen in the async proxy, and you must use an async proxy fence to ensure that the wgmma engine sees the most up-to-date values.

Output Matrix Register Layout

The output matrix D, on the other hand, resides in registers. Like before, the values are scattered across registers per thread. In our example, each thread holds (64 × 8) / 128 = 4 values. These values are distributed across threads as follows (figure from PTX documentation):

Mathematically, lane x in warp y handles values with M = 16y + (0, 8) * floor(x/4) and N that satisfy (N//2)%4 = x % 4. Intuitively, this means that each warp processes a cluster of 16 values of M. Within each warp, each set of four threads processes two values of M, reading values in strided 32-byte accesses (correspondingly 2 bf16 values per thread).

Synchronizing Asynchronous `wgmma` Operations

As we mentioned earlier, wgmma is asynchronous. The last piece to put in place is the ability to synchronize wgmma operations. wgmma uses a commit group style mechanism, but there are two significant differences from the commit group mechanisms we’ve used before.

First, there is ordering within a group. There is an ordering guarantee between different wgmma operations committed in the same group that operate on the same output registers. The ordering is enforced in the order the operations were issued. You can choose to wait for a sequence of operations by issuing them into the same group. Then, as before, you commit and wait on pending groups.

Second, the commit operations happen at warp group granularity. For both cp.async and cp.async_bulk, the operations were issued per thread, and as such the commit group was committed per thread. With wgmma, the warp group issues the commit group instruction together. These instructions must be aligned across all threads in the warp group. Similarly, the wait for pending wgmma operations must also happen at warp group granularity.

We provide the following helper functions to synchronize wgmma operations:

warpgroup_arrive() performs a memory fence across the warpgroup and ensures that all registers/shared memory across the warpgroup has been written into.
wgmma_commit() commits the current group of wgmma operations.
wgmma_wait<N>() waits until all but N pending groups of wgmma operations have completed.

Putting It All Together

Using these APIs, you now have the tools to implement a simple matrix multiplication kernel using the wgmma interface:

Deliverable: Call a single M=64, N=8, K=16 wgmma in 0-m64n8k16-wgmma.cu. This should involve only making a single call, but you will need to reorganize the data in shared memory to meet the core matrix layout.

Prelab Question 1: How did you layout your shared memory to be compatible with the wgmma interface (what were the corresponding leading byte offset and stride byte offset)? How did you manage synchronization?

Invoking the 64-byte swizzled `wgmma` instruction

In the previous section, we worked with the unswizzled wgmma instruction, which organizes data at a 8 × 16 byte granularity. The swizzled variants instead process different granularities depending on the swizzle mode:

32-byte swizzled requires repacking at 8 × 32 bytes (with each 32 bytes swizzled)
64-byte swizzled requires repacking at 8 × 64 bytes (with each 64 bytes swizzled)
128-byte swizzled requires repacking at 8 × 128 bytes (with each 128 bytes swizzled)

In this section, we’ll invoke the 64-byte swizzled variant. Within each 64-byte swizzled core matrix, elements are packed into 128-byte chunks, which are then swizzled. As an example, we show the swizzling layout below (figure from PTX documentation). Please note that each number corresponds to a 128-byte chunk (or correspondingly, 8 packed bf16 elements).

Outside of the size of the core matrix, the main difference between the swizzled case and the unswizzled case is how we set up the shared memory descriptor. For swizzled descriptors, the leading byte offset no longer matters and is assumed to be 1. The stride byte offset remains conceptually the same, but now represents the offset needed to move from one 8 × 64 byte core matrix to the next in the M/N dimension.

Because the 64-byte swizzle mode requires at least 64 bytes in the K dimension, we will now use K = 32 (32 elements × 2 bytes per bf16 = 64 bytes). With this choice, you will not need to repack the data as we did in the unswizzled case—it’s already in the correct form, since we don’t need multiple core matrices to fill the entire K dimension! However, you will need to swizzle each core matrix. You are free to use a TMA load if you wish, or you can use your own swizzle function from Lab 9.

Also, since K = 32, you will now require 2 wgmma operations. The first computes K = 0 to 15, and the second computes K = 16 to 31. To make these two calls, you will need to invoke the first wgmma at offset 0, and then invoke the second at offset 16 elements (32 bytes). You do not need to compute these offsets in swizzled coordinates, they should work as byte offsets directly because of the way the WGMMA unit is built. You can synchronize these two operations as you wish, either by waiting on each individually, or by committing them in the same group and waiting on them together.

Deliverable: Call two M=64, N=8, K=16 64-byte-swizzled wgmma operations in 1-swizzle-m64n8k32-wgmma.cu to compute a full M=64, N=8, K=32 matrix multiplication. You will need to swizzle the input data appropriately before invoking the wgmma instructions.

Prelab Question 2: How did you layout your shared memory to be compatible with the swizzled wgmma interface (what were the corresponding leading byte offset and stride byte offset)? How did you manage synchronization?

Part 1: Matrix Multiplication

Now that you have experience with both TMA and wgmma, it’s time to write an H100 matrix multiplication kernel!

In this part, you will implement a matrix multiplication kernel for matrices of size M = 8192, N = 8192, K = 8192. The goal is to achieve performance close to cuBLAS peak performance, which is around at 798 TFLOPs.

Deliverable: Implement a high-performance matrix multiplication kernel in h100-matmul.cu that computes C = A × B for M = 8192, N = 8192, K = 8192. As before, matrix A and B are both K-major, and matrix C is M-major. Your implementation will likely need to leverage both TMA for efficient data movement and wgmma for computation.

Question 1 for final write-up: What percentage of peak cuBLAS performance were you able to achieve? How did you manage the TMA and wgmma interfaces along with your shared memory and registers? What was your model for synchronization? What were the key tradeoffs that you encountered in designing your kernel? Any interesting bugs to report?

This part is open-ended, and you should use your experience from Labs 4, 5, and 6 to guide your exploration.

However, you may want to consider using larger wgmma operations, as the bf16 tensor cores on the H100 support up to N = 256 in steps of 8, keeping M = 64, K = 16 to be constant. Using larger N dimensions allows each warp group to compute more output values per instruction, which can significantly improve performance by increasing data reuse and reducing instruction overhead. If you are interested in exploring this direction, we implement a few sample wgmma sizes in wgmma-interface.cuh and also provide a skeleton in optional-wgmma-setup.cu for testing additional wgmma problem sizes.

Overall, we also expect the performance cutoff for this part of the lab to be no lower than 80% of cuBLAS peak performance. Stay tuned for more suggestions and exact cutoffs in the coming days!