Lab 10: H100 Matrix Multiplication

Prologue: Logistics

Due Dates

For this lab, you’ll be turning in the following deliverables:

Starter Code

You can get the starter code for this lab by cloning the lab repository:

git clone git@github.com:accelerated-computing-class/lab10.git

Goals for This Lab

In Lab 9, we explored the H100’s Tensor Memory Accelerator (TMA), learning how to move data from global to shared memory and building software-managed pipelines to hide latency. In this lab, we’ll turn our attention to the second major feature of the H100, the warp-group level tensor core instructions, or wgmma.

Tensor cores account for the vast majority of our machine’s computational capability. On the H100, the tensor cores provide approximately 1000 TFLOPs of half-precision matrix multiply throughput. To put this in perspective, the standard FMA units can run at a half-precision throughput of approximately 120 TFLOPs. This means that almost 90% of our machine’s half-precision throughput comes from tensor cores! To write high-performance kernels on the H100, we will need to understand how to use these tensor core instructions effectively.

Like in Lab 6, using tensor cores will require understanding not just how to invoke the instructions themselves, but also how to organize data in memory to match the expectations of these specialized units. Ultimately, our goal is to combine TMA (for data movement) and wgmma (for compute) to write a fast matrix multiplication kernel.

This lab has two main parts:

  1. Part 0 (Prelab) introduces the wgmma instruction. You’ll learn how to invoke this instruction and set up data for it, working with both swizzled and non-swizzled memory formats.

  2. Part 1 combines TMA and wgmma to implement a fast matrix multiplication kernel on the H100.

Part 0: The wgmma instruction

To use the wgmma instruction, we first need to understand how it differs from the mma instruction we used in Lab 6.

In fact, most of these differences are visible in the PTX instruction itself. Here is a concrete example of the wgmma instruction with size m64n256k16:

wgmmawarp group mma.mma_asyncasynchronous.syncsynchronize warp groupbefore issue.alignedall threads in warp groupexecute the same wgmma instruction. \underbrace{\texttt{wgmma}}_{\text{warp group \texttt{mma}}} \texttt{.} \underbrace{\texttt{mma\_async}}_{\text{asynchronous}} \texttt{.} \underbrace{\texttt{sync}}_{\substack{\text{synchronize warp group} \\ \text{before issue}}} \texttt{.} \underbrace{\texttt{aligned}}_{\substack{\text{all threads in warp group} \\ \text{execute the same \texttt{wgmma} instruction}}} \texttt{.} m64n256k16bigger tile size.f32.bf16.bf16 \underbrace{\texttt{m64n256k16}}_{\text{bigger tile size}} \texttt{.f32.bf16.bf16}

doutput (in registers),also serves as accumulator(notice the missing C param),a-desc, b-descshared memory descriptors,not registers,scale-dscale for D(0 or 1), \underbrace{\texttt{d}}_{\substack{\text{output (in registers),} \\ \text{also serves as accumulator} \\ \text{(notice the missing } C \text{ param)}}} \texttt{,} \quad \underbrace{\texttt{a-desc, b-desc}}_{\substack{\text{shared memory descriptors,} \\ \text{not registers}}} \texttt{,} \quad \underbrace{\texttt{scale-d}}_{\substack{\text{scale for } D \\ \text{(0 or 1)}}} \texttt{,}

imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-bconfiguration parameters for scaling and transposition \underbrace{\texttt{imm-scale-a, imm-scale-b, imm-trans-a, imm-trans-b}}_{\text{configuration parameters for scaling and transposition}}

Now that we understand how wgmma differs from mma, let’s try invoking it. We’ll start with non-swizzled data layouts before moving on to swizzled layouts.

Part 0: Invoking the wgmma instruction

The unswizzled wgmma instruction

Input Matrix Memory Layout

Since the A and B matrices reside in shared memory, their memory layout (row-major or column-major) becomes important. Unlike with registers, where each register is mapped to the same logical index regardless of the original layout, shared memory layouts must be explicitly specified through the descriptors so the hardware knows how to interpret the matrix data.

While wgmma supports both row-major and column-major variants for the input matrices, for simplicity we’ll use a consistent convention throughout this lab for matrices in both global and shared memory: A is stored in row-major order, while B is stored in column-major order. For simplicity, we will refer to this as both A and B being K-major. In all cases, we also make the output matrix M-major (i.e. stored as NxM) for consistency with the cuBLAS API.

Core Matrices

The wgmma instruction operates on data made up of “core matrices” (as referred to in the PTX documentation). A core matrix represents the fundamental unit of data that must be contiguous, or tightly packed together in memory. For the non-swizzled case, each core matrix is 8 × 16 bytes in size. It is important to note that the values within each core matrix must be contiguous and K-major, meaning that you can’t directly copy the whole matrix from global memory into shared, since the condition that core matrices must be contiguous will be violated.

While each core matrix must itself be contiguous (8 × 16 bytes with no gaps), there can be gaps between different core matrices. The spacing between consecutive core matrices in the K dimension is given by the leading byte offset, while the spacing between consecutive core matrices in the M or N dimension is given by the stride byte offset.

In practice, we tell wgmma about our shared memory using shared memory descriptors. The descriptor is set up in a special format. We’ve provided a wrapper function make_smem_desc that encodes descriptors in this format. This function takes three parameters in this order:

  1. The address of the shared memory buffer.

  2. The leading byte offset.

  3. The stride byte offset.

Now, let’s examine what these core matrices look like in an example, and how wgmma expects these matrices to be laid out in shared memory. We’ll start with the smallest wgmma variant, where M=64, K=16, and N=8 and the matrices are K-major.

In this case, let’s calculate the leading byte offset and the stride byte offset for the M*K matrix. Moving in the K direction, core matrices are adjacent, with spacing at 128 bytes. Hence, the leading byte offset is 128 bytes. Moving in the M/N direction, core matrices are separated by 2, with spacing at 2*128=256 bytes. Hence, the stride byte offset is 256 bytes. The important thing to take away is that leading byte offset = moving in the K dimension and stride byte offset = moving in the M/N dimension.

Note: Like TMA, operations on wgmma happen in the async proxy, and you must use an async proxy fence to ensure that the wgmma engine sees the most up-to-date values.

Output Matrix Register Layout

The output matrix D, on the other hand, resides in registers. Like before, the values are scattered across registers per thread. In our example, each thread holds (64 × 8) / 128 = 4 values. These values are distributed across threads as follows (figure from PTX documentation):

Mathematically, lane x in warp y handles values with M = 16y + (0, 8) + floor(x/4) and N that satisfy (N//2)%4 = x % 4. Intuitively, this means that each warp processes a cluster of 16 values of M. Within each warp, each set of four threads processes two values of M, reading values in strided 32-byte accesses (correspondingly 2 bf16 values per thread).

Synchronizing Asynchronous wgmma Operations

As we mentioned earlier, wgmma is asynchronous. The last piece to put in place is the ability to synchronize wgmma operations. wgmma uses a commit group style mechanism, but there are two significant differences from the commit group mechanisms we’ve used before.

First, there is ordering within a group. There is an ordering guarantee between different wgmma operations committed in the same group that operate on the same output registers. The ordering is enforced in the order the operations were issued. You can choose to wait for a sequence of operations by issuing them into the same group. Then, as before, you commit and wait on pending groups.

Second, the commit operations happen at warp group granularity. For both cp.async and cp.async_bulk, the operations were issued per thread, and as such the commit group was committed per thread. With wgmma, the warp group issues the commit group instruction together. These instructions must be aligned across all threads in the warp group. Similarly, the wait for pending wgmma operations must also happen at warp group granularity.

We provide the following helper functions to synchronize wgmma operations:

  1. warpgroup_arrive() performs a memory fence across the warpgroup and ensures that all registers/shared memory across the warpgroup has been written into.

  2. wgmma_commit() commits the current group of wgmma operations.

  3. wgmma_wait<N>() waits until all but N pending groups of wgmma operations have completed.

Putting It All Together

Using these APIs, you now have the tools to implement a simple matrix multiplication kernel using the wgmma interface:

Deliverable: Call a single M=64, N=8, K=16 wgmma in 0-m64n8k16-wgmma.cu. This should involve only making a single call, but you will need to reorganize the data in shared memory to meet the core matrix layout.

Prelab Question 1: How did you layout your shared memory to be compatible with the wgmma interface (what were the corresponding leading byte offset and stride byte offset)? How did you manage synchronization?

Invoking the 64-byte swizzled wgmma instruction

In the previous section, we worked with the unswizzled wgmma instruction, which organizes data at a 8 × 16 byte granularity. The swizzled variants instead process different granularities depending on the swizzle mode:

In this section, we’ll invoke the 64-byte swizzled variant. Within each 64-byte swizzled core matrix, elements are packed into 128-bit chunks, which are then swizzled. As an example, we show the swizzling layout below (figure from PTX documentation). Please note that each number corresponds to a 128-bit chunk (or correspondingly, 8 packed bf16 elements).

Outside of the size of the core matrix, the main difference between the swizzled case and the unswizzled case is how we set up the shared memory descriptor. For swizzled descriptors, the leading byte offset no longer matters and is ignored. The stride byte offset remains conceptually the same, but now represents the offset needed to move from one 8 × 64 byte core matrix to the next in the M/N dimension.

Because the 64-byte swizzle mode requires at least 64 bytes in the K dimension, we will now use K = 32 (32 elements × 2 bytes per bf16 = 64 bytes). With this choice, you will not need to repack the data as we did in the unswizzled case—it’s already in the correct form, since we don’t need multiple core matrices to fill the entire K dimension! However, you will need to swizzle each core matrix. You are free to use a TMA load if you wish, or you can use your own swizzle function from Lab 9.

Also, since K = 32, you will now require 2 wgmma operations. The first computes K = 0 to 15, and the second computes K = 16 to 31. To make these two calls, you will need to invoke the first wgmma at offset 0, and then invoke the second at offset 16 elements (32 bytes). You do not need to compute these offsets in swizzled coordinates, they should work as byte offsets directly because of the way the WGMMA unit is built. You can synchronize these two operations as you wish, either by waiting on each individually, or by committing them in the same group and waiting on them together.

Deliverable: Call two M=64, N=8, K=16 64-byte-swizzled wgmma operations in 1-swizzle-m64n8k32-wgmma.cu to compute a full M=64, N=8, K=32 matrix multiplication. You will need to swizzle the input data appropriately before invoking the wgmma instructions.

Prelab Question 2: How did you layout your shared memory to be compatible with the swizzled wgmma interface (what were the corresponding leading byte offset and stride byte offset)? How did you manage synchronization?

Part 1: Matrix Multiplication

Now that you have experience with both TMA and wgmma, it’s time to write an H100 matrix multiplication kernel!

In this part, you will implement a matrix multiplication kernel for matrices of size M = 8192, N = 8192, K = 8192. The goal is to achieve performance close to cuBLAS peak performance, which is around at 798 TFLOPs.

Deliverable: Implement a high-performance matrix multiplication kernel in h100-matmul.cu that computes C = A × B for M = 8192, N = 8192, K = 8192. As before, matrix A and B are both K-major, and matrix C is M-major. Your implementation will likely need to leverage both TMA for efficient data movement and wgmma for computation. For performance cutoffs, your code grade for this part will be calculated as 100 + (%speedup - 95) * 1.33. This means 95% speedup will count as full credit, and 80% speedup will be 80%, with any percentage above 95% speedup counting to extra credit.

Question 1 for final write-up: What percentage of peak cuBLAS performance were you able to achieve? How did you manage the TMA and wgmma interfaces along with your shared memory and registers? What was your model for synchronization? What were the key tradeoffs that you encountered in designing your kernel? Any interesting bugs to report?

Overall, this part is open-ended, and you should use your experience from Labs 4, 5, and 6 to guide your exploration. The rest of this section below contains some suggestions that we think will be useful to consider:

Suggestion 1: Calling bigger wgmma operations

In Part 0, we worked with the smallest wgmma operation size, where N = 8. However, for a high-performance matrix multiplication kernel, you’ll likely want to use larger tile sizes to improve data reuse and reduce the overhead of instruction issue. The wgmma instruction supports N values ranging from 8 to 256 in steps of 8 (i.e., 8, 16, 24, 32, …, 256), all with M = 64 and K = 16 for bf16 data.

Using larger N dimensions allows each warp group to compute more output values per instruction, which can significantly improve performance by:

  1. Increasing data reuse: As N increases, the B matrix tile (16 × N elements) grows larger, while the A matrix tile (64 × 16 elements) remains the same size. Each element loaded from A is reused across all N output rows, and by loading a larger B tile (larger N), we amortize the cost of loading from shared memory over more output values.

  2. Reducing instruction overhead: Fewer wgmma instructions are needed to cover the same output matrix, reducing the cycles spent on instruction issue.

The shared memory layout for A and B follows the same pattern regardless of the N dimension you choose. The setup we used for N = 8 generalizes directly: A remains in row-major order and B in column-major order, both organized as K-major tiles with the same core matrix structure (8 × 16 bytes for unswizzled, or 8 × 32/64/128 bytes for swizzled variants).

The key challenge when working with larger N values is understanding how the output matrix D is distributed across registers. As N increases, each thread in the warp group must hold more output values. For example:

The pattern of how these values map to thread indices becomes more complex as N grows. Understanding this register layout is essential for correctly writing the results back to global memory.

In the starter code, we provide implementations for a few representative wgmma sizes as examples. However, we cannot provide wrappers for all possible granularities. If you want to experiment with a different N value, you’ll need to:

  1. Define the inline PTX assembly for the appropriate instruction variant (following the pattern in wgmma-interface.cuh).

  2. Allocate the correct number of registers for the output matrix based on your chosen N.

  3. Understand the register distribution pattern to correctly map output values back to their positions in the result matrix.

You can use the optional file optional-wgmma-setup.cu as a testing ground to experiment with different wgmma configurations before integrating them into your full matrix multiplication kernel. The PTX ISA documentation provides the complete specification for all supported wgmma variants.

Suggestion 2: Using TMA and WGMMA to get overlap

Recalling the following diagram from Lecture 8, you may find it beneficial to overlap TMAs with two sets of consumers as to keep the Tensor cores busy.

As discussed in Lab 5, there are a number of ways to implement this, from overlapping multiple blocks on the same SM to warp specialization. You may find it valuable to review the relevant sections in Lab 5, and in the subsequent section, we detail one of the possible strategies, warp specialization.

Warp Specialized Kernel

One approach to overlapping TMA and WGMMA operations is to manually assign warps to tasks using warp specialization. As in the figure above, in this warp specialized kernel, you may want to adopt a producer-consumer strategy, where one warpgroup issues all TMA load operations and serves as the producer. Then, you can choose a relevant number of consumer warpgroups that use the TMA load outputs to perform wgmma operations.

The key advantage of this warp specialized kernel is that it allows you to to explicitly overlap memory and compute operations, maximizing the utilization of the relevant hardware on our H100 machine.

Suggestion 3: Using TMA with Swizzled Layouts

In Part 0, you had to prepare data for wgmma by manually repacking matrices into the specific core matrix layout expected by the instruction. For the unswizzled case, this meant reorganizing data at an 8 × 16 byte granularity, and for larger operations you may have needed to repack at 8 × 32, 8 × 64, or 8 × 128 bytes depending on the variant.

Using swizzled layouts in the TMA provides two key benefits:

  1. Avoiding shared memory bank conflicts when accessing data for wgmma operations.
  2. Enabling larger TMA loads that can write data directly into the format wgmma expects, eliminating the need for manual repacking.

By combining TMA with the appropriate swizzle mode (32B, 64B, or 128B), you can load tiles directly from global memory into shared memory in the swizzled layout that wgmma can work with. This eliminates the extra work of reorganizing data in shared memory and lets you issue coarser-grained TMA transfers that better saturate memory bandwidth.

Suggestion 4: Dynamic Register Allocation

If you adopt a warp-specialized design, you may notice an asymmetry in register usage between producer and consumer warps. Producer warps primarily issue TMA loads and use relatively few registers, while consumer warps execute wgmma operations that require many registers to store accumulator values. This imbalance can limit the size of wgmma operations you can perform, since the compiler allocates the same number of registers per thread across all warps in a block.

On the H100, NVIDIA introduced the ability to dynamically reallocate registers at a warp group granularity using the setmaxnreg PTX instruction. This allows producer warp groups to release registers they don’t need, making them available for consumer warp groups to use for larger wgmma tiles.

The setmaxnreg instruction operates at warp group granularity, meaning all threads in a warp group must execute the same setmaxnreg instruction with the same immediate value. The instruction requires a compile-time constant for the register count.

It’s important to note that setmaxnreg is a hint, not a directive. In our experience, whether the compiler can successfully apply your hint depends on the structure of your code. If it fails to apply the hint, the compiler does helpfully print out a warning message. Additionally, for setmaxnreg to work, the kernel must be launched with a valid maximum number of per-thread registers, which you can do with the the __launch_bounds__ attribute.

For completeness, note that after executing a setmaxnreg instruction, all warps in the warp group must synchronize explicitly before executing any subsequent setmaxnreg instructions. If threads in a warp group execute different setmaxnreg instructions or fail to synchronize properly, the behavior is undefined. While it’s unlikely you’ll need to call multiple setmaxnreg instructions back-to-back, if you do end up doing this, it can help save some time debugging.

In the starter code, we provide two helper functions that wrap setmaxnreg instructions:

Finally, you can use the -X-ptxas-options=-v option to get more information about register spills using the uvx telerun submit -X-ptxas-options=-v your_code.cu command.