Lab 10: H100 Matrix Multiplication
Prologue: Logistics
Due Dates
For this lab, you’ll be turning in the following deliverables:
-
Checkpoint: Due Friday, November 14, 11:59 pm. Submit your completed code for Part 0, your responses to Prelab Questions 1-2, and a discussion of your progress on Part 1.
-
Final Submission: Due Friday, November 21, 11:59 pm. Submit your completed code for all parts of the lab (including Part 0), as well as a PDF writeup containing your answers for Question 1 of the final writeup.
Starter Code
You can get the starter code for this lab by cloning the lab repository:
Goals for This Lab
In Lab 9, we explored the H100’s Tensor Memory Accelerator (TMA),
learning how to move data from global to shared memory and building
software-managed pipelines to hide latency. In this lab, we’ll turn our
attention to the second major feature of the H100, the warp-group level tensor
core instructions, or wgmma.
Tensor cores account for the vast majority of our machine’s computational capability. On the H100, the tensor cores provide approximately 1000 TFLOPs of half-precision matrix multiply throughput. To put this in perspective, the standard FMA units can run at a half-precision throughput of approximately 120 TFLOPs. This means that almost 90% of our machine’s half-precision throughput comes from tensor cores! To write high-performance kernels on the H100, we will need to understand how to use these tensor core instructions effectively.
Like in Lab 6, using tensor cores will require understanding not
just how to invoke the instructions themselves, but also how to organize data in
memory to match the expectations of these specialized units. Ultimately, our
goal is to combine TMA (for data movement) and wgmma (for compute) to write a
fast matrix multiplication kernel.
This lab has two main parts:
-
Part 0 (Prelab) introduces the
wgmmainstruction. You’ll learn how to invoke this instruction and set up data for it, working with both swizzled and non-swizzled memory formats. -
Part 1 combines TMA and
wgmmato implement a fast matrix multiplication kernel on the H100.
Part 0: The wgmma instruction
To use the wgmma instruction, we first need to understand how it differs from
the mma instruction we used in Lab 6.
-
The first difference is semantic. In
mma, the operation computesD = A × B + C, where the accumulatorCis separate from the outputD. Inwgmma, the operation instead computesD = A × B + D, meaning the output serves as its own accumulator. Additionally,wgmmaprovides optional transformations thatmmadoes not: you can transposeAorB, scale either matrix by1or-1, and set theDscale to either0(to clear the accumulator) or1(to accumulate). -
The
wgmmainstruction operates at the level of a warp group, a new organizational unit consisting of 4 warps, or 128 threads total.mma, on the other hand, operates on a single warp of 32 threads. -
The matrix tile sizes that
wgmmacan handle are significantly larger than those supported bymma. Forbf16, the data type we’ll focus on in this lab,wgmmasupports tiles withM = 64,K = 16, andNranging from 8 to 256 in steps of 8. For comparison, themmainstruction in Lab 6 usedM = 16,N = 8, andK = 16(or could also have been set toK = 8). These larger tiles enable greater data reuse. -
The support layouts of input and output matrices are different, including optional fine-grained “swizzling” of the data for more efficient memory access.
-
There are also differences in where the input and output matrices can reside. The
Amatrix can be in either registers or shared memory, while theBmatrix must be in shared memory. Unlikemma, theBmatrix no longer needs to be loaded into registers before use. The output matrixDmust still reside in registers. -
Finally, unlike
mma, thewgmmainstruction is asynchronous. When you issue awgmmainstruction, the warp group can immediately continue executing other instructions while the matrix multiply completes in the background, but it is necessary to explicitly synchronize before using the result in other (non-wgmma) operations.
In fact, most of these differences are visible in the PTX instruction itself.
Here is a concrete example of the wgmma instruction with size m64n256k16:
Now that we understand how wgmma differs from mma, let’s try invoking it.
We’ll start with non-swizzled data layouts before moving on to swizzled layouts.
Part 0: Invoking the wgmma instruction
The unswizzled wgmma instruction
Input Matrix Memory Layout
Since the A and B matrices reside in shared memory, their memory layout
(row-major or column-major) becomes important. Unlike with registers, where each
register is mapped to the same logical index regardless of the original layout,
shared memory layouts must be explicitly specified through the descriptors so
the hardware knows how to interpret the matrix data.
While wgmma supports both row-major and column-major variants for the input
matrices, for simplicity we’ll use a consistent convention throughout this lab for
matrices in both global and shared memory:
A is stored in row-major order, while B is stored in column-major order. For simplicity,
we will refer to this as both A and B being K-major. In all cases, we also make the output
matrix M-major (i.e. stored as NxM) for consistency with the cuBLAS API.
Core Matrices
The wgmma instruction operates on data made up of “core matrices” (as referred to in the PTX
documentation). A core matrix represents the
fundamental unit of data that must be contiguous, or tightly packed together in
memory. For the non-swizzled case, each core matrix is 8 × 16 bytes in size. It is important to note that the values within each core matrix must be contiguous and K-major, meaning that you can’t directly copy the whole matrix
from global memory into shared, since the condition that core matrices must be contiguous will be violated.
While each core matrix must itself be contiguous (8 × 16 bytes with no gaps),
there can be gaps between different core matrices. The spacing between
consecutive core matrices in the K dimension is given by the leading byte offset,
while the spacing between consecutive core matrices in the M or N dimension is given by the stride byte offset.
In practice, we tell wgmma about our shared memory using
shared memory descriptors. The descriptor is set up in a special
format. We’ve
provided a wrapper function make_smem_desc that encodes descriptors in
this format. This function takes three parameters in this order:
-
The address of the shared memory buffer.
-
The leading byte offset.
-
The stride byte offset.
Now, let’s examine what these core matrices look
like in an example, and how wgmma expects these matrices to be laid out in shared memory.
We’ll start with the smallest wgmma variant, where M=64, K=16, and N=8 and the matrices are K-major.

In this case, let’s calculate the leading byte offset and the stride byte offset for the M*K matrix. Moving in the
K direction, core matrices are adjacent, with spacing at 128 bytes. Hence, the leading byte offset is 128 bytes. Moving
in the M/N direction, core matrices are separated by 2, with spacing at 2*128=256 bytes. Hence, the stride byte offset is 256 bytes. The important thing
to take away is that leading byte offset = moving in the K dimension and stride byte offset = moving in the M/N dimension.
Note: Like TMA, operations on wgmma happen in the async proxy, and you
must use an async proxy fence to ensure that the wgmma engine sees the most
up-to-date values.
Output Matrix Register Layout
The output matrix D, on the other hand, resides in registers. Like before, the
values are scattered across registers per thread. In our example, each thread
holds (64 × 8) / 128 = 4 values. These values are distributed across threads as
follows (figure from PTX documentation):
Mathematically, lane x in warp y handles values with M = 16y + (0, 8) + floor(x/4) and N that satisfy (N//2)%4 = x % 4. Intuitively, this means that
each warp processes a cluster of 16 values of M. Within each warp, each set of four threads processes two values of M, reading values in strided 32-byte accesses
(correspondingly 2 bf16 values per thread).
Synchronizing Asynchronous wgmma Operations
As we mentioned earlier, wgmma is asynchronous. The last piece to put in place
is the ability to synchronize wgmma operations. wgmma uses a commit group
style mechanism, but there are two significant differences from the commit
group mechanisms we’ve used before.
First, there is ordering within a group. There is an ordering guarantee
between different wgmma operations committed in the same group that operate on
the same output registers. The ordering is enforced in the order the operations
were issued. You can choose to wait for a sequence of operations by issuing them
into the same group. Then, as before, you commit and wait on pending groups.
Second, the commit operations happen at warp group granularity. For both
cp.async and cp.async_bulk, the operations were issued per thread, and as
such the commit group was committed per thread. With wgmma, the warp group
issues the commit group instruction together. These instructions must be aligned
across all threads in the warp group. Similarly, the wait for pending wgmma
operations must also happen at warp group granularity.
We provide the following helper functions to synchronize wgmma operations:
-
warpgroup_arrive()performs a memory fence across the warpgroup and ensures that all registers/shared memory across the warpgroup has been written into. -
wgmma_commit()commits the current group ofwgmmaoperations. -
wgmma_wait<N>()waits until all butNpending groups ofwgmmaoperations have completed.
Putting It All Together
Using these APIs, you now have the tools to implement a simple matrix multiplication kernel using the
wgmma interface:
Deliverable: Call a single M=64, N=8, K=16
wgmmain0-m64n8k16-wgmma.cu. This should involve only making a single call, but you will need to reorganize the data in shared memory to meet the core matrix layout.
Prelab Question 1: How did you layout your shared memory to be compatible with the
wgmmainterface (what were the corresponding leading byte offset and stride byte offset)? How did you manage synchronization?
Invoking the 64-byte swizzled wgmma instruction
In the previous section, we worked with the unswizzled wgmma instruction,
which organizes data at a 8 × 16 byte granularity. The swizzled variants instead process
different granularities depending on the swizzle mode:
- 32-byte swizzled requires repacking at 8 × 32 bytes (with each 32 bytes swizzled)
- 64-byte swizzled requires repacking at 8 × 64 bytes (with each 64 bytes swizzled)
- 128-byte swizzled requires repacking at 8 × 128 bytes (with each 128 bytes swizzled)
In this section, we’ll invoke the 64-byte swizzled variant. Within each 64-byte swizzled core matrix, elements are packed into 128-bit chunks, which are then swizzled. As an
example, we show the swizzling layout below (figure from PTX documentation).
Please note that each number corresponds to a 128-bit chunk (or correspondingly, 8 packed bf16 elements).

Outside of the size of the core matrix, the main difference between the swizzled case
and the unswizzled case is how we set up the shared memory descriptor. For
swizzled descriptors, the leading byte offset no longer matters and is ignored. The stride byte offset remains conceptually the same, but now
represents the offset needed to move from one 8 × 64 byte core matrix to the
next in the M/N dimension.
Because the 64-byte swizzle mode requires at least 64 bytes in the K dimension,
we will now use K = 32 (32 elements × 2 bytes per bf16 = 64 bytes). With this
choice, you will not need to repack the data as we did in the unswizzled
case—it’s already in the correct form, since we don’t need multiple core matrices to fill the entire K dimension!
However, you will need to swizzle each core matrix. You are free to use a TMA load if you wish, or you can use your own swizzle function
from Lab 9.
Also, since K = 32, you will now require 2 wgmma operations. The first computes K = 0 to 15, and the second computes K = 16 to 31. To make these two calls, you will
need to invoke the first wgmma at offset 0, and then invoke the second at
offset 16 elements (32 bytes). You do not need to compute these offsets in
swizzled coordinates, they should work as byte offsets directly because of the way
the WGMMA unit is built. You can synchronize these two operations as you wish, either by waiting on each
individually, or by committing them in the same group and waiting on them
together.
Deliverable: Call two
M=64, N=8, K=1664-byte-swizzledwgmmaoperations in1-swizzle-m64n8k32-wgmma.cuto compute a fullM=64, N=8, K=32matrix multiplication. You will need to swizzle the input data appropriately before invoking thewgmmainstructions.
Prelab Question 2: How did you layout your shared memory to be compatible with the swizzled
wgmmainterface (what were the corresponding leading byte offset and stride byte offset)? How did you manage synchronization?
Part 1: Matrix Multiplication
Now that you have experience with both TMA and wgmma, it’s time to write an
H100 matrix multiplication kernel!
In this part, you will implement a matrix multiplication kernel for matrices of size M = 8192, N = 8192, K = 8192. The goal is to achieve performance close to cuBLAS peak performance, which is around at 798 TFLOPs.
Deliverable: Implement a high-performance matrix multiplication kernel in
h100-matmul.cuthat computesC = A × BforM = 8192, N = 8192, K = 8192. As before, matrix A and B are bothK-major, and matrix C isM-major. Your implementation will likely need to leverage both TMA for efficient data movement andwgmmafor computation. For performance cutoffs, your code grade for this part will be calculated as100 + (%speedup - 95) * 1.33. This means 95% speedup will count as full credit, and 80% speedup will be 80%, with any percentage above 95% speedup counting to extra credit.
Question 1 for final write-up: What percentage of peak cuBLAS performance were you able to achieve? How did you manage the TMA and
wgmmainterfaces along with your shared memory and registers? What was your model for synchronization? What were the key tradeoffs that you encountered in designing your kernel? Any interesting bugs to report?
Overall, this part is open-ended, and you should use your experience from Labs 4, 5, and 6 to guide your exploration. The rest of this section below contains some suggestions that we think will be useful to consider:
Suggestion 1: Calling bigger wgmma operations
In Part 0, we worked with the smallest wgmma operation size, where N = 8.
However, for a high-performance matrix multiplication kernel, you’ll likely want
to use larger tile sizes to improve data reuse and reduce the overhead of
instruction issue. The wgmma instruction supports N values ranging from 8
to 256 in steps of 8 (i.e., 8, 16, 24, 32, …, 256), all with M = 64 and K = 16 for bf16 data.
Using larger N dimensions allows each warp group to compute more output values
per instruction, which can significantly improve performance by:
-
Increasing data reuse: As
Nincreases, theBmatrix tile (16 × N elements) grows larger, while theAmatrix tile (64 × 16 elements) remains the same size. Each element loaded fromAis reused across allNoutput rows, and by loading a largerBtile (largerN), we amortize the cost of loading from shared memory over more output values. -
Reducing instruction overhead: Fewer
wgmmainstructions are needed to cover the same output matrix, reducing the cycles spent on instruction issue.
The shared memory layout for A and B follows the same pattern regardless
of the N dimension you choose. The setup we used for N = 8 generalizes
directly: A remains in row-major order and B in column-major order, both
organized as K-major tiles with the same core matrix structure (8 × 16 bytes for
unswizzled, or 8 × 32/64/128 bytes for swizzled variants).
The key challenge when working with larger N values is understanding how the
output matrix D is distributed across registers. As N increases, each
thread in the warp group must hold more output values. For example:
- With
N = 8: each thread holds output values - With
N = 16: each thread holds output values - With
N = 32: each thread holds output values - With
N = 64: each thread holds output values
The pattern of how these values map to thread indices becomes more
complex as N grows. Understanding this register layout is essential for
correctly writing the results back to global memory.
In the starter code, we provide implementations for a few representative wgmma
sizes as examples. However, we cannot provide wrappers for all possible
granularities. If you want to experiment with a different N value, you’ll need
to:
-
Define the inline PTX assembly for the appropriate instruction variant (following the pattern in
wgmma-interface.cuh). -
Allocate the correct number of registers for the output matrix based on your chosen
N. -
Understand the register distribution pattern to correctly map output values back to their positions in the result matrix.
You can use the optional file optional-wgmma-setup.cu as a testing ground to
experiment with different wgmma configurations before integrating them into
your full matrix multiplication kernel. The PTX ISA
documentation
provides the complete specification for all supported wgmma variants.
Suggestion 2: Using TMA and WGMMA to get overlap
Recalling the following diagram from Lecture 8, you may find it beneficial to overlap TMAs with two sets of consumers as to keep the Tensor cores busy.
As discussed in Lab 5, there are a number of ways to implement this, from overlapping multiple blocks on the same SM to warp specialization. You may find it valuable to review the relevant sections in Lab 5, and in the subsequent section, we detail one of the possible strategies, warp specialization.
Warp Specialized Kernel
One approach to overlapping TMA and WGMMA operations is to manually assign warps
to tasks using warp specialization. As in the figure above, in this warp specialized
kernel, you may want to adopt a producer-consumer strategy, where one warpgroup
issues all TMA load operations and serves as the producer. Then, you can choose a relevant
number of consumer warpgroups that use the TMA load outputs to perform wgmma operations.
The key advantage of this warp specialized kernel is that it allows you to to explicitly overlap memory and compute operations, maximizing the utilization of the relevant hardware on our H100 machine.
Suggestion 3: Using TMA with Swizzled Layouts
In Part 0, you had to prepare data for wgmma by manually
repacking matrices into the specific core matrix layout expected by the
instruction. For the unswizzled case, this meant reorganizing data at an 8 × 16
byte granularity, and for larger operations you may have needed to repack at 8 × 32,
8 × 64, or 8 × 128 bytes depending on the variant.
Using swizzled layouts in the TMA provides two key benefits:
- Avoiding shared memory bank conflicts when accessing data for
wgmmaoperations. - Enabling larger TMA loads that can write data directly into the format
wgmmaexpects, eliminating the need for manual repacking.
By combining TMA with the appropriate swizzle mode (32B, 64B, or 128B), you can
load tiles directly from global memory into shared memory in the swizzled layout
that wgmma can work with. This eliminates the extra work of reorganizing data
in shared memory and lets you issue coarser-grained TMA transfers that better
saturate memory bandwidth.
Suggestion 4: Dynamic Register Allocation
If you adopt a warp-specialized design, you may notice an asymmetry in
register usage between producer and consumer warps. Producer warps primarily
issue TMA loads and use relatively few registers, while consumer warps execute
wgmma operations that require many registers to store accumulator values. This
imbalance can limit the size of wgmma operations you can perform, since the
compiler allocates the same number of registers per thread across
all warps in a block.
On the H100, NVIDIA introduced the ability to dynamically reallocate registers
at a warp group granularity using the setmaxnreg PTX
instruction.
This allows producer warp groups to release registers they don’t need, making
them available for consumer warp groups to use for larger wgmma tiles.
The setmaxnreg instruction operates at warp group granularity, meaning all
threads in a warp group must execute the same setmaxnreg instruction with the
same immediate value. The instruction requires a compile-time constant for
the register count.
It’s important to note that setmaxnreg is a hint, not a directive. In our
experience, whether the compiler can successfully apply your hint depends on the
structure of your code. If it fails to apply the hint, the compiler does
helpfully print out a warning message. Additionally, for setmaxnreg to work,
the kernel must be launched with a valid maximum number of per-thread registers,
which you can do with the the __launch_bounds__ attribute.
For completeness, note that after executing a setmaxnreg instruction, all
warps in the warp group must synchronize explicitly before executing any
subsequent setmaxnreg instructions. If threads in a warp group execute
different setmaxnreg instructions or fail to synchronize properly, the
behavior is undefined. While it’s unlikely you’ll need to call multiple
setmaxnreg instructions back-to-back, if you do end up doing this, it
can help save some time debugging.
In the starter code, we provide two helper functions that wrap setmaxnreg instructions:
warpgroup_reg_alloc<N>(): Hint to increase the maximum registers per thread in a warp group, and set it toN.warpgroup_reg_dealloc<N>(): Hint to decrease the maximum registers per thread in a warp group and set it toN.
Finally, you can use the -X-ptxas-options=-v option to get more information about register spills using the uvx telerun submit -X-ptxas-options=-v your_code.cu command.