Note: L1 Cache Behavior

Caching Mutable Data in L1

While it is generally a good mental and programming model to treat the L1 cache as incoherent, with typical load instructions bypassing the L1 cache, and only loads accessing read-only data being eligible for caching, this is not the full story:

Although NVIDIA’s documentation and public communications are somewhat opaque on this matter, at least on recent NVIDIA GPUs, it appears that even data which the compiler believes may change in the future can still benefit from being cached in L1.
At the PTX level, this means that even ld.global instructions without the .nc qualifier (or any other caching qualifiers) can still make use of the L1 cache.
At the SASS level, this means that LDG instructions without the .CONSTANT qualifier can make use of the L1 cache.

L1 Cache Hit Granularity

Once again, while it is still a generally good mental model to treat the L1 cache as only supplying data from at most one contiguous 128-byte (32-word) cache line on each cycle, we’ve now seen some evidence online suggesting that the L1 cache can in fact perform multiple tag lookups in parallel per cycle. However, we still recommend trying to touch only one cache line at a time per load instruction as a reasonable policy to adopt by default when designing high-performance kernels.

PTX memory consistency model
- Useful background reading to understand the correctness contract which the compiler and hardware are trying to uphold. Does not directly describe the performance characteristics of the microarchitecture, but imposes constraints on its design which may help one make educated guesses at how it must be implemented.
- See also: A Formal Analysis of the NVIDIA PTX Memory Consistency Model by Lustig et al., 2019
CUDA global memory caching documentation
- Documentation consistent with our simpler description of the L1 cache behavior.
- This documentation was originally published for NVIDIA’s Maxwell generation of GPUs (released 2014), but the CUDA documentation claims elsewhere that this description of the L1 cache is still relevant on Ampere and Hopper. It is unclear to us if the documentation is fully correct.
NVIDIA developer forum post
- Suggests that on Ampere, all loads, even loads of non-read-only data, are cached by default in L1:
  
  …caching at all levels (what ca hint means) is the default behavior, at least for cc 8.0.
NVIDIA developer forum post
- Suggests that on Volta (released 2017), data is retrieved from L2 and DRAM at the granularity of 32-byte “sectors,” and that the L1 can service up to 4 tag lookups per cycle:
  
  The Volta L1 data cache has 128 byte cache lines divided into 4 sectors. For local and global accesses the tag stage can compare all 32 threads at a time. The tag stage can look up 4 tags per cycle resolving a maximum of 16 sectors (4 tags x 4 sectors). On miss the cache will only fetch the unique 32 byte sectors that missed. The full cache line is not automatically fetched from L2.
Dissecting the Turing GPU Architecture through Microbenchmarking (Jia et al., 2019)
- Fascinating presentation on experimentally observing microarchitectural details of NVIDIA’s Turing generation of GPUs (released 2018).
- Provides clear, direct evidence of the existence of 32-byte sectors in the L1 cache (see slide 32).
- See also: accompanying 66-page technical report on the Volta architecture, from the same authors.

Note: L1 Cache Behavior

Caching Mutable Data in L1

L1 Cache Hit Granularity

Further Reading