Note: L1 Cache Behavior
Caching Mutable Data in L1
While it is generally a good mental and programming model to treat the L1 cache as incoherent, with typical load instructions bypassing the L1 cache, and only loads accessing read-only data being eligible for caching, this is not the full story:
-
Although NVIDIA’s documentation and public communications are somewhat opaque on this matter, at least on recent NVIDIA GPUs, it appears that even data which the compiler believes may change in the future can still benefit from being cached in L1.
-
At the PTX level, this means that even
ld.globalinstructions without the.ncqualifier (or any other caching qualifiers) can still make use of the L1 cache. -
At the SASS level, this means that
LDGinstructions without the.CONSTANTqualifier can make use of the L1 cache.
L1 Cache Hit Granularity
Once again, while it is still a generally good mental model to treat the L1 cache as only supplying data from at most one contiguous 128-byte (32-word) cache line on each cycle, we’ve now seen some evidence online suggesting that the L1 cache can in fact perform multiple tag lookups in parallel per cycle. However, we still recommend trying to touch only one cache line at a time per load instruction as a reasonable policy to adopt by default when designing high-performance kernels.
Further Reading
If you’re interested in investigating this topic yourself, here are some references which may be relevant:
-
-
Useful background reading to understand the correctness contract which the compiler and hardware are trying to uphold. Does not directly describe the performance characteristics of the microarchitecture, but imposes constraints on its design which may help one make educated guesses at how it must be implemented.
-
See also: A Formal Analysis of the NVIDIA PTX Memory Consistency Model by Lustig et al., 2019
-
-
CUDA global memory caching documentation
-
Documentation consistent with our simpler description of the L1 cache behavior.
-
This documentation was originally published for NVIDIA’s Maxwell generation of GPUs (released 2014), but the CUDA documentation claims elsewhere that this description of the L1 cache is still relevant on Ampere and Hopper. It is unclear to us if the documentation is fully correct.
-
-
-
Suggests that on Ampere, all loads, even loads of non-read-only data, are cached by default in L1:
…caching at all levels (what ca hint means) is the default behavior, at least for cc 8.0.
-
-
-
Suggests that on Volta (released 2017), data is retrieved from L2 and DRAM at the granularity of 32-byte “sectors,” and that the L1 can service up to 4 tag lookups per cycle:
The Volta L1 data cache has 128 byte cache lines divided into 4 sectors. For local and global accesses the tag stage can compare all 32 threads at a time. The tag stage can look up 4 tags per cycle resolving a maximum of 16 sectors (4 tags x 4 sectors). On miss the cache will only fetch the unique 32 byte sectors that missed. The full cache line is not automatically fetched from L2.
-
-
Dissecting the Turing GPU Architecture through Microbenchmarking (Jia et al., 2019)
-
Fascinating presentation on experimentally observing microarchitectural details of NVIDIA’s Turing generation of GPUs (released 2018).
-
Provides clear, direct evidence of the existence of 32-byte sectors in the L1 cache (see slide 32).
-
See also: accompanying 66-page technical report on the Volta architecture, from the same authors.
-