Note: L1 Cache Behavior

Caching Mutable Data in L1

While it is generally a good mental and programming model to treat the L1 cache as incoherent, with typical load instructions bypassing the L1 cache, and only loads accessing read-only data being eligible for caching, this is not the full story:

L1 Cache Hit Granularity

Once again, while it is still a generally good mental model to treat the L1 cache as only supplying data from at most one contiguous 128-byte (32-word) cache line on each cycle, we’ve now seen some evidence online suggesting that the L1 cache can in fact perform multiple tag lookups in parallel per cycle. However, we still recommend trying to touch only one cache line at a time per load instruction as a reasonable policy to adopt by default when designing high-performance kernels.

Further Reading

If you’re interested in investigating this topic yourself, here are some references which may be relevant: