## 6.894 Accelerated Computing Lecture 4: Memory Continued...

Jonathan Ragan-Kelley Hii



1 load / 150 ops

360 GB/sec



high **bandwidth**, limited **capacity** 

high clocks & wide interface

## Memory is striped across channels for high bandwidth on contiguous access







## gather

32 x 32 bits/cycle



X 3Z DILS/Cycle scatter

# How can we turn gather/scatter into dense load/store to DRAM?

## Approach 1: "coalescing" at the memory controller











Block replacement amortized over potentially many accesses to the same line while cached.



Block replacement amortized over potentially many accesses to the same line while cached.

## Rule of thumb: we can often idealize GPUs in terms of aggregate throughputs



high **bandwidth**, limited **capacity** 

high clocks & wide interface

**aggregate** large transactions for DRAM

streaming

access

large-scale reuse



high **bandwidth**, limited **capacity** 

high clocks & wide interface

**aggregate** large transactions for DRAM

streaming

access

large-scale reuse

#### L1 SRAM



#### L1 SRAM

128 KB per-SM  $(\times 48 \text{ SMs} = 6 \text{ MB})$ 

128 bytes / cycle / SM

1 warp-wide ld/st
(Per-core: 1 every 4 cycles)

 $\times$  48 SMs  $\times$  2.18 GHz = 9.6 TB/s



#### L1 SRAM

Allows data sharing & communication across warps running simultaneously on the same SM.

In **CUDA**, grouping warps to allow this sharing is the role of **thread blocks**.



#### L1 Cache is incoherent between SMs



#### L1 Cache is incoherent between SMs

**Conventional processors** enforce **cache coherence** via complex protocols built into the hardware.

Accelerators often forego coherence in exchange for performance & scalability, at the cost of programming complexity.

## L1 Cache: opt-in via explicit instructions

```
ld.local
st.local

ld.global.ca vs.
ld.global.cg
```

#### From CUDA C:

```
Read-only data const __restrict__
__ldg( ) intrinsic __local___
```

Textures

### L1 SRAM: also used as explicit scratchpad

Scratchpad Address Space

Each block (SM) only sees its own scratchpad