GPU 記憶體階層 (GPU Memory Hierarchy)

#cuda #memory #shared-memory #unified-memory #concept

在異質運算系統中，有效運用記憶體與最大化運算單元利用率同等重要。GPU 除了快取之外，還包含多種可程式化的 on-chip memory。本筆記依序介紹 DRAM、on-chip memory、caches 與 unified memory。

重點總覽

項目	重點
DRAM	GPU 與 CPU 各有直連 DRAM；device code 視角下 GPU 的 DRAM 稱 global memory，CPU 的 DRAM 稱 system memory / host memory
Unified virtual memory space	CPU 與所有 GPU 共用單一虛擬位址空間，每個位址可判斷屬於哪個記憶體
Register file	每個 SM 私有，存 thread 區域變數，由 compiler 配置，per-thread
Shared memory	每個 SM 私有，供 thread block / cluster 內 threads 交換資料，per-block 配置
容量限制	register / shared memory 容量有限；block 所需 register 總量超過 SM register file 則 kernel 無法啟動
L1 cache	每個 SM 一個，屬於 unified data cache
L2 cache	較大，由整個 GPU 所有 SM 共享
Constant cache	每個 SM 一個，快取宣告為 constant 的 global memory，kernel 參數也可能放入
Unified memory	配置可從 CPU 或 GPU 存取，由 runtime / 硬體在需要時搬移
Mapped memory（例外）	可被 GPU 直接存取的 CPU memory，但走 PCIe/NVLINK，延遲高、非高效替代

DRAM Memory in Heterogeneous Systems

GPU 與 CPU 都各自有直連的 DRAM 晶片；在多 GPU 系統中，每個 GPU 都有自己的記憶體。從 device code 的視角來看，連接到 GPU 的 DRAM 稱為 global memory，因為 GPU 內所有 SM 皆可存取。連接到 CPU 的 DRAM 則稱為 system memory 或 host memory。

global memory 之名僅表示「GPU 內所有 SM 可存取」，不代表整個系統各處皆可存取。
與 CPU 相同，GPU 採用 virtual memory addressing（虛擬記憶體定址）。
現行支援的所有系統中，CPU 與 GPU 使用單一 unified virtual memory space：
- 每個 GPU 的虛擬位址範圍唯一且互不重疊，與 CPU 及其他 GPU 皆不同。
- 對任一虛擬位址，可判斷其屬於 GPU memory 還是 system memory；多 GPU 系統中還可判斷屬於哪個 GPU。
CUDA 提供 API 來配置 GPU memory、CPU memory，並在 CPU↔GPU、GPU 內、GPU↔GPU 之間複製資料；資料 locality 可在需要時明確控制。

            Unified Virtual Memory Space（單一虛擬位址空間）
  ┌──────────────────────────────┬──────────────────────────────┐
  │            CPU                │             GPU              │
  │   ┌──────────────────────┐   │   ┌──────────────────────┐   │
  │   │  system / host memory│   │   │   global memory      │   │
  │   │       (DRAM)         │   │   │       (DRAM)         │   │
  │   └──────────────────────┘   │   │  所有 SM 皆可存取     │   │
  │                              │   └──────────────────────┘   │
  └──────────────────────────────┴──────────────────────────────┘
       位址範圍唯一互不重疊 → 由位址即可判斷資料位於何處

Tip

由於採用 unified virtual memory space，給定一個指標位址即可判斷它指向 CPU 或哪一個 GPU 的記憶體，這是 CUDA 能正確搬移與存取資料的基礎。

On-Chip Memory in GPUs

除了 global memory 之外，每個 GPU 還有一些 on-chip memory。每個 SM 都有自己的 register file 與 shared memory，這些記憶體屬於 SM 的一部分，可從在該 SM 內執行的 threads 極快速地存取。

register file：儲存 thread 區域變數，通常由 compiler 配置；配置為 per-thread。
shared memory：可被同一 thread block 或 cluster 內所有 threads 存取，用於 threads 之間交換資料；配置為 per-block（整個 thread block 共用，不像 register 是 per-thread）。
SM 的 register file、shared memory space 與 L1 cache 由 thread block 內所有 threads 共享。
register file 與 unified data cache 的容量有限，其大小與 L1/shared memory 的配置方式依 Compute Capability 而定（見 Memory Information per Compute Capability）。

                    GPU
  ┌───────────────────────────────────────────────┐
  │   SM 0                     SM 1     ...         │
  │  ┌─────────────────┐     ┌─────────────────┐    │
  │  │ register file   │     │ register file   │ ← per-thread，極快
  │  │ shared memory   │     │ shared memory   │ ← per-block，極快
  │  │ L1 cache        │     │ L1 cache        │ ← unified data cache 一部分
  │  │ constant cache  │     │ constant cache  │ ← 獨立於 L1
  │  └─────────────────┘     └─────────────────┘    │
  │            └──────── L2 cache（全 GPU 共享）─────┘│
  │                  global memory (DRAM)           │
  └───────────────────────────────────────────────┘
        越往上越快、容量越小；越往下越慢、容量越大

Important

Register 配置決定 block 能否啟動：要將 thread block 排程到 SM，需滿足「每個 thread 所需 register 數 × block 內 thread 數 ≤ SM 可用 register」。若 thread block 所需 register 超過 register file 大小，kernel 無法啟動（not launchable），必須減少 thread block 內的 thread 數才能啟動。

Warning

register 與 shared memory 的配置粒度不同：register 是 per-thread，而 shared memory 是 per-thread-block（整個 block 共用同一份配置）。混淆兩者的配置層級是常見錯誤。

Caches

除了可程式化的記憶體外，GPU 同時具備 L1 與 L2 快取。

L1 cache：每個 SM 各有一個，是該 SM unified data cache 的一部分。
L2 cache：較大，由 GPU 內所有 SM 共享（可見於 GPU block diagram）。
constant cache：每個 SM 另有一個獨立的 constant cache，用來快取「在 kernel 生命週期內宣告為 constant」的 global memory 值。
- compiler 也可能把 kernel 參數放入 constant memory。
- 好處：讓 kernel 參數能與 L1 data cache 分開地快取於 SM 中，藉此提升 kernel 效能。

   thread → L1 cache (per-SM, unified data cache)
                 │
                 └──────→ L2 cache (整個 GPU 共享)
                              │
                              └──────→ global memory (DRAM)

   constant 資料 / kernel 參數 → constant cache (per-SM，獨立於 L1)

Tip

把 kernel 參數放入 constant memory，可讓參數在 SM 內被快取於與 L1 data cache 分離的路徑，避免佔用 L1 並提升存取效率。

Unified Memory

當應用程式在 GPU 或 CPU 上明確（explicit）配置記憶體時，該記憶體只有對應 device 的程式碼能存取：CPU memory 只能被 CPU code 存取，GPU memory 只能被在 GPU 上執行的 kernel 存取。此時需以 CUDA API 在 CPU↔GPU 間明確複製資料，於正確時機把資料放到正確記憶體。

unified memory 讓配置出的記憶體可從 CPU 或 GPU 存取。
由 CUDA runtime 或底層硬體負責啟用存取，或在需要時把資料搬移（relocate）到正確位置。
即使使用 unified memory，最佳效能仍來自盡量減少搬移，並盡可能從直連該記憶體的處理器直接存取資料。
資料如何在記憶體空間之間存取與交換，取決於系統的硬體特性（不同類別的 unified memory 系統見 Unified Memory 章節）。

  explicit 配置：           unified memory：
  ┌────────┐  copy ┌────────┐    ┌──────────────────────┐
  │CPU mem │ <───> │GPU mem │    │  單一配置可被 CPU/GPU │
  │CPU 可存│       │GPU 可存│    │  存取，runtime/HW     │
  └────────┘       └────────┘    │  需要時自動搬移        │
   須手動 cudaMemcpy             └──────────────────────┘

Warning

例外 — mapped memory：mapped memory 是配置時帶有特殊屬性、可被 GPU 直接存取的 CPU memory。但這種存取走 PCIe 或 NVLINK 連線，GPU 無法用 parallelism 隱藏其較高的延遲與較低的頻寬，因此 mapped memory 並非 unified memory 或「把資料放在適當記憶體空間」的高效替代方案。

考試/測驗重點

情境/關鍵字	答案
device code 視角下 GPU 的 DRAM 叫什麼？	global memory（GPU 內所有 SM 可存取）
CPU 的 DRAM 叫什麼？	system memory 或 host memory
global memory 是否系統各處皆可存取？	否，僅表示 GPU 內所有 SM 可存取
CPU 與 GPU 的虛擬位址空間關係？	共用單一 unified virtual memory space，各範圍唯一不重疊
給定虛擬位址能判斷什麼？	屬 GPU 或 system memory；多 GPU 時還可判斷屬哪個 GPU
register file 存什麼？由誰配置？粒度？	thread 區域變數；compiler 配置；per-thread
shared memory 用途？粒度？	thread block / cluster 內 threads 交換資料；per-block 配置
block 所需 register 超過 register file 大小會怎樣？	kernel 無法啟動，須減少 thread 數
排程 block 到 SM 的 register 條件？	每 thread register 數 × thread 數 ≤ SM 可用 register
L1 與 L2 的歸屬範圍？	L1 per-SM（屬 unified data cache）；L2 較大、全 GPU 共享
constant cache 快取什麼？	宣告為 constant 的 global memory；compiler 也可能放入 kernel 參數
explicit 配置的記憶體誰能存取？	只有對應 device（CPU 或 GPU）能存取
unified memory 的特性？	可從 CPU/GPU 存取，runtime/硬體需要時搬移
unified memory 最佳效能原則？	盡量減少搬移，從直連的處理器直接存取
mapped memory 是什麼？是否高效？	可被 GPU 直接存取的 CPU memory；走 PCIe/NVLINK，延遲高、非高效替代

重點總覽

DRAM Memory in Heterogeneous Systems

On-Chip Memory in GPUs

Caches

Unified Memory

考試/測驗重點

Related Notes