執行模型與 SIMT (Execution Model and SIMT)

#cuda #execution-model #thread-hierarchy #simt #warp #concept

重點總覽

項目	重點
異質系統 (Heterogeneous System)	系統同時含 CPU 與 GPU；host=CPU+host memory，device=GPU+device memory；應用程式永遠從 CPU 開始執行
kernel 與 launch	在 GPU 執行的函式稱 kernel；啟動 kernel 稱 launch，等於在 GPU 上平行啟動大量 threads 執行 kernel code
GPU 硬體模型	GPU = 一群 SM，SM 組成 GPC；每個 SM 有 register file、unified data cache（提供 shared memory 與 L1）、functional units
Thread Blocks 與 Grids	threads→thread block→grid；可 1/2/3 維；同 block 的 threads 在單一 SM 執行、可同步並共用 shared memory
block 排程	grid 內不同 block 之間無排程順序保證，不可有資料相依；必須可任意順序（平行或串行）執行
Thread Block Clusters	CC 9.0+ 選用層級；cluster 內 blocks 在同一 GPC 同時排程，可用 distributed shared memory 互通
Warps 與 SIMT	block 內 threads 以 32 個為一組成 warp；SIMT 讓每個 thread 可走自己的控制流；分歧時 lane 被 mask
warp divergence	同 warp 內 threads 走不同分支稱 warp divergence，未走該分支的 lane 被 mask 掉，降低利用率

異質系統 (Heterogeneous Systems)

CUDA programming model 假設一個異質運算系統 (heterogeneous computing system)，即同時包含 GPU 與 CPU 的系統。CPU 與其直接相連的記憶體稱為 host 與 host memory；GPU 與其直接相連的記憶體稱為 device 與 device memory。應用程式有部分程式碼在 GPU 執行，但永遠從 CPU 開始執行。

host code：在 CPU 上執行的程式碼，可呼叫 CUDA APIs 來：在 host memory 與 device memory 間複製資料、在 GPU 上啟動程式碼、等待資料複製或 GPU 程式碼完成。
device code：在 GPU 上執行的程式碼。
kernel：被呼叫到 GPU 上執行的函式（名稱出於歷史因素）。
launch（啟動）：使 kernel 開始執行的動作；可想成在 GPU 上平行啟動大量 threads 執行同一份 kernel code。
CPU 與 GPU 可同時執行程式碼；最佳效能通常來自最大化 CPU 與 GPU 兩者的利用率。

Tip

GPU threads 的運作方式與 CPU threads 類似，但有一些對正確性與效能都很重要的差異（最關鍵的就是後面 warp / SIMT 的分組與排程行為）。

        Host (CPU 端)                          Device (GPU 端)
  ┌─────────────────────┐                ┌──────────────────────┐
  │  host code (CPU)     │                │  device code (kernel) │
  │                      │  launch kernel │                      │
  │  1. 分配/準備資料 ────┼───────────────▶│  大量 threads 平行執行 │
  │  2. cudaMemcpy H→D ──┼───────────────▶│  讀寫 device memory   │
  │  3. launch kernel    │                │                      │
  │  4. 等待完成 ◀────────┼────────────────┤  kernel 結束          │
  │  5. cudaMemcpy D→H ◀─┼────────────────┤                      │
  └─────────┬───────────┘                └──────────┬───────────┘
            │                                       │
      host memory                            device memory

Important

在某些 system-on-chip (SoC) 系統中，CPU 與 GPU 可能位於單一封裝 (single package)；較大型系統則可能有多個 CPU 或多個 GPU。host/device 是邏輯角色區分，不必然代表兩塊獨立晶片。

GPU 硬體模型 (GPU Hardware Model)

如同所有 programming model，CUDA 依賴一個底層硬體的概念模型。就 CUDA 程式設計而言，GPU 可視為一群 Streaming Multiprocessors (SM)，這些 SM 又被組織成稱為 Graphics Processing Clusters (GPC) 的群組。

每個 SM 內含：

local register file：存放 thread 區域變數。
unified data cache：提供 shared memory 與 L1 cache 的實體資源；兩者的配置比例可在 runtime 設定。
數個 functional units：實際執行運算。

各種記憶體的大小、以及 SM 內 functional units 的數量，會因 GPU 架構不同而異。

              GPU
  ┌────────────────────────────────────┐
  │  GPC                  GPC           │
  │ ┌──────────┐        ┌──────────┐    │
  │ │ SM  SM   │        │ SM  SM   │    │      每個 SM 內部:
  │ │ SM  SM   │  ...   │ SM  SM   │    │   ┌────────────────────┐
  │ └──────────┘        └──────────┘    │   │ register file       │
  │            connected to             │   │ unified data cache  │
  │            GPU memory               │   │  → shared memory     │
  └────────────────────────────────────┘   │  → L1 cache          │
                                            │ functional units    │
   GPU = 一群 GPC；GPC = 一群 SM            └────────────────────┘

Warning

GPU 的實際硬體佈局、或它實際執行 programming model 的方式，都可能與此概念模型不同。這些差異不影響依 CUDA programming model 撰寫之軟體的正確性，因此不應依賴特定硬體實作細節。

Thread Blocks 與 Grids

啟動 kernel 時往往帶有非常多 threads（常達數百萬）。這些 threads 被組織成 block（thread block），多個 thread block 再組織成 grid。同一個 grid 內所有 thread block 都有相同的大小與維度。

thread block 與 grid 可為 1、2 或 3 維，方便將個別 thread 對應到工作單元或資料項目。
execution configuration（執行組態）：launch kernel 時指定的設定，定義 grid 與 thread block 的維度；亦可含選用參數如 cluster size、stream、SM configuration settings。
built-in 變數：每個 thread 可藉此算出自己在所屬 block 內的位置、所屬 block 在 grid 內的位置，以及 block 與 grid 的維度，從而得到在所有 threads 中唯一的身分，用以決定自己負責哪些資料或運算。
同一 block 的所有 threads 在單一 SM 上執行 → 可彼此有效率地溝通與同步，並共用 on-chip shared memory 交換資訊。

   thread ──┐
   thread   ├─ 32 個 threads ─▶  warp        (硬體排程單位)
   thread ──┘
     ⋮
   一群 threads ─────────────▶  thread block (同一 SM、可同步、共用 shared memory)
     ⋮
   一群 blocks ──────────────▶  grid         (一次 kernel launch 的全部 threads)

block 之間的排程：無順序保證

grid 可能含數百萬個 thread block，但執行它的 GPU 也許只有數十至數百個 SM。每個 thread block 由單一 SM 執行，且多數情況下在該 SM 上跑到完成。block 之間沒有排程順序保證，因此一個 block 不能依賴其他 block 的結果——其他 block 可能要等到此 block 完成後才被排程。

不同 block 被分配到可用的 SM 上，可以任意順序執行。
CUDA model 要求：thread block 必須可以任意順序執行（平行或串行皆可），這讓任意大的 grid 能在任意大小的 GPU（從 1 個到數千個 SM）上執行。

Warning

概略規則：block 跑到完成、且 block 間無資料相依。
例外一：使用 CUDA Dynamic Parallelism 等功能時，thread block 可能被暫停 (suspend) 到記憶體——SM 狀態存到系統管理的 GPU memory 區域、釋放 SM 去執行其他 block（類似 CPU 的 context switch）。這並不常見。
例外二：CUDA model「with some exceptions」要求不同 block 的 threads 間無資料相依；一個 thread 不應依賴或同步於同一 grid 中另一個 block 的 thread。跨 block 同步須改用 cluster（見下）等機制。

Thread Block Clusters

除了 thread block 之外，compute capability 9.0 及以上的 GPU 多了一個選用 (optional) 的分組層級，稱為 cluster。cluster 是一群 thread block，與 thread block、grid 一樣可排成 1、2 或 3 維。

指定 cluster 不會改變 grid 的維度，也不改變 thread block 在 grid 內的索引；只是把相鄰的 thread blocks 群組成 cluster。
cluster 內所有 thread block 在單一 GPC 上執行，且同時被排程。
因為同時排程且位於同一 GPC，同一 cluster 內不同 block 的 threads 可互相溝通與同步，使用 Cooperative Groups 提供的軟體介面。
cluster 內的 threads 可存取 cluster 中所有 block 的 shared memory，稱為 distributed shared memory（分散式共享記憶體）。
cluster 內的 thread block 在 grid 中永遠彼此相鄰。

   一般 grid (無 cluster)            指定 cluster 後
   ┌───┬───┬───┬───┐               ┌───────┬───────┐
   │ B │ B │ B │ B │               │ B   B │ B   B │   ← cluster (相鄰 blocks)
   ├───┼───┼───┼───┤               │ B   B │ B   B │     同一 GPC 同時排程
   │ B │ B │ B │ B │               └───────┴───────┘     可用 distributed
   └───┴───┴───┴───┘               grid 維度與索引不變    shared memory 互通
   block 間無溝通保證

Important

cluster 是跨 block 溝通/同步的唯一受支援途徑，但最大 cluster 大小取決於硬體，因裝置而異。沒有 cluster 時，仍維持「block 間無資料相依」的鐵則。

Warps 與 SIMT

在一個 thread block 內，threads 以 32 個為一組，每組稱為一個 warp。warp 以 SIMT (Single-Instruction Multiple-Threads) 範式執行 kernel code：warp 內所有 threads 執行同一份 kernel code，但每個 thread 可走不同的分支——雖跑同一份程式碼，卻不必走相同的執行路徑。

warp lane：thread 在 warp 中被指派一個 lane，編號 0 到 31；thread 以可預測的方式被分配到 warp。
同時執行同一指令：warp 內所有 threads 同時執行同一條指令。
masking（遮罩）：若 warp 內部分 threads 走某分支、其餘不走，則未走該分支的 lane 會被 mask 掉，待走分支的 threads 執行完該段指令。例：條件只對半數 threads 成立，另一半就在這段期間被 mask off。
warp divergence（warp 分歧）：當同一 warp 內不同 threads 走不同程式路徑時稱之。當 warp 內 threads 走相同控制流路徑時，GPU 利用率最高。

  warp (32 lanes)，執行 if (threadIdx % 2 == 0) { ... }

  lane:  0   1   2   3   4   5  ...  31
         ●   ○   ●   ○   ●   ○       ○      ● = 條件成立、執行
         ▲   ╳   ▲   ╳   ▲   ╳       ╳      ╳ = 被 mask off (閒置)
         └── 偶數 lane 執行 if body，奇數 lane 被遮罩，利用率下降

SIMT vs SIMD

面向	SIMT (CUDA)	SIMD
控制流	每個 thread 可有自己的控制流路徑	整體只走單一控制流路徑
資料寬度	無固定 data-width	有固定的 data-width
分支處理	允許分歧（divergence），用 masking	不以 per-thread 分支為模型
程式視角	per-thread 程式碼	per-vector 運算

Tip

thread block 的 thread 數最好是 32 的倍數。 任意數量都合法，但若總數不是 32 的倍數，最後一個 warp 會有部分 lane 全程閒置，導致 functional units 利用率與記憶體存取次佳 (suboptimal)。

Warning

SIMT 模型上，warp 內所有 threads「lock step（鎖步）」一起前進；但實際硬體執行可能不同。
不應利用「warp 如何對應到真實硬體」的知識來寫程式（discouraged）。只要遵守 programming model，硬體可用對程式透明的方式最佳化被 mask 的 lane；一旦違反此模型（見 Independent Thread Execution），會導致undefined behavior，且在不同 GPU 上行為可能不同。

Tip

寫 CUDA 不一定要考慮 warp，但理解 warp 執行模型有助於掌握 global memory coalescing 與 shared memory bank 存取樣式；部分進階技巧會刻意對 block 內 warp 做特化 (specialization) 以減少 divergence、提升利用率。

考試/測驗重點

情境/關鍵字	答案
應用程式從哪裡開始執行？	永遠從 CPU (host) 開始
host / device 各指什麼？	host = CPU + host memory；device = GPU + device memory
在 GPU 上執行的函式叫什麼？	kernel
「launch a kernel」意義？	在 GPU 上平行啟動大量 threads 執行 kernel code
GPU 由什麼組成？	一群 SM，SM 組成 GPC
shared memory 與 L1 來自哪？	SM 的 unified data cache，比例可 runtime 配置
thread 階層由小到大？	thread → warp → thread block → grid（cluster 為選用層級）
thread block 與 grid 維度？	可 1/2/3 維
execution configuration 指定什麼？	grid 與 thread block 維度（選用：cluster size、stream、SM 設定）
同一 block 的 threads 在哪執行？	單一 SM，可同步、共用 shared memory
不同 block 間可否有資料相依？	不可；無排程順序保證，須可任意順序執行
例外：block 一定跑到完成嗎？	多數是；但 Dynamic Parallelism 下可被 suspend 到記憶體（不常見）
cluster 需要什麼條件？	compute capability 9.0+，且為選用
cluster 內 block 在哪排程？	同一 GPC、同時排程
cluster 跨 block 怎麼共享記憶體？	distributed shared memory（透過 Cooperative Groups）
一個 warp 幾個 threads？lane 編號？	32 個；lane 編號 0–31
SIMT 全名？	Single-Instruction Multiple-Threads
warp divergence 是什麼？	同 warp 內 threads 走不同分支；未走分支的 lane 被 mask off
何時 GPU 利用率最高？	warp 內 threads 走相同控制流路徑
SIMT 與 SIMD 兩大差異？	SIMT 每 thread 可有自己控制流、且無固定 data-width
block 的 thread 數建議？	32 的倍數（否則最後 warp 有閒置 lane，利用率次佳）

重點總覽

異質系統 (Heterogeneous Systems)

GPU 硬體模型 (GPU Hardware Model)

Thread Blocks 與 Grids

block 之間的排程：無順序保證

Thread Block Clusters

Warps 與 SIMT

SIMT vs SIMD

考試/測驗重點

Related Notes