使用 PTX 與硬體模型 (Using PTX and Hardware Model)

#cuda #cuda-platform #ptx #execution-model #independent-thread-scheduling #concept

重點總覽

項目	重點
Using PTX	PTX 是 CUDA 用來抽象硬體 ISA 的虛擬機 ISA；手寫 PTX 是最後手段，只用於極度效能敏感處。兩種用法：`cuda::ptx` namespace 與 inline PTX (`asm volatile`)
Hardware Implementation	SM 用 SIMT 模型並行執行數百 threads；指令 pipeline 化、in-order issue；不做 branch prediction 與 speculative execution；little-endian
SIMT Execution Model	SM 以 32 threads 為一組 = warp；一次執行一條共同指令；同 warp 內 branch divergence 會序列化各路徑，跨 warp 互不影響
Independent Thread Scheduling	CC 7.0+ 每 thread 有獨立 program counter 與 call stack，可 sub-warp 粒度 diverge/reconverge；不再保證 warp lockstep；warp-synchronous code 需用 `__syncwarp()` 修正
Hardware Multithreading	warp context 常駐 on-chip，warp 切換 zero-overhead；warp scheduler 每 cycle 選 ready warp 發射指令；resident block/warp 數受 register 與 shared memory 限制
Asynchronous Execution Features	CC 8.0 引入 async copy + async barrier；CC 9.0 引入 TMA；async operation 由 CUDA thread 發起、由 async thread 執行，以 barrier/pipeline 同步
Async Thread / Async Proxy	normal 存取走 generic proxy；LDGSTS/STAS 用 async thread on generic proxy；TMA/tensor core 用 async thread on async proxy，跨 proxy 需 proxy fence

Using PTX（手寫 PTX）

PTX（Parallel Thread Execution） 是 CUDA 用來抽象各代硬體 ISA 的虛擬機指令集架構（virtual machine ISA）。直接以 PTX 撰寫程式是高度進階的最佳化手法，對大多數開發者並非必要，應視為最後手段（tool of last resort）。

適用時機：應用程式中極度效能敏感的部分，連 0.x% 的效能提升都有顯著效益。
完整指令列於 PTX ISA document。
兩種在 C++ 中使用 PTX 的方式：

方式	說明
`cuda::ptx` namespace（來自 libcu++）	提供直接對應 PTX 指令的 C++ 函式，簡化在 C++ 應用中的使用
Inline PTX	直接把 PTX 內嵌進 C++，類似在 CPU 上寫組合語言

inline PTX 透過 asm volatile 嵌入，語法上很像 CPU 的 inline assembly：

__device__ int add_via_ptx(int a, int b) {
    int r;
    // 將兩個 .s32 相加，輸出綁到 %0，輸入綁到 %1、%2
    asm volatile("add.s32 %0, %1, %2;" : "=r"(r) : "r"(a), "r"(b));
    return r;
}

重點：volatile 阻止編譯器將該指令最佳化掉或重排；"=r"/"r" 為輸出/輸入 register constraint。

Tip

一般情況優先用既有 intrinsic 或 cuda::ptx，可讀性與可攜性都更好；只有在 profiler 證實熱點、且確需細粒度控制時才手寫 inline PTX。

Hardware Implementation（硬體實作概觀）

一個 streaming multiprocessor（SM） 設計上可同時並行執行數百個 threads。為管理如此大量 threads，它採用獨特的並行計算模型 SIMT（Single-Instruction, Multiple-Thread）。

指令是 pipeline 化的：同時利用單一 thread 內的 instruction-level parallelism (ILP)，以及透過 hardware multithreading 的 thread-level parallelism。
與 CPU core 不同：SM 按順序（in order）發射指令，不做 branch prediction，也不做 speculative execution。
SIMT Execution Model 與 Hardware Multithreading 描述所有裝置共通的架構特性；各 compute capability 的細節在 Compute Capabilities。
NVIDIA GPU 架構使用 little-endian 表示法。

SIMT Execution Model

每個 SM 以 32 個並行 threads 為一組來建立、管理、排程與執行，這一組稱為 warp。

同一 warp 的各 thread 起始於同一程式位址，但各自擁有獨立的 instruction address counter 與 register state，因此可自由 branch 與獨立執行。
名詞溯源：warp 來自織布（weaving）；half-warp 是 warp 的前半或後半（16 threads），quarter-warp 是四等分之一（8 threads）。
warp 一次執行一條共同指令，當 32 threads 都走同一執行路徑時達到全效率。
若 warp 內 threads 因 data-dependent conditional branch 而 diverge，warp 會逐條執行每個被取用的 branch 路徑，並停用不在該路徑上的 threads。
Branch divergence 只發生在 warp 內；不同 warp 互相獨立，無論是否走相同路徑。

warp（32 threads）遇到 if(cond) 分歧：
  cond=true 路徑 :  T0 T1 .. T15  ----執行---->  (T16..T31 停用)
  cond=false 路徑:                              T16..T31 ----執行---->  (T0..T15 停用)
  reconverge     :  ============= 全 32 threads 再次一起 =============
  → 兩條路徑「序列化」，divergence 越多效能越差

比較	SIMD（向量機）	SIMT
SIMD width 是否暴露給軟體	是，需軟體 coalesce loads 成向量、手動管理 divergence	否，指令描述單一 thread 的執行與 branch 行為
程式模型	資料平行	可寫獨立 scalar threads 的 thread 平行，也可寫資料平行

Tip

就正確性而言可忽略 SIMT 行為；但就效能而言應讓 code 盡量少要求 warp 內 threads diverge。此關係類比於 cache line：設計正確性時可忽略，設計尖峰效能時必須考慮。

Independent Thread Scheduling（CC 7.0+）

項目	CC < 7.0（Volta 之前）	CC ≥ 7.0（Volta 之後）
Program counter	整個 warp 共用一個 PC + active mask	每個 thread 各自一個 PC
Call stack	共用	每個 thread 各自一個 call stack
排程粒度	warp 層級	可在 per-thread 粒度 yield 執行
diverge/reconverge	受限	可在 sub-warp 粒度 diverge 與 reconverge

CC < 7.0：因共用 PC + active mask，divergent region 中的 threads 無法互相 signal 或交換資料；需要 lock/mutex 細粒度共享資料的演算法可能 deadlock（視競爭 threads 來自哪個 warp 而定）。
CC 7.0+：independent thread scheduling 允許 thread 間（不分 warp）完全並行。GPU 維護 per-thread 執行狀態（含 program counter 與 call stack），可讓一個 thread 等待另一 thread 產生的資料。
schedule optimizer 決定如何把同 warp 的 active threads 分組進 SIMT units，保留 SIMT 高吞吐但更有彈性。

共用 PC (CC<7.0):     [PC]  → 全 warp 一個指令指標 + active mask
每 thread PC (CC≥7.0): [PC0][PC1][PC2]...[PC31] + 各自 call stack
                        ↓ schedule optimizer 重新分組成 SIMT units
                        → 可在 sub-warp 粒度 diverge / reconverge

Warning

Independent thread scheduling 會破壞依賴隱式 warp-synchronous 行為的舊 code。Warp-synchronous code 假設同 warp threads 在每條指令都 lockstep，這在 sub-warp diverge/reconverge 下不再成立，可能導致參與執行的 thread 集合與預期不同。任何 CC 7.0 之前的 warp-synchronous code（例如免同步的 intra-warp reduction）都應重新檢視，並用 __syncwarp() 顯式同步以確保跨世代正確。

Important

active vs inactive threads：參與當前指令的是 active threads；不在當前指令的是 inactive（disabled）。thread inactive 的原因包括：比同 warp 其他 threads 提早 exit、走了不同 branch 路徑、或是 block thread 數非 warp size 倍數時的尾端 threads。

Warning

同址寫入未定義行為：

非 atomic 指令若 warp 內多個 threads 寫入 global/shared 的同一位置，序列化寫入的次數因 CC 而異；且哪個 thread 做最後寫入是 undefined。
atomic 指令若 warp 內多個 threads 對 global 同址做 read-modify-write，每筆都會發生且全部序列化，但發生順序是 undefined。

Hardware Multithreading

當 SM 拿到一或多個 thread block，會把它們切分成 warps，每個 warp 由一個 warp scheduler 排程執行。

切分方式固定：每個 warp 含連續遞增 thread ID，第一個 warp 含 thread 0。
每個 block 的 warp 數：

warps_per_block = ceil(T / Wsize)
  T     = 每 block 的 threads 數
  Wsize = warp size = 32
  ceil(x, y) = x 向上取整到 y 的最近倍數

每個 warp 的 execution context（program counters、registers 等） 在 warp 整個生命週期中常駐 on-chip，因此 warp 之間切換零成本（zero-overhead），可隱藏延遲。
每個 instruction issue cycle，warp scheduler 挑選一個有 ready threads 的 warp，把指令發射給該 warp 的 active threads。

issue cycle:  cycle0   cycle1   cycle2   cycle3 ...
scheduler →   warpA    warpC    warpA    warpB   （挑 ready 者，無切換開銷）
              ↑ 某 warp 在等記憶體時，立刻切到另一 ready warp 隱藏延遲

每個 SM 有一組 32-bit registers（在 warps 間分配）與 shared memory（在 thread blocks 間分配）。
一個 kernel 可同時常駐的 block/warp 數，取決於 kernel 用掉的 register 與 shared memory，以及 SM 上可用量；另有每 SM 最大常駐 block 數與 warp 數上限。

Warning

若 SM 資源連一個 block 都放不下，kernel 直接 launch 失敗。block 配置的 register/shared memory 總量可用 Occupancy 一節的方法估算。

Asynchronous Execution Features

近代 NVIDIA GPU 加入非同步執行能力，讓資料搬移、計算、同步在 GPU 內更能重疊。這些能力讓從 GPU code 發起的某些操作，可相對於同一 thread block 內的其他 GPU code 非同步執行。

Important

此處的「非同步」不要與 Section 2.5 的 asynchronous CUDA API 混淆。後者讓 kernel launch 或 memory 操作彼此之間、或相對於 CPU 非同步；本節是裝置端 GPU code 內部的非同步。

Compute Capability	架構	新增非同步能力
8.0	Ampere	硬體加速的 global→shared 非同步資料複製；asynchronous barriers
9.0	Hopper	TMA（搬大塊資料與多維 tensor，global↔shared）；asynchronous transaction barriers；asynchronous matrix multiply-accumulate

async operation 由一個 CUDA thread 發起，但「彷彿由另一個 thread 執行」，那個 thread 稱為 async thread。
well-formed 程式中，一或多個 CUDA threads 會與該 async operation 同步；發起的 thread 不必屬於同步它的 threads。async thread 永遠關聯到發起它的 CUDA thread。
async operation 用一個 synchronization object 來 signal 完成，可以是 barrier 或 pipeline（詳見 Advanced Synchronization Primitives 與 Asynchronous Data Copies）。

Async Thread and Async Proxy

非同步操作存取記憶體的方式可能與一般操作不同。為區分，CUDA 引入三個概念：async thread、generic proxy、async proxy。一般 load/store 走 generic proxy。

指令類型	模型	proxy
Normal loads/stores	（一般執行）	generic proxy
`LDGSTS`、`STAS`/`REDAS`	async thread	generic proxy
TMA bulk-async copy、tensor core（`tcgen05.`、`wgmma.mma_async.`）	async thread	async proxy

Async thread on generic proxy：發起時關聯一個有別於發起者的 async thread。對同位址而言，先前的 normal load/store 保證排在 async op 之前；但後續的 normal load/store 不保證維持順序，在 async thread 完成前可能 race condition。
Async thread on async proxy：對同位址而言，先前與後續的 normal load/store 都不保證維持順序。需要 proxy fence 跨 proxy 同步，以確保正確的 memory ordering（TMA 一節示範用 proxy fence 確保非同步複製正確）。

generic proxy:  [前 normal store] --ordered--> [async op (LDGSTS)] ...[後 normal load 不保證順序]
async proxy:    [前 normal store]  ?無序?  [async op (TMA)]  ?無序?  [後 normal load]
                                    └────────── 需 proxy fence 才能跨 proxy 排序 ──────────┘

Warning

更多細節見 PTX ISA documentation。涉及 TMA / tensor core 的非同步路徑時，缺少 proxy fence 會造成跨 proxy 的記憶體順序錯誤。

考試/測驗重點

情境/關鍵字	答案
手寫 PTX 的定位	高度進階最佳化，最後手段，只用於極度效能敏感處
在 C++ 用 PTX 的兩種方式	`cuda::ptx` namespace (libcu++) / inline PTX (`asm volatile`)
inline PTX 為何加 volatile	防止編譯器最佳化掉或重排該段 PTX
warp 大小	32 threads；half-warp=16；quarter-warp=8
SM 是否做 branch prediction / speculative execution	否；in-order issue
SIMD vs SIMT 關鍵差異	SIMD 把 width 暴露給軟體；SIMT 指令描述單一 thread 行為
branch divergence 範圍	只在 warp 內；跨 warp 互不影響、各路徑序列化執行
CC < 7.0 warp 的 PC	32 threads 共用一個 PC + active mask
CC ≥ 7.0 每 thread 擁有	獨立 program counter 與 call stack
independent thread scheduling 起始 CC	7.0（Volta）
修正 warp-synchronous code 的正確做法	顯式呼叫 `__syncwarp()`，不可依賴隱式 lockstep
warp 切換成本	zero-overhead（context 常駐 on-chip）
warp 數公式	ceil(T / 32)
資源不足放不下一個 block	kernel launch 失敗
多 thread 寫 global 同址（非 atomic）	序列化次數依 CC，最終寫入者 undefined
多 thread atomic 對 global 同址 RMW	全部序列化但順序 undefined
async copy + async barrier 引入 CC	8.0（Ampere）
TMA 引入 CC	9.0（Hopper）
async operation 由誰執行	async thread（關聯到發起的 CUDA thread）
async op 用什麼 signal 完成	synchronization object：barrier 或 pipeline
normal load/store 走哪個 proxy	generic proxy
LDGSTS / STAS / REDAS 用哪個 proxy	async thread on generic proxy
TMA / `tcgen05.` / `wgmma.mma_async.` 用哪個 proxy	async thread on async proxy
跨 generic/async proxy 排序需要什麼	proxy fence
裝置端非同步 vs Section 2.5 async API	前者是 GPU code 內部非同步；後者是 launch/memory 相對 CPU 或彼此非同步，勿混淆

重點總覽

Using PTX（手寫 PTX）

Hardware Implementation（硬體實作概觀）

SIMT Execution Model

Independent Thread Scheduling（CC 7.0+）

Hardware Multithreading

Asynchronous Execution Features

Async Thread and Async Proxy

考試/測驗重點

Related Notes