第三章練習題 (Practice - Advanced CUDA)

重點模式（關鍵字對照表）

作答前可先用這張表喚起記憶；每列是「情境/關鍵字 → 答案」。

情境 / 關鍵字	答案
不改 kernel 原始碼附加額外啟動屬性	`cudaLaunchKernelEx` + `cudaLaunchConfig_t` + `attrs`/`numAttrs`
`<<<>>>` 四個固定參數	block 維度 / grid 維度 / dynamic shared memory / stream
thread block clusters 起始 CC / 保證	9.0；同 cluster blocks 同時在單一 GPC，可跨 block 同步
grid 與 cluster 維度關係	grid 各維度須可被 cluster 維度整除
PDL 三要件	`cudaTriggerProgrammaticLaunchCompletion` + `cudaGridDependencySynchronize` + `programmaticStreamSerializationAllowed=1`
ephemeral stack buffer 的 srcAccessOrder	`cudaMemcpySrcAccessOrderDuringApiCall`（pinned/managed 用 `...Stream`）
減少 stream false dependency 序列化	增加 `CUDA_DEVICE_MAX_CONNECTIONS`（MPS 預設較低）
warp 大小 / independent thread scheduling	32；CC 7.0+ 每 thread 各自 PC + call stack，須用 `__syncwarp()`
thread scope 由窄到寬	thread → block(.cta,L1) → cluster(.cluster,L2) → device(.gpu,L2) → system(.sys,L2+caches)
scoped atomics 兩組成 / 效能	Thread Scope + Memory Ordering；scope 最窄、ordering 最弱、shared > global
producer-consumer ordering	producer `release` store + consumer `acquire` load（relaxed 會 data race）
async barrier 與 `__syncthreads()`	把 arrive（不阻塞、回 token）與 wait（阻塞）分離，重疊等待
LDGSTS / TMA / STAS 起始 CC	8.0 / 9.0 / 9.0
carveout API / 性質	`cudaFuncSetAttribute` + `cudaFuncAttributePreferredSharedMemoryCarveout`，是 hint
driver API 前綴 / 首呼叫 / context 類比	`cu`；先 `cuInit()`；context ≈ CPU process（獨立 address space）
跑未來架構載入什麼	PTX（載入時 driver JIT 成 binary）；binary 架構特定不相容
切換 current device / 預設	`cudaSetDevice()`；首次呼叫前預設 device 0
跨裝置同步用哪個 API	`cudaStreamWaitEvent()`（跨裝置仍成功）
Linux bare-metal P2P 的 IOMMU	必須關閉，否則 silent memory corruption
work stealing / 引入架構	cluster launch control；CC 10.0 (Blackwell)
跨 host process 共享 GPU buffer	CUDA IPC

Question 1 - cudaLaunchKernelEx 的價值 [recall]

你想在「不修改 kernel 原始碼」的前提下，於每次啟動附加 cluster 維度或 shared memory carveout 等額外屬性，但 triple chevron <<<>>> 只能表達固定參數——該用什麼機制？<<<>>> 的四個可程式化參數又是哪些？

查看答案

用 cudaLaunchKernelEx：以 cudaLaunchConfig_t（含 gridDim/blockDim/dynamicSmemBytes/stream）設定執行配置，再透過 attrs 指標與 numAttrs 傳入零或多個 cudaLaunchAttribute，可在不改 kernel 下附加任意數量屬性/提示。
<<<>>> 只能表達四個固定參數：block 維度、grid 維度、dynamic shared memory（預設 0）、stream（預設 default stream）。

Question 2 - Thread Block Clusters 的保證與約束 [recall]

Thread block clusters 從哪個 compute capability 開始支援？它提供什麼執行保證？grid 維度與 cluster 維度之間有何約束？

查看答案

compute capability 9.0+。保證同一 cluster 內所有 thread blocks 同時在單一 GPC（GPU Processing Cluster）上執行，使超過單一 SM 容量的一群 threads 能跨 block 交換資料與同步。
grid 不受 cluster 影響，仍以 thread block 數枚舉，但 grid 各維度必須可被對應 cluster 維度整除。

Question 3 - Programmatic Dependent Launch 的重疊機制 [analysis]

第一個 kernel 已把後續所需資料寫入 global memory 卻仍有其他工作要做，第二個 kernel 在用到那份資料前也有一段獨立工作——請說明 PDL 如何讓兩者部分重疊，以及缺一不可的三個組成。

查看答案

primary 在資料就緒時呼叫 cudaTriggerProgrammaticLaunchCompletion()；secondary 做完獨立工作後才呼叫 cudaGridDependencySynchronize() 阻塞等待；secondary 須以 cudaLaunchKernelEx 帶 cudaLaunchAttributeProgrammaticStreamSerialization（programmaticStreamSerializationAllowed = 1）啟動。
三者讓 primary 觸發後的剩餘工作與 secondary 的獨立段（含其 launch overhead）重疊；但能否真正重疊仍取決於硬體資源是否足夠與兩 kernel「何時產出/何時消費」的結構。

Question 4 - 異質批次傳輸的 srcAccessOrder [application]

你用 cudaMemcpyBatchAsync 做一個異質批次，一部分 source 來自 pinned host memory、另一部分是當前 scope 的 stack buffer——各應設哪種 srcAccessOrder？為什麼？

查看答案

pinned（或 managed）memory 用 cudaMemcpySrcAccessOrderStream：以正常 stream order 存取，memcpy 會 block 直到先前處理該資料的 kernel 完成。
stack buffer 的位址是 ephemeral pointer，async copy 真正執行前可能失效，必須用 cudaMemcpySrcAccessOrderDuringApiCall（只在 API call 期間存取 source），否則指標會在複製前失效。

Question 5 - SIMT 與 Independent Thread Scheduling [recall]

SM 以多少 threads 為一組執行一條共同指令？compute capability 7.0 (Volta) 的 independent thread scheduling 改變了什麼？對舊的 warp-synchronous code 有何要求？

查看答案

32 threads 為一個 warp，一次執行一條共同指令（divergence 只發生在 warp 內並序列化各路徑）。
CC 7.0+ 每個 thread 擁有獨立的 program counter 與 call stack，可在 sub-warp 粒度 diverge/reconverge，不再保證 warp lockstep。依賴隱式 lockstep 的 warp-synchronous code（如免同步 intra-warp reduction）須改用 __syncwarp() 顯式同步。

Question 6 - Thread Scopes 與 Scoped Atomics [recall]

thread scope 由窄到寬有哪些、各自的 point of coherency 為何？scoped atomics 由哪兩個概念組成，效能三原則是什麼？

查看答案

由窄到寬：thread（無）→ block（.cta，L1）→ cluster（.cluster，L2）→ device（.gpu，L2）→ system（.sys，L2 + connected caches）。
scoped atomics = Thread Scope（誰能觀察到效果）+ Memory Ordering（相對其他記憶體操作的順序約束）。效能口訣：scope 最窄、ordering 最弱、位置最近（shared memory atomics 比 global 快）。

Question 7 - Producer-Consumer 的 Memory Ordering [application]

thread 0 寫 data = 42 後設置 ready 旗標，其餘 thread 自旋等到 ready 再讀 data——ready 的 store/load 該用哪種 memory ordering？若兩端都改用 relaxed 會發生什麼？

查看答案

producer 對 ready 用 memory_order_release store、consumer 用 memory_order_acquire load，形成 happens-before，保證看到 ready==true 的 consumer 也看到 data=42（只有 ready 旗標需 atomic，data 可為一般 int）。
若兩端都用 relaxed，只保證 ready 本身的原子性，不保證 consumer 讀到 ready==true 時也看到 data=42，會產生資料競爭。

Question 8 - Asynchronous Barrier 為何分離 arrive/wait [analysis]

解釋 asynchronous barrier 相對單階段 __syncthreads() 把 arrive 與 wait 分離為何能提升效率，並說明 bar.arrive() 與 bar.wait() 各自的行為。

查看答案

bar.arrive() 不阻塞、只回傳標記目前 barrier phase 的 arrival_token，thread 可立刻去做不依賴他人 arrive 前更新的獨立工作（如 compute()），把同步等待時間重疊掉（latency hiding）；真正阻塞的是 bar.waitmove(token)，等到參與 threads 完成 arrive 達 init 設定的 expected arrival count。
arrive 點帶有 seq_cst / thread_scope_block 的隱含 fence，使跨 thread 可見性與 __syncthreads() 一致，差別只在於把等待延後到 wait()。

Question 9 - 非同步複製硬體機制與 L1/Shared Carveout [recall]

列出 LDGSTS / TMA / STAS 各自起始的 compute capability 與資料路徑；設定 L1/shared carveout 用哪個 API，它是 hint 還是硬性需求？

查看答案

LDGSTS（8.0+，global → shared::cta，小規模）、TMA（9.0+，大型多維 bulk-async，global ↔ shared）、STAS（9.0+，registers → shared::cluster）。
用 cudaFuncSetAttribute + cudaFuncAttributePreferredSharedMemoryCarveout 設定，是 hint（driver 可改用其他配置）；相對地 cudaFuncSetCacheConfig 是硬性需求（hard requirement），會在切換 shared 配置時不必要地序列化 launch。

Question 10 - CUDA Driver API 基本流程 [recall]

CUDA driver API 的入口點前綴是什麼？任何 driver API 呼叫前必做什麼？context 類比於什麼？runtime 隱式建立的 primary context 如何從 driver API 取得？

查看答案

入口點皆以 cu 為前綴（runtime API 為 cuda）。任何 driver API 呼叫前必須先 cuInit()，再建立並 current 一個 context。
context 類比一個 CPU process，每個 context 有自己獨立的 address space（不同 context 的 CUdeviceptr 指向不同記憶體）。可用 cuDevicePrimaryCtxRetain() 從 driver API 取得 runtime 隱式建立、per-device 共享的 primary context。

Question 11 - Driver API 傳參時的 struct 對齊差異 [analysis]

用 driver API 以單一 parameter buffer（CU_LAUNCH_PARAM_BUFFER_POINTER）傳參時，struct { float f; float4 f4; } 為何在 host code 完全不 padding，但在 device code 卻會在 f 之後 padding？

查看答案

struct 的對齊需求等於其各 field 對齊需求的最大值。float4 在 device 端對齊需求為 16 bytes，因此 device code 必須在 f（4 bytes）之後 padding 12 bytes，讓 f4 落在 16-byte 邊界；host 編譯器無此要求故不 padding。
因此用 buffer 傳參時必須按 device 端對齊規則正確計算每個參數的 offset，否則參數會錯位。

Question 12 - Multi-GPU 的 current device 語意 [recall]

在 host thread 中如何切換 current device？device memory 分配、kernel launch、stream/event 綁定各自落在哪個裝置？把 kernel launch 到綁定別的裝置的 stream 會如何？

查看答案

用 cudaSetDevice()（任意時刻可呼叫；第一次呼叫前 current device 預設為 device 0）。cudaMalloc 與 kernel launch 都作用在 current device；stream 與 event 在建立時即綁定當時的 current device。
kernel launch 必須送往綁定 current device 的 stream，否則 launch 失敗（但 memory copy 送往非 current device 的 stream 仍會成功）。

Question 13 - 跨裝置同步與 P2P 環境設定 [application]

你要讓 device 1 上某 stream 的工作等待 device 0 上記錄的 event 才開始——該用哪個 API？另外，在 Linux bare-metal 上做 PCIe P2P 前，IOMMU 必須如何設定？

查看答案

用 cudaStreamWaitEvent()：即使 stream 與 event 綁定不同裝置仍會成功，因此可作為跨裝置同步工具。
Linux bare-metal 上 CUDA + display driver 不支援 IOMMU-enabled 的 PCIe P2P，必須關閉 IOMMU，否則會 silent device memory corruption（VM pass-through 情況則相反，須開啟 IOMMU 並使用 VFIO driver）。

Question 14 - Features 導覽：Kernel 效能與延遲 [recall]

「work stealing」靠哪個功能、哪個架構引入？async data copies 指的是哪一種搬移（別與什麼混淆）？CUDA graphs 有哪兩種建立方式與兩大效益？

查看答案

work stealing 靠 cluster launch control，compute capability 10.0（Blackwell）引入：一個 block 可 cancel 尚未啟動的 block/cluster、奪取其 index 並立即執行。async data copies 指 kernel 內 shared memory ↔ GPU DRAM 的搬移，勿與 CPU ↔ GPU 之間的 async memcpy 混淆。
CUDA graphs 可由 stream capture 或 graphs API 建立；兩大效益為降低 CPU 端 launch 成本，以及啟用「整個 workload 預先已知」才可能的最佳化。

Question 15 - Features 導覽：功能性與細緻控制 [recall]

green context 的作用為何？dynamic parallelism 是什麼？跨不同 host process 共享 GPU buffer 用什麼？啟用 CUDA 錯誤紀錄的環境變數是哪個？

查看答案

green context（execution context）限制 kernel 只在部分 SM 上執行，被保留的 SM 其他 context（含 primary context）不會把 block 排上去，為高優先/延遲敏感工作保留資源（CUDA runtime 自 13.1 起支援）。dynamic parallelism 讓 GPU 上執行的 kernel 直接發起新的 kernel 啟動。
跨 host process 共享 GPU buffer 用 CUDA IPC（interprocess communication）；設定環境變數 CUDA_LOG_FILE（stderr / stdout / 檔案）可啟用錯誤紀錄，並可註冊 error callback。

模式總結

第三章「Advanced CUDA」的核心模式可歸納為五條主線：

進階啟動：cudaLaunchKernelEx + cudaLaunchConfig_t 用可擴充的 cudaLaunchAttribute 取代 <<<>>> 的四個固定參數；clusters（CC 9.0+，單一 GPC）與 PDL（trigger + sync + serialization attribute 三要件）都靠此擴充。
進階 kernel：SIMT/warp 與 CC 7.0+ independent thread scheduling（須 __syncwarp()）；scoped atomics 用「最窄 scope + 最弱 ordering + 最近位置」；async barrier 把 arrive/wait 分離以重疊等待；async copy（LDGSTS/TMA/STAS）與 L1/shared carveout（hint）隱藏記憶體延遲。
Driver API：cu 前綴、cuInit 先行、context ≈ process、跑未來架構載 PTX；buffer 傳參須遵守 device 端對齊（struct 取 field 對齊最大值）。
Multi-GPU：cudaSetDevice 決定 current device 歸屬；kernel launch 須配對綁定的 stream；跨裝置同步用 cudaStreamWaitEvent；bare-metal P2P 須關 IOMMU。
Features 導覽：依「瓶頸類型」選功能——kernel 效能、launch 延遲、功能性、互通、細緻控制。
共同精神：多為 host 端、不需改 kernel 即可影響 GPU work execution 與 CPU/GPU 兩側效能；正確性要求（scope 涵蓋、ordering 配對、stream 綁定、IOMMU）絕不能為了效能而妥協。

Related Concepts

Question 1 - cudaLaunchKernelEx 的價值 [recall]

Question 2 - Thread Block Clusters 的保證與約束 [recall]

Question 3 - Programmatic Dependent Launch 的重疊機制 [analysis]

Question 4 - 異質批次傳輸的 srcAccessOrder [application]

Question 5 - SIMT 與 Independent Thread Scheduling [recall]

Question 6 - Thread Scopes 與 Scoped Atomics [recall]

Question 7 - Producer-Consumer 的 Memory Ordering [application]

Question 8 - Asynchronous Barrier 為何分離 arrive/wait [analysis]

Question 9 - 非同步複製硬體機制與 L1/Shared Carveout [recall]

Question 10 - CUDA Driver API 基本流程 [recall]

Question 11 - Driver API 傳參時的 struct 對齊差異 [analysis]

Question 12 - Multi-GPU 的 current device 語意 [recall]

Question 13 - 跨裝置同步與 P2P 環境設定 [application]

Question 14 - Features 導覽：Kernel 效能與延遲 [recall]

Question 15 - Features 導覽：功能性與細緻控制 [recall]