CUDA Dynamic Parallelism

#cuda #execution-model #dynamic-parallelism #concept

重點總覽

CUDA Dynamic Parallelism（CDP）讓「在 GPU 上執行的 kernel」能直接從 device code 再啟動新的 kernel，把 launch 決策搬到 runtime、減少 host↔device 來回，特別適合遞迴、不規則迴圈等不符合單層平坦平行性的演算法。

項目	重點
本質	device code 透過 `<<< >>>` 啟動 child grid，產生資料相依的動態平行工作
版本	CDP2 為 CUDA 12.0+ 預設、CC 9.0+ 唯一版本；CDP1 為 legacy（`-DCUDA_FORCE_CDP1_IF_SUPPORTED`，將被移除）
Parent/Child	巢狀關係：parent 必須等所有 child 完成才算結束（implicit synchronization）
物件範圍	streams／events 由 device API 建立者只在「建立它的 grid」內有效
Stream 排序	同一 stream 內 in-order（fire-and-forget 例外）；named stream 全 grid 共用、NULL stream 僅 block 內共用
並行	CDP 不引入任何新的並行保證；不可依賴 block 間或 parent/child 間並行
記憶體共享	global／mapped 共用指標；local／shared 不共用；texture 唯讀共用
Global 一致性	weak consistency；唯一完全一致點是 child 被 invoke 的瞬間
取回 child 結果	移除 `cudaDeviceSynchronize()` 後，parent 退出前只能靠 `cudaStreamTailLaunch` kernel
Device Runtime	per-thread API、語意比照 host runtime，可在 divergent code 中無死鎖呼叫
限制	連結 device runtime 即有 overhead；pending launch pool 大小可設
PTX 層	`cudaLaunchDevice()` + `cudaGetParameterBuffer()`，需連結 `cudadevrt`

Important

本筆記描述的是 CDP2（CUDA 12.0+ 預設、CC 9.0+ 唯一版本）。CDP2 沒有 cudaDeviceSynchronize()，改用 cudaStreamTailLaunch／cudaStreamFireAndForget 兩種特殊 stream 達成同步與 fire-and-forget。CDP1 文件已不在本指南，且預計於未來版本移除。

Introduction 與 Overview

CDP 是 CUDA programming model 的功能：GPU 上執行的 code 可以「新增 GPU 工作」，即從 device code 再發出 kernel launch。
好處：launch configuration 可在 runtime 由 device 上的 thread 決定，減少 host／device 之間轉移執行控制與資料的需求。
適用情境：遞迴、不規則迴圈結構等「不符合單層平坦平行性」的程式結構，用 CDP 表達更自然。
名詞：一次 kernel launch 的實例稱為 grid（含 thread block 形狀與 grid 形狀）；要區分「kernel function 本身」與「該 kernel 的特定 invocation（grid）」。

Execution Environment：Parent / Child Grids

設定並啟動新 grid 的 device thread 屬於 parent grid；被啟動的新 grid 稱為 child grid。
invocation 與 completion 是「正確巢狀（properly nested）」：parent 在其所有 child 完成前不算完成，runtime 保證 parent 與 child 之間的 implicit synchronization。

parent grid ┌─────────────────────────────────────────┐
            │ thread0: child_launch<<<...>>>          │
            │            └─ child grid ┌──────────┐    │
            │                          │  ......  │    │
            │                          └────┬─────┘    │
            │   (parent 不算完成，直到 child 全部完成) │
            └──────────────────────┬──────────────────┘
                          implicit synchronization 後 parent 結束

Scope of CUDA Primitives

CDP 依賴 CUDA Device Runtime：在 device code 中可呼叫一組「語法類似 CUDA Runtime API、但數量受限」的 API，行為與 host 版相近但有差異（詳列於 API Reference 表）。
host 與 device 都提供 launch kernel 與用 streams／events 追蹤相依性的 API。
在 device 上，被啟動的 kernel 與 CUDA 物件對「發出該 launch 的 grid 內所有 thread」可見：例如一個 thread 建立的 stream，可被同 grid 的其他 thread 使用。
但 device API 建立的 streams／events 只在「建立它的 grid」內有效。

Streams and Events

同一 stream 的 kernel launch in-order 執行；events 用於建立 stream 之間的相依。
grid 內建立的 streams／events 屬 grid scope，在建立它的 grid 之外使用為 undefined behavior。
grid 退出時，其發出的所有工作（含送進 streams 的工作）都會 implicit synchronize，相依性會被正確解析。
在 grid scope 之外被修改過的 stream，後續對它的操作為 undefined。

Warning

跨界使用一律 undefined：host 建立的 streams／events 在任何 kernel 內使用為 undefined；parent grid 建立的 streams／events 在 child grid 內使用同樣為 undefined。

Ordering and Concurrency

device runtime 的 launch 排序遵循 CUDA Stream ordering 語意。
同 grid 內送進「同一 stream」的 launch（fire-and-forget stream 例外）會 in-order 執行。
多個 thread 送進同一 stream 時，stream 內順序取決於 grid 內 thread 排程，可用 __syncthreads() 等同步原語控制。

Stream 種類	共享範圍	多 thread 同時送入的結果
named stream	全 grid 共用	由 thread 排程決定順序
implicit NULL stream（同 block）	僅 thread block 內共用	in-order
implicit NULL stream（跨 block）	不共用	可能並行

若想讓「同一 block 內多個 thread 的 launch」並行，應使用顯式的 named stream。
CDP 不在 CUDA execution model 中引入任何「新的並行保證」：device 上任意數量的不同 block 之間沒有並行保證。
parent 與 child 之間也沒有並行保證：child 在 stream 相依滿足且硬體資源可用時可能開始，但「不保證」在 parent 抵達 implicit synchronization point 前開始執行。

Warning

並行程度會隨 device 配置、應用 workload、runtime 排程而變動。依賴不同 block 之間（或 parent/child 之間）的並行是不安全的，會造成 undefined behavior。

Memory Coherence and Consistency

Parent 與 child 共享同一份 global 與 constant memory，但各自擁有獨立的 local 與 shared memory。Child 永遠不能存取 parent 的 local／shared，反之亦然。

Memory Space	Parent/Child 可用相同指標？
Global Memory	Yes
Mapped Memory	Yes
Local Memory	No
Shared Memory	No
Texture Memory	Yes（read-only）

Global Memory

parent 與 child 對 global memory 有 coherent access，但 child 與 parent 之間是 weak consistency。
child 整個執行期間，唯一「對 memory 的視角與 parent thread 完全一致」的時間點，就是 child 被 parent invoke 的那一刻。
parent thread 在 child invocation 之前的所有 global memory 操作，對 child 都可見。
由於 CDP2 移除了 cudaDeviceSynchronize()，parent 無法再直接讀到 child thread 做的修改；唯一在 parent 退出前取得 child 修改的方式，是把一個 kernel 送進 cudaStreamTailLaunch stream。

__global__ void parent_launch(int *data) {
    data[threadIdx.x] = threadIdx.x;
    __syncthreads();                 // 讓 child 看到 data[0..255] 全部寫入
    if (threadIdx.x == 0) {
        child_launch<<< 1, 256 >>>(data);
        // tail launch：等 child 完成，才能在 parent 退出前讀到 child 結果
        tail_launch<<< 1, 256, 0, cudaStreamTailLaunch >>>(data);
    }
}

重點：因 thread 0 負責 launch，child 與「thread 0 看到的 memory」一致；__syncthreads() 確保 child 看到 data[0]=0 … data[255]=255（若無此同步只保證看到 data[0]=0）。child 只在 implicit synchronization 時返回，其修改不保證對 parent 可見，要靠 tail launch 取回。

時間軸（parent 視角）
 寫 data + __syncthreads ──▶ child_launch ──▶ (parent 抵達 implicit sync)
                                  │                     │
                                  └─ child 跑完 ─────────┴─▶ tail_launch 讀到 child 結果

Mapped Memory

mapped system memory 的 coherence／consistency 保證與 global memory 完全相同，遵循上述語意。
kernel 不能 allocate 或 free mapped memory，但可使用「由 host program 傳入」的 mapped memory 指標。

Shared and Local Memory

shared／local memory 分別私有於 thread block／thread，parent 與 child 之間既不可見也不一致。
在所屬範圍外引用這類物件為 undefined behavior，可能造成 error。
NVIDIA compiler 會嘗試在偵測到「local／shared 指標被當成 launch 參數」時提出警告。
runtime 可用 __isGlobal() intrinsic 判斷指標是否指向 global memory（即是否能安全傳給 child launch）。
cudaMemcpy*Async() 或 cudaMemset*Async() 為維持 stream 語意，可能在 device 上 invoke 新的 child kernel；因此把 shared／local 指標傳給這些 API 是非法的，會回傳 error。

Local Memory（細節）

local memory 是執行中 thread 的私有儲存，對外不可見；把 local memory 指標當 child launch 參數是非法的，從 child grid 解參考這種指標為 undefined。

// 非法：x_array 在 parent 的 local memory
int x_array[10];
child_launch<<< 1, 1 >>>(x_array);

通用原則：傳給 child kernel 的所有儲存都應「明確從 global-memory heap 配置」——用 cudaMalloc()、new()，或在 global scope 宣告 __device__ 儲存。

__device__ int value;                  // 正確：global 儲存
__device__ void x() { value = 5; child<<<1,1>>>(&value); }

__device__ void y() { int value = 5;   // 非法：local 儲存
    child<<<1,1>>>(&value); }

Texture Memory

對「被 texture 映射的 global 區域」的寫入，與 texture 存取之間是 incoherent 的。
texture 的 coherence 在 child grid invocation 與 child grid completion 兩個時點被強制：child launch 之前的寫入會反映到 child 的 texture 存取。
同 Global Memory，child 的寫入不保證反映到 parent 的 texture 存取；parent 退出前取回唯一方式是 cudaStreamTailLaunch kernel。parent 與 child 同時存取可能產生不一致資料。

Programming Interface：Basics

__global__ void parentKernel() {
    childKernel<<<1,1>>>();                          // launch child
    if (cudaSuccess != cudaGetLastError()) return;
    // tail launch：隱含同步，等 child 完成
    tailKernel<<<1,1,0,cudaStreamTailLaunch>>>();
}
// host: parentKernel<<<1,1>>>(); ... cudaDeviceSynchronize();

重點：device 端用 cudaGetLastError() 檢查 launch 錯誤；cudaStreamTailLaunch 讓 tail kernel 等 child 完成後才跑。編譯需開啟 relocatable device code 並連結 cudadevrt：

nvcc -arch=sm_75 -rdc=true hello_world.cu -o hello -lcudadevrt

C++ Language Interface for CDP

提供給 kernel 使用的語言介面／API 稱為 CUDA Device Runtime；盡量保留 CUDA Runtime API 的語法與語意，方便 host／device 共用程式碼。
與所有 CUDA C++ 一樣，這些 API 是 per-thread code：每個 thread 可獨立、動態決定下一步要執行哪個 kernel 或操作。
thread 之間「無需同步」即可呼叫任一 device runtime API，因此可在任意 divergent kernel code 中呼叫而不會 deadlock。

Device-Side Kernel Launch

kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments]);

參數	型別	說明
`Dg`	`dim3`	grid 的維度與大小
`Db`	`dim3`	每個 thread block 的維度與大小
`Ns`	`size_t`	動態 shared memory bytes（選填，預設 0）
`S`	`cudaStream_t`	關聯 stream，必須在「發出此呼叫的同一 grid」內配置（選填，預設 NULL stream）

Launches are Asynchronous：device 端 launch 對發出的 thread 是非同步的——<<<>>> 立即返回，發出 thread 繼續執行，直到碰到 implicit launch-synchronization point（例如送進 cudaStreamTailLaunch 的 kernel）。child 可在 launch 後任意時間開始，但不保證在發出 thread 抵達 implicit launch-sync point 前開始。送進不同 stream 的工作「可能」並行，但實際並行不被保證；依賴 child 之間並行為 undefined behavior。
Launch Environment Configuration：所有 global device 設定（如 cudaDeviceGetCacheConfig() 回傳的 shared/L1 配置、cudaDeviceGetLimit() 回傳的 device limits，含 stack size）都由 parent 繼承。host 設定的 per-kernel 配置優先於 global 設定，且 device 端 launch 同 kernel 時沿用。無法從 device 重新配置 kernel 的環境。

Events

只支援 events 的 inter-stream 同步能力：支援 cudaStreamWaitEvent()；不支援 cudaEventSynchronize()、cudaEventElapsedTime()、cudaEventQuery()。
因不支援 cudaEventElapsedTime()，event 必須以 cudaEventCreateWithFlags() 並帶 cudaEventDisableTiming flag 建立。
同 streams，event 可被建立它的 grid 內所有 thread 共享，但對該 grid 為 local、不能傳給其他 kernel。event handle 在不同 grid 之間不保證唯一，於「非建立者 grid」內使用某 event handle 會 undefined。

Synchronization

若要讓某 thread 與「其他 thread 發出的 child grid」同步，需由程式自行做足夠的 inter-thread 同步（例如透過 CUDA Event）。
因為無法從 parent thread 顯式同步 child 工作，所以無法保證 child grid 的變更對 parent grid 內的 thread 可見。

Device Management

只能從 kernel 控制「該 kernel 正在執行的 device」；因此 device runtime 不支援 cudaSetDevice()。
cudaGetDevice() 看到的 active device 號與 host 看到的相同。
cudaDeviceGetAttribute() 可查詢其他 device（API 允許以 device ID 為參數）。
不提供 catch-all 的 cudaGetDeviceProperties()，屬性必須逐項查詢。

Programming Guidelines

Performance

DP-enabled Kernel Overhead：當 system software 在管理 dynamic launch 時，會對「當時正在執行的任何 kernel」造成 overhead——無論該 kernel 本身是否發出 launch。此 overhead 來自 device runtime 的執行追蹤與管理軟體，可能降低效能；一般而言，凡是連結 device runtime library 的應用都會付出此成本。

Implementation Restrictions and Limitations

CDP 保證本文件描述的所有語意，但某些硬體與軟體資源是 implementation-dependent，會限制使用 device runtime 程式的規模、效能等。
Memory Footprint：device runtime system software 會為管理用途保留 memory，特別是用於追蹤 pending grid launches 的保留區；有 configuration 可縮小此保留區以換取某些 launch 限制。
Pending Kernel Launches：kernel launch 時，其全部 configuration 與 parameter 資料會被追蹤到 kernel 完成，存放在 system 管理的 launch pool。此固定大小 launch pool 可由 host 端呼叫 cudaDeviceSetLimit() 並指定 cudaLimitDevRuntimePendingLaunchCount 來設定。

Compatibility and Interoperability

CDP2 為預設；可用 -DCUDA_FORCE_CDP1_IF_SUPPORTED 在 CC < 9.0 的 device 上退回 CDP1。

情境	CDP2（CUDA 12.0+ 預設）	CDP1（`-DCUDA_FORCE_CDP1_IF_SUPPORTED`）
編譯	device code 引用 `cudaDeviceSynchronize` 會編譯錯誤	引用 `cudaStreamTailLaunch`／`cudaStreamFireAndForget` 編譯錯誤；引用 `cudaDeviceSynchronize` 且為 sm_90+ 編譯也錯誤
CC < 9.0	使用新介面	使用 legacy 介面
CC ≥ 9.0	使用新介面	仍用新介面；若 device code 引用 `cudaDeviceSynchronize`，function load 回傳 `cudaErrorSymbolNotFound`（為 CC<9.0 編譯但以 JIT 跑在 9.0+ 時可能發生）

CDP1 與 CDP2 的 function 可在同一 context 中同時 load 與執行：CDP1 用其專屬功能（如 cudaDeviceSynchronize），CDP2 用其專屬功能（如 tail launch、fire-and-forget launch）。

Warning

CDP1 的 function 不能 launch CDP2 的 function，反之亦然。若 CDP1 function 的 call graph 內含 CDP2 function（或相反），function load 時會得到 cudaErrorCdpVersionMismatch。

Device-side Launch from PTX

針對以 PTX 為目標、要在自家語言支援 dynamic parallelism 的編譯器／語言實作者，device 端 launch 由兩個 PTX 可存取的 API 完成：cudaLaunchDevice() 與 cudaGetParameterBuffer()。前者用「由後者取得並填入參數」的 parameter buffer 來啟動指定 kernel；若 kernel 不帶參數，parameter buffer 可為 NULL（不需呼叫 cudaGetParameterBuffer()）。

cudaLaunchDevice

extern "C" __device__
cudaError_t cudaLaunchDevice(void *func, void *parameterBuffer,
                             dim3 gridDimension, dim3 blockDimension,
                             unsigned int sharedMemSize,
                             cudaStream_t stream);

第一參數為要啟動的 kernel 指標，第二參數為存放實際參數的 parameter buffer，其餘為 launch configuration（grid／block 維度、shared memory 大小、stream）。
PTX 層需先以對應 .address_size（如 64-bit）的 .extern .func 形式宣告；CUDA 層宣告位於 cuda_device_runtime_api.h，函式定義在 cudadevrt system library，使用 device-side launch 必須連結它。

cudaGetParameterBuffer

extern "C" __device__
void *cudaGetParameterBuffer(size_t alignment, size_t size);

第一參數為 parameter buffer 的 alignment 需求、第二為 size（bytes）。
目前實作回傳的 buffer 一律保證 64-byte 對齊，alignment 參數被忽略；但仍建議傳入正確 alignment（即 buffer 中任一參數的最大 alignment）以利未來可攜性。

Parameter Buffer Layout

禁止在 parameter buffer 中重排參數，每個參數都必須對齊：參數須放在第 n byte，n 是「大於前一參數最後一個 byte 偏移」的最小、可被該參數大小整除的倍數。
parameter buffer 最大為 4KB。更詳細的 CUDA compiler 產生的 PTX，請參考 PTX 3.5 spec。

parameter buffer (max 4KB)
┌──────────┬─pad─┬───────────────┬─────────────┐
│ param0   │     │ param1(對齊)  │ param2(對齊) │ ...
└──────────┴─────┴───────────────┴─────────────┘
 每個 param 偏移 = 大於前一 param 末端、且為自身 size 倍數的最小值

考試/測驗重點

主題	常考點 / 陷阱
版本	CDP2 = CUDA 12.0+ 預設、CC 9.0+ 唯一；CDP1 需 -DCUDA_FORCE_CDP1_IF_SUPPORTED，將被移除
同步機制	CDP2 無 cudaDeviceSynchronize；改用 cudaStreamTailLaunch（等 child）與 cudaStreamFireAndForget
取回 child 結果	parent 退出前只能靠送進 cudaStreamTailLaunch 的 kernel；child 寫入「不保證」對 parent 可見
Global 一致性	weak consistency；唯一完全一致點 = child 被 invoke 當下；launch 前 parent 寫入對 child 可見
記憶體可傳指標	global / mapped / texture(唯讀) 可；local / shared 不可（傳了 undefined / error）
判斷指標	__isGlobal() 判斷是否 global，可安全傳給 child；compiler 會嘗試對 local/shared 指標警告
非法傳遞	把 shared/local 指標傳給 cudaMemcpyAsync / cudaMemsetAsync 非法（它們可能 invoke child kernel）
Stream 共享	named stream 全 grid 共用；NULL stream 僅 block 內共用；同 stream in-order（fire-and-forget 例外）
並行保證	CDP 無新並行保證；block 間、parent/child 間皆不可依賴並行
Events	僅支援 cudaStreamWaitEvent；須用 cudaEventCreateWithFlags + cudaEventDisableTiming
Device 管理	不支援 cudaSetDevice / cudaGetDeviceProperties；cudaDeviceGetAttribute 可查他 device
物件範圍	device 建立的 stream/event 僅在「建立它的 grid」有效；跨 grid / host↔device 皆 undefined
編譯連結	nvcc -rdc=true ... -lcudadevrt；device-side launch 需連結 cudadevrt
Pending launch	cudaDeviceSetLimit + cudaLimitDevRuntimePendingLaunchCount 設定 launch pool 大小
版本不相容	CDP1 與 CDP2 互相 launch / 同一 call graph 混用 → cudaErrorCdpVersionMismatch
PTX 層	cudaLaunchDevice + cudaGetParameterBuffer；buffer 64-byte 對齊、不可重排、最大 4KB

重點總覽

Introduction 與 Overview

Execution Environment：Parent / Child Grids

Scope of CUDA Primitives

Streams and Events

Ordering and Concurrency

Memory Coherence and Consistency

Global Memory

Mapped Memory

Shared and Local Memory

Local Memory（細節）

Texture Memory

Programming Interface：Basics

C++ Language Interface for CDP

Device-Side Kernel Launch

Events

Synchronization

Device Management

Programming Guidelines

Performance

Implementation Restrictions and Limitations

Compatibility and Interoperability

Device-side Launch from PTX

cudaLaunchDevice

cudaGetParameterBuffer

Parameter Buffer Layout

考試/測驗重點

Related Notes