CUDA Graphs：結構與擷取 (Graph Structure and Capture)

#cuda #async-execution #cuda-graphs #concept

重點總覽

CUDA Graphs 提供另一種工作提交模型：把一連串以「依賴關係」連接的操作（kernel launch、資料搬移等）定義一次、重複啟動多次。將定義與執行分離可降低 CPU launch 成本，並讓 CUDA 在掌握整體工作流程後做出 stream 逐段提交無法達成的最佳化。

項目	重點
為何用 graph	對短時 kernel，host driver 每次 launch 的設定 overhead 佔比高；graph 把這些成本在 instantiation 一次付清
節點 (node)	一個操作就是一個 node；共 12 種節點類型
邊 (edge)	操作間的依賴，約束執行順序；node 依賴完成後即可被排程
Edge data (12.3+)	outgoing port / incoming port / type 三部分；目前唯一非預設用途是 PDL
三階段	定義 (definition) → 實例化 (instantiation) → 執行 (execution)
建立方式	顯式 Graph API（cudaGraphCreate / cudaGraphAddNode）或 stream capture
Stream capture	cudaStreamBeginCapture / cudaStreamEndCapture 包夾既有 stream 程式碼
實例化	cudaGraphInstantiate 產生 cudaGraphExec_t（executable graph）
執行	cudaGraphLaunch 把 executable graph 送入 stream，可重複多次

graph 的核心價值

對「會被啟動很多次的固定工作流程」，instantiation 把驗證與初始化成本一次付清，之後每次 cudaGraphLaunch 的 overhead 極低。

Graph Structure：節點與邊

graph 由 node（操作） 與 edge（依賴） 構成。edge 約束執行順序：一個操作只要它所依賴的 node 都完成，就可在任意時點被排程，排程交由 CUDA 系統決定。

Node Types（節點類型）

一個 graph node 可以是以下其中一種：

節點類型	說明
kernel	kernel 啟動
CPU function call	host 端函式呼叫（host node）
memory copy	記憶體複製
memset	記憶體設值
empty node	空節點（純粹做依賴匯聚/分岔）
waiting on a CUDA Event	等待一個 CUDA Event
recording a CUDA Event	記錄一個 CUDA Event
signalling an external semaphore	觸發外部 semaphore
waiting on an external semaphore	等待外部 semaphore
conditional node	條件節點
memory node	記憶體節點（配置/釋放）
child graph	執行一個獨立的巢狀 graph

empty node 的用途

empty node 本身不做事，常用來把多條依賴匯聚成單一節點、或從單點分岔出去，簡化依賴拓撲。

Edge Data（邊資料，CUDA 12.3 引入）

edge data 用來修飾一條依賴邊，由三部分組成：

outgoing port：指定該邊在上游 node 的「何時」被觸發。
incoming port：指定下游 node 的「哪一部分」依賴此邊。
type：修飾兩端點之間的關係。

port 值與 node 類型及方向相關，edge type 也可能限定於特定 node 類型。零初始化的 edge data 代表預設行為：

欄位	預設值 (0) 的意義
outgoing port 0	等待整個上游 task 完成
incoming port 0	阻擋整個下游 task
edge type 0	完整依賴，且具記憶體同步語意

實務重點：

edge data 在各 graph API 以「與 node 平行的陣列」選擇性傳入；若省略輸入，使用零初始化資料。
作為輸出（查詢）參數省略時，若被忽略的 edge data 全為零初始化則允許；否則回傳 cudaErrorLossyQuery（避免遺失資訊）。
部分 stream capture API 也支援 edge data：cudaStreamBeginCaptureToGraph()、cudaStreamGetCaptureInfo()、cudaStreamUpdateCaptureDependencies()。此時尚無下游 node，資料附在 dangling edge（half edge，懸空半邊） 上，未來連到被擷取的 node 或在 capture 結束時被丟棄。
唯一的非預設依賴類型是 cudaGraphDependencyTypeProgrammatic，用於兩個 kernel node 之間啟用 Programmatic Dependent Launch (PDL)。目前 non-default edge data 的唯一用途就是啟用 PDL。
沒有任何 node 類型定義額外的 incoming port；只有 kernel node 定義額外的 outgoing port。

不等待完整完成的 edge 例外

有些 edge type 不會等待上游 node 完全結束。判斷 stream capture 是否已完整 rejoin 回 origin stream 時，這類邊會被忽略，且不能在 capture 結束時被丟棄。

Building and Running Graphs：三階段

  ┌────────────┐    ┌──────────────────┐    ┌───────────────┐
  │ Definition │ →  │  Instantiation   │ →  │   Execution   │
  │  建立節點   │    │ cudaGraph        │    │ cudaGraph     │
  │  與依賴     │    │ Instantiate      │    │ Launch (×N)   │
  │ (cudaGraph_t)   │ → cudaGraphExec_t│    │  送入 stream  │
  └────────────┘    └──────────────────┘    └───────────────┘

Definition / Creation：建立 graph 中各操作及其依賴的描述（產生 cudaGraph_t 模板）。
Instantiation：對 graph 模板取快照、驗證、並完成大部分設定與初始化，目的是把 launch 時要做的事降到最低，產出 executable graph。
Execution：executable graph 可像一般 CUDA work 一樣送入 stream，且可重複啟動任意次而不需重新 instantiate。

Graph Creation 方式一：顯式 Graph API

用 cudaGraphCreate() 建立空 graph，再以 cudaGraphAddNode()（或 cudaGraphAddKernelNode / cudaGraphAddMemcpyNode / cudaGraphAddMemsetNode / cudaGraphAddHostNode 等）逐一加入節點與依賴。

// 建立空 graph
cudaGraphCreate(&graph, 0);
cudaGraphNode_t nodes[4];
cudaGraphNodeParams kParams = { cudaGraphNodeTypeKernel };
// ... 設定 func / gridDim / blockDim ...
cudaGraphAddNode(&nodes[0], graph, NULL,      NULL, 0, &kParams);
cudaGraphAddNode(&nodes[1], graph, &nodes[0], NULL, 1, &kParams);
cudaGraphAddNode(&nodes[2], graph, &nodes[0], NULL, 1, &kParams);
cudaGraphAddNode(&nodes[3], graph, &nodes[1], NULL, 2, &kParams);

cudaGraphAddNode 的依賴參數（第 3、5 個引數）指定該 node 依賴哪些既有 node。上例建出 nodes[0] 為根，分岔到 nodes[1]/nodes[2]，nodes[3] 再依賴 nodes[1]。實際應用還需加入 memcpy/memset 等記憶體操作節點。

顯式 API 的特性

完整掌控拓撲與每個 node 的參數，適合直接以圖思維建構工作流程；節點與依賴一次明確指定。

Graph Creation 方式二：Stream Capture

stream capture 從既有 stream-based API 建出 graph：用 cudaStreamBeginCapture() 與 cudaStreamEndCapture() 把一段送入 stream 的程式碼包起來。

cudaGraph_t graph;
cudaStreamBeginCapture(stream);
kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
libraryCall(stream);            // 既有函式庫呼叫也能被擷取
kernel_C<<< ..., stream >>>(...);
cudaStreamEndCapture(stream, &graph);

進入 capture 模式後，送入 stream 的 work 不會被排入執行，而是逐步附加到一個內部正在建構的 graph（稱為 capture graph），最後由 cudaStreamEndCapture() 回傳並結束 capture。
可用於任何 stream，唯獨不能用於 cudaStreamLegacy（NULL stream）；但可用於 cudaStreamPerThread。若程式使用 legacy stream，常可把 stream 0 重定義為 per-thread stream 而不改變功能。
cudaStreamIsCapturing() 可查詢某 stream 是否正在被擷取。
cudaStreamBeginCaptureToGraph() 可把 work 擷取到使用者提供的既有 graph，而非內部新建的 graph。

capture 的價值

不需重寫成 graph API，就能把現有 stream 程式碼（含第三方函式庫呼叫）「錄製」成可重複啟動的 graph。

Cross-stream Dependencies and Events（跨 stream 依賴）

stream capture 能處理以 cudaEventRecord() 與 cudaStreamWaitEvent() 表達的跨 stream 依賴，前提是被等待的 event 記錄於同一個 capture graph。

在 capture 模式的 stream 中記錄 event 會產生 captured event，代表 capture graph 中的一組 node。
當某 stream 等待一個 captured event：若該 stream 尚未在 capture 模式則被帶入 capture 模式，其下一個項目會額外依賴該 captured event 所代表的 node；兩條 stream 自此被擷取到同一個 capture graph。
origin stream：呼叫 cudaStreamBeginCapture() 的那條 stream。cudaStreamEndCapture() 必須在 origin stream 呼叫，且因 event 依賴而加入同一 capture graph 的其他 stream 都必須 join 回 origin stream，否則整個 capture 失敗。所有 stream 在 cudaStreamEndCapture() 時一起退出 capture 模式。

// stream1 是 origin stream
cudaStreamBeginCapture(stream1);
kernel_A<<< ..., stream1 >>>(...);
cudaEventRecord(event1, stream1);      // fork：記錄 A
cudaStreamWaitEvent(stream2, event1);  // stream2 等待 A
kernel_B<<< ..., stream1 >>>(...);
kernel_C<<< ..., stream2 >>>(...);
cudaEventRecord(event2, stream2);      // join：記錄 C
cudaStreamWaitEvent(stream1, event2);  // stream1 等待 C
kernel_D<<< ..., stream1 >>>(...);
cudaStreamEndCapture(stream1, &graph); // 兩條 stream 退出 capture 模式

上述 fork/join 擷取出的拓撲（菱形依賴）：

            kernel_A          (origin / root)
           ╱        ╲
      kernel_B    kernel_C    (B 在 stream1、C 在 stream2)
           ╲        ╱
            kernel_D          (依賴 B 與 C，回到 origin)

退出 capture 模式仍保留 stream 內順序

stream 退出 capture 模式後，stream 中下一個「未被擷取」的項目，仍會依賴於前一個未被擷取的項目，即便中間被擷取的項目已被移除。

Prohibited and Unhandled Operations（禁止與不支援的操作）

操作	為何無效
同步或查詢正被擷取的 stream / captured event	它們不代表已排入執行的項目
同步或查詢涵蓋作用中 capture 的更廣 handle（device / context）	只要任一關聯 stream 在 capture 模式即無效
在同 context 有 stream 被擷取時使用 legacy stream	legacy stream handle 永遠涵蓋其他 stream（非 `cudaStreamNonBlocking` 建立者）
同步式 API（如 `cudaMemcpy()`）	會把 work 排入 legacy stream 並在返回前同步它，故同樣無效
合併兩個獨立 capture graph	在某 capture stream 等待屬於「不同 capture graph」的 captured event 無效
等待非擷取的 event	在 capture stream 等待 non-captured event 須加 `cudaEventWaitExternal` flag
少數非同步入隊 API（如 `cudaStreamAttachMemAsync()`）	graph 目前不支援，對 capture 中 stream 呼叫會回傳錯誤

CUDA 的通則：寧可報錯，不要默默忽略依賴

當依賴關係會把「被擷取的東西」連到「未被擷取、已排入執行的東西」時，CUDA 偏好回傳錯誤而非忽略該依賴。唯一例外是把 stream 移入/移出 capture 模式：這會切斷模式轉換前後緊鄰項目間的依賴關係。

Invalidation（失效）

當 capture 期間嘗試了無效操作，相關的 capture graph 會被 invalidate。capture graph 失效後，任何正被擷取的 stream 或關聯 captured event 都不可再用、會回傳錯誤，直到以 cudaStreamEndCapture() 結束擷取。該呼叫會把相關 stream 移出 capture 模式，但同時回傳錯誤值與 NULL graph。

Capture Introspection（擷取內省）

cudaStreamGetCaptureInfo() 可檢視作用中的 capture：取得 capture 狀態、每個 process 唯一的 capture ID、底層 graph 物件，以及該 stream 中「下一個待擷取 node」的依賴/edge data。這些依賴資訊可用來取得 stream 中最後被擷取的 node 的 handle。

Graph Instantiation（實例化）

graph 建立後（無論來自 graph API 或 stream capture）須先 instantiate 成 executable graph 才能 launch：

cudaGraphExec_t graphExec;
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);

instantiate 會對 graph 模板取快照、驗證、並完成設定，產出可啟動的 cudaGraphExec_t。

Graph Execution（執行）

executable graph 用 cudaGraphLaunch() 送入指定 stream；一次 instantiate、可多次 launch：

cudaGraphLaunch(graphExec, stream);

完整流程（以 stream capture 為例）：擷取 → 結束擷取得 graph → instantiate → launch：

cudaStreamBeginCapture(stream);
kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
libraryCall(stream);
kernel_C<<< ..., stream >>>(...);
cudaStreamEndCapture(stream, &graph);
cudaGraphExec_t graphExec;
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphLaunch(graphExec, stream);

重複啟動才划算

instantiation 的成本在第一次付清；對只跑一次的工作，graph 的設定成本可能得不償失。把它用在會反覆執行的固定 workflow 上。

考試/測驗重點

主題	常考點 / 易錯點
三階段	definition / instantiation / execution；instantiate 一次、launch 多次
節點類型數	共 12 種，含 empty node、conditional node、memory node、child graph、event/semaphore 相關
Edge data 預設	outgoing port 0 = 等整個 task；incoming port 0 = 擋整個 task；type 0 = 完整依賴且具記憶體同步
Edge data 用途	目前 non-default 唯一用途是 PDL；只有 kernel node 有額外 outgoing port；無 node 有額外 incoming port
省略 edge data 查詢	被忽略者非全零時回傳 cudaErrorLossyQuery
capture 適用 stream	不可用 cudaStreamLegacy / NULL stream；可用 cudaStreamPerThread
origin stream	EndCapture 必須在 origin stream；其他 stream 必須 join 回 origin，否則失敗
退出 capture	下一個未擷取項目仍依賴前一個未擷取項目（中間被擷取者已移除）
禁止操作	同步/查詢 capture 中 stream、device/context handle、legacy stream、同步式 API (cudaMemcpy) 皆無效
跨 graph 合併	等待屬於不同 capture graph 的 captured event 無效
非擷取 event	capture stream 等待 non-captured event 須加 cudaEventWaitExternal
Invalidation	失效後 stream/event 不可用；EndCapture 仍回傳錯誤 + NULL graph
內省 API	cudaStreamGetCaptureInfo 取狀態/ID/graph/下一 node 依賴
關鍵 API	cudaGraphCreate / cudaGraphAddNode / cudaStreamBeginCapture / cudaStreamEndCapture / cudaGraphInstantiate / cudaGraphLaunch

重點總覽

Graph Structure：節點與邊

Node Types（節點類型）

Edge Data（邊資料，CUDA 12.3 引入）

Building and Running Graphs：三階段

Graph Creation 方式一：顯式 Graph API

Graph Creation 方式二：Stream Capture

Cross-stream Dependencies and Events（跨 stream 依賴）

Prohibited and Unhandled Operations（禁止與不支援的操作）

Invalidation（失效）

Capture Introspection（擷取內省）

Graph Instantiation（實例化）

Graph Execution（執行）

考試/測驗重點

Related Notes