CUDA Graphs：記憶體節點與裝置端啟動 (Memory Nodes and Device Launch)

#cuda #async-execution #cuda-graphs #concept

重點總覽

項目	重點
Graph memory node	讓 graph 自己擁有記憶體配置；具 GPU ordered lifetime 語意，對應 cudaMallocAsync/cudaFreeAsync
固定虛擬位址	graph allocation 的 VA 在 graph 的整個生命（含多次 instantiate/launch）固定，免 graph update
兩種節點	cudaGraphNodeTypeMemAlloc（alloc node）、cudaGraphNodeTypeMemFree（free node）
三種建立法	cudaGraphAddNode 顯式建立、stream capture 擷取 cudaMallocAsync/cudaFreeAsync、child graph 內含節點
Optimized reuse	圖內：非重疊生命共用 VA；圖間：virtual aliasing 共用 physical memory
AutoFreeOnLaunch	允許帶未釋放配置重啟 graph，啟動時自動插入 async free
Physical footprint	cudaDeviceGraphMemTrim 釋還 OS；cudaDeviceGetGraphMemAttribute 查詢 reserved/used
Peer access	accessDescs 陣列指定可存取的 peer GPU；resident device 隱含可存取
Device graph launch	從裝置端啟動 graph 做 data-dependent control flow；需 cudaGraphInstantiateFlagDeviceLaunch
三種裝置啟動模式	fire-and-forget、tail launch、sibling launch（各對應一個 named stream）
Using Graph APIs	cudaGraph_t 非執行緒安全；cudaGraphExec_t 不能與自身並行
CUDA User Objects	以 refcount（似 shared_ptr）綁定 destructor，管理 graph/capture 使用的資源生命

Graph Memory Nodes：簡介與生命語意

Graph memory node 讓 graph 能建立並擁有記憶體配置，使用 GPU ordered lifetime 語意（決定何時可在裝置上存取）。這些語意讓 driver 能管理記憶體重用，並與 stream ordered 配置 API（cudaMallocAsync/cudaFreeAsync，可被 capture）一致。

graph allocation 的虛擬位址在整個 graph 生命（含重複 instantiation 與 launch）固定，因此可被圖內其他操作直接引用，即使底層 physical memory 改變也不需 graph update。
圖內生命期不重疊的配置，可共用同一塊底層 physical memory。
跨多個 graph 時，CUDA 可用 virtual aliasing 把同一 physical memory 映射到不同 VA（例如多個 graph 進同一 stream，配置都是 single-graph lifetime）。

node 生命 ≠ allocation 生命：CUDA 在 node 建立時就指派 VA（VA 在 alloc node 生命內固定），但配置內容在 free 後不保留、可能被其他配置覆寫。每次 graph 執行都視為重新建立 allocation，其生命始於 GPU 執行抵達 alloc node，終於以下任一：

GPU 執行抵達 free graph node
GPU 執行抵達 cudaFreeAsync() stream 呼叫
cudaFree() 呼叫時立即結束

Graph 銷毀不會自動釋放記憶體

銷毀含 memory node 的 graph 雖結束了 alloc node 的生命，但不會自動釋放仍存活的 graph 記憶體。必須事後用另一個 graph、或 cudaFreeAsync()/cudaFree() 釋放，否則洩漏。

alloc node ──► a ──► b ──┐
              └──► c ──┐ ├──► free node     (b, c 排在 alloc 後、free 前 → 可存取)
                       d              d 不排在 free 前 → 不可存取
   e (不依賴 alloc node)              → 不可存取（即使 free 依賴 e 也一樣）

存取 graph 記憶體的操作必須：排在 alloc node 之後、且排在 free 操作之前（用 dependency edge 保證）。

API Fundamentals：建立記憶體節點的三種方式

1) Graph Node API（顯式）：用 cudaGraphAddNode 加入 cudaGraphNodeTypeMemAlloc 節點，配置位址由 params.alloc.dptr 回傳；free 節點用 cudaGraphNodeTypeMemFree。

cudaGraphNodeParams params = { cudaGraphNodeTypeMemAlloc };
params.alloc.poolProps.allocType   = cudaMemAllocationTypePinned;
params.alloc.poolProps.location.type = cudaMemLocationTypeDevice;
params.alloc.poolProps.location.id = 0;     // resident device 0
params.alloc.bytesize = size;
cudaGraphAddNode(&allocNode, graph, NULL, NULL, 0, &params);
// kernel node 以 params.alloc.dptr 為參數，並依賴 allocNode
cudaGraphNodeParams freeNodeParams = { cudaGraphNodeTypeMemFree };
freeNodeParams.free.dptr = params.alloc.dptr;   // free node 依賴所有使用者

free node 須排在圖內所有使用該配置之後；間接依賴（b 依賴 a）即足夠，free 不必再顯式依賴 a。

2) Stream Capture：擷取 cudaMallocAsync/cudaFreeAsync，captured API 回傳的 VA 可被圖內其他操作使用；stream ordered 依賴會一併被擷取，保證正確排序（前提是 stream code 寫對）。

cudaMallocAsync(&dptr, size, stream1);
kernel_A<<<..., stream1>>>(dptr, ...);
// event 依賴會轉成 graph 依賴
cudaFreeAsync(dptr, stream1);          // free 依賴所有存取記憶體的工作
cudaStreamEndCapture(stream1, &graph);

圖外存取/釋放：graph 不必由配置它的 graph 來釋放；未被釋放的配置會存活到 graph 執行之後，可被後續 CUDA 操作存取（只要透過 event/stream ordering 排在 alloc 之後）。可由 cudaFree、cudaFreeAsync、另一個含 free node 的 graph、或帶 AutoFreeOnLaunch 的同 graph 再次 launch 來釋放。

free 必須在所有裝置操作完成後排序

因 graph allocation 可能共用底層 physical memory，free 操作必須排在所有裝置操作完成之後。kernel 內以記憶體為基礎的 out-of-band 同步不足以為「記憶體寫入 ↔ free」排序。可用 compute sanitizer 抓越界/生命期外存取（可能靜默讀寫他人活資料）。

Child graph 內的記憶體節點（CUDA 12.9）：可將 child graph 的所有權移交 parent（cudaGraphChildGraphOwnershipMove），讓含 alloc/free 節點的 child 能獨立建構後再加入。移交後 child 的限制：

不能獨立 instantiate 或銷毀
不能再被加為另一 parent 的 child graph
不能作為 cuGraphExecUpdate 的引數
不能再新增 alloc/free 節點

AutoFreeOnLaunch：帶未釋放配置重啟

正常情況下，graph 若有未釋放配置會被禁止重啟（同位址多次配置會洩漏）。以 cudaGraphInstantiateFlagAutoFreeOnLaunch instantiate 後，可在仍有未釋放配置時重啟，launch 會自動插入這些配置的 async free。

適合 single-producer/multiple-consumer：每輪 producer graph 建立數個配置，依執行期條件由不同 consumer 集合存取；consumer 無法 free（後續 consumer 可能還要用），auto free 讓 launch loop 不需追蹤 producer 的配置。

仍須手動釋放避免洩漏

AutoFreeOnLaunch 不改變 graph 銷毀的行為。即使用此 flag instantiate，應用程式仍須顯式釋放未釋放的記憶體（例：迴圈結束後對 data1/data2 呼叫 cudaFreeAsync）。

Optimized Memory Reuse

CUDA 以兩種方式重用記憶體：

層級	機制	說明
圖內（virtual + physical）	address assignment	把同一 VA range 指派給生命期不重疊的不同配置；故不同（生命不相交）配置的指標不保證唯一
圖間（physical）	virtual aliasing	不同 graph 把同一 physical memory 映射到各自獨有的 VA

圖內位址重用：新 alloc node 可重用被依賴 free node 釋放的位址；若新 alloc 不依賴某 free node，則不能重用該 free 釋放的位址，需要新位址。
Physical memory 管理與共用：CUDA 負責在 GPU order 抵達 alloc node 前，把 physical memory 映射到 VA。多個不會同時執行的 graph 可共用同一 physical memory；但若 physical page 同時綁到多個執行中 graph、或綁到尚未釋放的 graph allocation，則不可重用。CUDA 可在 instantiation/launch/execution 任何時刻更新映射，並可在未來 launch 間插入同步以避免活配置指向同一 physical memory。

Performance Considerations 與 First Launch

多個 graph 進同一 stream 時（執行不會重疊），CUDA 會嘗試配給同一 physical memory，並在多次 launch 間保留映射以省去 remap 成本。若某 graph 改成可能與其他並行（如進不同 stream），就必須 remap（並行 graph 需各自獨立記憶體以免資料毀損）。

remap 通常由以下觸發：

改變 graph 啟動的 stream
對 graph memory pool 做 trim（顯式釋放未用記憶體）
在另一 graph 的未釋放配置仍映射到同記憶體時重啟 graph

把含記憶體節點的 graph 一致地啟動到同一 stream

remap 必須依執行順序、且在該 graph 前次執行完成後進行（否則可能 unmap 仍在用的記憶體）；加上映射是 OS 呼叫，故相對昂貴。固定使用同一 stream 可避免此成本。

First Launch/cudaGraphUpload：instantiation 時 stream 未知，無法配置/映射 physical memory，映射改在 launch 時做。呼叫 cudaGraphUpload 可把配置成本與 launch 分離——立即完成所有映射並把 graph 關聯到 upload stream；之後若 launch 進同一 stream，即無額外 remap。upload 與 launch 用不同 stream 的行為類似換 stream（很可能 remap），且不相關的 pool 管理可能從 idle stream 抽走記憶體而抵銷 upload 效益。

Physical Memory Footprint

非同步配置的 pool 管理意味著：銷毀含 memory node 的 graph（即使配置已 free）不會立刻把 physical memory 還給 OS。

API/Attribute	作用
cudaDeviceGraphMemTrim	unmap 並釋放未在用的 graph 預留 physical memory 還給 OS（下次 launch 會重新配置/映射）
cudaMemPoolTrimTo	不同 pool；graph memory pool 不暴露給 stream ordered allocator
cudaGraphMemAttrReservedMemCurrent	driver 為本 process graph 配置預留的 physical memory 量
cudaGraphMemAttrUsedMemCurrent	目前被至少一個 graph 映射的 physical memory 量

trim 不影響：未釋放的配置、以及已排程或執行中的 graph（視為正在使用）。
兩個 attribute 可用 cudaDeviceGetGraphMemAttribute 查詢，用以追蹤新 physical memory 取得時機、評估共用機制省下多少記憶體。

Peer Access

graph allocation 可設定供多個 GPU 存取，CUDA 會按需把配置映射到 peer GPU。當需要不同映射的配置重用同一 VA 時，該位址範圍會被映射到所有需要的 GPU——這代表配置有時允許比建立時請求更多的 peer access，但依賴這些額外映射仍屬錯誤。

Graph Node API：在 alloc node 參數的 accessDescs 陣列指定映射請求；poolProps.location 指定 resident device。對 resident device 的存取被假定需要，故不必在 accessDescs 中列入該 device。

allocNodeParams.alloc.poolProps.location.id = 1;   // resident device 1
accessDescs[0].flags = cudaMemAccessFlagsProtReadWrite;
accessDescs[0].location.type = cudaMemLocationTypeDevice;
accessDescs[0].location.id = 0;                     // 請求 device 0
accessDescs[1].location.id = 2;                     // 請求 device 2
allocNodeParams.accessDescCount = 2;
allocNodeParams.accessDescs = accessDescs;
// 結果：可從 device 0、1、2 存取（0、2 來自 desc，1 為 resident）

add node API 只支援 ReadWrite 與 Device 存取

accessDescs 經 add node API 僅支援 cudaMemAccessFlagsProtReadWrite 與 cudaMemLocationTypeDevice。

Stream Capture：alloc node 記錄擷取當下 pool 的 peer accessibility；capture 之後再改 pool 的 accessibility（cudaMemPoolSetAccess）不影響 graph 為該配置所做的映射。

Device Graph Launch：裝置端啟動

許多工作流程需在執行期做 data-dependent 決策。與其卸載給 host（可能需裝置往返），不如在裝置上完成。CUDA 提供從裝置 launch graph 的機制，方便做動態控制流（從簡單迴圈到裝置端 work scheduler）。

device graph：可從裝置 launch 的 graph；host graph：只能從 host launch。device graph 可從 host 與 device 兩端 launch。
在前次 launch 仍執行時從裝置再 launch 同一 device graph → 回傳 cudaErrorInvalidValue（裝置端不能同時 launch 同一 graph 兩次）。
從 host 與 device 同時 launch 同一 device graph → undefined behavior。

Device Graph Creation 與要求

需以 cudaGraphInstantiateFlagDeviceLaunch 傳給 cudaGraphInstantiate() 顯式為裝置啟動 instantiate。與 host graph 一樣，結構在 instantiation 時固定，不重新 instantiate 不能更新，且只能在 host 上 instantiate。要求：

類別	要求
General	所有節點須在單一 device；只能含 kernel、memcpy、memset、child graph 節點
Kernel	不可使用 CUDA Dynamic Parallelism；cooperative launch 允許（除非用 MPS）
Memcpy	僅 device memory 與/或 pinned device-mapped host memory；不可含 CUDA array；兩運算元在 instantiation 時須對 current device 可存取（copy 由 graph 所在 device 執行，即使目標在他 device）

Device Graph Upload：launch 前須先 upload 以填裝置資源，三種方式：(1) cudaGraphUpload()；(2) instantiation 時用 cudaGraphInstantiateWithParams() 請求 upload（cudaGraphInstantiateFlagUpload + uploadStream）；(3) 先從 host launch，隱式完成 upload。

Device Graph Update：只能從 host 更新，更新後須重新 upload 才生效。更新進行中從裝置 launch → undefined behavior。

Device Launch 與三種模式

device graph 從 host 與 device 皆以 cudaGraphLaunch() 啟動（裝置端簽章相同、同一 handle）。從裝置 launch 時必須從另一個 graph 內啟動。裝置端 launch 是 per-thread，多執行緒可同時 launch，使用者需自選單一執行緒來 launch 給定 graph。device graph 不能進一般 CUDA stream，只能進下列具名 stream（各代表一種啟動模式）：

Stream	模式	語意
cudaStreamGraphFireAndForget	Fire and forget	立即提交，獨立於 launching graph 執行；child 環境
cudaStreamGraphTailLaunch	Tail launch	在 launching graph 環境完成後執行（序列依賴）
cudaStreamGraphFireAndForgetAsSibling	Sibling	作為 launching graph 的父環境的 child（等同從父環境 fire-and-forget）

Fire and Forget：立即提交、獨立執行；launching graph 為 parent，launched graph 為 child。一個 graph 在其執行期間最多 120 個 fire-and-forget graph（每次同 parent graph 重啟會重置此計數）。

__global__ void launchFireAndForgetGraph(cudaGraphExec_t graph) {
    cudaGraphLaunch(graph, cudaStreamGraphFireAndForget);
}
// device graph 須先 cudaGraphInstantiate(..., cudaGraphInstantiateFlagDeviceLaunch) 並 cudaGraphUpload

Execution Environments：graph 從裝置 launch 時進入自己的 execution environment，封裝該 graph 全部工作與其所有 fire-and-forget 工作；graph 在自身完成且所有 child 工作完成時才算 complete。環境是階層式的（可有多層 child 環境）。從 host launch 時，存在一個 stream environment 作為 launched graph 執行環境的 parent，封裝整體 launch 的所有工作；stream environment 標記完成時，下游依賴工作方可執行。

stream environment (host launch)
└── graph A environment
    ├── A 的工作 + fire-and-forget 工作
    └── child fire-and-forget environment ...
        tail launch：A 環境 complete 後，下一 tail graph 環境取代 A 成為 parent 的 child

Tail Launch：裝置端無法以 cudaDeviceSynchronize()/cudaStreamSynchronize() 與 device graph 同步；tail launch 提供類似的序列依賴。tail launch 在 graph 環境 complete（graph 及其所有 child 都完成）時執行；一個 graph 可排多個 tail launch。

同一 graph 排的多個 tail launch 依入列順序逐一執行。
tail graph 排的 tail launch 會先於前面 graph 排的 tail launch 執行（新的這些再依入列順序）。
一個 graph 最多 255 個 pending tail launch。

Tail Self-launch：device graph 可把自己排入 tail launch，但同時最多只能有一個 self-launch。用 cudaGraphExec_t cudaGetCurrentGraphExec(); 取得當前執行中的 device graph handle（若當前 kernel 不在 device graph 內則回傳 NULL）。

__global__ void relaunchSelf() {
    if (threadIdx.x == 0 && relaunchCount < relaunchMax)
        cudaGraphLaunch(cudaGetCurrentGraphExec(), cudaStreamGraphTailLaunch);
    relaunchCount++;
}

Sibling Launch：fire-and-forget 的變體，launched graph 不是 launching graph 環境的 child，而是其父環境的 child（等同從父環境 fire-and-forget）。因不在 launching graph 的環境內，sibling launch 不會 gate launching graph 排的 tail launch。

Using Graph APIs

規則	說明
cudaGraph_t 非執行緒安全	使用者須確保多執行緒不並行存取同一 cudaGraph_t
cudaGraphExec_t 不能與自身並行	一次 launch 會排在同一 executable graph 的前次 launch 之後
stream 僅用於排序	graph 在 stream 中執行只為與其他非同步工作排序；不限制圖內並行度、也不影響節點在何處執行

CUDA User Objects

CUDA User Object 用來管理非同步工作所用資源的生命期，對 CUDA Graphs 與 stream capture 特別有用。許多資源管理方案（如 event-based pool、synchronous-create/asynchronous-destroy）與 graph 不相容，因為資源指標/handle 不固定（需 indirection 或 graph update），且每次提交都需同步 CPU 程式碼，亦會在 capture 期間使用不被允許的 API。

User object 把使用者指定的 destructor callback 與內部 refcount 綁定（似 C++ shared_ptr）。參考可由 CPU user code 與 CUDA graph 持有。

User-owned reference 須手動追蹤

不同於 C++ smart pointer，user-owned reference 沒有代表它的物件，使用者須手動追蹤。典型用法是建立 user object 後立刻把唯一的 user-owned reference 移交給 graph。

reference 與 graph 關聯後，CUDA 自動管理 graph 操作：

clone 的 cudaGraph_t 保留來源每個 reference 的副本（含相同重數）。
instantiate 的 cudaGraphExec_t 保留來源 cudaGraph_t 每個 reference 的副本。
cudaGraphExec_t 未經同步即被銷毀時，reference 會保留到執行完成。

cudaUserObjectCreate(&cuObject, object,
    1,                              // 初始 refcount
    cudaUserObjectNoDestructorSync);// callback 不可經 CUDA 等待
cudaGraphRetainUserObject(graph, cuObject, 1,
    cudaGraphUserObjectMove);       // 移交呼叫者持有的 reference（不改總 refcount）
// graphExec instantiate 會保留一份 reference；graph 銷毀後 graphExec 仍持有
// launch 後 cudaGraphExecDestroy：未同步則延後 release
// cudaStreamSynchronize 後最後一個 reference 被 release，destructor 非同步執行

child graph node 內 graph 持有的 reference 關聯到 child graph（非 parent）。child 被更新/刪除時 reference 隨之變動。
以 cudaGraphExecUpdate 或 cudaGraphExecChildGraphNodeSetParams 更新時，新來源 graph 的 reference 被 clone 並取代目標 graph 的 reference；若前次 launch 未同步，待釋放的 reference 會保留到 launch 完成。

destructor 內不可呼叫 CUDA API

目前無 CUDA API 可等待 user object destructor；使用者可在 destructor 內手動 signal 同步物件。destructor 內不可呼叫 CUDA API（同 cudaLaunchHostFunc 的限制），以免阻塞 CUDA 內部共享執行緒、妨礙 forward progress。可 signal 另一執行緒去做 API 呼叫（須為單向依賴、且該執行緒不會阻塞 CUDA 工作的 forward progress）。

考試/測驗重點

主題	必記重點
固定 VA	graph allocation 的虛擬位址在整個 graph 生命固定，底層 physical memory 變動不需 graph update
生命期	allocation 生命始於 GPU 抵達 alloc node，終於 free node / cudaFreeAsync / 立即 cudaFree
銷毀不釋放	graph 銷毀不自動釋放活配置；AutoFreeOnLaunch 也不改變銷毀行為
node type	cudaGraphNodeTypeMemAlloc / cudaGraphNodeTypeMemFree；dptr 在 alloc.dptr / free.dptr
free 排序	free 須排在所有裝置操作完成後；kernel 內記憶體同步不足以排序
重用	圖內共用 VA（指標可能不唯一）；圖間 virtual aliasing 共用 physical memory
remap 觸發	換 stream / trim / 在他 graph 未釋放配置映射時重啟
trim / 查詢	cudaDeviceGraphMemTrim（異於 cudaMemPoolTrimTo）；cudaGraphMemAttrReservedMemCurrent / UsedMemCurrent
upload	cudaGraphUpload 把映射成本與 launch 分離；instantiation 時無法映射（stream 未知）
peer access	accessDescs 不含 resident device；add node 僅支援 ReadWrite + Device
device graph flag	cudaGraphInstantiateFlagDeviceLaunch；只能 host 端 instantiate / update
device graph 限制	單一 device；僅 kernel/memcpy/memset/child；禁 CDP；cooperative 除非 MPS；memcpy 禁 CUDA array
同時 launch	裝置端重複 launch → cudaErrorInvalidValue；host+device 同時 → undefined behavior
啟動模式	FireAndForget / TailLaunch / FireAndForgetAsSibling（各對應一具名 stream）
數量上限	fire-and-forget 最多 120（每次重啟重置）；pending tail launch 最多 255
tail 順序	tail graph 排的 tail launch 先於前面 graph 排的；self-launch 同時僅一個（cudaGetCurrentGraphExec）
sibling	等同從父環境 fire-and-forget；不 gate launching graph 的 tail launch
thread safety	cudaGraph_t 非執行緒安全；cudaGraphExec_t 不能與自身並行
user object	refcount 似 shared_ptr；clone/instantiate 保留 reference 副本；destructor 內禁呼叫 CUDA API

重點總覽

Graph Memory Nodes：簡介與生命語意

API Fundamentals：建立記憶體節點的三種方式

AutoFreeOnLaunch：帶未釋放配置重啟

Optimized Memory Reuse

Performance Considerations 與 First Launch

Physical Memory Footprint

Peer Access

Device Graph Launch：裝置端啟動

Device Graph Creation 與要求

Device Launch 與三種模式

Using Graph APIs

CUDA User Objects

考試/測驗重點

Related Notes