Graphics Interoperability (OpenGL/Direct3D/SLI)

#cuda #cuda-platform #graphics-interop #concept

重點總覽

CUDA 與其他 API（圖形 API、Vulkan 等）互通有兩條路徑：直接法（Graphics Interoperability，把 OpenGL/Direct3D 資源直接映射進 CUDA 位址空間）與更彈性的 External resource interoperability（透過 OS 層 handle 匯入/匯出記憶體與同步物件）。本筆記聚焦前者。

項目	重點
適用對象	OpenGL 與 Direct3D（D3D9/10/11，不含 D3D12）的 buffer、texture、surface、renderbuffer
核心型別	`struct cudaGraphicsResource`，註冊後持有 device pointer 或 CUDA array
基本流程	register → map → 取 pointer/array → kernel 使用 → unmap → unregister
取得位址	buffer 用 `cudaGraphicsResourceGetMappedPointer()`；array 用 `cudaGraphicsSubResourceGetMappedArray()`
OpenGL 註冊	buffer 用 `cudaGraphicsGLRegisterBuffer()`（→device pointer）；texture/renderbuffer 用 `cudaGraphicsGLRegisterImage()`（→CUDA array）
Direct3D 註冊	`cudaGraphicsD3D11RegisterResource()`，僅支援 `D3D_DRIVER_TYPE_HARDWARE` 裝置
成本提示	register 昂貴，理想上每資源每 context 只做一次；map/unmap 可任意次數
SLI	僅支援 explicit SLI；每 GPU 各建 context、各自註冊資源

唯一事實來源

本筆記只涵蓋原文 4.19 與 4.19.1（OpenGL / Direct3D / SLI 互通）。External resource interoperability（OS handle 匯入/匯出，支援 Direct3D11/12、Vulkan、NVSCI）屬於另一概念，見 - 04-CUDA-Features/24-External-Resource-Interop。

兩種互通概念與資源生命週期

在用 CUDA 存取 Direct3D 或 OpenGL 資源（例如 VBO，vertex buffer object）前，必須先註冊（register）並映射（map）。註冊會回傳 struct cudaGraphicsResource，內部持有一個 CUDA device pointer 或 array。kernel 要存取裝置資料前，資源必須先被 map；只要資源仍處於 registered 狀態，就可以反覆 map / unmap 任意次數。用完後再 unregister。

六個主要步驟：

Register the graphics buffer with CUDA
Map the resource
Access the device pointer or array of the mapped resource
Use device pointer or array in a CUDA kernel
Unmap the resource
Unregister the resource

 register                  map                      kernel use
   (一次)                (可多次)        ┌──────────────────────────┐
GL/D3D 資源 ──► [registered] ──► [mapped] ──► GetMappedPointer/Array
                    ▲   │            │              │
                    │   │            │              ▼
                    │   └──unmap─────┘          CUDA kernel
                    │
                 unregister ◄── (不再需要時)

map 時取位址：buffer → cudaGraphicsResourceGetMappedPointer()、array → cudaGraphicsSubResourceGetMappedArray()。
cudaGraphicsResourceSetMapFlags() 可指定使用提示（write-only / read-only），讓 driver 做資源管理最佳化。

map 狀態下的存取規則

當資源仍處於 mapped 狀態時，若透過 OpenGL、Direct3D 或另一個 CUDA context 去存取它，結果是 undefined。map 區間應視為 CUDA 獨佔。

成本與 per-context 註冊

OpenGL Interoperability

可映射進 CUDA 位址空間的 OpenGL 資源為 buffer、texture、renderbuffer 物件。

cudaGraphicsGLRegisterBuffer()：註冊 buffer object，在 CUDA 端呈現為一般的 device pointer。
cudaGraphicsGLRegisterImage()：註冊 texture 或 renderbuffer，在 CUDA 端呈現為 CUDA array。
若 texture/renderbuffer 以 cudaGraphicsRegisterFlagsSurfaceLoadStore 旗標註冊，則可被寫入。
cudaGraphicsGLRegisterImage() 支援所有 1、2、4 通道、內部型別為 float（如 GL_RGBA_FLOAT32）、normalized integer（如 GL_RGBA8、GL_INTENSITY16）、unnormalized integer（如 GL_RGBA8UI）的格式。

simpleGL 範例的核心：kernel 動態改寫存於 VBO 的 width × height 頂點網格。

// createVBO：建立 VBO 並向 CUDA 註冊（步驟 1）
glGenBuffers(1, vbo);
glBindBuffer(GL_ARRAY_BUFFER, *vbo);
glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
cudaGraphicsGLRegisterBuffer(vbo_res, *vbo, vbo_res_flags);

註冊時可帶 cudaGraphicsMapFlagsWriteDiscard（此範例只寫不讀），讓 driver 知道不需保留舊內容。

// display()：每幀 map → 取 pointer → kernel → unmap（步驟 2~5）
cudaGraphicsMapResources(1, &cuda_vbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void**)&dptr, &num_bytes, cuda_vbo_resource);
simple_vbo_kernel<<<grid, block>>>(dptr, mesh_width, mesh_height, g_fAnim);
cudaGraphicsUnmapResources(1, &cuda_vbo_resource, 0);
// 之後再用 glDrawArrays 等 OpenGL 呼叫把更新後的頂點畫出

kernel 直接把 dptr 當 float4* 寫入頂點位置（sine 波形）；unmap 後 OpenGL 才接手繪製，符合上一節「map 區間 CUDA 獨佔」原則。

// deleteVBO：unregister 後再刪除 GL buffer（步驟 6）
cudaGraphicsUnregisterResource(vbo_res);
glDeleteBuffers(1, vbo);

OpenGL 互通限制

進行任何 OpenGL 互通 API 呼叫時，被共享資源所屬的 OpenGL context 必須是目前 host thread 的 current context。
一旦 OpenGL texture 被設為 bindless（例如以 glGetTextureHandle / glGetImageHandle 取得 handle），它就無法再向 CUDA 註冊。必須先註冊互通，再去取得 image/texture handle。

Direct3D Interoperability

Direct3D 互通支援 Direct3D9、Direct3D10、Direct3D11，但不支援 Direct3D12（原文聚焦 D3D11；D3D9/D3D10 細節請參考 CUDA Programming Guide 12.9）。可映射進 CUDA 位址空間的 Direct3D 資源為 buffer、texture、surface，以 cudaGraphicsD3D11RegisterResource() 註冊。

CUDA context 只能與 DriverType 設為 D3D_DRIVER_TYPE_HARDWARE 所建立的 Direct3D11 裝置互通。

simpleD3D11Texture 範例（CUDA 側）：kernel cuda_kernel_texture_2d 在閃爍藍底上畫移動的紅/綠交織紋路，且會讀取也會寫入 texture，因此註冊時不可用 WriteDiscard。底層是 2D CUDA array，列偏移由 pitch 決定。

// 註冊 D3D11 texture：因會讀又會寫，不設特殊 map flag
cudaGraphicsD3D11RegisterResource(&g_texture_2d.cudaResource,
                                  g_texture_2d.pTexture,
                                  cudaGraphicsRegisterFlagsNone);
// CUDA 無法直接寫入 texture：它被視為 cudaArray、只能以 texture 映射，
// 故另配一塊 linear memory 供 CUDA 寫入，pitch 對齊 D3D 列 pitch
cudaMallocPitch(&g_texture_2d.cudaLinearMemory, &g_texture_2d.pitch,
                g_texture_2d.width * sizeof(float) * 4, g_texture_2d.height);

重點：D3D texture 映射為 CUDA array 後無法被 kernel 直接寫入，故通常另配 cudaMallocPitch 的線性記憶體，CUDA 算完再拷回；CUDA array 的 width/height/pitch 要與 D3D 像素格式（此例 DXGI_FORMAT_R32G32B32A32_FLOAT）相符。

// 渲染迴圈：一次 map 多個資源 → 跑 kernel → unmap（可帶 stream）
cudaGraphicsResource *ppResources[3] = {
    g_texture_2d.cudaResource, g_texture_3d.cudaResource,
    g_texture_cube.cudaResource };
cudaGraphicsMapResources(nbResources, ppResources, stream);
RunKernels();
cudaGraphicsUnmapResources(nbResources, ppResources, stream);

cudaGraphicsMapResources / cudaGraphicsUnmapResources 可一次處理多個資源，並接受 cudaStream_t 做串流排序。
收尾：cudaGraphicsUnregisterResource() 解註冊，再 cudaFree() 釋放輔助的 linear memory。

OpenGL vs Direct3D 映射心智模型

不論 GL 或 D3D：buffer 類 → CUDA device pointer（可直接讀寫）；texture/surface 類 → CUDA array（多半唯讀，要寫常需 SurfaceLoadStore 旗標或另配 linear memory 中轉）。

Interoperability in SLI configuration

在多 GPU 系統中，所有 CUDA-enabled GPU 一般會被 driver/runtime 視為獨立裝置。但在 SLI 模式下不同：SLI 是硬體層的 multi-GPU 設定，把渲染工作分散到多顆 GPU 以提升效能。

Implicit SLI（driver 自行假設）已不再支援；目前僅支援 explicit SLI——應用程式透過 Vulkan/DirectX/GL 等 API 自行得知並管理 SLI group 內所有裝置的 SLI 狀態。

SLI 模式下的特別考量：

考量	說明
記憶體放大	在 SLI 設定中，某一 GPU 上的 CUDA 配置會在其他同組 GPU 上一併消耗記憶體，故配置可能比預期更早失敗
多 context	建議每個 GPU 各建一個 CUDA context（非強制），以避免裝置間不必要的資料傳輸
找對裝置	用 `cudaD3D[9\|10\|11]GetDevices()`（D3D）/ `cudaGLGetDevices()`（GL）查出負責 current/next frame 渲染的 CUDA device handle，`deviceList` 設為 `...DeviceListCurrentFrame` 等
資源綁裝置	`cudaGraphicsD3D[9\|10\|11]RegisterResource` 與 `cudaGraphicsGLRegister[Buffer\|Image]` 回傳的資源只能在註冊它的那顆裝置上使用

SLI group (explicit)
 ┌──────────┐   ┌──────────┐
 │  GPU 0   │   │  GPU 1   │   不同 frame 在不同 GPU 計算
 │ context0 │   │ context1 │   → 必須各自註冊資源
 │ resource0│   │ resource1│   （資源不可跨裝置使用）
 └──────────┘   └──────────┘
   frame N         frame N+1

跨 frame 分散運算時的註冊

當不同 frame 的資料在不同 CUDA 裝置上計算時，必須為每個裝置分別註冊資源；無法把在裝置 A 註冊的 graphics resource 拿到裝置 B 使用。

考試/測驗重點

主題	常考點
生命週期順序	register → map → GetMappedPointer/Array → kernel → unmap → unregister；map/unmap 可多次，register/unregister 各一次
取位址 API	buffer → cudaGraphicsResourceGetMappedPointer()；array → cudaGraphicsSubResourceGetMappedArray()
OpenGL 註冊對應	RegisterBuffer → device pointer；RegisterImage → CUDA array（texture/renderbuffer）
可寫 texture	需 cudaGraphicsRegisterFlagsSurfaceLoadStore 旗標才能寫入 image
Direct3D 版本	支援 D3D9/10/11，不支援 D3D12；context 僅與 D3D_DRIVER_TYPE_HARDWARE 互通
map 期間規則	mapped 狀態下用 GL/D3D/另一 context 存取 = undefined results
per-context 註冊	register 昂貴，每資源每 context 各註冊一次，不可跨 context 共用
bindless 限制	已 bindless 的 GL texture 無法註冊；須先註冊再取 handle
WriteDiscard 用途	只寫不讀可用 WriteDiscard；又讀又寫（如 D3D 範例）不可設
SLI 模式	只支援 explicit SLI；配置會放大其他 GPU 記憶體；資源綁定註冊裝置
SLI 找裝置	cudaD3D[9/10/11]GetDevices()、cudaGLGetDevices() + DeviceListCurrentFrame
兩種互通法	Graphics Interop（直接映射）vs External resource interop（OS handle，更彈性、含 Vulkan/NVSCI/D3D12）

重點總覽

兩種互通概念與資源生命週期

OpenGL Interoperability

Direct3D Interoperability

Interoperability in SLI configuration

考試/測驗重點

Related Notes