Interprocess Communication (IPC)

重點總覽

項目	重點
核心問題	device pointer 與 event handle 只在「建立它的 process」內有效，跨 process 無法直接引用
解法	用 IPC API 或 VMM API 建立 process-portable handle，透過 OS IPC 機制交換，再還原成 process-local pointer
兩條路線	Legacy CUDA IPC API、Virtual Memory Management (VMM) IPC API
Legacy IPC API	`cudaIpcGetMemHandle()` 取得 handle、`cudaIpcOpenMemHandle()` 還原 pointer；event 有對應 entry point
Legacy 限制	僅 Linux、不支援 `cudaMallocManaged`、收發雙方需同一 driver/runtime
子分配風險	`cudaMalloc()` 可能子分配自大區塊，IPC 會分享整個底層區塊；建議 2 MiB 對齊
VMM IPC	可建立 IPC-shareable allocation，靠 OS 專屬 handle 結構支援多種作業系統
多節點	NVLink 多節點叢集用 "fabric" handle 跨 OS instance 溝通

IPC 的基本問題與模型

多個由「不同 host process」管理的 GPU 之間，要靠 interprocess communication (IPC) API 與 IPC-shareable memory buffer 來溝通。做法是建立 process-portable handle，對方再用它取得指向 peer GPU device memory 的 process-local device pointer。

任何 host thread 建立的 device memory pointer 或 event handle，可被「同一 process」內任何其他 thread 直接引用。
但這些 pointer / event handle 在「建立它的 process 之外」無效，無法被別的 process 的 thread 直接引用。
跨 process 存取 device memory 與 CUDA event，必須用 CUDA IPC 或 VMM API 製作 process-portable handle。
handle 透過標準 host OS IPC 機制（如 interprocess shared memory 或檔案）交換，對方再用 CUDA IPC / VMM API 還原出 process-local device pointer，之後使用方式與單一 process 內完全相同。

Process A (owner)                         Process B (consumer)
  cudaMalloc -> devPtr_A                    (尚無 pointer)
       |                                          ^
  cudaIpcGetMemHandle(devPtr_A)                   |
       |  IPC handle                              |
       +----> OS IPC (shm / file) ----------------+
                                            cudaIpcOpenMemHandle(handle)
                                                  -> devPtr_B (process-local)

Important

handle 是「可攜的鑰匙」，pointer 不是。永遠交換 handle、由對方在自己的 process 內重新 open 出 pointer，切勿直接傳遞原始 device pointer。

多節點 (multi-node) 延伸

單節點、單一 OS instance 內用的 portable-handle 方法，同樣用於多節點 NVLink-connected 叢集中 GPU 之間的 peer-to-peer 溝通。
多節點情境下，通訊的 GPU 由「各節點上獨立 OS instance」內的 process 管理，需要高於 OS instance 層級的額外抽象。
解法是建立並交換 fabric handle，再於各參與 process 與 OS instance 內取得對應 multi-node rank 的 process-local device pointer。

IPC using the Legacy Interprocess Communication API（4.15.1）

要跨 process 分享 device memory pointer 與 event，可用 CUDA Interprocess Communication API：

cudaIpcGetMemHandle()：對某個 device memory pointer 取得它的 IPC handle。
IPC handle 透過標準 host OS IPC 機制（interprocess shared memory 或檔案）傳給另一個 process。
cudaIpcOpenMemHandle()：用 IPC handle 還原成可在對方 process 內使用的有效 device pointer。
Event handle 透過類似的 entry point 分享。

// Process A: 取得並送出 handle
cudaIpcMemHandle_t handle;
cudaIpcGetMemHandle(&handle, devPtr);   // devPtr 來自 cudaMalloc
// ... 透過 shm/file 把 handle 傳給 Process B ...

// Process B: 還原為本地可用的 device pointer
void* peerPtr;
cudaIpcOpenMemHandle(&peerPtr, handle, cudaIpcMemLazyEnablePeerAccess);

典型用例：單一 primary process 產生一批 input data，讓多個 secondary process 直接取用，不需重新產生或複製資料。

Tip

上方 code 之要點：get 端把實體 allocation 變成可攜 handle，open 端把 handle 映射成自己 process 的 pointer；兩端共享的是同一塊底層 device memory，不是資料副本。

Legacy IPC API 的限制與注意事項

Warning

以下為使用 Legacy IPC API 的重要例外與限制：

僅支援 Linux。
不支援 cudaMallocManaged 分配。
互相通訊的應用程式應以「相同的 CUDA driver 與 runtime」編譯、連結、執行。

子分配風險：cudaMalloc() 為了效能，可能從較大的記憶體區塊「子分配 (sub-allocate)」。此時 CUDA IPC API 會分享「整個底層 memory block」，可能連帶把其他子分配也分享出去，造成 process 間的資訊外洩 (information disclosure)。
- 防範方式：建議只分享「2 MiB 對齊大小」的 allocation。
Tegra / 嵌入式平台：在 L4T 與 embedded Linux Tegra（compute capability 7.x 以上）裝置，只支援 IPC event-sharing API；IPC memory-sharing API 不支援 Tegra 平台。

限制面向	Legacy IPC API 行為
平台	僅 Linux
Managed memory	不支援 `cudaMallocManaged`
Driver / runtime	收發雙方須一致
子分配	分享整個底層 block，建議 2 MiB 對齊
Tegra (L4T / cc 7.x+)	僅 event-sharing，無 memory-sharing

IPC using the Virtual Memory Management API（4.15.2）

CUDA Virtual Memory Management API 允許建立 IPC-shareable 的 memory allocation，並藉由「OS 專屬的 IPC handle 資料結構」支援多種作業系統。

相較於 Legacy IPC API，VMM API 可在「memory allocation 時」就對每個 allocation 做 peer accessibility 與 sharing 的精細控制。
代價是必須使用 CUDA Driver API（如 cuMemCreate 等 VMM 進入點），而非 Runtime API。
透過 OS-specific handle 結構，VMM IPC 不像 Legacy IPC 受限於單一 Linux 平台，可跨多種作業系統運作（細節見 VMM 章節）。

Important

選型權衡：Legacy IPC API 簡單、走 Runtime API，但僅 Linux 且控制粒度粗；VMM IPC 可逐 allocation 控制分享與 peer 存取、跨 OS，但須改用 Driver API。

考試/測驗重點

問題	答案
device pointer / event handle 的有效範圍	僅限「建立它的 process」內；跨 process 無效
跨 process 共享 memory 的兩個 API 家族	CUDA IPC API / Virtual Memory Management (VMM) API
取得 / 還原 IPC memory handle 的函式	cudaIpcGetMemHandle() / cudaIpcOpenMemHandle()
handle 如何在 process 間傳遞	標準 host OS IPC 機制：interprocess shared memory 或檔案
Legacy IPC API 支援哪個 OS	僅 Linux
Legacy IPC 是否支援 cudaMallocManaged	否
雙方 driver / runtime 需求	必須相同
cudaMalloc 子分配的風險與對策	分享整個底層 block 可能洩漏其他子分配；建議 2 MiB 對齊
Tegra / L4T (cc 7.x+) 支援哪種 IPC	僅 event-sharing；不支援 memory-sharing
VMM IPC 的優勢	逐 allocation 控制 peer 存取 / 分享、支援多 OS（需 Driver API）
多節點 NVLink 叢集用什麼 handle	fabric handle

重點總覽

IPC 的基本問題與模型

多節點 (multi-node) 延伸

IPC using the Legacy Interprocess Communication API（4.15.1）

Legacy IPC API 的限制與注意事項

IPC using the Virtual Memory Management API（4.15.2）

考試/測驗重點

Related Notes