Memory Synchronization Domains

#cuda #memory #memory-sync-domain #concept

重點總覽

項目	重點
解決的問題	memory fence / flush 因 cumulativity（累積性）而等待比實際所需更多的 in-flight transactions，造成 fence interference（效能退化）
核心機制	每個 kernel launch 取得一個 domain ID；writes 與 fences 都標記該 ID，fence 只 order 同 domain 的 writes
硬體需求	compute capability 9.0（Hopper） + CUDA 12.0；Hopper 有 4 個 physical domains
跨 domain 規則	不同 domain 間的 ordering / synchronization 需 system-scope fencing；同 domain 內 device-scope 即足夠
邏輯 / 實體分層	logical domain（`Default` / `Remote`）→ 透過 mapping → physical domain；便於跨層 library 組合
使用方式	launch attributes `cudaLaunchAttributeMemSyncDomain`（選 logical）與 `cudaLaunchAttributeMemSyncDomainMap`（logical→physical 映射）
向後相容	kernel 預設落在 domain 0；pre-9.0 裝置 domain count 回報 1，程式碼可可攜

4.14.1 Memory Fence Interference

部分 CUDA 應用會因 memory fence / flush 操作等待超出 CUDA memory consistency model 所必需的 transactions 而效能退化。

考慮以下三執行緒範例（thread 1、2 在 SM 上，thread 3 在 CPU）：

__managed__ int x = 0;
__device__   cuda::atomic<int, cuda::thread_scope_device> a(0);
__managed__  cuda::atomic<int, cuda::thread_scope_system> b(0);

// Thread 1 (SM)      Thread 2 (SM)        Thread 3 (CPU)
x = 1;                while (a != 1);      while (b != 1);
a = 1;                assert(x == 1);      assert(x == 1);
                      b = 1;

memory consistency model 保證兩個 assert 都成立：thread 1 對 x 的寫入，必須在 thread 2 對 b 的寫入「之前」對 thread 3 可見。

a 是 device-scope 的 release/acquire：只足以讓 x 對 thread 2 可見，對 thread 3 不夠。
b 是 system-scope 的 release/acquire：不僅要讓 thread 2 自己的寫入對 thread 3 可見，還要讓「所有對 thread 2 可見的其他執行緒寫入」也對 thread 3 可見 — 這就是 cumulativity（累積性）。

為什麼會 interference

GPU 在執行當下無法分辨哪些 in-flight write 是 source level 真正保證要可見的、哪些只是恰好時序碰巧可見。為了正確性，它只能「保守地撒一張很大的網」，把所有 in-flight memory operations 都納入等待。結果 fence/flush 花的時間比實際必要的更久。

fence 可能顯式出現（如範例中的 intrinsics / atomics），也可能在 task 邊界隱式產生（實作 synchronizes-with 關係）。

典型情境：一個 kernel 在 local GPU memory 做運算，另一個並行 kernel（例如 NCCL）正與 peer 做通訊。local kernel 完成時會隱式 flush 它的寫入，以滿足下游 work 的 synchronizes-with；這可能不必要地等待通訊 kernel 較慢的 NVLink / PCIe 寫入（完全或部分）。

   compute kernel (local mem)        communication kernel (NCCL, peer)
   ───────────────────────────       ─────────────────────────────────
   fast local writes ──┐                 slow NVLink/PCIe writes
                       │  implicit flush at completion
                       └──► 保守網把「慢通訊寫入」也納入等待 → 變慢

4.14.2 Isolating Traffic with Domains

從 compute capability 9.0（Hopper）+ CUDA 12.0 起，memory synchronization domains 用來緩解上述 interference：以「程式碼提供的顯式協助」換取「GPU 縮小 fence 撒網範圍」。

每個 kernel launch 取得一個 domain ID。
writes 與 fences 都標記該 domain ID。
一個 fence 只 order 與其 domain 相符的 writes。
在 compute vs communication 範例中，把通訊 kernel 放到不同 domain 即可隔離流量。

            domain 0 (compute)              domain 1 (remote/comm)
            ┌────────────────────┐          ┌────────────────────┐
  writes ──►│ tag=0   tag=0       │          │ tag=1   tag=1      │
  fence ───►│ 只等待 tag==0 的 write│          │ 只等待 tag==1 的 write│
            └────────────────────┘          └────────────────────┘
        跨 domain ordering ⇒ 必須用 system-scope fence

跨 domain 的規則（cumulativity 仍須維持）

同一 GPU 上不同 domain 之間的 ordering 或 synchronization，必須使用 system-scope fencing；同一 domain 內 device-scope fencing 仍然足夠。原因是：一個 kernel 的 writes 不會被「另一個 domain 的 kernel 所發出的 fence」涵蓋。本質上，cumulativity 是靠「讓 cross-domain traffic 提前 flush 到 system scope」來滿足。

thread_scope_device 的定義被修改

此機制改變了 thread_scope_device 的語意。但因為 kernel 預設落在 domain 0（見下節），向後相容性仍維持。

4.14.3 Using Domains in CUDA

Domains 透過兩個新的 launch attributes 存取：

Attribute	作用
`cudaLaunchAttributeMemSyncDomain`	選擇 logical domain：`cudaLaunchMemSyncDomainDefault` 或 `cudaLaunchMemSyncDomainRemote`
`cudaLaunchAttributeMemSyncDomainMap`	提供 logical → physical domain 的映射

remote domain 用於做 remote memory access 的 kernel，用來把它們的 traffic 與 local kernel 隔離。
選擇某個 domain 不會改變 kernel 在法律上可做的記憶體存取，只影響 fence ordering。
domain 數量以 device attribute cudaDevAttrMemSyncDomainCount 查詢：Hopper（9.0）有 4 個；為了可攜，所有裝置都能使用此功能，pre-9.0 裝置回報 count = 1。

logical domain 簡化應用組合：stack 底層的單一 launch（如 NCCL）可選一個語意上的 logical domain，不需在意上層架構；上層再用 mapping 去操控 logical domain 的落點。

logical domain 未設定時，預設為 default domain。
預設 mapping：default → 0，remote → 1（在 domain 數 > 1 的 GPU 上）。
特定 library 會替 launch 標記 remote domain，例如 NCCL 2.16 起會這麼做 → 開箱即用、其他元件不需改 code。
替代用法（如使用 NVSHMEM、或 kernel 類型無清楚區分時）：切分平行 streams — stream A 把兩個 logical domain 都映到 physical 0、stream B 映到 1，依此類推。

// 以 remote logical domain 啟動 kernel
cudaLaunchAttribute domainAttr;
domainAttr.id  = cudaLaunchAttrMemSyncDomain;
domainAttr.val = cudaLaunchMemSyncDomainRemote;

cudaLaunchConfig_t config;
// Fill out other config fields
config.attrs    = &domainAttr;
config.numAttrs = 1;
cudaLaunchKernelEx(&config, myKernel, kernelArg1, kernelArg2 /* ... */);

上面在 launch level 選定 logical domain（remote）。

// 為 stream 設定 mapping（9.0+ 未顯式設定時這即為預設，僅供示意）
cudaLaunchAttributeValue mapAttr;
mapAttr.memSyncDomainMap.default_ = 0;
mapAttr.memSyncDomainMap.remote   = 1;
cudaStreamSetAttribute(stream, cudaLaunchAttributeMemSyncDomainMap, &mapAttr);

// 把不同 stream 映到不同 physical domain（忽略 logical 設定）
mapAttr.memSyncDomainMap.default_ = 0;
mapAttr.memSyncDomainMap.remote   = 0;
cudaStreamSetAttribute(streamA, cudaLaunchAttributeMemSyncDomainMap, &mapAttr);
mapAttr.memSyncDomainMap.default_ = 1;
mapAttr.memSyncDomainMap.remote   = 1;
cudaStreamSetAttribute(streamB, cudaLaunchAttributeMemSyncDomainMap, &mapAttr);

第二段把 default 與 remote 都映到同一 physical domain，達成「用 stream 切分 physical domain、忽略 logical 設定」的效果。

典型用法與 graph 行為

這些 attribute 在 CUDA streams、cudaLaunchKernelEx 個別 launch、以及 CUDA graphs 的 kernel nodes 上一致暴露。典型做法：在 stream level 設 mapping，在 launch level 設 logical domain（或框住一段 stream 使用）。

stream capture 時兩個 attribute 都會複製到 graph node。
graph 執行時，兩個 attribute 都取自 node 本身（等同於間接指定 physical domain）。
graph 啟動時所注入 stream 上的 domain 相關 attribute 不會用於 graph 的執行。

graph / launch attribute 的更多脈絡見 04-CUDA-Features/03-CUDA-Graphs-Structure-and-Capture 與 03-Advanced-CUDA/02-Advanced-Streams-and-Dependent-Launch。

考試/測驗重點

問題	答案
Memory fence interference 的根因？	cumulativity：GPU 無法分辨 in-flight write 是必要可見還是碰巧可見，只能保守等待全部
domain 機制如何縮小 fence 撒網？	每個 launch 有 domain ID，write/fence 標記 ID，fence 只 order 同 domain 的 write
跨 domain ordering 需要什麼 scope 的 fence？	system-scope（同 domain 內 device-scope 即足夠）
起始硬體 / CUDA 版本？	compute capability 9.0（Hopper）/ CUDA 12.0
Hopper 有幾個 physical domain？pre-9.0 回報多少？	Hopper 4 個 / pre-9.0 回報 1（功能仍可用，利於可攜）
查 domain 數量用哪個 attribute？	`cudaDevAttrMemSyncDomainCount`
兩個 launch attribute 各做什麼？	`MemSyncDomain` 選 logical domain；`MemSyncDomainMap` 設 logical→physical 映射
兩個 logical domain？預設 mapping？	`Default` / `Remote`；default→0、remote→1（domain 數 > 1 時）
未設定 logical domain 時預設為何？	default domain（對應 physical 0），維持向後相容
哪個 library 會標記 remote domain？	NCCL 2.16 起
選某 domain 會改變 kernel 可做的記憶體存取嗎？	不會，只影響 fence ordering
graph 的 domain attribute 來源？	取自 node 本身；launch 進去的 stream 上的 domain attribute 不生效