Cooperative Groups 深入 (Cooperative Groups Deep Dive)

#cuda #execution-model #cooperative-groups #concept

重點總覽

項目	重點
動機	取代過往各自手刻、脆弱不安全的 warp/跨 block 同步原語；提供安全且跨 GPU 世代可維護（future-proof）的機制
Group handle	管理 group 的把手，讓參與 thread 查詢自己的位置與 group 資訊（`thread_rank`、`num_threads`、`thread_index`、`dim_threads`）
Implicit groups	依 launch 配置隱式建立：`this_thread_block()`、`this_grid()`、`coalesced_threads()`、`this_cluster()`
建立 group	partition 母 group 成子 group：`tiled_partition`、`labeled_partition`、`binary_partition`（皆為 collective）
同步 sync	`sync()` 等價 `__syncthreads()`；保證記憶體可見性 + 全員到齊才放行；可同步整個 grid（CUDA 13 起不再支援多 device）
Barriers	`barrier_arrive` 回傳 `arrival_token`，傳入 `barrier_wait` 消耗；自動初始化；每 phase 須 arrive+wait 一次
Collective ops	`reduce`（plus/less/greater/bit_*）、`inclusive_scan`/`exclusive_scan`、`invoke_one`/`invoke_one_broadcast`
memcpy_async	global→shared 的非同步 prefetch，搭配 `wait`；需來源 global、目的 shared、雙方 ≥4B 對齊（建議 16B）
Large scale groups	跨整個 grid 的 group；要同步整個 grid 必須用 `cudaLaunchCooperativeKernel`（CC 6.0+）

4.4.1 介紹：為什麼需要 Cooperative Groups

Cooperative Groups 是 CUDA 程式模型的擴充，用來組織「協同合作的 thread 群組」，讓開發者控制 thread 協作的粒度（granularity），表達更豐富、更高效的平行分解，並內建常見平行原語（如 scan、parallel reduce）。

歷史上 CUDA 只提供單一簡單機制做 thread 同步：對整個 thread block 的 barrier，即 __syncthreads()。
為表達更廣的平行互動模式，許多人自行手刻warp 內或跨 block 的同步原語——效能雖有提升，卻累積出脆弱、難寫難調難維護、且跨 GPU 世代易壞的程式碼。
Cooperative Groups 正是為此提供安全且 future-proof 的寫法。

一句話定位

Cooperative Groups = 「可控粒度的 thread 群組 + 安全的群組同步/集合運算」，把人們各自亂寫的 ad hoc 原語標準化。

4.4.2 Group Handle 與 Member Functions

Cooperative Group 透過 Cooperative Group Handle 管理。Handle 讓參與的 thread 得知自己在 group 中的位置、group 大小與其他資訊。

Accessor	回傳
`thread_rank()`	呼叫 thread 在 group 內的 rank（序號）
`num_threads()`	group 內的 thread 總數
`thread_index()`	thread 在 launched block 內的 3D index
`dim_threads()`	launched block 的 3D 維度（以 thread 為單位）

完整清單見 Cooperative Groups API。

rank vs index

thread_rank() 是 group 內的「一維線性序號」（做 reduce/scan 取資料常用）；thread_index() / dim_threads() 則對應 launched block 的三維座標與維度，注意兩者語意不同。

4.4.3 預設行為 / Groupless Execution

代表 grid 與 thread block 的 group 會依 kernel launch 配置隱式建立。這些「implicit groups」是進一步細分的起點。

Accessor	Group 範圍
`this_thread_block()`	目前 thread block 內所有 thread
`this_grid()`	grid 內所有 thread
`coalesced_threads()`	一個 warp 內當下 active 的 thread 群
`this_cluster()`	目前 cluster 內的 thread 群

coalesced_threads 與 this_cluster 的注意事項

coalesced_threads() 回傳「該時間點」的 active thread 集合，不保證回傳哪些 thread（只要是 active）、也不保證它們在後續執行中維持 coalesced。
this_cluster() 在啟動的是非 cluster grid 時，假設為 1x1x1 cluster；需 Compute Capability 9.0 以上。

4.4.3.1 盡早建立 implicit group handle

為求最佳效能，建議在任何分支發生之前、盡可能早建立 implicit group 的 handle，並在整個 kernel 沿用同一個 handle。

4.4.3.2 只用 reference 傳遞 group handle

把 group handle 傳入函式時，建議以 reference 傳遞。
Group handle 沒有 default constructor，必須在宣告時就初始化。
不鼓勵 copy-construct group handle。

namespace cg = cooperative_groups;
// 盡早取得 handle，並以 reference 傳給其他函式
cg::thread_block block = cg::this_thread_block();

重點：handle 是「輕量把手」但語意上不該到處複製，宣告即初始化、傳參用 reference。

4.4.4 建立 Cooperative Groups

Group 由切分（partition）母 group 而來；切分後會產生一個新 handle 管理子 group。

Partition 類型	說明
`tiled_partition`	把母 group 切成一系列固定大小的子 group，呈一維 row-major 排列
`labeled_partition`	依條件 label（任意整數型別）把母 group 切成一維子 group
`binary_partition`	labeled partition 的特例，label 只能是 `0` 或 `1`

namespace cg = cooperative_groups;
// 取得目前 thread 的 cooperative group
cg::thread_block my_group = cg::this_thread_block();
// 切成大小為 8 的 tile
cg::thread_block_tile<8> my_subgroup = cg::tiled_partition<8>(my_group);
// 以 my_subgroup 進行運算

tiled_partition<8> 把 block 切成多個 8-thread 的 tile，每個 thread 取得管理自己 tile 的 handle。

  母 group (this_thread_block, 例: 32 threads)
  ┌──────────────────────────────────────────────────────────┐
  │ t0..t7 │ t8..t15 │ t16..t23 │ t24..t31                    │
  └──────────────────────────────────────────────────────────┘
   tiled_partition<8>  → 4 個固定大小子 group (1D, row-major)
   [tile0]    [tile1]     [tile2]     [tile3]

4.4.4.1 避免 Group 建立的 Hazards

partition 是 collective 操作

切分 group 是 collective 操作，group 內所有 thread 都必須參與。若 group 是在「不是所有 thread 都會到達的條件分支」中建立，會導致 deadlock 或資料毀損（data corruption）。建立 group 的程式碼必須讓全員都會執行到。

4.4.5 同步（Synchronization）

在 Cooperative Groups 之前，CUDA 只允許 thread block 之間在 kernel 完成邊界才同步。Cooperative Groups 讓開發者可在不同粒度同步協作的 thread 群。

4.4.5.1 Sync

呼叫 collective 的 sync() 即可同步一個 group。與 __syncthreads() 相同，sync() 保證：

同步點之前group 內 thread 做的所有記憶體存取（讀/寫），對同步點之後的所有 thread 皆可見。
group 內所有 thread 都到達同步點後，才允許任何 thread 越過。

namespace cg = cooperative_groups;
cg::thread_block my_group = cg::this_thread_block();
// 同步 block 內的 thread（等價 __syncthreads()）
cg::sync(my_group);

grid 同步與 CUDA 13 變更

Cooperative Groups 可用來同步整個 grid。但自 CUDA 13 起，不再支援用 cooperative groups 做 multi-device（跨裝置）同步；相關 multi-device launch API 已移除（見 Large Scale Groups）。

4.4.5.2 Barriers

Cooperative Groups 提供類似 cuda::barrier 的 barrier API 做更進階的同步，與 cuda::barrier 的關鍵差異：

Cooperative Groups barrier 自動初始化。
group 內所有 thread 每個 phase 都必須 arrive 並 wait 各一次。
barrier_arrive 回傳一個 arrival_token，必須傳入對應的 barrier_wait，在那裡被消耗（consumed）且不可再用。

namespace cg = cooperative_groups;
cg::thread_block my_group = this_block();
cg::cluster_group cluster = this_cluster();
auto token = cluster.barrier_arrive();
// 選用：做些 local 處理以隱藏同步延遲
local_processing(my_group);
// 確保 cluster 內其他 block 都已執行並初始化 shared data 後，才存取 dsmem
cluster.barrier_waitmove(token);

barrier_arrive 與 barrier_wait 之間的空檔可拿來做不依賴同步結果的 local 工作，藉此隱藏同步延遲。

  phase 內每個 thread 的時序：
    barrier_arrive() ──► [token]
        │  (此空檔可做 local_processing，但不可做 collective op)
        ▼
    barrier_wait(move(token))   ← token 在此被消耗

Barrier 的 hazards

在 barrier_arrive 之後、barrier_wait 之前，group 不可使用任何 collective 操作。
barrier_wait 只保證「所有 thread 都已呼叫 barrier_arrive」，不保證所有 thread 都已呼叫 barrier_wait。

4.4.6 集合運算（Collective Operations）

Collective 操作需 group 內所有 thread 參與才能完成。除非 API 明確允許，所有 thread 對對應參數必須傳入相同值，否則行為未定義。

4.4.6.1 Reduce

reduce 對 group 內每個 thread 提供的資料做平行歸約，須指定運算子：

運算子	回傳
`plus`	group 內所有值的總和
`less`	最小值
`greater`	最大值
`bit_and`	位元 AND 歸約
`bit_or`	位元 OR 歸約
`bit_xor`	位元 XOR 歸約

namespace cg = cooperative_groups;
cg::thread_block my_group = cg::this_thread_block();
int val = data[threadIdx.x];
int sum = cg::reduce(my_group, val, cg::plus<int>());
if (my_group.thread_rank() == 0) {
    result[blockIdx.x] = sum;   // 由 rank 0 寫出結果
}

硬體加速條件

Reduction 在可用時會用硬體加速（需 Compute Capability 8.0 以上）；較舊硬體有 software fallback。只有 4-byte（4B）型別會被硬體加速。

4.4.6.2 Scans

提供 inclusive_scan 與 exclusive_scan，適用任意 group 大小，對 group 內每個 thread 的資料做掃描；可選擇性指定上表的歸約運算子。

namespace cg = cooperative_groups;
cg::thread_block my_group = cg::this_thread_block();
int val = data[my_group.thread_rank()];
int exclusive_sum = cg::exclusive_scan(my_group, val, cg::plus<int>());
result[my_group.thread_rank()] = exclusive_sum;

exclusive_scan 不含自身、inclusive_scan 含自身，是 prefix-sum 類前綴運算的標準工具。

4.4.6.3 Invoke One

當「只需單一 thread 代表整個 group 做一段序列工作」時使用：

invoke_one：從呼叫 group 中選一個任意 thread，用該 thread 以給定參數呼叫 invocable 函式。
invoke_one_broadcast：同 invoke_one，但會把呼叫結果廣播給 group 內所有 thread。
thread 選擇機制不保證 deterministic。

namespace cg = cooperative_groups;
cg::thread_block my_group = cg::this_thread_block();
// 確保 block 內只有一個 thread 印出訊息
cg::invoke_one(my_group, []() {
    printf("Hello from one thread in the block!");
});
// 同步讓所有 thread 等訊息印完
cg::sync(my_group);

invocable 內的限制

在 invocable 函式內部，不允許對「呼叫 group」做通訊或同步；但與「呼叫 group 之外」的 thread 通訊是允許的。

4.4.7 非同步資料搬移（memcpy_async）

memcpy_async 提供 global ↔ shared memory 的非同步記憶體複製，特別適合最佳化記憶體傳輸、把計算與資料傳輸重疊以提升效能。

memcpy_async：啟動從 global memory 到 shared memory 的非同步載入，當作「prefetch」使用——在資料被需要之前先載入。
wait：強制 group 內所有 thread 等到非同步傳輸完成。所有 thread 都必須先呼叫 wait，才能存取 shared memory 中的資料。

namespace cg = cooperative_groups;
cg::thread_group my_group = cg::this_thread_block();
__shared__ int shared_data[];
// 從 global 非同步複製到 shared
cg::memcpy_async(my_group, shared_data + my_group.rank(),
                 input + my_group.rank(), sizeof(int));
// 此處做其他工作以隱藏延遲，但不可使用 shared_data
cg::wait(my_group);
// prefetch 的資料現在可用

  時間 ──►
  memcpy_async (global→shared) ████████████  (背景進行)
  其他計算 (不碰 shared_data)   ▓▓▓▓▓▓        ← 與傳輸重疊
  wait() ─ barrier ─ 之後才安全存取 shared_data

4.4.7.1 Memcpy Async 對齊需求

何時才真的「非同步」

memcpy_async 只有在來源是 global memory、目的是 shared memory、且兩者皆至少 4-byte 對齊時才是非同步的。追求最佳效能時，建議來源與目的都做 16-byte 對齊。

4.4.8 大規模群組（Large Scale Groups）

Cooperative Groups 支援橫跨整個 grid 的大型 group。前述所有功能對大型 group 皆可用，但有一個關鍵例外：

同步整個 grid 必須用 cudaLaunchCooperativeKernel

要同步整個 grid，必須透過 cudaLaunchCooperativeKernel runtime launch API 啟動 kernel。一般 <<<>>> 啟動只能在 thread block 內同步，無法做 inter-block 的全 grid 同步。
此外，Cooperative Groups 的 multi-device launch API 與相關參考已於 CUDA 13 移除。

4.4.8.1 何時使用 cudaLaunchCooperativeKernel

用於啟動「需要 inter-block 同步」的單一 device kernel，使整個 grid 的所有 thread 都能跨 block 同步、協作。
保證 launch 是 atomic：若 API 呼叫成功，則指定數量的 thread block 必定會在指定 device 上啟動。
最佳實務：先查詢 device attribute cudaDevAttrCooperativeLaunch 確認支援。

int dev = 0;
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);
// supportsCoopLaunch == 1 表示 device 0 支援 cooperative launch

支援條件：僅 Compute Capability 6.0 以上，且需在以下其中一種平台執行：

平台	條件
Linux（無 MPS）	直接支援
Linux（有 MPS）	device 需 Compute Capability 7.0 以上
Windows	需最新版 Windows 平台

考試/測驗重點

情境/關鍵字	答案
Cooperative Groups 的核心價值	可控協作粒度 + 安全/future-proof，取代手刻 ad hoc 同步原語
`thread_rank()` vs `thread_index()`	rank 是 group 內一維序號；index 是 launched block 內 3D 座標
`num_threads()` / `dim_threads()`	group 內 thread 總數 / launched block 的 3D 維度
四個 implicit group accessor	`this_thread_block()` / `this_grid()` / `coalesced_threads()` / `this_cluster()`
`coalesced_threads()` 的保證	只回「當下 active」的 thread，不保證是哪些、不保證持續 coalesced
`this_cluster()` 需求	CC 9.0+；非 cluster grid 時假設 1x1x1 cluster
handle 建立/傳遞建議	盡早建立（分支前）；以 reference 傳；無 default ctor、宣告即初始化、勿 copy
三種 partition	`tiled_partition`（固定大小 1D）/ `labeled_partition`（依整數 label）/ `binary_partition`（label 0 或 1）
partition 的 hazard	是 collective，全員須參與；放在非全員到達的分支 → deadlock 或 data corruption
`sync()` 兩大保證	同步前的記憶體存取對同步後全員可見；全員到齊才放行（等價 `__syncthreads()`）
CUDA 13 對 grid/multi-device sync 的變更	仍可同步整個 grid；但不再支援 multi-device 同步、相關 API 已移除
barrier 的 token 規則	`barrier_arrive` 回傳 `arrival_token`，傳入 `barrier_wait` 被消耗、不可再用
barrier_arrive 與 barrier_wait 之間禁忌	不可使用任何 collective 操作
`barrier_wait` 的保證範圍	只保證全員已 `barrier_arrive`；不保證全員已 `barrier_wait`
reduce 運算子	plus / less（min）/ greater（max）/ bit_and / bit_or / bit_xor
reduce 硬體加速條件	CC 8.0+，且只加速 4B 型別；舊硬體有 software fallback
scan 種類	`inclusive_scan` / `exclusive_scan`，任意 group 大小，可選歸約運算子
`invoke_one` vs `invoke_one_broadcast`	皆選一任意 thread 執行；後者額外把結果廣播給全 group；選擇非 deterministic
invoke_one invocable 內限制	不可對呼叫 group 通訊/同步；可與 group 外 thread 通訊
`memcpy_async` 方向與用途	global→shared 的非同步 prefetch，搭配 `wait`，全員須先 wait 才能存取
`memcpy_async` 對齊需求	來源 global、目的 shared、雙方 ≥4B 對齊才非同步；最佳建議 16B
同步整個 grid 的條件	必須用 `cudaLaunchCooperativeKernel` 啟動
cudaLaunchCooperativeKernel 平台/CC	CC 6.0+；Linux 無 MPS / Linux+MPS 需 CC 7.0+ / 最新 Windows
查詢是否支援 cooperative launch	`cudaDeviceGetAttribute(..., cudaDevAttrCooperativeLaunch, dev)` 回 1 表支援

重點總覽

4.4.1 介紹：為什麼需要 Cooperative Groups

4.4.2 Group Handle 與 Member Functions

4.4.3 預設行為 / Groupless Execution

4.4.3.1 盡早建立 implicit group handle

4.4.3.2 只用 reference 傳遞 group handle

4.4.4 建立 Cooperative Groups

4.4.4.1 避免 Group 建立的 Hazards

4.4.5 同步（Synchronization）

4.4.5.1 Sync

4.4.5.2 Barriers

4.4.6 集合運算（Collective Operations）

4.4.6.1 Reduce

4.4.6.2 Scans

4.4.6.3 Invoke One

4.4.7 非同步資料搬移（memcpy_async）

4.4.7.1 Memcpy Async 對齊需求

4.4.8 大規模群組（Large Scale Groups）

4.4.8.1 何時使用 cudaLaunchCooperativeKernel

考試/測驗重點

Related Notes