Atomics、Cooperative Groups 與 Occupancy

#cuda #execution-model #atomics #cooperative-groups #occupancy #concept

重點總覽

項目	重點
為何需要 atomics	grid 層級沒有全域同步機制；atomic functions 對 global memory 位置提供同步的 read-modify-write
atomic 運作原理	thread 取得該記憶體位置的 lock，鎖定期間其他 thread 不可存取，完成 RMW 後釋放
C++ atomics	`cuda::std::atomic` / `cuda::std::atomic_ref`（類標準庫）；`cuda::atomic` / `cuda::atomic_ref`（可指定 thread scope）
Python atomics	`numba.cuda.atomic` namespace：`add`、`sub`、`max`、`min`、`compare_and_swap`
避免 race condition	非 atomic 的 `s[0] = s[0] + x` 會遺失更新、結果隨執行/GPU 而變；atomic 才能保證正確性
效能代價	atomic 會強制 thread 同步，應節制使用
Cooperative Groups	CUDA C++ 軟體工具，可定義跨 block／跨 grid／跨 GPU 同步的 thread group，也可定義小於 block 的 group
Occupancy 定義	active warps ÷ SM 支援的最大 active warps；越高越能隱藏延遲
限制 occupancy 的資源	block size（threads/block）、shared memory/block、registers/thread；受 per-SM 與 per-block 上限約束
查詢與調整工具	`cudaGetDeviceProperties` 查上限；`nvcc --resource-usage` 查用量；`nvcc --maxrregcount` 限制 register

2.3.5 Atomics

效能良好的 CUDA kernel 仰賴盡量表達演算法平行度，而 GPU kernel 非同步執行要求 thread 盡可能彼此獨立。但 thread 無法總是完全獨立。

block 內：同一 thread block 的 thread 可用 shared memory 交換資料並以 __syncthreads() 同步。
grid 層級：沒有同步整個 grid 所有 thread 的機制（cooperative groups 為例外，見後）。
取代方案：可透過 atomic functions 對 global memory 位置提供同步存取。

Atomic function 的語意：一個 thread 對某 global memory 位置取得 lock，執行一次 read-modify-write（讀-改-寫）。在 lock 持有期間，沒有其他 thread 能存取同一位置。

非 atomic（race condition）         atomic（正確）
T0: read s=0                       T0: lock s ─ read 0 ─ +5 ─ write 5 ─ unlock
T1: read s=0   (兩者都讀到 0)      T1:        (等待)    lock s ─ read 5 ─ +3 ─ write 8 ─ unlock
T0: write 5                        結果 = 8（正確）
T1: write 3   結果 = 3（遺失 T0 更新）

效能代價

Atomic function 會強制 thread 同步，可能損及效能，應節制使用（used sparingly）。盡量把競爭降到最低，例如先在 shared memory 做局部歸約，再由單一 thread 做一次 atomic。

2.3.5.1 C++ std::atomic-like Atomics

CUDA C++ 提供兩組 atomic，語法/行為對應 C++ 標準庫：

形式	說明
`cuda::std::atomic`、`cuda::std::atomic_ref`	與同名 C++ 標準庫 atomics 語法與行為相似
`cuda::atomic`、`cuda::atomic_ref`	擴充版，允許指定 atomic 操作的 thread scope

以 cuda::atomic_ref 做 device-wide 原子加總（result 指向 global memory 中存放總和的位置）：

__global__ void sumReduction(int n, float *array, float *result) {
    // ...
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    cuda::atomic_ref<float, cuda::thread_scope_device> result_ref(result);
    result_ref.fetch_add(array[tid]);
    // ...
}

模板第二參數 cuda::thread_scope_device 指定此 atomic 在整個 device 範圍內保證原子性。
fetch_add 即原子的 read-modify-write 加法。

C++ 最佳實務

C++ 也有底層 atomic functions（在原文 5.4.5.1 討論），但官方建議優先使用 std::atomic-like 的 cuda::std::atomic / cuda::atomic 介面，視為 CUDA C++ 的 best practice。

2.3.5.2 Memory Atomics in Python

Python 中由 numba.cuda.atomic namespace 提供 atomic 記憶體操作，常見有 add、sub、max、min、compare_and_swap（完整清單見 Numba CUDA 文件）。

下例為 sum reduction：每個 block 把陣列片段載入 shared memory，由 block 內單一 thread（thread 0） 算局部總和，再做一次 atomic add 到結果 s。因為資料就在離 SM 計算資源很近的 shared memory，單一 thread 做加總通常仍合理快。

from numba import cuda
import numpy as np

@cuda.jit
def sum_reduce(a, s):
    shared_staging = cuda.shared.array(shape=512, dtype=np.float32)
    shared_staging[cuda.threadIdx.x] = a[cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x]
    cuda.syncthreads()  # 確保所有載入完成

    local_sum = float(0.0)
    if cuda.threadIdx.x == 0:               # 只有 thread 0 做局部加總
        for i in range(cuda.blockDim.x):
            local_sum = local_sum + shared_staging[i]
        cuda.atomic.add(s, 0, local_sum)    # 每個 block 一次 atomic add

sum_reduce[grid_size, block_size](a, s)     # 啟動方式：kernel[grid, block](args)

atomic 是正確性的必要條件

若把 cuda.atomic.add(s, 0, local_sum) 換成非 atomic 的 s[0] = s[0] + local_sum，對 s[0] 的存取就不是原子的，最終 s[0] 會小於正確值，且結果每次執行可能不同、在不同 SM 數量的 GPU 上也可能改變。這正說明 atomic memory access 是此程式正確性所必需。

別自己重造輪子

上例僅示意，並非 GPU 上效能最佳的 reduction。NVIDIA 的 CCCL（CUDA Core Compute Libraries） 提供高效能 primitives（含 reductions），Python 可透過 cuda.coop package 取得。為兼顧生產力與效能，應優先使用這些高度調校過的實作。

2.3.6 Cooperative Groups

Cooperative Groups 是 CUDA C++ 中的軟體工具，讓應用程式定義可彼此同步的 thread group——即使該 group 橫跨多個 thread block、單一 GPU 上的多個 grid，甚至跨多個 GPU。

CUDA 程式模型本身的兩個限制，正是 cooperative groups 要補足的：

預設 CUDA 程式模型	Cooperative Groups 補足
可高效同步 thread block 或 cluster 內的 thread	可定義小於 block／cluster 的 thread group 並同步
不提供小於 block/cluster 的 group 機制	—
不保證跨 thread block 的同步	可建立跨 block／cluster 邊界的 group 並同步

            ┌── 預設模型可同步 ──┐
  Grid ┌─────────────────────────────────────┐
       │  Block0[__syncthreads()]  Block1 ... │  ← 跨 block 預設無同步
       └─────────────────────────────────────┘
  Cooperative Groups ──► 可同步：block 內子群 / 跨 block / 跨 grid / 跨 GPU

跨邊界的代價

建立跨越 thread block 與 cluster 邊界的 group，會帶來語意限制與效能影響（細節見專門的 cooperative groups 章節）。並非免費的全域同步。

2.3.7 Kernel Launch and Occupancy

排程：block 如何被指派到 SM

kernel 啟動時，thread 依 execution configuration 被組成 thread block 與 grid，接著 scheduler 把 thread block 指派給 SM。

應用程式無法控制或查詢哪個 block 排到哪個 SM，scheduler 不保證任何順序，因此程式不可依賴特定排程順序來保證正確性。
一個 SM 能同時容納幾個 block，取決於 block 所需資源與 SM 可用資源。

                    scheduler
  待排 blocks  ─────────────────►  SM0 [block][block]  (資源已滿)
  [b0][b1][b2]...                  SM1 [block]          (還有空間 → 繼續塞)
                                   SM2 [block][block]
   無 SM 有空位 → scheduler 等待既有 block 完成 → SM 釋出 → 再指派
   直到所有 block 都被排程並執行完畢

用 `cudaGetDeviceProperties` 查資源上限

注意上限分為 per-SM 與 per-block 兩類：

Device property	意義	範例值 (CC 10.0)
`maxBlocksPerMultiProcessor`	每 SM 最多常駐 block 數	32
`sharedMemPerMultiprocessor`	每 SM 可用 shared memory（bytes）	233472
`regsPerMultiprocessor`	每 SM 的 32-bit register 數	65536
`maxThreadsPerMultiProcessor`	每 SM 最多常駐 thread 數	2048
`sharedMemPerBlock`	單一 block 可配置的 shared memory 上限（bytes）	49152
`regsPerBlock`	單一 block 可配置的 32-bit register 上限	65536
`maxThreadsPerBlock`	單一 block 的 thread 上限	1024

Occupancy 定義

Occupancy = active warps ÷ SM 支援的最大 active warps。

一般而言 occupancy 越高越好：能隱藏延遲（hide latency）、提升效能。
計算 occupancy 需知道：(1) SM 的資源上限（上表）；(2) kernel 所需資源。
查 kernel 資源用量：編譯時用 nvcc --resource-usage，會列出 kernel 所需的 register 數與 shared memory。

計算範例（CC 10.0，maxActiveWarps = 2048/32 = 64）

testKernel<<<512, 768>>>()   // 768 threads/block
  受限於 maxThreadsPerMultiProcessor = 2048
  → 每 SM 最多 2 blocks  (2*768=1536 ≤ 2048，第 3 個會超過)
  → occupancy = (768 * 2) / 2048 = 75%

testKernel<<<512, 32>>>()    // 32 threads/block
  未碰到 thread 上限，但 maxBlocksPerMultiProcessor = 32
  → 每 SM 最多 32 blocks  → 32*32 = 1024 threads
  → occupancy = 1024 / 2048 = 50%

block 太小反而傷 occupancy

第二例的 block 只有 32 threads，雖然 thread 總量沒滿，卻先撞到「每 SM 最多 32 個 block」的上限，導致只能跑滿 1024 threads（50%）。block size 過小會浪費 SM 容量。

Shared memory 同理：若 kernel 每 block 用 100KB shared memory，因第 3 個 block 將使總量達 300KB，超過每 SM 的 233472 bytes，故每 SM 只能排 2 個 block。

程式設計者能控制什麼

資源	可控程度	調整方式
Threads per block	完全可控	調整 launch 的 block size
Shared memory per block	完全可控	調整 kernel 內 shared 配置量
Registers per thread	有限控制	compiler/runtime 會自動最佳化；可用 `nvcc --maxrregcount` 設定上限

限制 register 的取捨

用 --maxrregcount 限制 register 後，若 kernel 需要的 register 超過上限，很可能spill 到 local memory，改變效能特性。但即使發生 spilling，限制 register 有時能讓更多 block 被排程、提高 occupancy，反而帶來淨效能提升。需實測權衡。

考試/測驗重點

情境/關鍵字	答案
grid 內所有 thread 能否用內建機制同步？	預設不能；只有 block 內可用 `__syncthreads`。跨 grid 需 cooperative groups（軟體）
atomic function 做的事	對 global memory 位置 lock → read-modify-write → unlock，期間他人不可存取
`s[0] = s[0] + x` vs `atomic.add`	前者有 race condition，結果偏小且每次/每 GPU 可能不同；atomic 才正確
指定 thread scope 的 C++ atomic	`cuda::atomic` / `cuda::atomic_ref`（如 `cuda::thread_scope_device`）；`cuda::std::*` 不指定 scope
C++ atomic 最佳實務	優先用 `std::atomic`-like（`cuda::std::atomic` / `cuda::atomic`），而非底層 atomic functions
Python atomic 在哪	`numba.cuda.atomic`：`add`/`sub`/`max`/`min`/`compare_and_swap`
高效能 reduction 該用什麼	CCCL primitives；Python 用 `cuda.coop`，勿自己重寫
Cooperative Groups 屬於哪個語言	CUDA C++；可跨 block / 跨 grid / 跨 GPU 同步，也可定義小於 block 的 group
Cooperative Groups 跨邊界代價	有語意限制與效能影響，非免費全域同步
Occupancy 公式	active warps ÷ SM 最大 active warps；越高越能隱藏延遲
限制 occupancy 的三類資源	threads/block、shared memory/block、registers/thread（受 per-SM 與 per-block 上限）
`<<<512,768>>>` 的 occupancy（CC10.0）	2 blocks/SM（受 2048 thread 限）→ (7682)/2048 = 75%*
`<<<512,32>>>` 的 occupancy（CC10.0）	32 blocks/SM（受 maxBlocksPerMultiProcessor=32 限）→ 1024/2048 = 50%
100KB shared/block（SM 233472B）→ 每 SM 幾 block	2（第 3 個需 300KB，超過上限）
查 kernel register/shared 用量	`nvcc --resource-usage`
限制每 thread register 數	`nvcc --maxrregcount`；過低會 spill 到 local memory（但有時反而提升 occupancy）
block 排到哪個 SM 可控嗎	不可控、不可查、無順序保證；勿依賴排程順序

重點總覽

2.3.5 Atomics

2.3.5.1 C++ std::atomic-like Atomics

2.3.5.2 Memory Atomics in Python

2.3.6 Cooperative Groups

2.3.7 Kernel Launch and Occupancy

排程：block 如何被指派到 SM

用 cudaGetDeviceProperties 查資源上限

Occupancy 定義

計算範例（CC 10.0，maxActiveWarps = 2048/32 = 64）

程式設計者能控制什麼

考試/測驗重點

Related Notes

用 `cudaGetDeviceProperties` 查資源上限