Thread Scopes 與 Scoped Atomics (Thread Scopes and Scoped Atomics)

#cuda #execution-model #thread-scope #scoped-atomics #memory-ordering #concept

重點總覽

項目	重點
Thread Scope	定義「哪些執行緒能觀察到某執行緒的 load/store」以及「哪些執行緒能彼此同步」；每個 scope 對應記憶體階層中一個 point of coherency
五個 scope	`thread` → `block`(.cta) → `cluster`(.cluster) → `device`(.gpu) → `system`(.sys)，涵蓋範圍由窄到寬
Point of Coherency	thread=無 / block=L1 / cluster=L2 / device=L2 / system=L2 + connected caches
暴露介面	thread scope 暴露於 CUDA PTX（`.cta/.cluster/.gpu/.sys`），並以擴充形式提供於 libcu++（`cuda::thread_scope_*`）
Scoped Atomics	結合 C++ 標準 atomic 記憶體語意與 CUDA thread scope，可安全在 block/cluster/device/system 層級溝通
兩大組成	Thread Scope（誰能觀察到效果）+ Memory Ordering（相對其他記憶體操作的順序約束）
兩種 API	libcu++ 的 `cuda::atomic` / `cuda::atomic_ref`；或編譯器 built-in `__nv_atomic_*`
Memory Ordering	`relaxed`(只保原子性) / `acquire`(load) / `release`(store) / `acq_rel`(RMW) / `seq_cst`(最強，全域總順序)
Performance	scope 越窄越快；ordering 越弱越快；shared memory atomics 比 global memory atomics 快

3.2.3 Thread Scopes（執行緒範圍）

CUDA 執行緒構成 Thread Hierarchy；在此階層中，記憶體操作的「可見性」與「同步範圍」並非一致。為描述這種非一致性，CUDA 程式設計模型引入 thread scope 的概念。

thread scope 的定義：決定哪些執行緒能觀察到某執行緒的 loads 與 stores，並指定哪些執行緒能透過 atomic operations、barriers 等 synchronization primitives 互相同步。
每個 scope 都對應記憶體階層中的一個 point of coherency（一致性點）。
thread scope 暴露於 CUDA PTX，同時以擴充形式提供於 libcu++ 函式庫（cuda::thread_scope_*）。

CUDA C++ Thread Scope	CUDA PTX	可見範圍	Point of Coherency
`cuda::thread_scope_thread`	–	僅本執行緒可見	– （無）
`cuda::thread_scope_block`	`.cta`	同一 thread block 內其他執行緒	L1
（`cuda::thread_scope_cluster`）	`.cluster`	同一 thread block cluster 內其他執行緒	L2
`cuda::thread_scope_device`	`.gpu`	同一 GPU device 內其他執行緒	L2
`cuda::thread_scope_system`	`.sys`	同一系統內（CPU、其他 GPU）	L2 + connected caches

Note

原文表格中 .cluster 列未列出 CUDA C++ 名稱（與 .cta 共用 cuda::thread_scope_block 列的對應），libcu++ 對應的 scope 為 cuda::thread_scope_cluster。cluster 的 point of coherency 為 L2。

scope 涵蓋範圍與一致性點由窄到寬：

 scope               PTX        coherency point        可見對象
 thread          (private)        –                    僅自己
   │
 block            .cta            L1                    同 block 的執行緒
   │
 cluster          .cluster        L2                    同 cluster 的執行緒
   │
 device           .gpu            L2                    同 GPU 的執行緒
   │
 system           .sys            L2 + connected caches CPU + 其他 GPU
   ▼  (越往下涵蓋越廣、需穿越的快取階層越多 → 越慢)

Tip

「選擇能滿足正確性的最窄 scope」是 scoped atomics 的核心原則。scope 越窄，coherency point 越靠近核心（L1），同步成本越低。

Important

章節 Advanced Synchronization Primitives 與 Asynchronous Data Copies 都會示範 thread scopes 的使用。

3.2.4 Advanced Synchronization Primitives 概觀

本節介紹三大類同步原語：

原語	用途
Scoped Atomics	將 C++ memory ordering 與 CUDA thread scope 配對，於 block/cluster/device/system scope 安全跨執行緒溝通
Asynchronous Barriers	把同步拆成 arrival 與 wait 兩階段，可追蹤非同步操作的進度
Pipelines	分段安排工作、協調多緩衝 producer–consumer 模式，常用於 compute 與 async data copy 的重疊

本筆記聚焦 Scoped Atomics；Asynchronous Barriers 與 Pipelines 見 03-Advanced-CUDA/06-Asynchronous-Barriers-and-Pipelines。

3.2.4.1 Scoped Atomics

Scoped atomics 支援 C++ 標準 atomic 記憶體語意，可透過 libcu++ 函式庫或 編譯器 built-in functions 使用。它在 CUDA thread hierarchy 的適當層級提供高效同步，兼顧正確性與效能。

兩大組成（3.2.4.1.1）

Scoped atomics 結合兩個關鍵概念：

Thread Scope：定義哪些執行緒能觀察到此 atomic 操作的效果。
Memory Ordering：定義此操作相對其他記憶體操作的順序約束。

範例一：block-scoped 原子計數器

#include <cuda/atomic>
__global__ void block_scoped_counter() {
    // 只在本 block 內可見的 shared atomic counter
    __shared__ cuda::atomic<int, cuda::thread_scope_block> counter;
    if (threadIdx.x == 0) {                       // 只由 thread 0 初始化
        counter.store(0, cuda::memory_order_relaxed);
    }
    __syncthreads();                              // 確保所有執行緒看到初始化結果
    int old_value = counter.fetch_add(1, cuda::memory_order_relaxed);
    // 使用 old_value...
}

等價的 built-in 寫法以 __nv_atomic_* 搭配 __NV_ATOMIC_RELAXED、__NV_THREAD_SCOPE_BLOCK：

__shared__ int counter;
if (threadIdx.x == 0) {
    __nv_atomic_store_n(&counter, 0, __NV_ATOMIC_RELAXED, __NV_THREAD_SCOPE_BLOCK);
}
__syncthreads();
int old_value = __nv_atomic_fetch_add(&counter, 1,
                                      __NV_ATOMIC_RELAXED, __NV_THREAD_SCOPE_BLOCK);

此例展示 scoped atomics 的基本要點：

Shared Variable：以 __shared__ 記憶體在 block 內共享單一 counter。
Atomic Type Declaration：cuda::atomic<int, cuda::thread_scope_block> 建立具 block 層級可見性的 atomic 整數。
Single Initialization：只由 thread 0 初始化，避免設定期間的 race condition。
Block Synchronization：__syncthreads() 確保所有執行緒在繼續前都看到已初始化的 counter。
Atomic Increment：每個執行緒原子地遞增 counter 並取回先前值。

Tip

此處選 cuda::memory_order_relaxed，因為我們只需要 原子性（不可分割的 read-modify-write），不需要跨不同記憶體位置的順序約束。單純計數時遞增的先後順序不影響正確性。

範例二：producer-consumer 用 acquire-release 確保順序

__global__ void producer_consumer() {
    __shared__ int data;
    __shared__ cuda::atomic<bool, cuda::thread_scope_block> ready;
    if (threadIdx.x == 0) {
        // Producer：先寫 data，再 signal ready
        data = 42;
        ready.store(true, cuda::memory_order_release);  // release 確保 data 寫入可見
    } else {
        // Consumer：等 ready 訊號後再讀 data
        while memory_order_acquire) { // acquire 確保讀到該寫入
            // spin wait
        }
        int value = data;
        // 處理 value...
    }
}

Important

release/acquire 配對形成 happens-before：producer 在 release store 之前的所有寫入（data = 42），對於以 acquire load 看到該旗標的 consumer 都保證可見。built-in 版本對應 __NV_ATOMIC_RELEASE / __NV_ATOMIC_ACQUIRE，且只有 ready 旗標需要 atomic 操作（data 為一般 int）。

producer-consumer 的 release/acquire 時序：

 Producer (thread 0)                 Consumer (thread != 0)
 ───────────────────                 ──────────────────────
 data = 42;                          while(!ready.load(acquire)) { spin }
 ready.store(true, release) ─────────►          │
        │   release: 之前的寫入                  │  acquire: 之後的讀取
        │   不能重排到 store 之後                 │  不能重排到 load 之前
        ▼                                        ▼
   (data 寫入對 consumer 可見) ───── 配對 ───► int value = data;  // 保證讀到 42

Warning

若兩端都用 memory_order_relaxed，則只保證 ready 本身的原子性，不保證 consumer 讀到 ready==true 時也能看到 data=42，會產生資料競爭。producer-consumer 一定要用 acquire-release（或更強）。

Memory Ordering 語意對照（C++ 標準）

memory order	適用操作	語意
`memory_order_relaxed`	load/store/RMW	只保證該 atomic 本身原子性，無跨位置順序約束（最弱、最快）
`memory_order_acquire`	load	此 load 之後的記憶體存取不得重排到它之前；可看見配對 release 之前的所有寫入
`memory_order_release`	store	此 store 之前的記憶體存取不得重排到它之後；對配對 acquire 公開先前寫入
`memory_order_acq_rel`	read-modify-write	同時具 acquire + release 語意（適用 `fetch_add`、CAS 等 RMW）
`memory_order_seq_cst`	load/store/RMW	最強：除 acq_rel 外，所有 seq_cst 操作共享單一全域總順序（最慢）

3.2.4.1.2 Performance Considerations（效能考量）

原則	說明
使用最窄的 scope	block-scoped atomics 遠快於 system-scoped atomics
優先用較弱的 ordering	僅在正確性需要時才用較強 ordering（relaxed < acquire/release < acq_rel < seq_cst）
考量記憶體位置	shared memory atomics 比 global memory atomics 快

Tip

三條原則合在一起的口訣：scope 最窄、ordering 最弱、位置最近（shared > global），在保證正確的前提下選最便宜的組合。

Warning

「scope 越窄越快」是效能取向，但不能為了快而選錯 scope：若兩個執行緒不在同一 block，卻用 thread_scope_block 同步，行為將不正確。scope 必須涵蓋所有需要互相觀察/同步的執行緒。

考試/測驗重點

情境/關鍵字	答案
thread scope 定義了什麼	哪些執行緒能觀察 load/store + 哪些執行緒能用 atomics/barriers 互相同步
scope 由窄到寬排序	thread → block → cluster → device → system
`.cta` 對應	`cuda::thread_scope_block`，coherency point = L1
`.cluster` 對應 / coherency	thread block cluster，coherency point = L2
`.gpu` 對應	`cuda::thread_scope_device`，coherency point = L2
`.sys` 對應 / coherency	`cuda::thread_scope_system`（CPU、其他 GPU），coherency point = L2 + connected caches
thread scope 的 coherency point	thread = 無；block = L1；cluster/device = L2；system = L2 + connected caches
scoped atomics 兩大組成	Thread Scope + Memory Ordering
scoped atomics 兩種 API	libcu++ `cuda::atomic`/`cuda::atomic_ref`；built-in `__nv_atomic_*`
單純計數該用哪種 ordering	`memory_order_relaxed`（只需原子性，不需跨位置順序）
producer-consumer 該用哪種 ordering	producer `release` store + consumer `acquire` load
release 保證什麼	release 之前的寫入，對看見此旗標（acquire）的執行緒可見
acquire 保證什麼	acquire 之後的讀取能看到配對 release 之前的寫入
RMW（fetch_add/CAS）要 acquire+release 用哪個	`memory_order_acq_rel`
最強 ordering / 全域總順序	`memory_order_seq_cst`
哪個 atomic 較快：block vs system	block-scoped 遠快於 system-scoped
哪個 atomic 較快：shared vs global	shared memory atomics 較快
易混：scope 越窄越快，可否亂選	不可，scope 必須涵蓋所有需互相同步的執行緒
易混：relaxed 能否做 producer-consumer 同步	不能，會資料競爭；需 acquire-release
初始化 + 同步樣板	只 thread 0 初始化 → `__syncthreads()` → 全部執行緒 atomic 存取

重點總覽

3.2.3 Thread Scopes（執行緒範圍）

3.2.4 Advanced Synchronization Primitives 概觀

3.2.4.1 Scoped Atomics

兩大組成（3.2.4.1.1）

範例一：block-scoped 原子計數器

範例二：producer-consumer 用 acquire-release 確保順序

Memory Ordering 語意對照（C++ 標準）

3.2.4.1.2 Performance Considerations（效能考量）

考試/測驗重點

Related Notes