非同步 Barriers 與 Pipelines (Asynchronous Barriers and Pipelines)

#cuda #async-execution #async-barrier #pipeline #concept

重點總覽

項目	重點
Asynchronous barrier	把「到達 (arrive)」與「等待 (wait)」分離的多階段 barrier，相對於單階段 `__syncthreads()`
為何分離	thread 在 arrive 後可繼續做與 barrier 無關的工作，重疊 (overlap) 等待時間，提升效率
硬體支援	CC 7.0+ 可用；CC 8.0+ 在 shared memory 硬體加速，且可同步 block 內「任意 subset」的 threads
主要 API	`cuda::barrier` (libcu++ ISO C++)、低階 `cuda::ptx`、shared-memory primitives
Thread scope	block / cluster (local 與 remote shared) / device / system；scope 影響加速與 wait 是否允許
arrival_token	`arrive()` 回傳的 token 標記目前 barrier phase，`waitmove(token)` 等待該 phase 完成
五階段模型	before-arrive → arrive(含 fence) → between → wait → after-wait
Producer-consumer	用 warp specialization 讓不同 thread subset 分工，靠 arrive/wait 協調
Pipeline	把多筆非同步 memory copy 排成多階段 (double/multi-buffering) 的 FIFO 協調物件
cuda::pipeline API	`producer_acquire` / `producer_commit` / `consumer_wait` / `consumer_release`
Pipeline primitives	`__pipeline_memcpy_async` / `__pipeline_commit` / `__pipeline_wait_prior(N)`
Pipeline vs 單一 barrier	多階段可預取 (prefetch) 多個 buffer，重疊 copy 與運算，遠勝單 barrier 雙緩衝

Asynchronous Barriers（非同步 barrier）

Asynchronous barrier 與典型的單階段 barrier（__syncthreads()）不同：thread 發出「我到了」的通知（arrival）被從「等待其他 thread 到達」的動作（wait）中分離出來。這個分離提升執行效率，因為 thread 可在等待期間執行與 barrier 無關的額外運算，更有效利用等待時間。

關鍵用途：

用 CUDA threads 實作 producer-consumer 模式。
讓記憶體階層內的非同步資料複製在完成時對 barrier 發出訊號（「arrive on」barrier）。

硬體與可用性

在 compute capability 7.0 或更高的裝置上可用。
CC 8.0+ 對 shared-memory 的 async barrier 提供硬體加速，且大幅提升同步粒度：可硬體加速地同步 block 內任意 subset 的 CUDA threads。
先前架構只能加速 whole-warp（__syncwarp()）或 whole-block（__syncthreads()）層級的同步。

API 層級

CUDA 透過 cuda::std::barrier（libcu++ 中符合 ISO C++ 的 barrier）提供 async barrier；常用別名為 cuda::barrier。
除實作 std::barrier 外，函式庫提供 CUDA 專屬擴充，可選擇 barrier 的 thread scope 以改善效能，並暴露較低階的 cuda::ptx API。
cuda::barrier 可透過 friend function cuda::device::barrier_native_handle() 取得 native handle，傳給 cuda::ptx 函式而互通。
CUDA 另提供一組 primitives API，針對 thread-block scope 的 shared-memory async barrier。

Tip

重點不是「更快的 __syncthreads()」，而是把同步點切成兩半，在中間塞入獨立工作，把同步等待時間藏起來（latency hiding）。

不同 thread scope 的 async barrier 對照

Thread Scope	Memory Location	Arrive on Barrier	Wait on Barrier	Hardware-accelerated	CUDA APIs
block	local shared memory	allowed	allowed	yes (8.0+)	`cuda::barrier`, `cuda::ptx`, primitives
cluster	local shared memory	allowed	allowed	yes (9.0+)	`cuda::barrier`, `cuda::ptx`
cluster	remote shared memory	allowed	not allowed	yes (9.0+)	`cuda::barrier`, `cuda::ptx`
device	global memory	allowed	allowed	no	`cuda::barrier`
system	global/unified memory	allowed	allowed	no	`cuda::barrier`

Warning

對 cluster remote shared memory 的 barrier，只允許 arrive，不允許 wait。其餘 scope 兩者皆可。device / system scope 沒有硬體加速，且只有 cuda::barrier 支援。

同步的時間性切分（Temporal Splitting）

沒有 arrive-wait barrier 時，thread block 內同步靠 __syncthreads() 或 Cooperative Groups 的 block.sync()：

#include <cooperative_groups.h>
__global__ void simple_sync(int iteration_count) {
    auto block = cooperative_groups::this_thread_block();
    for (int i = 0; i < iteration_count; ++i) {
        /* code before arrive */
        // Wait for all threads to arrive here.
        block.sync();
        /* code after wait */
    }
}

threads 被阻擋在同步點直到所有 thread 都到達；且同步點之前的記憶體更新保證在同步點之後對 block 內所有 thread 可見。此模式有三階段：sync 前的更新 → 同步點 → sync 後（可見前面的更新）。

改用 async barrier 後，同步被切成 arrive 點與 wait 點：

#include <cuda/barrier>
#include <cooperative_groups.h>
__global__ void split_arrive_wait(int iteration_count, float *data) {
    using barrier_t = cuda::barrier<cuda::thread_scope_block>;
    __shared__ barrier_t bar;
    auto block = cooperative_groups::this_thread_block();
    if (block.thread_rank() == 0) {
        init(&bar, block.size());   // expected arrival count = block 大小
    }
    block.sync();
    for (int i = 0; i < iteration_count; ++i) {
        /* code before arrive */
        barrier_t::arrival_token token = bar.arrive();  // arrive 不阻擋 thread
        compute(data, i);                                // 與等待重疊的獨立工作
        bar.waitmove(token);                      // 在此阻擋直到全員 arrive
        /* code after wait */
    }
}

重點：barrier 用 init(&bar, count) 設定預期到達數 (expected arrival count)（通常為 block.size()），且只由一個 thread 初始化再 block.sync()。bar.arrive() 回傳 arrival_token；中間的 compute() 重疊掉等待時間；bar.waitmove(token) 才真正阻擋。

Important

init() 傳入的 expected arrival count 決定 wait() 要等到參與 thread 完成 arrive() 多少次。thread 以第一次呼叫 bar.arrive() 開始參與該 barrier。

token 與 barrier phase

bar.arrive() 回傳的 arrival_token 擷取了 barrier 目前的 phase。
bar.waitmove(token)) 會阻擋，直到參與 threads 完成 bar.arrive( 達到預期次數（即該 phase 完成）。
phase 完成後 barrier 自動進入下一個 phase 並重設計數，因此可在迴圈中重複使用同一個 barrier。

記憶體可見性與五階段模型

arrive() 之前發生的記憶體更新，保證在參與 threads 呼叫 wait() 之後對它們可見。arrive() 不會阻擋 thread——它可以接著做不依賴其他 thread arrive 前更新的工作。

arrive-wait 模式的五階段：

階段	內容
1. before arrive	做出將在 wait 後被讀取的記憶體更新
2. arrive 點	含隱含 memory fence（等同 `cuda::atomic_thread_fencememory_order_seq_cst, cuda::thread_scope_block`）
3. between	arrive 與 wait 之間的獨立工作（重疊等待）
4. wait 點	阻擋直到全員 arrive
5. after wait	可見 arrive 之前所做的更新

Thread A                         Thread B
────────                         ────────
code before arrive               code before arrive
   │  (更新將被 wait 後讀取)         │
 arrive() ─┐  token              arrive() ─┐  token
   │       │ (非阻擋, 隱含 fence)    │       │
compute()  │  ← 重疊區 →          compute()  │
   │       │                       │       │
 wait(tok)◄┘                     wait(tok)◄┘  (阻擋直到全員 arrive)
   │  (可見對方 arrive 前更新)        │
code after wait                  code after wait

Tip

階段 2 的 arrive 帶有 seq_cst / thread_scope_block 的隱含 fence，因此跨 thread 的可見性與一般 __syncthreads() 一致——差別只在於把等待延後到 wait()。

較低階 API：cuda::ptx 與 primitives

三套 API 是同一模式的不同層級寫法，差異在於 wait 的呼叫方式：

// cuda::ptx：手動輪詢 try_wait
cuda::ptx::mbarrier_init(&bar, block.size());          // bar 為 __shared__ uint64_t
uint64_t token = cuda::ptx::mbarrier_arrive(&bar);
while mbarrier_try_wait(&bar, token) {}  // 自旋等待

// C primitives：__mbarrier_*，try_wait 帶逾時(ns)上限
__mbarrier_init(&bar, block.size());                   // bar 為 __shared__ __mbarrier_t
__mbarrier_token_t token = __mbarrier_arrive(&bar);
while (!__mbarrier_try_wait(&bar, token, 1000)) {}

重點：cuda::ptx 與 primitives 用 try_wait 迴圈取代 bar.wait()；primitives 的 token 型別為 __mbarrier_token_t，且 __mbarrier_try_wait 第三參數是逾時上限。三者語意相同。

Producer-consumer 與 warp specialization

由於 CC 8.0+ 可硬體加速地同步 block 內任意 subset 的 threads，async barrier 很適合 warp specialization：把不同 warp（thread subset）指派成 producer 或 consumer，各自只在自己的 barrier 上 arrive/wait，彼此用 barrier 的 arrive 訊號協調，而非全 block 一起 __syncthreads()。

warp 0 (producer)            warp 1 (consumer)
─────────────────            ─────────────────
load → shared buffer
 arrive(full) ──────────────► wait(full)
       │ (繼續預取下一批)         │  consume shared buffer
 wait(empty) ◄────────────── arrive(empty)
 (buffer 釋放後再 load)

Important

warp specialization 的核心是「分離 arrive 與 wait」：producer arrive 後不必停下來，可立刻預取下一批，consumer 同時消費前一批，達成 compute 與 data movement 重疊。

Pipelines（pipeline 同步物件）

CUDA 提供 pipeline 同步物件作為協調機制，把非同步 memory copy 排序成多個階段 (stages)，方便實作 double-buffering / multi-buffering 的 producer-consumer 模式。

pipeline 是一個有 head 與 tail 的雙端佇列 (deque)，以 **FIFO（先進先出）**順序處理工作。
producer threads 把工作 commit 到 head；consumer threads 從 tail 取出工作。
透過 libcu++ 的 cuda::pipeline API 與一組 primitives API 暴露。

     producer 端 (head)                          consumer 端 (tail)
  producer_acquire ─► 發 memcpy_async ─► producer_commit
        │                                                │
        ▼            FIFO ring of stages                 ▼
   ┌──────┬──────┬──────┬──────┐
   │ s0   │ s1   │ s2   │ s3   │   ◄ 多 buffer 同時在飛
   └──────┴──────┴──────┴──────┘
        ▲                                                │
        └──── consumer_release ◄─ 用 shared 資料 ◄─ consumer_wait

cuda::pipeline API

API	說明
`producer_acquire`	取得 pipeline 內部佇列中一個可用的 stage
`producer_commit`	把在該次 `producer_acquire` 之後、於目前已取得 stage 上發出的非同步操作 commit
`consumer_wait`	等待 pipeline 中最舊 stage 的非同步操作完成
`consumer_release`	釋放最舊 stage 回 pipeline 物件供重用；釋放後的 stage 可被 producer 再次 acquire

Primitives API

API	說明
`__pipeline_memcpy_async`	請求一筆 global→shared memory 的複製，提交給非同步執行
`__pipeline_commit`	把此呼叫之前在目前 stage 上發出的非同步操作 commit
`__pipeline_wait_prior(N)`	等待「除了最後 N 次 commit 以外」所有 commit 的非同步操作完成

兩套 API 的比較

面向	`cuda::pipeline` API	Primitives API
介面	較豐富、限制較少	較精簡
追蹤範圍	一般非同步操作	僅追蹤 global→shared 的非同步複製
限制	較寬鬆	有特定 size / alignment 需求
等價性	—	等同於 `cuda::thread_scope_thread` 的 `cuda::pipeline` 物件

Tip

相對於單一 async barrier 只能做雙緩衝（一邊算、一邊載），多階段 pipeline 能同時讓多個 buffer「在飛」，預取 (prefetch) 後面好幾批資料，把 copy 延遲完全藏在運算之下。

Warning

Primitives API 只支援 global→shared 的非同步複製，且有 size/alignment 限制；需要更一般化或不同 thread scope 時，應改用 cuda::pipeline。

考試/測驗重點

情境/關鍵字	答案
async barrier 與 `__syncthreads()` 差別	async 把 arrive 與 wait 分離；`__syncthreads()` 是單階段、arrive 即阻擋
為何要分離 arrive/wait	thread 在 arrive 後可做獨立工作，重疊等待時間提升效率
async barrier 最低 compute capability	7.0；shared-memory 硬體加速需 8.0+
CC 8.0+ 新增能力	可硬體加速同步 block 內「任意 subset」threads（不再只有 warp/block 粒度）
先前架構只加速哪兩種粒度	whole-warp (`__syncwarp()`) 與 whole-block (`__syncthreads()`)
`bar.arrive()` 是否阻擋	否；只回傳 token，thread 可繼續
真正阻擋的呼叫	`bar.waitmove(token)`
`init(&bar, count)` 的 count 意義	expected arrival count，wait 要等到累計 arrive 達此數
arrival_token 作用	標記目前 barrier phase，供 `wait` 等待該 phase 完成
arrive 點的隱含 fence	等同 `cuda::atomic_thread_fence(memory_order_seq_cst, thread_scope_block)`
cluster remote shared memory barrier	只允許 arrive，不允許 wait
device / system scope 是否硬體加速	否；且只有 `cuda::barrier` 支援
`cuda::barrier` 來自哪個函式庫	libcu++（ISO C++ `std::barrier`，加 CUDA scope 擴充）
取得 native handle 傳給 ptx 的函式	`cuda::device::barrier_native_handle()`
cuda::ptx / primitives 如何 wait	`mbarrier_try_wait` / `__mbarrier_try_wait` 迴圈自旋
pipeline 資料結構與順序	雙端佇列 (deque)，FIFO，producer 進 head、consumer 出 tail
producer 兩步驟	`producer_acquire` → `producer_commit`
consumer 兩步驟	`consumer_wait` → `consumer_release`
`consumer_release` 後 stage 去處	回 pipeline 供 producer 再次 acquire
`__pipeline_wait_prior(N)` 語意	等待除最後 N 次 commit 外的所有 commit 完成
primitives pipeline 等價 scope	`cuda::thread_scope_thread`
primitives pipeline 限制	僅 global→shared 複製，有 size/alignment 需求
多階段 pipeline 勝過單 barrier 之處	可同時預取多個 buffer，多 buffer 並行隱藏 copy 延遲

重點總覽

Asynchronous Barriers（非同步 barrier）

不同 thread scope 的 async barrier 對照

同步的時間性切分（Temporal Splitting）

token 與 barrier phase

記憶體可見性與五階段模型

較低階 API：cuda::ptx 與 primitives

Producer-consumer 與 warp specialization

Pipelines（pipeline 同步物件）

cuda::pipeline API

Primitives API

兩套 API 的比較

考試/測驗重點

Related Notes