Asynchronous Barriers 深入 (Async Barriers Deep Dive)

#cuda #async-execution #async-barrier #concept

Asynchronous barriers 把 CUDA 的同步能力延伸到 __syncthreads() 與 __syncwarp() 之外，提供細粒度、非阻塞的協調方式，讓通訊與計算能更好地重疊。其核心是把「到達 (arrive)」與「等待 (wait)」拆成兩個獨立操作 (split arrive/wait)，主要透過 cuda::barrier API 使用，並可下探到 cuda::ptx 與 primitives。

重點總覽

項目	重點
Initialization	`init(&bar, count)` 設定 expected arrival count；必須在任何 thread `arrive()` 之前完成，用 `block.sync()` / `__syncthreads()` bootstrap
Phase 倒數	countdown 從 expected arrival count 倒數，每次 `arrive()` 遞減；歸零即 phase 完成
自動 reset	使 countdown 歸零的那次 `arrive()` 會「自動且原子地」reset，countdown 重設為 expected count 並進入下一 phase
arrival_token	`token = bar.arrive()` 綁定當前 phase；`bar.waitmove(token)` 在 phase 未翻轉時阻塞
Warp entanglement	完全收斂的 warp 只更新 barrier 一次；完全發散則 32 次；建議 `__syncwarp` 後再 arrive
Explicit phase tracking	`mbarrier_try_wait_parity(&bar, parity)`；even phase parity=0、odd=1，初始 0
Early exit	`arrive_and_drop()` 履行當前 arrive，並把下一 phase 的 expected count 減 1
Completion function	每 phase 最後一個 arrive 後、任何 wait 解除前，執行一次，具記憶體可見性保證
Transaction barrier	compute capability 9.0+，shared memory 的 block/cluster scope，額外追蹤 transaction count
Producer-Consumer	warp specialization + 雙緩衝；每個 buffer 配兩個 barrier (ready/filled)

Initialization：必須先初始化

任何 thread 開始參與 barrier 之前，barrier 都必須先用 cuda::barrier::init() friend function 初始化。init() 的第二個參數是 expected arrival count，即「參與 thread 在某個 thread 從 bar.waitmove(token)) 解除阻塞前，會呼叫 bar.arrive() 的總次數」。下例以整個 thread block 參與，count 設為 block.size(。

#include <cuda/barrier>
#include <cooperative_groups.h>
__global__ void init_barrier() {
    __shared__ cuda::barrier<cuda::thread_scope_block> bar;
    auto block = cooperative_groups::this_thread_block();
    if (block.thread_rank() == 0) {
        // 單一 thread 初始化 total expected arrival count
        init(&bar, block.size());
    }
    block.sync();   // bootstrap：所有 thread 都看到已初始化的 barrier
}

對應的低階寫法：cuda::ptx::mbarrier_init(&bar, block.size())（barrier 型別為 uint64_t），或 primitives 的 __mbarrier_init(&bar, block.size())（型別為 __mbarrier_t）。

Bootstrapping 兩難：thread 必須先同步才能參與 barrier，但建立 barrier 的目的正是為了同步。解法是讓參與的 thread 屬於同一個 cooperative group，先用 block.sync()（或 __syncthreads()）完成初始化的同步。
由單一 thread (thread_rank() == 0) 呼叫 init()，其餘 thread 在 block.sync() 處等待。

完整 block / 完整 warp 該用什麼

async barrier 的彈性在於可指定「哪些 thread」與「怎麼參與」(split arrive/wait)。但若目的只是同步整個 thread block 或整個 warp，建議改用 __syncthreads() 與 __syncwarp(mask)，效能較佳。

A Barrier's Phase：Arrival / Countdown / Completion / Reset

barrier 在每個 phase 從 expected arrival count 倒數到 0。參與 thread 呼叫 bar.arrive() 使 countdown 遞減；當倒數到零，當前 phase 即 complete。讓 countdown 歸零的那一次 arrive() 會自動且原子地執行 reset：把 countdown 重設為 expected arrival count，並把 barrier 推進到下一個 phase。

expected arrival count = N
┌───────────────── phase P ─────────────────┐
 arrive() ↓  arrive() ↓ ... (共 N 次)
 countdown:  N → N-1 → ... → 1 → 0
                                  │  最後一個 arrive 使 countdown 歸零
                                  ▼
                    [completion function 執行一次 (若有)]
                                  ▼
         原子 reset: countdown ← N，phase 推進到 P+1
         所有阻塞在 wait(token_P) 的 thread 一齊解除

token = bar.arrive() 回傳一個 cuda::barrier::arrival_token，綁定 barrier 的當前 phase。bar.waitmove(token) 在「token 的 phase 等於 barrier 的 phase」期間阻塞呼叫 thread：

若 phase 在呼叫 wait() 之前就已推進（countdown 已歸零），thread 不阻塞、直接通過。
若 phase 在 thread 正阻塞於 wait() 時推進，thread 被解除阻塞。

arrive 不阻塞、wait 才阻塞

arrive() 只回報「我到了」並使 countdown 遞減，不會阻塞 thread，呼叫後可立刻去做獨立計算 (split arrive/wait)。真正的阻塞發生在 wait()。

使用規則（尤其在非平凡的 arrive/wait 模式中必須遵守）：

同一 thread 的 token=bar.arrive() 與 bar.waitmove(token) 必須有序：arrive 發生在當前 phase，wait 發生在同一個或下一個 phase。
bar.arrive() 必須在 counter 非零時呼叫。初始化後，若某 thread 的 arrive() 使 countdown 歸零，則必須先有一次 bar.waitmove(token))，barrier 才能被後續的 arrive( 重用。

token 的有效範圍

bar.wait() 只能用「當前 phase」或「緊接的前一個 phase」的 token 物件。傳入任何其他 phase 的 token，行為未定義 (undefined behavior)。

Warp Entanglement

warp 的發散程度會影響一次 arrive-on 操作對 barrier 的更新次數：

warp 完全收斂 (converged) → barrier 只被更新 1 次。
warp 完全發散 (diverged) → barrier 被施加 32 次 個別更新。

收斂後再 arrive

建議由收斂的 thread 發出 arrive-on(bar)，以最小化對 barrier 物件的更新次數。若 arrive 之前的程式碼使 thread 發散，應先以 __syncwarp 重新收斂，再進行 arrive-on 操作。

Explicit Phase Tracking：用 parity 取代 token

barrier 會隨著被用來同步的次數而有多個 phase。除了用 token 追蹤 phase 翻轉，也可用 mbarrier_try_wait_parity() 系列（cuda::ptx 與 primitives API）直接追蹤 phase 的 parity。

#include <cuda/ptx>
#include <cooperative_groups.h>
__global__ void split_arrive_wait(int iteration_count, float *data) {
    __shared__ uint64_t bar;
    int parity = 0;   // 初始 phase parity 為 0
    auto block = cooperative_groups::this_thread_block();
    if (block.thread_rank() == 0)
        cuda::ptx::mbarrier_init(&bar, block.size());
    block.sync();
    for (int i = 0; i < iteration_count; ++i) {
        mbarrier_arrive(&bar;   // arrive 不阻塞
        compute(data, i);                          // 可重疊獨立計算
        while mbarrier_try_wait_parity(&bar, parity) {}
        parity ^= 1;                               // 翻轉 parity
    }
}

mbarrier_try_wait_parity(uint64_t* bar, const uint32_t& phaseParity) 等待具特定 parity 的 phase。
phaseParity 是「當前 phase」或「緊接前一個 phase」的整數 parity：even phase = 0、odd phase = 1。初始化時 phase 的 parity 為 0，故有效值只有 0 與 1。
好處：追蹤 async memory 操作時，可讓單一 thread arrive 並設定 transaction count，其他 thread 只等待 parity 翻轉，比所有 thread 都 arrive 並用 token 更有效率。

範圍限制

explicit phase tracking 只適用於 shared-memory barrier，scope 為 thread-block 或 cluster。

Early Exit：提前退出要先 drop

若某個正在參與一連串同步的 thread 必須提前退出，它必須先明確放棄參與再離開，其餘 thread 才能照常繼續後續的 arrive/wait。

for (int i = 0; i < N; ++i) {
    if (condition_check()) {
        bar.arrive_and_drop();   // 履行當前 arrive 並退出參與
        return;
    }
    auto token = bar.arrive();
    /* code between arrive and wait */
    bar.waitmove(token);  // 等所有 thread 到達
}

bar.arrive_and_drop() 先 arrive 以履行該 thread 在當前 phase的到達義務，接著把下一個 phase 的 expected arrival count 遞減，使該 thread 之後不再被期待 arrive。
primitives 版用 __mbarrier_arrive_and_drop(&bar)，等待用帶逾時的 __mbarrier_try_wait(&bar, token, 1000)。

Completion Function

cuda::barrier<Scope, CompletionFunction> 支援可選的 completion function，每個 phase 執行一次：在最後一個 thread arrive 之後、任何 thread 從 wait 解除之前執行。

auto completion_fn = [&] {                 // capture，故 barrier 非 default-constructible
    int sum = 0;
    for (int i = 0; i < BlockSize; ++i) sum += smem[i];
    *acc += sum;
};
using barrier_t = cuda::barrier<cuda::thread_scope_block, decltype(completion_fn)>;
// new (bar) barrier_t{block.size(), completion_fn};
// 等同 init(bar, block.size(), completion_fn);

記憶體可見性保證（這是 completion function 的關鍵價值）：

本 phase 中各 arrive thread 所做的記憶體操作，對執行 completion function 的 thread 可見。
completion function 內所做的所有記憶體操作，待各 thread 從 wait 解除後，對所有等待 thread 可見。
因 lambda 帶 capture 而非 default-constructible，範例以 placement new 在 shared 儲存上建構 barrier。

Tracking Asynchronous Memory Operations

async barrier 也能追蹤 async memory copy：當一個 async copy 被綁定到 barrier，該 copy 在啟動時自動增加當前 phase 的 expected count、在完成時遞減。如此 wait() 會阻塞直到所有相關 async copy 完成，是同步多個並行記憶體操作的便利機制。

從 compute capability 9.0 起，shared memory 中 thread-block 或 cluster scope 的 async barrier 可顯式追蹤 async memory 操作，稱為 asynchronous transaction barrier。除了 expected arrival count，barrier 還能接受一個 transaction count，以 async 操作指定的單位（通常是 bytes）追蹤尚未完成的 transaction 數。

auto block = cooperative_groups::this_thread_block();
if (block.thread_rank() == 0) init(&bar, block.size());
block.sync();
// 構造 token，arrival count 減 1，expected transaction count 加 0
auto token = cuda::device::barrier_arrive_tx(bar, 1, 0);
bar.waitmove(token);

transaction count 可在 arrive 時用 cuda::device::barrier_arrive_tx() 設定，或直接用 cuda::device::barrier_expect_tx() 設定。
ptx 版用 cuda::ptx::mbarrier_arrive_expect_tx(sem_release, scope_cluster, space_shared, &bar, 1, 0) 搭配 mbarrier_try_wait。
當 barrier 使用 transaction count，wait 會阻塞 thread，直到所有 producer thread 都已 arrive 且所有 transaction count 的總和達到期望值。
上例 transaction 更新為 0，故 barrier 未追蹤任何 transaction；真正的 TMA 追蹤範例見 async copy / TMA 章節。

兩個門檻

transaction barrier 的 wait 需同時滿足：(1) 所有 producer 都已 arrive；(2) transaction count 總和達到期望值。兩者皆滿足才解除阻塞。

Producer-Consumer Pattern Using Barriers

可把 thread block 空間切分 (spatial partitioning)，讓不同 warp 的 thread 負責不同任務，這稱為 warp specialization。在 producer-consumer 中，一組 thread 生產資料，另一組（互斥）thread 同時消費，需要兩個單向同步來管理 buffer。

Producer                              Consumer
  wait buffer ready to be filled  <-- signal buffer ready to be filled
  produce & fill buffer
  signal buffer is filled         --> wait buffer to be filled
                                      consume data in filled buffer

Producer 等 consumer 發出「buffer 可填」訊號；consumer 不等此訊號。
Consumer 等 producer 發出「buffer 已填」訊號；producer 不等此訊號。
為達完整並行，至少要雙緩衝 (double buffering)，每個 buffer 需 兩個 barrier（共 4 個：bar[0]/bar[1] 追蹤 buffer 是否 ready-to-fill，bar[2]/bar[3] 追蹤是否 filled）。

using barrier_t = cuda::barrier<cuda::thread_scope_block>;
__device__ void produce(barrier_t ready[], barrier_t filled[], ...) {
    for (int i = 0; i < N / buffer_len; ++i) {
        ready[i % 2].arrive_and_wait();         // 等 buffer_(i%2) 可被填
        /* fill buffer_(i%2) */
        filled[i % 2].arrive();                 // 標記已填，不等
    }
}
__device__ void consume(barrier_t ready[], barrier_t filled[], ...) {
    ready[0].arrive(); ready[1].arrive();       // 兩 buffer 一開始皆可填
    for (int i = 0; i < N / buffer_len; ++i) {
        filled[i % 2].arrive_and_wait();        // 等 buffer_(i%2) 被填
        /* consume buffer_(i%2) */
        ready[i % 2].arrive();                  // 標記可重填
    }
}

範例中第一個 warp 專責 producer、其餘 warp 專責 consumer；所有 producer 與 consumer thread 都參與四個 barrier，故 expected arrival count 皆等於 block.size()。要在 barrier 上等待，需先 arrive() 取得 token 再 wait(token)；arrive_and_wait() 把兩步合一：

bar.arrive_and_wait();
// 等同
bar.wait(bar.arrive());

Producer 填完後以 filled[i%2].arrive() 標記，但不在此等待，而是去等下一輪 (double buffering) 的 buffer 可填。
Consumer 先標記兩 buffer 皆可填（不等），之後每輪以 filled[i%2].arrive_and_wait() 等該 buffer 被填，消費後以 ready[i%2].arrive() 標記可重填。
ptx / primitives 版以 mbarrier_arrive + mbarrier_try_wait（primitives 帶逾時值如 1000）展開同樣的 arrive/wait。

考試/測驗重點

問題	答案
`init()` 第二個參數是什麼	expected arrival count，即參與 thread 會呼叫 `arrive()` 的總次數
為何 init 前要先做一次同步	bootstrapping 兩難：thread 須先同步才能參與 barrier，用 `block.sync()` / `__syncthreads()` 解決
`arrive()` 會阻塞 thread 嗎	不會，只有 `wait()` 會阻塞；arrive 後可立即做獨立計算
countdown 歸零後發生什麼	最後一次 arrive 自動且原子地 reset：countdown 重設為 expected count 並進入下一 phase
even / odd phase 的 parity 值	even = 0、odd = 1，初始 parity = 0，有效值只有 0 / 1
`wait()` 可用哪些 token	只能用當前 phase 或緊接前一個 phase 的 token，其他為 undefined behavior
warp 發散對 arrive 的影響	完全收斂 → 更新 1 次；完全發散 → 更新 32 次；建議先 `__syncwarp` 收斂
`arrive_and_drop()` 的作用	履行當前 phase 的 arrive，並把下一 phase 的 expected count 減 1，退出參與
completion function 何時執行	每 phase 最後一個 arrive 之後、任何 thread 從 wait 解除之前，執行一次
transaction barrier 的需求	compute capability 9.0+，限 shared memory 的 block / cluster scope
transaction count 的單位	由 async 操作指定，通常是 bytes
雙緩衝 producer-consumer 需要幾個 barrier	每 buffer 2 個 (ready / filled)，雙緩衝共 4 個
`arrive_and_wait()` 等同什麼	`wait(arrive())`，把 arrive 與 wait 合為一步
explicit phase tracking 限制	只適用 shared-memory、block 或 cluster scope 的 barrier