Work Stealing 與 Cluster Launch Control

#cuda #execution-model #work-stealing #concept

重點總覽

項目	重點
問題背景	資料量與計算量可變時，傳統有兩種 grid 配置策略，各有取捨
Fixed Work per Thread Block	block 數由問題大小決定、每 block 工作量固定；優點是 load balancing 與 preemption
Fixed Number of Thread Blocks	block 數固定（通常依 SM 數），用 grid-stride loop；優點是降低 block overhead
Cluster Launch Control (CLC)	Blackwell（compute capability 10.0）新增，結合上述兩者優點
核心機制	一個 block 嘗試「取消」尚未開始執行的另一個 block，成功則竊取其 index 來做事
Work stealing	閒置處理器主動從忙碌處理器的工作佇列「偷」任務，而非等待派工
API 形式	透過 libcu++ 的非同步 try-cancel 指令，搭配 shared memory mbarrier 同步
Async proxy	CLC 操作被建模為 async proxy 操作，需要 proxy fence 確保可見性
Cluster 變體	由單一 cluster thread 提交 multicast 取消，結果廣播給 cluster 內所有 block

背景：三種 grid 配置策略

開發 CUDA 應用時，處理可變的資料與計算規模是核心課題。傳統上決定要 launch 多少 thread block 有兩種主要做法。

Fixed Work per Thread Block：block 數量由問題大小決定，每個 block 的工作量固定。
- load balancing：當 block 執行時間有差異、或 block 數遠多於 GPU 同時可執行量（low-tail effect）時，scheduler 可在某些 SM 上多跑幾個 block。
- preemption：即使較高優先權的 kernel 在較低優先權 kernel 開始後才 launch，scheduler 仍可在低優先權 block 完成時插入高優先權的 block；高優先權跑完再續跑低優先權。
Fixed Number of Thread Blocks：通常以 block-stride / grid-stride loop 實作，block 數不隨問題大小變化，而是每個 block 的工作量隨問題大小變化。block 數通常依 GPU 的 SM 數與期望 occupancy 決定。
- 降低 block overhead：不只攤平 block launch latency，也減少所有 block 共用運算的計算開銷（這些開銷可能遠大於 launch latency）。
- 例如 convolution kernel 中，計算 convolution 係數的 prologue（與 block index 無關）因 block 數固定而只需算較少次，減少冗餘計算。

兩種策略的取捨

Fixed Work 擅長 load balancing 與 preemption，但無法攤平共用 prologue 的成本；Fixed Number 反之。CLC 的目標就是把兩者的好處合而為一。

Cluster Launch Control 機制

Cluster Launch Control 在 NVIDIA Blackwell GPU 架構（compute capability 10.0）引入，讓開發者能取消（cancel）尚未開始執行的 thread block 或 thread block cluster，藉此實現 work stealing。

工作流程：

一個 block 嘗試取消另一個「尚未開始執行」的 block。
取消成功 → 用被取消 block 的 index 來做它的工作（竊取工作）。
取消失敗 → 可能因為「沒有可用的 block index 了」，或「有更高優先權的 kernel 被排程」等原因。
失敗後若 block 結束（exit），scheduler 可先執行高優先權 kernel，之後再繼續排程目前 kernel 剩餘的 block。

        ┌─────────────────────────────┐
        │  block 開始（用自己的 index 做事）│
        └──────────────┬──────────────┘
                       v
              ┌──────────────────┐
        ┌───> │  try_cancel 另一個 │
        │     │  尚未啟動的 block  │
        │     └────────┬─────────┘
        │              v
        │      ┌───────────────┐  fail（無剩餘 index /
        │  success      │ 取消成功?     ├──────────────────┐
        │      └───────┬───────┘  高優先權 kernel）       │
        │              │ success                          v
        │              v                          ┌───────────────┐
        └──── 竊取其 index，做它的工作               │ block exit；    │
                                                  │ scheduler 可插入│
                                                  │ 高優先權 kernel  │
                                                  └───────────────┘

下表彙整三種策略的優缺點（V 表具備、X 表不具備）：

能力	Fixed Work per Thread Block	Fixed Number of Thread Blocks	Cluster Launch Control
Reduced overheads	X	V	V
Preemption	V	X	V
Load balancing	V	X	V

API Details

透過 cluster launch control API 取消一個 block 是非同步完成的，並用一個 shared memory barrier 來同步，programming pattern 與 04-CUDA-Features/13-Async-Copies-LDGSTS 等非同步資料複製類似。此 API 由 libcu++ 提供：

一個 request 指令：把編碼後的取消結果寫入一個 __shared__ 變數。
一組 decoding 指令：從結果中萃取 success/failure 狀態，以及被取消 block 的 index。

Async proxy

Cluster launch control 操作被建模為 async proxy 操作（參見 Async Thread 與 Async Proxy）。因此 generic proxy 與 async proxy 之間需要 proxy fence 來確保 shared memory 操作的可見性。

Thread Block Cancellation 五步驟

建議從單一 thread使用 CLC，即一次只發一個 request。取消流程分為 Setup Phase（步驟 1-2，宣告並初始化取消結果與同步變數）與 Work-Stealing Loop（步驟 3-5，反覆 request、同步、處理結果）。

// 1. 宣告 block 取消所需變數：
__shared__ uint4 result;     // request 結果
__shared__ uint64_t bar;     // 同步用 barrier
int phase = 0;               // barrier phase

// 2. 以單一 arrival count 初始化 shared memory barrier：
if thread_rank() == 0
    ptx::mbarrier_init(&bar, 1);
__syncthreads();

// 3. 由單一 thread 提交非同步取消 request，並設定 transaction count：
if thread_rank() == 0 {
    cg::invoke_onecoalesced_threads(), [&](
        { ptx::clusterlaunchcontrol_try_cancel(&result, &bar); });
    ptx::mbarrier_arrive_expect_tx(ptx::sem_relaxed, ptx::scope_cta,
        ptx::space_shared, &bar, sizeof(uint4));
}

// 4. 同步（完成）非同步取消 request：
while mbarrier_try_wait_parity(&bar, phase) {}
phase ^= 1;

// 5. 取出取消狀態與被取消 block 的 index：
bool success = ptx::clusterlaunchcontrol_query_cancel_is_canceled(result);
if (success) {
    int bx = ptx::clusterlaunchcontrol_query_cancel_get_first_ctaid_x(result);
    // 1D/2D block 不需取齊 x/y/z 三個座標
}
// 6. 確保 async 與 generic proxy 間 shared memory 操作的可見性，
//    並防止 work-stealing loop 迭代之間的 data race。

重點：transaction count 設為 sizeof(uint4)，因為 request 結果寫入的是一個 uint4；mbarrier 以 transaction-based 方式追蹤非同步寫入完成，與 04-CUDA-Features/14-Async-Copies-TMA 的 expect_tx 模式一致。

用 invoke_one 包住 try_cancel

因為 thread block cancellation 是 uniform 指令，建議放進 cg::invoke_one thread selector 內，讓編譯器能把 peeling loop 最佳化掉。

Constraints on Thread Block Cancellation

這些約束都與「失敗的取消 request」有關：

觀察到失敗後再發 request 是 undefined behavior（UB）。關鍵在於是否存在「觀察（observation，即 query 結果）」介於兩個 request 之間。
取出失敗 request 的 block index 是 UB（只有 success 時 index 才有效）。
從多個 thread 提交取消 request 不建議，會取消多個 block，需要小心處理：
- 每個提交的 thread 必須提供唯一的 __shared__ result 指標以避免 data race。
- 若共用同一個 barrier，arrival count 與 transaction count 必須對應調整。

// 無效（UB）：在兩個 request 之間「觀察」到第一個失敗
ptx::clusterlaunchcontrol_try_cancel(&result0, &bar0);
/* synchronize bar0 */
bool success0 = ptx::clusterlaunchcontrol_query_cancel_is_canceled(result0);
assert(!success0);                  // 觀察到失敗 → 下一個 request 無效
ptx::clusterlaunchcontrol_try_cancel(&result1, &bar1);  // UB

// 有效：兩個 request 之間沒有觀察
ptx::clusterlaunchcontrol_try_cancel(&result0, &bar0);
ptx::clusterlaunchcontrol_try_cancel(&result1, &bar1);  // 合法
/* synchronize bar0 */
bool ok0 = ptx::clusterlaunchcontrol_query_cancel_is_canceled(result0);
assert(!ok0);                       // 此時觀察 → 第二個 request 仍合法

UB 的判定關鍵是「觀察順序」

第一例的 try_cancel(&result1, ...) 出現在「已 query 並斷定第一個失敗」之後，故為 UB；第二例兩個 request 連續發出、之後才 query，因此合法。差別不在於是否失敗，而在於 request 之間有沒有插入 query 結果的觀察。

Example：Vector-Scalar Multiplication（Use-case Thread Blocks）

以 v := αv 示範三種策略。前兩者作為對照，第三者展示 CLC work stealing。

// Fixed Work per Thread Block：每 thread 一個元素
__global__ void kernel_fixed_work(float* data, int n) {
    float alpha = compute_scalar();             // prologue
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] *= alpha;
}
// Launch: kernel_fixed_work<<<(n + 1023) / 1024, 1024>>>(data, n);

// Fixed Number of Thread Blocks：grid-stride loop，block 數 = SM_COUNT
__global__ void kernel_fixed_blocks(float* data, int n) {
    float alpha = compute_scalar();             // prologue 只算一次/ block
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    while (i < n) { data[i] *= alpha; i += gridDim.x * blockDim.x; }
}
// Launch: kernel_fixed_blocks<<<SM_COUNT, 1024>>>(data, n);

CLC 版本則在 work-stealing loop 中反覆嘗試取消其他 block，並用被竊取的 bx 繼續計算：

__global__ void kernel_cluster_launch_control(float* data, int n) {
    __shared__ uint4 result; __shared__ uint64_t bar; int phase = 0;
    if thread_rank() == 0) ptx::mbarrier_init(&bar, 1;
    float alpha = compute_scalar();             // prologue
    int bx = blockIdx.x;                        // 假設 1D x 軸 block
    while (true) {
        __syncthreads();                        // 保護 result 不被下一輪覆寫
        if thread_rank() == 0 {
            // acquire：在 async proxy 取得 result 的寫入
            ptx::fence_proxy_async_generic_sync_restrict(
                ptx::sem_acquire, ptx::space_cluster, ptx::scope_cluster);
            cg::invoke_onecoalesced_threads(), [&](
                { ptx::clusterlaunchcontrol_try_cancel(&result, &bar); });
            ptx::mbarrier_arrive_expect_tx(ptx::sem_relaxed, ptx::scope_cta,
                ptx::space_shared, &bar, sizeof(uint4));
        }
        int i = bx * blockDim.x + threadIdx.x;  // 用目前 bx 計算
        if (i < n) data[i] *= alpha;
        while (!ptx::mbarrier_try_wait_parity(
            ptx::sem_acquire, ptx::scope_cta, &bar, phase)) {}
        phase ^= 1;
        bool success = ptx::clusterlaunchcontrol_query_cancel_is_canceled(result);
        if (!success) break;                    // 取消失敗 → 離開（無更多工作）
        bx = ptx::clusterlaunchcontrol_query_cancel_get_first_ctaid_x<int>(result);
        // release：把 result 的讀取釋出到 async proxy
        ptx::fence_proxy_async_generic_sync_restrict(
            ptx::sem_release, ptx::space_shared, ptx::scope_cluster);
    }
}
// Launch: kernel_cluster_launch_control<<<(n + 1023) / 1024, 1024>>>(data, n);

重點：launch 配置與 Fixed Work 相同（block 數隨問題大小），但每個 block 不會做完一份就退出，而是透過 try_cancel 竊取尚未啟動 block 的 index 繼續做，等於把 prologue（compute_scalar）攤平到較少的實際執行 block 上。fence_proxy_async_generic_sync_restrict 的 acquire/release 配對負責跨 proxy 的順序與可見性。

Example：Use-case Thread Block Clusters

當使用 thread block cluster 時，取消步驟與非 cluster 情況相同，但有幾項調整：

取消由單一 cluster thread提交（用 cg::cluster_group::thread_rank() == 0）。
每個 cluster 內所有 block 的 shared result 會收到相同的編碼值（multicast 廣播）；該值對應 cluster 內 local block index {0,0,0}，因此各 block 需自行加上 local block index。
同步由每個 cluster 的 block 用各自的 local __shared__ barrier 進行；barrier 操作必須使用 ptx::scope_cluster scope。
Cluster 情況的取消要求所有 block 都已存在；可用 cg::cluster_group::sync() 保證所有 block 都在執行。

__global__ __cluster_dims__(2, 1, 1)
void kernel_cluster_launch_control(float* data, int n) {
    __shared__ uint4 result; __shared__ uint64_t bar; int phase = 0;
    if thread_rank() == 0 {
        ptx::mbarrier_init(&bar, 1);
        ptx::fence_mbarrier_initsem_release, ptx::scope_cluster; // CGA fence
    }
    float alpha = compute_scalar();
    int bx = blockIdx.x;
    while (true) {
        cg::cluster_group::sync();              // 保證所有 block 都已啟動
        if thread_rank() == 0 {
            ptx::fence_proxy_async_generic_sync_restrict(
                ptx::sem_acquire, ptx::space_cluster, ptx::scope_cluster);
            cg::invoke_onecoalesced_threads(), [&](
                { ptx::clusterlaunchcontrol_try_cancel_multicast(&result, &bar); });
        }
        if thread_rank() == 0 // 完成由每個 block 各自追蹤
            ptx::mbarrier_arrive_expect_tx(ptx::sem_relaxed, ptx::scope_cluster,
                ptx::space_shared, &bar, sizeof(uint4));
        int i = bx * blockDim.x + threadIdx.x;
        if (i < n) data[i] *= alpha;
        while (!ptx::mbarrier_try_wait_parity(
            ptx::sem_acquire, ptx::scope_cluster, &bar, phase)) {}
        phase ^= 1;
        bool success = ptx::clusterlaunchcontrol_query_cancel_is_canceled(result);
        if (!success) break;
        bx = ptx::clusterlaunchcontrol_query_cancel_get_first_ctaid_x<int>(result);
        bx += cg::cluster_group::block_index().x; // 加上 local offset
        ptx::fence_proxy_async_generic_sync_restrict(
            ptx::sem_release, ptx::space_shared, ptx::scope_cluster);
    }
}

重點差異：取消指令改用 clusterlaunchcontrol_try_cancel_multicast；barrier 與 fence 全部提升到 scope_cluster；用 cg::cluster_group::sync() 取代 __syncthreads()；解碼出的 bx 需再加 cg::cluster_group::block_index().x 才是該 block 真正要處理的 index。

Cluster 取消的前提條件

在 cluster 情況下取消，要求整個 cluster 的所有 block 都存在並執行中。務必先用 cg::cluster_group::sync() 同步，否則取消行為不正確。同樣地，從 cluster 內多個 thread 提交取消會嘗試取消多個 cluster，不建議。

考試/測驗重點

主題	你必須記得的點
引入版本	Cluster Launch Control 是 Blackwell（compute capability 10.0）新功能
解決什麼	結合 Fixed Work（load balancing/preemption）與 Fixed Number（reduced overheads）的優點
Work stealing 定義	閒置處理器主動從忙碌處理器佇列偷任務，而非等待派工
核心動作	取消「尚未開始執行」的 block，成功則用其 index 做事
取消失敗原因	沒有剩餘 block index，或有更高優先權 kernel 被排程
同步機制	非同步 try_cancel + shared memory mbarrier；pattern 類似 async copy
transaction count	`expect_tx` 用 `sizeof(uint4)`，因結果寫入一個 uint4
為何用 invoke_one	try_cancel 是 uniform 指令，包進 invoke_one 讓編譯器最佳化掉 peeling loop
UB 規則 1	觀察到失敗後再發 request 是 UB（關鍵是 request 間有無 query 觀察）
UB 規則 2	取用失敗 request 的 block index 是 UB
多 thread 提交	會取消多個 block；需唯一 result 指標、調整 arrival/transaction count
async proxy	CLC 是 async proxy 操作，需 fence_proxy_async_generic_sync_restrict 的 acquire/release
Cluster：誰提交	單一 cluster thread，用 try_cancel_multicast
Cluster：結果	multicast 給所有 block，值對應 local index {0,0,0}，各 block 需加 block_index().x
Cluster：scope	barrier/fence 用 scope_cluster；同步用 cg::cluster_group::sync()
Cluster：前提	取消前所有 block 必須存在（cluster_group::sync 保證）

重點總覽

背景：三種 grid 配置策略

Cluster Launch Control 機制

API Details

Thread Block Cancellation 五步驟

Constraints on Thread Block Cancellation

Example：Vector-Scalar Multiplication（Use-case Thread Blocks）

Example：Use-case Thread Block Clusters

考試/測驗重點

Related Notes