非同步資料複製：LDGSTS (Async Data Copies with LDGSTS)

#cuda #async-execution #async-copy #concept

重點總覽

項目	重點
LDGSTS 是什麼	CC 8.0+ 的非同步資料搬移指令，從 global memory 直接複製到 shared memory，繞過 register
適用場景	較小、element-wise、或不規則存取模式的資料搬移（大批量/多維請改用 TMA）
複製大小	4、8 或 16 bytes；4/8 走 L1 ACCESS（資料同時快取於 L1），16 走 L1 BYPASS（不污染 L1）
方向限制	只支援 global → shared，且來源/目的需依大小對齊 4/8/16 bytes
最佳對齊	shared 與 global 都對齊 128 bytes 時效能最佳
非同步性	模型為 async thread operation；發起 thread 可繼續計算，硬體在背景搬資料
完成訊號	透過 shared memory barrier 或 pipeline；預設每個 thread 只等自己的複製
跨 thread 共享	prefetch 給別的 thread 用時，等待完成後仍需 `__syncthreads()`
三種 API	`cuda::memcpy_async`（C++）、`cooperative_groups::memcpy_async`、CUDA C primitives（`__pipeline_*`）
三大範例	條件式批次載入、資料 prefetch（multi-stage pipeline）、warp specialization 的 producer-consumer

LDGSTS 基本概念

許多 CUDA 應用需要在 global 與 shared memory 之間頻繁搬移資料，常見的是複製較小的資料元素或執行不規則存取模式。LDGSTS（CC 8.0+，詳見 PTX 文件）的主要目標是：為較小、element-wise 的傳輸提供高效的非同步 global→shared 搬移機制，並透過重疊執行（overlapped execution）讓計算資源獲得更好的利用率。

維度（Dimensions）：支援複製 4、8 或 16 bytes。複製 4 或 8 bytes 一律走 L1 ACCESS 模式（資料同時被快取在 L1）；複製 16 bytes 則啟用 L1 BYPASS 模式（不污染 L1）。
來源與目的（Source and destination）：唯一支援的方向是 global → shared。指標需依複製大小對齊 4/8/16 bytes；當 shared 與 global memory 皆對齊 128 bytes 時效能最佳。
非同步性（Asynchronicity）：傳輸被建模為 async thread operation（見 Async Thread 與 Async Proxy），讓發起 thread 可繼續計算，硬體在背景非同步搬資料。實務上是否真的非同步取決於硬體實作，未來可能改變。
完成訊號（Completion signal）：LDGSTS 必須在操作完成時提供訊號，可用 shared memory barrier 或 pipeline 作為完成機制。

預設只等自己的複製

預設情況下，每個 thread 只會等待它自己發出的 LDGSTS 複製。因此若你用 LDGSTS 來 prefetch 將與其他 thread 共享的資料，在與 LDGSTS 完成機制同步之後，還必須額外加一個 __syncthreads() 才能保證所有 thread 都看到完整資料。

Table 18：支援的來源/目的與完成機制

Direction	Source	Destination	完成機制 / API
Async Copy (LDGSTS, CC 8.0+)	global	shared::cta	shared memory barrier、cuda::memcpy_async、pipeline、cooperative_groups::memcpy_async、__pipeline_memcpy_async

空格表示不支援

原文表格中空白儲存格代表該來源-目的配對不被支援。LDGSTS 不支援 global→global、shared→shared、shared→global 等方向；那些情境屬於 TMA / STAS 的範疇。

範例一：條件式程式碼中的批次載入 (Batching Loads in Conditional Code)

stencil 範例中，thread block 的第一個 warp 負責協同載入中心（center）與左右 halo 的所有資料。若用同步複製，因為程式碼的條件性質，編譯器可能產生一連串「LDG（load-from-global）→ STS（store-to-shared）」交錯的指令，而不是最佳的「3 個 LDG 後接 3 個 STS」——後者才能有效隱藏 global memory 延遲。

// 同步版本：條件分支可能讓編譯器交錯 LDG/STS，無法讓所有 load 同時 in-flight
if (tid < 8)            buffer[tid]      = left[tid];   // Left halo
else if (tid >= 32 - 8) buffer[tid + 16] = right[tid];  // Right halo
if (tid < 32)           buffer[tid + 8]  = center[tid]; // Center
__syncthreads();

改用非同步複製，把資料直接從 global 載入 shared，可同時達到兩個好處：降低 register 用量（資料直接落 shared，不經 register），並確保所有 global load 都 in-flight。

API 寫法對照

// (1) cuda::memcpy_async + cuda::barrier
using barrier_t = cuda::barrier<cuda::thread_scope_block>;
__shared__ barrier_t barrier;            // block-wide barrier
// Version 1：個別 thread 各自發複製
cuda::memcpy_async(buffer + tid, left + tid,
                   cuda::aligned_size_t<4>(sizeof(float)), barrier);
// Version 2：整個 block 協同發複製（API 自行決定底層如何切分）
cuda::memcpy_async(block, buffer, left,
                   cuda::aligned_size_t<4>(8 * sizeof(float)), barrier);
barrier.arrive_and_wait();               // 合併 arrive 與 wait
__syncthreads();

cuda::barrier 的 cuda::memcpy_async overload 讓非同步傳輸能用 asynchronous barrier 同步：它「彷彿由另一個綁定到 barrier 的 thread 執行複製」——建立時把當前 phase 的 expected count 加一，複製完成時減一，因此 barrier 的 phase 唯有在所有參與 thread 都 arrive 且所有綁到此 phase 的 memcpy_async 都完成後才前進。範例用 block-wide barrier 並以 arrive_and_wait 把 arrive 與 wait 合併（phase 之間沒有額外工作）。

// (2) cooperative_groups::memcpy_async：協同搬移，但用 cg::wait 同步
cg::memcpy_async(block, buffer,      left,   8  * sizeof(float));
cg::memcpy_async(block, buffer + 8,  center, 32 * sizeof(float));
cg::memcpy_async(block, buffer + 40, right,  8  * sizeof(float));
cg::wait(block);                         // 等待所有複製完成
__syncthreads();

// (3) CUDA C primitives：最底層、最囉嗦、但保證用 LDGSTS
__pipeline_memcpy_async(buffer + tid, left + tid, sizeof(float));
__pipeline_commit();                     // 提交這一批複製
__pipeline_wait_prior(0);                // 等待 pipeline 中全部完成
__syncthreads();

cuda::aligned_size_t<4>() 同時告訴編譯器「資料對齊 4 bytes」且「複製大小是 4 的倍數」，以便啟用 LDGSTS。與 cuda::barrier 互通時，要用 cuda/barrier header 的 cuda::memcpy_async。
CUDA C primitives 提供最直接的控制（代價是程式碼較冗長），並保證底層使用 LDGSTS——這在高階 API 中並無保證。

cooperative_groups::memcpy_async 在此例較沒效率

cooperative_groups::memcpy_async 會在每個複製啟動時立即 commit，因而無法享有「先批次多個複製、再單次 commit」的最佳化（其他 API 可以做到）。

範例二：資料 Prefetch (Prefetching Data)

在「複製→計算」的迭代模式中，用非同步複製把未來迭代的資料搬移，與當前迭代的計算重疊，可隱藏傳輸延遲、提高 bytes-in-flight。核心手法是 multi-stage pipeline（見 04-CUDA-Features/12-Pipelines-Deep-Dive）。

// cuda::memcpy_async + thread-local pipeline（num_stages 個 stage）
cuda::pipeline<cuda::thread_scope_thread> pipeline = cuda::make_pipeline();

// 1) 先用前 num_stages 批把 pipeline 填滿
for (int s = 0; s < num_stages; ++s) {
  pipeline.producer_acquire();
  cuda::memcpy_async(shared + shared_offset[s] + tid,
                     global_in + block_batch(s) + tid,
                     cuda::aligned_size_t<4>(sizeof(int)), pipeline);
  pipeline.producer_commit();
}
// 2) 主迴圈：等最早的 stage 完成 → 計算 → 釋放 → 預取未來 stage
for (size_t cb = 0, fb = num_stages; cb < batch_size; ++cb, ++fb) {
  cuda::pipeline_consumer_wait_prior<num_stages - 1>(pipeline);
  __syncthreads();                       // 若每個 thread 只用自己複製的資料則非必要
  compute(/* current batch */);
  pipeline.consumer_release();
  pipeline.producer_acquire();           // 載入領先 num_stages 的未來 stage
}

執行流程（三種 API 概念一致）：

fill: [stage0][stage1]...[stage_{n-1}]   ← 先排 num_stages 個 memcpy_async
loop iteration:
   wait_prior(n-1) ─ 等最舊 stage 完成
        │
        ▼
   compute(current) ◀── 同時 ──▶ memcpy_async(future) 在背景搬移
        │
        ▼
   release / commit ─ 推進 pipeline，回收 buffer

cuda::memcpy_async 版：用 thread-local cuda::pipeline，先排 num_stages 個複製暖機，再於迴圈中「等當前 batch → 計算 → 排下一個複製」。
cooperative_groups::memcpy_async 版：不使用 pipeline 物件，改由 cg API 在底層自行分階段排程。
CUDA C primitives 版：用 __pipeline_memcpy_async / __pipeline_commit / __pipeline_wait_prior，作法與第一種非常類似。

永遠保持 num_stages 批在 pipeline 中

高效碼產生的關鍵是：即使已無更多 batch 可取，仍要持續 commit（pipeline.producer_commit() 或 __pipeline_commit()），讓 pipeline 內維持 num_stages 批。但 cooperative groups API 無法做到這點，因為我們無法存取其內部 pipeline。

範例三：Warp Specialization 的 Producer-Consumer

把單一 warp 專門化（specialize）為 producer，負責 global→shared 的非同步複製；其餘 warp 為 consumer，從 shared 讀資料並計算。為了讓 producer 與 consumer 並行，shared memory 採 double-buffering（雙緩衝）：consumer 在處理其中一個 buffer 時，producer 非同步把下一批資料抓進另一個 buffer。

shared buffer：[ buffer_0 ][ buffer_1 ]   （大小 = 2 * buffer_len）

  Producer warp        Consumer warps
  ───────────          ─────────────
  fill  buffer_0  ──▶  consume buffer_0
  fill  buffer_1  ──▶  consume buffer_1   （交替乒乓）
   ▲                         │
   └──────── ready ◀─────────┘

// cuda::memcpy_async：partitioned pipeline，第一個 warp 當 producer
constexpr int num_stages = 2;
cuda::std::size_t producer_count = warpSize;   // 32：producer 數量
__shared__ cuda::pipeline_shared_state<cuda::thread_scope_block, num_stages> shared_state;
pipeline pipe = cuda::make_pipeline(block, &shared_state, producer_count);

if (block.thread_rank() < producer_count)      // producer 先把兩個 stage 都填滿
  for (int s = 0; s < num_stages; ++s)
    produce(pipe, num_stages, s, num_batches, s, buffer, buffer_len, in, N);

produce 內 pipe.producer_acquire() → cuda::memcpy_async(...) → pipe.producer_commit()；consume 內 pipe.consumer_wait() → 處理 → pipe.consumer_release()。這是抽象層最高的寫法：producer 先填滿兩個 stage，主迴圈中 consumer 處理當前 batch 的同時，producer 抓未來 batch，維持穩定的工作流。

// CUDA C primitives：__pipeline_memcpy_async + shared memory barrier
__mbarrier_token_t token = __mbarrier_arrive(&ready[i % 2]);   // 等 buffer 可填
while (!__mbarrier_try_wait(&ready[i % 2], token, 1000)) {}
__pipeline_memcpy_async(buffer + i * buffer_len + threadIdx.x, /*...*/);
__pipeline_arrive_on(filled[i % 2]);          // 把這個複製關聯到 barrier
__mbarrier_arrive(filled[i % 2]);             // buffer 已填好

範例用 4 個 __mbarrier_t：bar[0]/bar[1] 追蹤 buffer_0/buffer_1 是否可填，bar[2]/bar[3] 追蹤是否已填好。
__pipeline_arrive_on() 把 memory copy 關聯到 barrier：它先把 barrier 的 arrival count 加一，當該複製之前序列的所有非同步操作完成時，arrival count 自動減一——淨效果為零。正因如此，還需要用 __mbarrier_arrive() 顯式抵達 barrier，才能真正推進 phase。

三種 API 的取捨

高階的 cuda::memcpy_async + cuda::pipeline 抽象最高、可做 partitioned pipeline；cooperative_groups::memcpy_async 簡潔但每次自動 commit、彈性最低；CUDA C primitives 最囉嗦但控制力最強，且保證用 LDGSTS。

考試/測驗重點

主題	必記重點
計算能力	LDGSTS 需 CC 8.0+
方向	只支援 global → shared（單向）；其他方向見 TMA / STAS
大小與模式	4/8 bytes = L1 ACCESS（快取於 L1）；16 bytes = L1 BYPASS（不污染 L1）
對齊	依大小對齊 4/8/16 bytes；最佳對齊為 128 bytes
register	直接 global→shared，繞過 register，降低 register 用量
預設等待語意	每 thread 只等自己的複製；共享給他人時需額外 __syncthreads()
完成機制	shared memory barrier 或 pipeline
對齊提示	cuda::aligned_size_t<4>() 同時告知對齊與大小為倍數，才會用 LDGSTS
保證 LDGSTS	只有 CUDA C primitives（_pipeline*）保證底層用 LDGSTS，高階 API 不保證
cg::memcpy_async 缺點	每次複製立即 commit，無法批次後單次 commit，較沒效率
prefetch 關鍵	即使無更多 batch 仍持續 commit，維持 num_stages 批在 pipeline；cg API 做不到
warp specialization	第一個 warp 當 producer，其餘當 consumer，搭配 double-buffering
__pipeline_arrive_on	對 barrier 淨效果為零，故仍需顯式 __mbarrier_arrive()
Table 18 空格	空白儲存格代表該 source-destination 配對不支援