非同步資料複製與 L1/Shared 配置 (Async Data Copies and L1/Shared Config)

#cuda #async-execution #async-copy #shared-memory #concept

重點總覽

項目	重點
非同步資料複製目的	把 global → shared memory 的搬移與計算重疊，threads 不需 idle 等待，提升頻寬與運算資源利用率
與前章非同步的差別	前章是 kernel/`cudaMemcpyAsync` 間的非同步；本節是單一 kernel 內部的 global ↔ on-SM memory 搬移非同步
copy-and-compute pattern	fetch global → store shared → compute on shared；同步版需在 copy 後與 compute 後各做一次 `block.sync()`
同步複製的代價	`shared[i] = global[j]` 被編譯為 global → register → shared，需經過 register file 中轉
`memcpy_async` 語意	整個 thread group 協同搬移，「彷彿由另一條 thread 執行」，由 `wait` 同步；未完成前讀寫資料是 data race
API 層級	Cooperative Groups、libcu++ memcpy_async、低階 `cuda::ptx` 與 primitives
硬體機制	LDGSTS (8.0+, 小規模 global→shared)、TMA (9.0+, 大型多維 bulk)、STAS (9.0+, register→distributed shared)
Unified data cache	L1 與 shared memory 共用同一塊物理資源，可在 per-kernel 基礎上調整切分比例
設定 carveout	`cudaFuncSetAttribute` + `cudaFuncAttributePreferredSharedMemoryCarveout`，整數百分比或便利 enum
carveout 是 hint	driver 可改用不同配置；對照 `cudaFuncSetCacheConfig` 是硬性需求
>48 KB shared	須用 dynamic shared memory 並以 `cudaFuncAttributeMaxDynamicSharedMemorySize` 明確 opt-in

非同步資料複製 (Asynchronous Data Copies)

GPU 本就靠並行隱藏記憶體延遲：SM 在某個 warp 等待記憶體時切換去執行其他 warp。但即使如此，記憶體延遲仍可能成為頻寬利用率與運算資源效率的瓶頸。非同步資料複製把「發起搬移」與「等待完成」解耦 (decouple)，讓 threads 在資料於背景搬移時繼續做有用的工作。

Important

本節的非同步與前章不同。前章談的是 kernel 與 cudaMemcpyAsync 等不同元件之間的非同步；本節指的是單一 kernel launch 內部，GPU 的 DRAM (global memory) 與 on-SM memory (shared / tensor memory) 之間的搬移不阻塞 GPU threads。

copy-and-compute pattern 與同步版本

CUDA 常見的 copy-and-compute 模式：從 global memory 取資料 → 寫入 shared memory → 在 shared memory 上計算 (可能再寫回 global)。其 copy 階段 shared[local_idx] = global[global_idx] 會被編譯器展開為「先讀 global 進 register，再從 register 寫入 shared」。在迭代演算法中，每次都需要兩次同步。

shared[local_idx] = global_in[global_idx];
block.sync();                          // 等所有 copy 完成，compute 才能開始
compute(global_out + block_batch_idx, shared);
block.sync();                          // 等 compute 完成，才能覆寫 shared

要點：同步路徑經過 register file 中轉，且 thread block 必須阻塞等待整批 copy 完成後才能進入 compute。

非同步版本：`memcpy_async` + `wait`

#include <cooperative_groups/memcpy_async.h>
// 整個 thread group 協同把整批 batch 複製到 shared memory
cooperative_groups::memcpy_async(block, shared, global_in + block_batch_idx, block.size());
// 等待期間可對其他資料計算
cooperative_groups::wait(block);       // 等所有 copy 完成
compute(global_out + block_batch_idx, shared);

cooperative_groups::memcpy_async 把 block.size() 個元素從 global 複製到 shared。此操作彷彿由另一條 thread 執行，並在完成後與目前 thread 對 cooperative_groups::wait 的呼叫同步。

Warning

在 copy 完成前，修改 global 資料、或讀寫 shared 資料都會造成 data race。必須在 wait 之後才能安全使用 shared memory。

所有非同步複製的核心概念都一致：將搬移的發起與完成解耦。CUDA 提供多層 API 存取此能力：Cooperative Groups 與 libcu++ 的 memcpy_async、以及較低階的 cuda::ptx 與 primitives API。它們語意相近——彷彿由另一條 thread 執行複製，完成時可用不同的 completion 機制同步。

同步 copy：  [── load global ──][── store shared ──][block.sync 等待][── compute ──]
                         (threads 在搬移期間 idle)

非同步 copy：[memcpy_async 發起]┄┄┄(背景搬移)┄┄┄[wait]
             └──────── 同時可對其他資料 compute ────────┘   (隱藏延遲)

硬體機制與資料路徑

現代 GPU 架構提供多種非同步搬移硬體機制，支援不同資料路徑、傳輸大小與對齊需求：

機制	Compute Capability	資料路徑 (Source → Destination)	定位
LDGSTS	8.0+	global → `shared::cta`	小規模 global→shared 非同步傳輸
TMA (Tensor Memory Accelerator)	9.0+	global ↔ shared (cta / cluster)、global→global 等 bulk 路徑	針對大型多維資料最佳化的 bulk-async 複製
STAS	9.0+	registers → `shared::cluster`	cluster 內 register 到 distributed shared memory 的小規模非同步傳輸

Tip

入門最常用的是 LDGSTS (8.0+) 驅動的 cooperative_groups::memcpy_async，它讓 global→shared 直接搬移、繞過 register file 中轉。TMA 與 STAS 為 9.0+ 的進階 bulk / cluster 場景。Table 5 中空白的 source-destination 格代表該配對不支援。

配置 L1/Shared Memory 平衡 (Configuring L1/Shared Memory Balance)

同一個 SM 上的 L1 與 shared memory 使用相同的物理資源，稱為 unified data cache。在多數架構上，若 kernel 幾乎不用 shared memory，unified data cache 可被配置為提供該架構允許的最大 L1 cache。

            ┌──────────────── Unified Data Cache (固定總量) ────────────────┐
MaxL1   →   │ L1 cache                                      │ shared (最小) │
Default →   │ L1 cache              │ shared memory                         │
MaxShared → │ L1 (最小) │ shared memory                                     │
            └────────────────────────────────────────────────────────────┘
                        carveout 調整這條分界線 (per-kernel)

設定 carveout

reserved 給 shared memory 的比例可在 per-kernel 基礎上配置。在 kernel launch 前用 cudaFuncSetAttribute 設定 carveout (偏好的 shared memory 容量)：

cudaFuncSetAttribute(kernel_name,
                     cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

carveout 可設為「該架構最大支援 shared memory 容量的整數百分比」，也可用三個便利 enum：

cudaSharedmemCarveoutDefault
cudaSharedmemCarveoutMaxL1
cudaSharedmemCarveoutMaxShared

最大 shared memory 與支援的 carveout 大小因架構而異。

Warning

當所選整數百分比無法剛好對應到支援的 shared memory 容量時，會採用下一個更大的容量。例如 compute capability 12.0 最大 shared 為 100 KB、支援大小為 {0, 8, 16, 32, 64, 100} KB；設定 carveout = 50% 會得到 64 KB 而非 50 KB。

要點：

傳入 cudaFuncSetAttribute 的函式必須以 __global__ 宣告。
carveout 被 driver 解讀為 hint；若執行 kernel 需要，driver 可選擇不同的 carveout 大小。

`cudaFuncSetAttribute` vs `cudaFuncSetCacheConfig`

項目	`cudaFuncSetAttribute` (PreferredSharedMemoryCarveout)	`cudaFuncSetCacheConfig`
性質	hint，driver 可改用其他配置	hard requirement (硬性需求)
交錯不同 shared 設定的 kernel	driver 可避免重配置 / thrashing	會在 shared memory 重配置後不必要地序列化 launch
建議	preferred	不建議用於需頻繁切換的情境

Tip

偏好 cudaFuncSetAttribute，因為 driver 可在「需要執行該函式」或「避免 thrashing」時自行選擇不同配置；而 cudaFuncSetCacheConfig 的硬性需求會讓不同 shared 配置的 kernel 互相序列化。

超過 48 KB shared memory 的 opt-in

依賴每 block 超過 48 KB shared memory 的 kernel 是架構特定的，必須使用 dynamic shared memory (而非靜態大小陣列)，並以 cudaFuncSetAttribute 明確 opt-in：

// Device code
__global__ void MyKernel(...) {
    extern __shared__ float buffer[];
    // ...
}

// Host code
int maxbytes = 98304; // 96 KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, maxbytes);
MyKernel<<<gridDim, blockDim, maxbytes>>>(...);

要點：靜態 __shared__ 陣列無法超過 48 KB；超過需 extern __shared__ + cudaFuncAttributeMaxDynamicSharedMemorySize，並在 launch 的 <<<...>>> 第三參數傳入動態 shared 大小。

考試/測驗重點

情境/關鍵字	答案
本節非同步 vs 前章非同步	前章：kernel 與 `cudaMemcpyAsync` 等元件間；本節：單一 kernel 內 global↔on-SM memory
同步 `shared[i]=global[j]` 的隱含路徑	global → register → shared (經 register file 中轉)
`memcpy_async` 之後忘了 `wait` 就讀 shared	data race
`memcpy_async` 的執行語意	彷彿由另一條 thread 執行，`wait` 與其同步
LDGSTS / 起始 compute capability / 路徑	8.0+ / global → shared::cta / 小規模
TMA / 起始 cc / 用途	9.0+ / 大型多維 bulk-async copy
STAS / 起始 cc / 路徑	9.0+ / registers → distributed shared (cluster 內)
L1 與 shared 的關係	共用同一物理資源 unified data cache
設定偏好 shared 容量的 API	`cudaFuncSetAttribute` + `cudaFuncAttributePreferredSharedMemoryCarveout`
carveout 三個 enum	Default / MaxL1 / MaxShared
carveout 50% 對 100KB (cc 12.0)	64 KB (取下一個更大支援容量，非 50KB)
carveout 是 hint 還是硬性	hint；driver 可改
`cudaFuncSetCacheConfig` 缺點	硬性需求，交錯不同 shared 設定會序列化 launch
傳入 `cudaFuncSetAttribute` 的函式需何宣告	`__global__`
超過 48KB shared 的條件	dynamic shared + `cudaFuncAttributeMaxDynamicSharedMemorySize` opt-in，且架構特定

重點總覽

非同步資料複製 (Asynchronous Data Copies)

copy-and-compute pattern 與同步版本

非同步版本：memcpy_async + wait

硬體機制與資料路徑

配置 L1/Shared Memory 平衡 (Configuring L1/Shared Memory Balance)

設定 carveout

cudaFuncSetAttribute vs cudaFuncSetCacheConfig

超過 48 KB shared memory 的 opt-in

考試/測驗重點

Related Notes

非同步版本：`memcpy_async` + `wait`

`cudaFuncSetAttribute` vs `cudaFuncSetCacheConfig`