非同步資料複製：TMA (Async Data Copies with TMA)

#cuda #async-execution #async-copy #tma #concept

重點總覽

項目	重點
TMA 是什麼	Tensor Memory Accelerator，CC 9.0 (Hopper)+ 的硬體引擎，把多維陣列 sub-tile 在 global ↔ shared 之間高效搬移，卸載 address 計算
兩種模式	bulk-asynchronous copy（1D 連續陣列，免 tensor map）/ bulk-tensor asynchronous copy（多維最多 5D，需 tensor map）
「bulk」一詞	用以與前一節 LDGSTS 的 `cp.async`（non-bulk）區別；硬體上是不同的 async-group
async proxy	TMA 屬 async proxy 操作，發起 thread 可繼續計算，硬體在背景搬移
完成機制（讀）	global → shared：用 shared memory barrier 的 transaction count，block 內任何 thread 可等待
完成機制（寫）	shared → global：用 bulk async-group，僅發起 thread 可等待（`commit_group` + `wait_group`）
tensor map	`CUtensorMap` 物件，描述多維陣列在 global/shared 的 layout；host 端用 `cuTensorMapEncodeTiled` 建立
傳遞 tensor map	首選 `const __grid_constant__` kernel 參數；亦可 `__constant__` 變數或 global memory（需 fence）
device 端編碼	用 template + `tensormap_replace*` 改欄位 + `tensormap_cp_fenceproxy` 寫回 global，適合一個 kernel 處理多種尺寸的 batch
release-acquire	global memory 的 tensor map 需在「修改方 release」與「使用方 acquire」間建立 proxy 順序
bank swizzling	tensor map 編 swizzle mode（none/32B/64B/128B），TMA 存入 shared 時重排佈局以消除 bank conflict
對齊（1D）	global/shared 位址 16B、barrier 8B、傳輸大小 16B 倍數
對齊（多維）	shared 位址 128B（swizzle 128B 時 1024B）、global stride 16B 倍數、傳輸 16B 倍數

Important

TMA 的核心價值是把「重複又易錯的 multi-dimensional address 計算」搬到硬體。一旦 tensor map 描述好陣列 layout，device 端只需給出 tile 的座標 {x, y, ...}，TMA engine 就會自動算出每個元素的 global 位址並搬移，連 out-of-bounds 補零都由硬體處理。

TMA 概觀：兩種模式與完成機制

TMA（在 PTX 中稱 bulk / bulk-tensor 非同步複製）的 source 與 destination 可在 shared 或 global，支援：global → shared、shared → global，以及 cluster 內 block 之間的 shared → distributed shared。在 cluster 中還可指定 multicast，把 global 的資料一次廣播到多個 block 的 shared memory（為 sm_90a 最佳化，其他 target 效能可能大幅下降）。

                       TMA 完成機制
  ┌─────────────────────────────────────────────────────────┐
  │ global → shared    : shared memory barrier (transaction)  │  讀：block 內任一 thread 可等
  │ global → s::cluster: shared memory barrier (multicast)    │
  │ shared → global    : bulk async-group (commit/wait)       │  寫：只有發起 thread 可等
  │ s::cta → s::cluster: shared memory barrier                │
  └─────────────────────────────────────────────────────────┘

讀（資料進 shared）：硬體完成讀取後更新 barrier 的 transaction count，barrier flip 後資料才安全可讀。
寫（資料出 shared）：用 thread-local 的 bulk async-group；多個寫可批次成一個 group，再等該 group「讀完 shared」或「寫完 global」。

Warning

bulk 與 non-bulk 的 async-group 彼此獨立：存在 cp.async.wait_group（LDGSTS）與 cp.async.bulk.wait_group（TMA）兩套不同指令，不可混用等待。

4.11.2.1 用 TMA 搬 1D 陣列

1D 連續陣列不需要 tensor map，只用指標與大小即可在 device 端發起。對應 API：cuda::memcpy_async、cuda::device::memcpy_async_tx、cuda::ptx::cp_async_bulk。

Tip

cuda::memcpy_async 只有在 source/destination 都 16B 對齊且大小是 16B 倍數時才用 TMA，否則 fallback 成同步複製；而 cuda::device::memcpy_async_tx 與 cuda::ptx::cp_async_bulk 永遠用 TMA，不滿足需求即為 undefined behavior。

1D read-modify-write 的七個步驟（global → shared → +1 → global）：

 1. init barrier(blockDim.x)                  ── 完成機制
 2. cp_async_bulk: global → shared            ── 發起讀（單一 thread）
 3. bar.arrive() ; bar.wait(token)            ── 等資料到齊
 4. smem_data[i] += 1                          ── 計算（所有 thread）
 5. fence_proxy_async(space_shared)+syncthreads── 讓 generic 寫對 async proxy 可見
 6. cp_async_bulk: shared → global            ── 發起寫（單一 thread）
 7. commit_group ; wait_group_read(0)          ── 等寫端讀完 shared

// 步驟 2/6 的三種寫法（擇一）
cuda::memcpy_async(smem_data, data + offset,
                   cuda::aligned_size_t<16>(sizeof(smem_data)), bar);   // 自動設 tx count
// 或 memcpy_async_tx / cp_async_bulk，需自行：
// cuda::ptx::mbarrier_expect_tx(...) 告知預期 transaction（以 byte 計）
ptx::cp_async_bulk(ptx::space_global, ptx::space_shared,
                   data + offset, smem_data, sizeof(smem_data));        // shared → global
ptx::cp_async_bulk_commit_group();
ptx::cp_async_bulk_wait_group_readn32_t<0>();

重點：memcpy_async 會自動更新 barrier 的預期 transaction count；用 memcpy_async_tx / cp_async_bulk 則必須手動 mbarrier_expect_tx。barrier 要等「所有 thread 已 arrive」且「所有 byte 已到」兩條件都成立才 flip。

Warning

應由單一 thread 發起 TMA。只寫 if (threadIdx.x == 0) 不夠：編譯器無法確認只有一條 thread，可能插入 peeling loop 造成 warp serialization。改用 is_elected()（內部用 cuda::ptx::elect_sync 從 warp 0 選一條 thread）或 cooperative_groups::invoke_one，讓編譯器生出更高效的程式碼。

步驟 5 的 fence_proxy_async 不可省：計算階段是透過 generic proxy 寫 shared，而 TMA 透過 async proxy 讀，必須用 proxy fence 把 generic 寫排在 async 讀之前，再以 __syncthreads() 讓所有 thread 的 fence 對發起 thread 0 生效。

1D 對齊需求：global 位址 16B、shared 位址 16B、barrier 位址 8B（cuda::barrier 保證）、傳輸大小須為 16B 倍數。

4.11.2.1.1 Prefetching（多階段預取）

在「複製-計算」迭代中，提前預取未來迭代的資料，可用當前迭代的計算掩蓋未來資料的搬移延遲，提高 bytes-in-flight。範例用 cuda::device::memcpy_async_tx 做 TMA 複製，並用 explicit phase tracking 的 shared memory barrier 同步：

初始化階段：每個 stage 一個 barrier（init(&bar[i], 1)），先載入前 num_stages 個 batch 到不同 shared 區段。
主迴圈：(a) mbarrier_try_wait_parity() 等當前 batch 到齊 → (b) 計算 → (c) 預取領先 num_stages 的下一筆 → (d) 用 rotating buffer 輪替 stage 並追蹤 barrier parity。

Tip

explicit phase tracking（parity 翻轉）讓「只有一條 thread arrive 設定 transaction count，其餘 thread 等 parity flip」成為可能，比 token-based bar.wait() 更省。詳見 04-CUDA-Features/11-Asynchronous-Barriers-Deep-Dive。

4.11.2.2 用 TMA 搬多維陣列（tensor map）

多維（最多 5D）與 1D 的關鍵差異：必須在 host 端建立 tensor map 並傳給 kernel。device code 用 cuda::ptx::cp_async_bulk_tensor 發起，完成機制與 1D 相同（讀用 barrier、寫用 bulk async-group）。

用 Driver API 建立 tensor map

cuTensorMapEncodeTiled 是 driver API，需直接連 -lcuda 或經 cudaGetDriverEntryPointByVersion 取得函式指標（見 04-CUDA-Features/25-Driver-Entry-Point-Access）：

PFN_cuTensorMapEncodeTiled_v12000 get_cuTensorMapEncodeTiled() {
  void* p = nullptr; cudaDriverEntryPointQueryResult st;
  cudaGetDriverEntryPointByVersion("cuTensorMapEncodeTiled", &p, 12000,
                                   cudaEnableDefault, &st);   // version 12000
  return reinterpret_cast<PFN_cuTensorMapEncodeTiled_v12000>(p);
}

建立需要：base 指標、各維 size（元素數）、stride（byte，且須 16B 倍數）、shared buffer 的 box_size（元素數）、elem_stride。注意最快變動維排第一。其他欄位包含資料型別、interleave、swizzle、L2 promotion（見 04-CUDA-Features/17-L2-Cache-Control）、OOB fill 等：

constexpr uint32_t rank = 2;
uint64_t size[rank]       = {GMEM_WIDTH, GMEM_HEIGHT};   // 最快維在前
uint64_t stride[rank-1]   = {GMEM_WIDTH * sizeof(int)};  // 須 16B 倍數
uint32_t box_size[rank]   = {SMEM_WIDTH, SMEM_HEIGHT};
uint32_t elem_stride[rank]= {1, 1};                      // 2 可只取複數的實部
cuTensorMapEncodeTiled(&tensor_map, CU_TENSOR_MAP_DATA_TYPE_INT32, rank,
    tensor_ptr, size, stride, box_size, elem_stride,
    CU_TENSOR_MAP_INTERLEAVE_NONE, CU_TENSOR_MAP_SWIZZLE_NONE,
    CU_TENSOR_MAP_L2_PROMOTION_NONE, CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE);

Host-to-device：三種傳遞方式

方式	寫法	說明
`__grid_constant__` 參數（首選）	`kernel(const __grid_constant__ CUtensorMap m)`	直接當 const kernel 參數傳；GCC 可能警告 64-byte 對齊 ABI，可忽略
`__constant__` 變數	`cudaMemcpyToSymbol(global_tensor_map, &m, sizeof(CUtensorMap))`	複製到 device constant memory
global memory	`cudaMemcpy(&global_tensor_map, ...)` + 每 block fence	較慢；每個 block 在任一 thread 使用前需 `fence_proxy_tensormap_generic`（host 改的用 `.sys` scope）

使用 tensor map（2D tile 搬移）

device 端只給 tile 左上角座標 {x, y}，TMA 自動算位址：

__shared__ alignas(128) int smem_buffer[SMEM_HEIGHT][SMEM_WIDTH];  // 多維需 128B 對齊
int32_t coords[2] = { x, y };
ptx::cp_async_bulk_tensor(ptx::space_shared, ptx::space_global,
    &smem_buffer, &tensor_map, coords, cuda::device::barrier_native_handle(bar));
token = cuda::device::barrier_arrive_tx(bar, 1, sizeof(smem_buffer)); // 設 tx count

Tip

負索引與越界：讀（global → shared）時越界區域會 zero-fill，左上角座標可為負；寫（shared → global）時可部分越界，但左上角座標不可為負。

size / stride：size 是某維的元素數（全部需 ≥ 1）；stride 是同維元素間的 byte 數。例：4×3 row-major int 矩陣，因對齊需求 stride 仍是 {4, 16}——每列補 4 byte 使下一列起點對齊 16B。

多維對齊需求：global 位址 16B、global stride 16B 倍數、shared 位址 128B、barrier 8B、傳輸大小 16B 倍數；global size 不必是 16B 倍數但須 ≥ 1。

4.11.2.2.1 / 4.11.2.2.4 在 device 上編碼 tensor map（template 模式）

當「典型的 __grid_constant__ 傳遞」不理想時（例如一個 kernel launch 要處理一批大小各異的 tensor），可在 device 上編碼 tiled-type tensor map。推薦流程：

host:   make_tensormap_template()  ── 用 cuTensorMapEncodeTiled 建最小 template
          │  (傳給 device：參數 / 指標 / __constant__ 皆可)
device: encode_tensor_map<<<1,32>>>(template, params, global_tensor_map)
          │  copy template → shared，tensormap_replace* 改欄位，cp_fenceproxy 寫回 global
device: consume_tensor_map<<<1,1>>>(global_tensor_map)
          │  fence acquire → cp_async_bulk_tensor 使用

Template via Driver API（4.11.2.2.4）：用 cuTensorMapEncodeTiled 建立一個 rank=1、UINT8 的最小可用 tensor map 當 template，之後在 device 修改其欄位即可。

Tip

在 sm_90a 上，shared memory 中一塊 zero-initialized 的 buffer 也可當初始 tensor map 值，於是可完全在 device 端編碼，無需 driver API 建 template。

4.11.2.2.2 device 端編碼與修改

步驟：① 把 template 傳進 kernel（任意方式）→ ② copy-initialize 到 shared 的 CUtensorMap（128B 對齊）→ ③ 用 cuda::ptx::tensormap_replace* 改欄位 → ④ 用 tensormap_cp_fenceproxy 從 shared 寫回 global 並 fence。

__launch_bounds__(32) __global__
void encode_tensor_map(const __grid_constant__ CUtensorMap template_map,
                       tensormap_params p, CUtensorMap* out) {
  __shared__ alignas(128) CUtensorMap smem_tmap;
  if (threadIdx.x == 0) {
    smem_tmap = template_map;                                  // copy template
    ptx::tensormap_replace_global_addressspace_shared, &smem_tmap, p.global_address;
    ptx::tensormap_replace_rankspace_shared, &smem_tmap, p.rank - 1; // 零基！
    // tensormap_replace_box_dim / global_dim / global_stride / element_size ...
  }
  __syncwarp();                                                // warp 內同步修改
  ptx::n32_t<128> bytes_128;
  ptx::tensormap_cp_fenceproxysem_release, ptx::scope_gpu, out, &smem_tmap, bytes_128;
}

要點：.rank 欄位採零基編號，要填「想要的 rank 減一」。tensormap_cp_fenceproxy 由 warp 內 threads 集體把 tensor map 複製到 global，同時以 release 語意讓更新對其他 thread（之後的 cp.async.bulk.tensor）可見。

Warning

tensor map 格式未來可能變動，故 tensormap_replace（tensormap.replace.tile PTX）標記為 sm_90a 專屬，須以 nvcc -arch sm_90a 編譯。且 device 端修改僅支援 tiled-type tensor map，其他型別不可在 device 修改。

4.11.2.2.3 使用被修改過的 tensor map（release-acquire）

與 __grid_constant__ 參數不同，使用 global memory 中的 tensor map 必須在「修改方」與「使用方」之間明確建立 tensor map proxy 的 release-acquire 關係：

 修改 thread ── tensormap_cp_fenceproxy(sem_release, ...) ──► global tensor map
                                                                 │
 使用 thread ── fence_proxy_tensormap_generic(sem_acquire,...) ◄─┘
                  │  (同 device 用 .gpu scope；跨 device 用 .sys)
                  ▼  __syncthreads() 後同 block 其他 thread 才可用
              cp_async_bulk_tensor(...)

__global__ void consume_tensor_map(CUtensorMap* tensor_map) {
  ptx::n32_t<128> size_bytes;
  ptx::fence_proxy_tensormap_generic(ptx::sem_acquire, ptx::scope_sys,
                                     tensor_map, size_bytes);   // acquire
  // 之後安全使用 tensor_map 發 cp_async_bulk_tensor
}

Warning

執行 fence 的 thread 與使用 tensor map 的 thread 必須在同一 block。若位於同 cluster / 同 grid 的不同 block、或不同 kernel，cluster.syncsync(、stream-order 同步都不足以建立 tensor map 更新的順序——那些 block 仍需各自以正確 scope acquire。若中間沒有再修改，fence 不必在每次 cp.async.bulk.tensor 前重複。

4.11.2.2.5 Shared-Memory Bank Swizzling

shared memory 有 32 個 bank，連續的 32-bit word 映到連續 bank，每 bank 每 cycle 32-bit。若同一 transaction 內重複用到同一 bank 即發生 bank conflict，頻寬下降。TMA 預設「shared 佈局 = global 佈局」，但這對某些存取模式會造成衝突。tensor map 可編入 swizzle mode，讓 TMA 存入 shared 時 'swizzle'、寫回 global 時 'unswizzle'。

矩陣轉置範例：8×8 的 int4（每元素 16B）row-major 載入，每組 8 條 thread 讀一列、寫到轉置 buffer 的一行 → 八路 bank conflict。改用 CU_TENSOR_MAP_SWIZZLE_128B（128B 正好對應一列長度）後，列存取與行存取都不再撞同一組 bank，衝突消除。轉置時用 swizzle 過的索引：

__shared__ alignas(1024) int4 smem_buffer[8][8];     // 128B swizzle 須 1024B 對齊
__shared__ alignas(1024) int4 smem_buffer_tr[8][8];
// ... cp_async_bulk_tensor 載入（與無 swizzle 寫法相同）...
for (int sidx_j = threadIdx.x; sidx_j < 8; sidx_j += blockDim.x)
  for (int sidx_i = 0; sidx_i < 8; ++sidx_i) {
    const int swiz_j     = (sidx_i % 8) ^ sidx_j;     // 異或得 swizzle 索引
    const int swiz_i_tr  = (sidx_j % 8) ^ sidx_i;
    smem_buffer_tr[sidx_j][swiz_i_tr] = smem_buffer[sidx_i][swiz_j];
  }
// host 端 encode 時改 swizzle 欄位為 CU_TENSOR_MAP_SWIZZLE_128B

Warning

此範例僅示範 swizzle 用法，'as-is' 既不高效也無法擴展到更大尺寸。

swizzle 規則（CC 9）：

Pattern	swizzle 寬	shared box 內維	重複週期	shared 對齊	global 對齊
`SWIZZLE_128B`	128 bytes	≤ 128 bytes	1024 bytes	128 bytes	128 bytes
`SWIZZLE_64B`	64 bytes	≤ 64 bytes	512 bytes	128 bytes	128 bytes
`SWIZZLE_32B`	32 bytes	≤ 32 bytes	256 bytes	128 bytes	128 bytes
`SWIZZLE_NONE`（預設）	-	-	-	-	16 bytes

考量：swizzle 映射粒度固定 16B；shared box 的內維須 ≤ swizzle span；global 須 128B 對齊。若 shared buffer 未按「重複週期」對齊（但仍 128B 對齊），swizzle pattern 與 shared 間會有 row offset，需用 offset 公式校正：

// CU_TENSOR_MAP_SWIZZLE_128B 的偏移與索引關係
int offset = (reinterpret_cast<uintptr_t>(smem_ptr) / 128) % 8;
smem[y][((y + offset) % 8) ^ x] = ...;   // 64B 用 %4、32B 用 %2

要點：offset 反映 shared buffer 相對 swizzle pattern 被「位移」幾次，加在 row index y 上，再與 column index x 做 XOR 得到真正的 swizzled 索引。

考試/測驗重點

主題	必記重點
TMA 起始 CC	Compute Capability 9.0 (Hopper)+；swizzle/replace 標記 sm_90a 專屬
兩種模式	1D 連續 = bulk-asynchronous（免 tensor map）/ 多維 ≤5D = bulk-tensor asynchronous（需 tensor map）
讀完成機制	global → shared 用 shared memory barrier + transaction count，block 內任一 thread 可等
寫完成機制	shared → global 用 bulk async-group：commit_group 後 wait_group_read / wait_group
bulk vs non-bulk	async-group 獨立：cp.async.wait_group (LDGSTS) / cp.async.bulk.wait_group (TMA) 不可混
誰設 tx count	memcpy_async 自動設；memcpy_async_tx / cp_async_bulk 需手動 mbarrier_expect_tx
何時用真 TMA	memcpy_async 需 16B 對齊且大小 16B 倍數否則 fallback；memcpy_async_tx / cp_async_bulk 永遠用 TMA
單一 thread 發起	用 is_elected() / elect_sync / invoke_one，避免 if(tid==0) 觸發 peeling loop
proxy fence	計算經 generic proxy 寫、TMA 經 async proxy 讀 → 需 fence_proxy_async(space_shared)
tensor map 建立	host 端 cuTensorMapEncodeTiled（driver API），最快變動維排第一
傳遞首選	const grid_constant 參數；其次 constant；再者 global memory（需 fence）
多維 shared 對齊	128B（swizzle 128B 時 1024B）；1D shared 對齊 16B
OOB 行為	讀越界 zero-fill 且左上角可負；寫可部分越界但左上角不可負
device 編碼 rank	.rank 欄位零基，填 desired_rank - 1
release-acquire	修改用 tensormap_cp_fenceproxy(release)，使用前 fence_proxy_tensormap_generic(acquire)，fence 與 use 須同 block
scope 選擇	host 改 tensor map → acquire 用 .sys；同 device thread 間 → .gpu
swizzle 對齊	128B/64B/32B 重複後須對齊 1024/512/256 bytes，shared 一律 ≥128B、global ≥128B
swizzle 索引	smem[y][((y+offset)%N)^x]，128B 用 %8、64B 用 %4、32B 用 %2

重點總覽

TMA 概觀：兩種模式與完成機制

4.11.2.1 用 TMA 搬 1D 陣列

4.11.2.1.1 Prefetching（多階段預取）

4.11.2.2 用 TMA 搬多維陣列（tensor map）

用 Driver API 建立 tensor map

Host-to-device：三種傳遞方式

使用 tensor map（2D tile 搬移）

4.11.2.2.1 / 4.11.2.2.4 在 device 上編碼 tensor map（template 模式）

4.11.2.2.2 device 端編碼與修改

4.11.2.2.3 使用被修改過的 tensor map（release-acquire）

4.11.2.2.5 Shared-Memory Bank Swizzling

考試/測驗重點

Related Notes