Tile 載入/儲存與控制流 (Tile Load/Store and Control Flow)

重點總覽

項目	重點
兩種搬移法	tile-space load/store（用 view 物件，規則、可預測樣式）vs gather/scatter（用 index/pointer tile，任意位置）
tile-space load	建 view 把 array 分成 tile 大小的格子（tile-space），用 tile-space index 一次 load/store 一格
partition view	不重疊、無間隙、固定 tile 大小的 tile-space；目前唯一支援的 view 型別
C++ 兩步建構	`ct::tensor_span`（pointer + `ct::extents`）→ `ct::partition_view`（切格，提供 `.load`/`.store`）
Python view	`Array.tiled_view(tile_shape)` 回傳 `TiledView`，提供 `.load(index)`/`.store(index, tile)`
Python one-call	`ct.load(array, index, shape)` / `ct.store(array, index, tile)`，tile shape 寫在每次呼叫上
C++ 邊界	unmasked（`.load`/`.store`，partial OOB 為 UB）vs masked（`.load_masked`/`.store_masked`，安全）
Python 邊界	`ct.load` 用 `padding_mode`（`ZERO` / `UNDETERMINED`）；`ct.store` 永遠靜默丟棄 OOB
gather/scatter	Python：integer index tile + 內建 bounds check；C++：pointer tile + 手動 boolean mask
TMA 效能	tile-space load 可被 compiler 下放到 TMA，遠快於 per-element gather
控制流	每 block 單一控制流路徑；scalar 驅動分支/迴圈，tile 運算由 compiler 分配給 threads
Loops	C++ 用 `ct::irange`（range-for）給 compiler 結構化邊界；Python `range/for/while` 原生支援
Conditionals	`if/else` 正常運作；因單一控制流，warp divergence 概念不適用

2.4.6 Loading and Storing Tiles：tile 與 array

Tile 程式模型有兩個關鍵記憶體物件：

array：多維容器，位於 global memory，對 tile kernel 的所有 block 可見。
tile：多維容器，local 於單一 block，通常是 array 元素的子集。

本主題是「把 array 的區塊 load 成 tile」與「把 tile store 回 array」。兩種搬移方式：

方式	用什麼定位	樣式
Tile-Space Loads/Stores	view 物件 + tile-space index	規則、可預測的 array→tile 映射
Gather/Scatter	index tile / pointer tile	任意 index 或 address，逐元素

Tip

效能：tile-space load 在支援的硬體上可被 compiler 下放到 TMA（Tensor Memory Accelerator），顯著快於 per-element gather。能用 tile-space load 就別用 gather。

Important

programmer 必須決定 load 時越界元素取什麼值。越界寫入：Python 一律靜默丟棄；C++ 則在使用 masked 變體時靜默丟棄。

2.4.6.1 Tile-Space Loads and Stores 與 partition view

建立 view 物件，指定 array 如何被分割成 tile 大小的格子網格；此映射稱為 tile-space，kernel 一次 load/store 一個格子，用 tile-space index 定位。

tiled view：指定 array 元素如何映射到指定大小的 tiles。
partition view：一種 tiled view，tile 大小固定、彼此不重疊、之間無間隙。
當 array 維度無法被 tile 整除時，跨越 array 邊界的 tile 會被部分填滿（partial tile），其載入行為由 boundary handling 決定。

Figure 19 範例：shape (10, 16) 的二維 array，以 tile shape (2, 4) 分割，產生 shape (5, 4) 的 tile grid；tile-space index (1, 2) 涵蓋元素 (2, 8) 到 (3, 11)。

Array (10, 16)                     Tile-space (5, 4)，tile = (2,4)
 col→ 0        8   11               ┌─────┬─────┬─────┬─────┐
row  ┌──────────────────┐          │(0,0)│(0,1)│(0,2)│(0,3)│
 0   │                  │          ├─────┼─────┼─────┼─────┤
 2   │        ┌────┐     │ ──映射→  │(1,0)│(1,1)│(1,2)│(1,3)│ ← (1,2)
 3   │        └────┘     │          ├─────┼─────┼─────┼─────┤   = 元素
 ... │   (1,2) 涵蓋的     │          │(2,0)│ ... │     │     │   (2,8)~
 9   │   元素(2,8)~(3,11) │          └─────┴─────┴─────┴─────┘   (3,11)
     └──────────────────┘

Warning

目前範例用 partition view 是因為它是 CUDA Tile 首個支援的 view 型別；未來版本預期會加入其他 view 型別。

2.4.6.1.1 Partition View Loads and Stores

結構化的 tile-space load 是「在 global memory 與 tile 之間搬資料」的首選方式：先建 view 定義 tile-space，再以 tile-space index 一次 load/store 一個 tile。

C++ 兩步建構：
- ct::tensor_span — 把 raw pointer 配上 ct::extents，賦予多維結構。
- ct::partition_view — 把 span 切成固定大小 tile 的網格，提供以 tile-space 座標操作的 .load(idx...) / .store(tile, idx...)。
Python：Array.tiled_view(tile_shape) 回傳 TiledView，提供 .load(index) / .store(index, tile)，直接對應 C++ 的 partition_view。

__tile_global__ void vec_add(float* __restrict__ a, float* __restrict__ b,
                             float* __restrict__ out) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  a   = ct::assume_aligned(a,   16_ic);
  b   = ct::assume_aligned(b,   16_ic);
  out = ct::assume_aligned(out, 16_ic);

  // Step 1: 把 shape 附到 raw pointer，128_ic 標記為 compile-time 常數
  auto aSpan = ct::tensor_span{a,   ct::extents{128_ic}};
  auto bSpan = ct::tensor_span{b,   ct::extents{128_ic}};
  auto oSpan = ct::tensor_span{out, ct::extents{128_ic}};
  // Step 2: 把每個 span 分割成固定 8 元素 tile 的 tile-space
  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};
  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};

  int bx = ct::bid().x;          // 本 block 的 tile-space index
  auto aTile = aView.load(bx);   // 取 a 的第 bx 個 tile
  auto bTile = bView.load(bx);
  oView.store(aTile + bTile, bx);// 寫回 out 的第 bx 個位置
}

重點：建構順序為 tensor_span（給 pointer 維度）→ partition_view（切 tile）；.store 的簽名是 .store(tile, idx...)，tile 在前、index 在後。

@ct.kernel
def vec_add(a, b, c, TILE: ct.Constant[int]):
    a_view = a.tiled_view((TILE,))
    b_view = b.tiled_view((TILE,))
    c_view = c.tiled_view((TILE,))
    bid = ct.bid(0)
    a_tile = a_view.load((bid,))
    b_tile = b_view.load((bid,))
    c_view.store((bid,), a_tile + b_tile)   # store(index, tile)

Tip

C++ 範例頂部的 __restrict__ 與 ct::assume_aligned(ptr, 16_ic) 是效能標註（見 2.4.12 / C++ Performance Tips）。數值字面值的 _ic 後綴（如 128_ic、8_ic）將其標記為 compile-time 常數。注意 Python .store 是 store(index, tile)，C++ .store 是 store(tile, idx...)，參數順序相反。

2.4.6.1.2 Python One-Call Load and Store

Python 另提供不需顯式 view 物件的 one-call 形式，把 tile shape 直接寫在每次 load/store 上：

ct.load(array, index, shape) — 在指定 tile-space index 讀取指定 shape 的 tile。
ct.store(array, index, tile) — 對應的寫入。

@ct.kernel
def vec_add(a, b, c, TILE: ct.Constant[int]):
    bid = ct.bid(0)
    a_tile = ct.load(a, index=(bid,), shape=(TILE,))  # 取 a 的第 bid 個 TILE 區塊
    b_tile = ct.load(b, index=(bid,), shape=(TILE,))
    ct.store(c, index=(bid,), tile=a_tile + b_tile)

ct.load/ct.store 與 Array.tiled_view 表達相同的 tile-space 存取樣式，差別只在 tile shape 寫在哪裡：

形式	tile shape 綁定位置	適用時機
`Array.tiled_view`	綁定一次到 view 物件	同一切分被多次 load/store 重用
`ct.load` / `ct.store`	每次呼叫內聯提供	單次、一次性 load 較簡潔

2.4.6.1.3 Tile-Space Boundary Handling

當 array 不能被 tile 整除時，邊緣 tile 會 partial OOB，須選擇 masked 或 unmasked 變體。

C++（partition_view）：

變體	行為
`.load(idx...)` / `.store(tile, idx...)`	假設 tile 完全 in-bounds；partial OOB 為 undefined behavior
`.load_masked(idx...)`	安全處理 partial 邊緣 tile；OOB 位置預設填零（float tile 可選 NaN 等其他 padding mode）
`.store_masked(tile, idx...)`	安全處理；OOB 寫入靜默丟棄

array 被 tile 完整整除時用 unmasked 較好；需處理邊界時用 masked——即使 tile 完全填滿也可用 masked。
此為本指南首個 array 維度為 runtime 值的 C++ 範例：ct::extents{N} 接受 runtime 維度，且 ct::extents 支援 compile-time（_ic）與 runtime 值任意混合。

__tile_global__ void edge_safe(float* __restrict__ in, float* __restrict__ out, int N) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  in  = ct::assume_aligned(in,  16_ic);
  out = ct::assume_aligned(out, 16_ic);
  // ct::extents{N} 為 runtime 維度；128_ic 仍是 compile-time
  auto inView  = ct::partition_view{ct::tensor_span{in,  ct::extents{N}}, ct::shape{128_ic}};
  auto outView = ct::partition_view{ct::tensor_span{out, ct::extents{N}}, ct::shape{128_ic}};
  int bx = ct::bid().x;
  auto tile = inView.load_masked(bx);   // masked load：OOB lane 預設為 0
  outView.store_masked(tile, bx);       // masked store：OOB 寫入靜默丟棄
}

Python：ct.load 接受 padding_mode 控制越界元素取值：

padding_mode	行為
`PaddingMode.ZERO`	OOB 元素填零
`PaddingMode.UNDETERMINED`（預設）	OOB 值交由實作決定；適用於確知 tile 完全 in-bounds 時

ct.store 永遠靜默丟棄越界寫入，不需 padding_mode 參數。
tiled_view 適用同樣規則，但它在 view 建立時就固定 padding_mode。

@ct.kernel
def edge_safe(arr_in, arr_out, TILE: ct.Constant[int]):
    bid = ct.bid(0)
    tile = ct.load(arr_in, index=(bid,), shape=(TILE,),
                   padding_mode=ct.PaddingMode.ZERO)  # partial 邊緣 tile 的 OOB lane 變 0
    ct.store(arr_out, index=(bid,), tile=tile)        # OOB 寫入靜默丟棄

Warning

邊界處理只適用於「部分」越界的 tile。Load 或 store 一個完全落在 array 之外的 tile 是 undefined，masked 變體也救不了。

2.4.6.2 Gather and Scatter

當存取樣式不規則或資料相依（如 lookup table、permutation）時，partition view 的規則切分不適用；gather/scatter 允許 tile 從/到 array 的非均勻、非連續元素，用任意 index 或 address 定位。

語言	gather	scatter	邊界
Python	`ct.gather(array, index_tile)`	`ct.scatter(array, index_tile, values)`	內建 bounds check
C++	`ct::load(ptr_tile)`	`ct::store(ptr_tile, tile)`	用 `ct::load_masked`/`ct::store_masked` + boolean mask tile

C++ 慣用法：形成一個 pointer tile（每元素一個 pointer），傳給 ct::loadstore(。scalar pointer 與 integer tile 的算術會逐元素進行，產出 pointer tile——這是建構 gather/scatter index tile 的標準寫法。
Python：ct.gather 載入 index tile 中每個 index 對應的元素；bounds check 預設開啟，OOB index 回傳 padding value（預設零，可用 padding_value= 設定），可用 check_bounds=False 關閉。ct.scatter 每 index 存一個值，OOB 寫入靜默丟棄。

__tile_global__ void vec_add_gather(int* __restrict__ a, int* __restrict__ b,
                                    int* __restrict__ out) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  using i32x8 = ct::tile<int, ct::shape<8>>;
  a = ct::assume_aligned(a, 16_ic); b = ct::assume_aligned(b, 16_ic);
  out = ct::assume_aligned(out, 16_ic);
  int bx = ct::bid().x;
  auto offsets = 8 * bx + ct::iota<i32x8>();  // 每 lane 一個元素級 offset
  // scalar pointer + int tile = pointer tile（每 offset 一個 pointer）
  auto aPtrs = a + offsets;
  auto bPtrs = b + offsets;
  auto aTile = ct::load(aPtrs);               // gather：每 pointer 一次 load
  auto bTile = ct::load(bPtrs);
  ct::store(out + offsets, aTile + bTile);     // scatter：每 pointer 一次 store
}

ct::iota<i32x8>() 產生 0,1,...,7 的 index tile；a + offsets 用逐元素指標算術造出 pointer tile。

@ct.kernel
def vec_add_gather(a, b, c, TILE: ct.Constant[int]):
    bid = ct.bid(0)
    indices = bid * TILE + ct.arange(TILE, dtype=ct.int32)  # 每 lane 一個 index
    a_tile = ct.gather(a, indices)            # 每 lane 載入 a[indices[i]]
    b_tile = ct.gather(b, indices)
    ct.scatter(c, indices, a_tile + b_tile)   # 每 index 存一個值到 c

2.4.6.2.1 Gather/Scatter Boundary Handling

gather/scatter 的邊界規則與 tile-space load/store 不同：

語言	預設行為	控制方式
Python	bounds-safe：OOB 讀回 padding（預設零），OOB 寫靜默丟棄	能證明所有 index 在範圍內時可關閉檢查（關閉後 OOB = UB）
C++	不自動 bounds check	自建 boolean mask（如 `offsets < N`），傳給 `ct::load_masked`/`ct::store_masked`

__tile_global__ void gather_safe(int* __restrict__ arr, int* __restrict__ out, int N) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  using i32x8 = ct::tile<int, ct::shape<8>>;
  arr = ct::assume_aligned(arr, 16_ic); out = ct::assume_aligned(out, 16_ic);
  int bx = ct::bid().x;
  auto offsets = 8 * bx + ct::iota<i32x8>();
  auto mask = offsets < N;                       // boolean tile：in-bounds 處為 true
  auto ptrs = arr + offsets;
  auto tile = ct::load_masked(ptrs, mask, 0);    // masked lane 取 pad 值 0
  ct::store_masked(out + offsets, tile, mask);   // masked lane 在 store 時被跳過
}

Warning

易混淆：Python gather/scatter 預設安全；C++ gather/scatter 預設不安全，必須自己造 mask。tile-space load 的 C++ masked 變體填零是 load_masked(idx)，gather 的 load_masked(ptrs, mask, pad) 需要明確的 mask 與 pad 值。

2.4.7 Control Flow

從 programmer 視角，tile kernel 每 block 只有單一控制流路徑。scalar 值（在條件與迴圈邊界中）驅動控制流，而 body 中的 tile 運算由 compiler 分配給硬體 threads。

            單一 block 的控制流
   scalar 條件 / loop 邊界 ──驅動──► 一條控制流路徑
                                        │
                                        ▼
   tile 運算（body）──► compiler 分配給 block 內所有 threads 平行執行

Warning

並非所有控制流構造都支援。例如從迴圈內 return 在 tile code 中不允許。完整限制見各語言 API reference。

2.4.7.1 Loops

常見樣式：逐一處理 array 的各個 tile。

C++：ct::irange 是 forward range，表示「從下界起、到上界（不含）止、可帶 optional step」的遞增整數序列。用 ct::irange 給 compiler 結構化的迭代邊界資訊，可助最佳化；前提是迴圈變數須透過 range-for over ct::irange 綁定。
Python：內建 range()、for、while 與巢狀迴圈在 tile code 中全部原生支援。step 須嚴格為正，不支援負 step range。

__tile_global__ void tile_sum(float* __restrict__ arr, float* __restrict__ out, int num_tiles) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  using f32x8 = ct::tile<float, ct::shape<8>>;
  arr = ct::assume_aligned(arr, 16_ic); out = ct::assume_aligned(out, 16_ic);
  auto inView  = ct::partition_view{ct::tensor_span{arr, ct::extents{8 * num_tiles}}, ct::shape{8_ic}};
  auto outView = ct::partition_view{ct::tensor_span{out, ct::extents{8_ic}},          ct::shape{8_ic}};
  auto acc = ct::full<f32x8>(0.0f);
  // range-for over ct::irange 給 compiler 結構化迭代邊界
  for (auto k : ct::irange(0, num_tiles)) {
    auto tile = inView.load(k);
    acc = acc + tile;             // 把第 k 個 tile 累加進 acc
  }
  outView.store(acc, 0);          // 把最終結果寫成 out 的第 0 個 tile
}

@ct.kernel
def tile_sum(arr, out, TILE: ct.Constant[int], N_TILES: ct.Constant[int]):
    # 預期 grid 為 (1,)：單一 block 加總 arr 的所有 tile
    acc = ct.zeros((TILE,), dtype=ct.float32)
    for k in range(N_TILES):                       # range() 在 tile code 中原生可用
        tile = ct.load(arr, index=(k,), shape=(TILE,))
        acc = acc + tile
    ct.store(out, index=(0,), tile=acc)

2.4.7.2 Conditionals

標準 if/else 正常運作。因為每 block 走單一控制流路徑，所以「warp 內 branch divergence」的考量不適用於 tile kernel。

__tile_global__ void conditional_load(float* __restrict__ arr, float* __restrict__ out, int N) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  using f32x8 = ct::tile<float, ct::shape<8>>;
  arr = ct::assume_aligned(arr, 16_ic); out = ct::assume_aligned(out, 16_ic);
  auto inView  = ct::partition_view{ct::tensor_span{arr, ct::extents{N}}, ct::shape{8_ic}};
  auto outView = ct::partition_view{ct::tensor_span{out, ct::extents{N}}, ct::shape{8_ic}};
  int bx = ct::bid().x;
  int nb_x = ct::num_blocks().x;
  auto tile = ct::full<f32x8>(0.0f);  // last-block 分支的預設值
  // scalar 條件 -> 每 block 一條控制流，無 divergence 可推敲
  if (bx < nb_x - 1) {
    tile = inView.load(bx);           // 除最後一個 block 外
  }
  outView.store_masked(tile, bx);     // masked：處理可能 partial 的最後一個 tile
}

@ct.kernel
def conditional_load(arr, out, TILE: ct.Constant[int]):
    bid = ct.bid(0)
    # scalar 條件 -> 每 block 一條控制流，無 divergence 可推敲
    if bid < ct.num_blocks(0) - 1:                       # 除最後一個 block 外
        tile = ct.load(arr, index=(bid,), shape=(TILE,))
    else:
        tile = ct.zeros((TILE,), dtype=ct.float32)       # 最後一個 block：輸出零
    ct.store(out, index=(bid,), tile=tile)

Tip

記法：tile kernel 的 if/for 由 scalar 控制，所以你像寫單執行緒程式那樣推理控制流；divergence 是 SIMT（per-thread）模型才需要擔心的事。

考試/測驗重點

情境/關鍵字	答案
兩種 tile 搬移方式	tile-space load/store（view + tile-space index）與 gather/scatter（index/pointer tile）
tile-space load 可下放到什麼硬體？	TMA（Tensor Memory Accelerator），比 per-element gather 快很多
C++ 建 partition view 兩步	`ct::tensor_span`（pointer+`ct::extents`）→ `ct::partition_view`（切 tile）
Python 建 view	`Array.tiled_view(tile_shape)` 回傳 `TiledView`
Python one-call API	`ct.load(array, index, shape)` / `ct.store(array, index, tile)`
C++ `.store` vs Python `.store` 參數順序	C++：`store(tile, idx...)`；Python：`store(index, tile)`（相反）
`_ic` 後綴意義	把數值字面值標記為 compile-time 常數
C++ unmasked load 遇 partial OOB	undefined behavior
C++ `load_masked` 預設 OOB 值	零（float tile 可改 NaN 等 padding mode）
C++ `store_masked` OOB 寫入	靜默丟棄
Python `ct.load` 預設 padding_mode	`PaddingMode.UNDETERMINED`（值交給實作）
Python `ct.store` 越界	永遠靜默丟棄，無 padding_mode 參數
`tiled_view` 的 padding_mode 何時固定	view 建立時
完全落在 array 外的 tile load/store	undefined；masked 也救不了（masked 只處理 partial OOB）
Python gather/scatter 預設安全嗎？	是（OOB 讀回 padding 預設零，OOB 寫丟棄）
C++ gather/scatter 預設安全嗎？	否；須自建 boolean mask 給 `load_masked`/`store_masked`
C++ 怎麼造 gather index tile	scalar pointer + int tile 逐元素算術 = pointer tile（搭 `ct::iota`）
關閉 Python gather bounds check	`check_bounds=False`；之後 OOB = UB
tile kernel 控制流由什麼驅動	scalar 值（條件/loop 邊界）；tile 運算由 compiler 分配給 threads
tile code 不允許的控制流例子	從迴圈內 return
C++ 結構化迴圈	`ct::irange` + range-for，給 compiler 結構化邊界
Python loop step 限制	須嚴格為正，不支援負 step
tile kernel 有 warp divergence 嗎？	沒有；單一控制流路徑，divergence 考量不適用

重點總覽

2.4.6 Loading and Storing Tiles：tile 與 array

2.4.6.1 Tile-Space Loads and Stores 與 partition view

2.4.6.1.1 Partition View Loads and Stores

2.4.6.1.2 Python One-Call Load and Store

2.4.6.1.3 Tile-Space Boundary Handling

2.4.6.2 Gather and Scatter

2.4.6.2.1 Gather/Scatter Boundary Handling

2.4.7 Control Flow

2.4.7.1 Loops

2.4.7.2 Conditionals

考試/測驗重點

Related Notes