Tile Kernel 結構與啟動 (Tile Kernel Structure)

#cuda #tile-programming #kernel-launch #concept

重點總覽

CUDA Tile 提供有別於 SIMT 的 kernel 撰寫方式：programmer 以整個 block（tile）為單位思考（load 一整塊 tile、對整塊做運算、再 store），把最底層的 thread 平行對應交給 compiler。Kernel 外圍的 host 程式（cudaMalloc、cudaMemcpy、launch 排序）與 SIMT 完全相同，只有 kernel 內部寫法改變。Python 透過 cuda.tile 套件，C++ 從 CUDA Toolkit 13.3 起提供，兩者共用 compiler 後端 CUDA Tile IR，慣例上 API 別名為 ct。

項目	重點
Kernel/Function 宣告	C++ `__tile_global__`/`__tile__`；Python `@ct.kernel`/`@ct.function`（後者可省略）
啟動 kernel	C++ triple-chevron 第二參數必須是 1；Python `ct.launch(stream, grid, kernel, args)`
Grid-sizing pattern	ceil 除法 `(N + tile_size - 1) / tile_size` 覆蓋整個 array（含不滿的尾端 block）
查詢 block 位置	C++ `ct::bidnum_blocks()`（uint3/dim3）；Python `ct.bid(axis)`/`ct.num_blocks(axis`
建立 tile	固定大小、多維、shape 與 dtype 編譯期已知、每維為 2 的次方、value semantics；factory `zeros`/`ones`/`full`/`iota`/`arange`
Compile-time constants	Python `ct.Constant[T]`；C++ `ct::integral_constant` 與 `_ic` literal

Important

Tile programming 的核心差異：SIMT 以單一 thread 思考（算 global index、load 自己的元素、運算、store）；Tile 以整個 block 思考。compiler 負責把 tile 運算映射到 block 內的 hardware threads——這正是 SIMT programmer 必須手動處理的部分。

SIMT 思維                          Tile 思維
┌─────────────────────┐           ┌─────────────────────┐
│ idx = blockIdx*N +  │           │ load 一整塊 tile     │
│       threadIdx     │           │   (多個元素)         │
│ load 我這一格元素   │   vs.     │ 對整塊 tile 做運算   │
│ 對單一元素運算      │           │ store 整塊 tile      │
│ store 我這一格      │           │ (thread 對應 → 編譯器)│
└─────────────────────┘           └─────────────────────┘
   programmer 管 thread              compiler 管 thread

API 別名與引入

慣例上 tile API 在兩種語言皆別名為 ct，後續所有 ct. / ct:: 前綴都指 tile API。

#include "cuda_tile.h"
namespace ct = cuda::tiles;   // C++：API 位於 cuda::tiles namespace

import cuda.tile as ct        # Python：API 位於 cuda.tile 模組

C++ 的 tile API 由 cuda_tile.h header 暴露，位於 cuda::tiles namespace。
Python 的 tile API 位於 cuda.tile 模組。
兩語言共用 compiler 後端 CUDA Tile IR，因此執行語意相同。
Tile 程式相較 SIMT 提供更簡單的途徑去使用近期 NVIDIA GPU 的效能功能，如 TMA（tensor memory accelerator） 單元與 tensor cores。

2.4.1 Kernel 與 Function 宣告

Tile kernel 是 GPU entry point，每個 block 執行一次；tile function 可被 tile kernel 或另一個 tile function 呼叫，但本身不是 entry point。與 SIMT kernel 一樣，tile kernel 不能從 host code 直接呼叫，必須 launch。

角色	C++ 修飾符	Python 裝飾器	類比 SIMT
Kernel entry point	`__tile_global__`	`@ct.kernel`	`__global__`
Device-side function	`__tile__`	`@ct.function`（可省略）	`__device__`

#include "cuda_tile.h"
// Tile kernel entry point：不能直接呼叫，必須 launch
__tile_global__ void my_kernel(float* a, float* b, float* c) { ... }

// Tile function：可被 tile kernel 與 tile function 呼叫
__tile__ float helper(float x, float y) { return x + y; }

import cuda.tile as ct
@ct.kernel                      # entry point
def my_kernel(a, b, c): ...

@ct.function                    # 可省略：被 tile code 呼叫的函式會自動編成 tile code
def helper(x, y):
    return x + y

Array 與 scalar 參數的傳遞方式與 SIMT kernel 相同。
Tile code 與 SIMT code 可共存：單一 .cu 檔可同時定義 __tile_global__ 與 __global__ kernel，單一 host 程式可同時 launch 兩者。
Python 中 @ct.function 是可省略的：任何被 kernel 呼叫的函式都會自動編譯為 tile code。
Python 的 array 參數接受任何暴露 DLPack 或 CUDA Array Interface 的 device-resident array（例如 PyTorch tensor、CuPy array）；scalar 參數直接傳入。

目前的互呼叫限制（未來版本可能解除）

__tile__ 函式目前無法從 __global__ 或 __device__ 函式呼叫；反之 __device__ 函式也無法從 __tile_global__ 或 __tile__ 函式呼叫。兩種程式可在同檔共存並各自 launch，但目前不能跨界呼叫。

2.4.2 啟動 Kernel

Tile kernel 啟動於一個 tile blocks 的 grid，如同 SIMT kernel 啟動於 thread blocks 的 grid。Programmer 指定 grid 形狀（最多三維）。從 programmer 視角，每個 tile block 由「單一邏輯 thread」執行；block 內的平行由 compiler 管理。

// C++：沿用 SIMT 的 triple-chevron；第二個 chevron 參數必須是 1
my_kernel<<<dim3(num_blocks_x, num_blocks_y), 1>>>(a, b, c);
//                                            ↑ 第二參數（per-block thread 數）必須是 1

import torch
stream = torch.cuda.current_stream()          # CUDA stream 物件
grid = (num_blocks_x, num_blocks_y, 1)         # tile-block grid (x, y, z)
ct.launch(stream, grid, my_kernel, (a, b, c))  # 四個位置參數

C++：第一個 chevron 參數是 grid 形狀（tile block 數量）；第二個原本是 SIMT 的 per-block thread 數，但 tile kernel 由 compiler 內部決定 thread 數，故此參數必須為 1。
Tile kernel 同時也是普通 CUDA kernel，可透過 runtime 既有的 cudaLaunchKernel 與 cudaLaunchKernelEx 以相同的 grid, 1 設定啟動——便於整合進已用這些 API 驅動 launch 的程式庫。
Python：ct.launch 取四個位置參數——CUDA stream、grid tuple（各維 tile block 數）、kernel 物件、kernel 參數 tuple。

C++ 第二個 chevron 參數恆為 1

在 SIMT 中第二個 chevron 參數是 per-block thread 數；在 tile kernel 中 thread 數由 compiler 決定，因此寫任何非 1 的值都是錯誤。這是最容易寫錯的點。

2.4.2.1 Grid-Sizing Pattern（只指定 grid，thread 數交給 compiler）

常見模式是啟動足夠多的 block 以覆蓋整個 array，其中最後一個 block 可能在一或多維上超出 array 大小。用 ceil（向上取整）除法計算 block 數。

int num_blocks = (N + tile_size - 1) / tile_size;  // ceil 除法 → 覆蓋不滿的尾端
kernel<<<num_blocks, 1>>>(in, out, N);

import math
grid = (math.ceil(N / TILE),)                      # ceil 除法 → 覆蓋不滿的尾端
ct.launch(stream, grid, my_kernel, (arr_in, arr_out, TILE))

N = 10, TILE = 4  →  num_blocks = ceil(10/4) = 3
 array:  [0 1 2 3][4 5 6 7][8 9 . .]
 block:   block0   block1   block2  ← 最後一個 block 超出 array（尾端 2 格無效）

Tip

array 大小無法被 tile 大小整除時的處理（尾端 block 的 bounds 處理）詳見 2.4.6 各小節，於 02-Programming-GPUs/11-Tile-Load-Store-and-Control-Flow 說明。

2.4.3 查詢 Block 位置

每個 block 需知道自己在 grid 中的位置，才能決定要處理 array 的哪一部分。SIMT 中 programmer 結合 blockIdx 與 threadIdx 算出 global thread index；tile code 只需要 block index，所有 block 內的 thread-level indexing 由 compiler 處理。

用途	C++	Python
本 block 索引	`ct::bid()` → `uint3`（`.x/.y/.z`）	`ct.bid(axis)` → `int32` scalar
各維 block 總數	`ct::num_blocks()` → `dim3`	`ct.num_blocks(axis)` → 該軸總數

__tile_global__ void my_kernel(float* a, float* b, float* c) {
    namespace ct = cuda::tiles;
    int bid_x = ct::bid().x;            // .x 方向的 block index
    int bid_y = ct::bid().y;            // .y 方向的 block index
    int num_x = ct::num_blocks().x;     // .x 方向 block 總數
}

@ct.kernel
def my_kernel(a, b, c):
    bid_x = ct.bid(0)         # axis 0 的 block index
    bid_y = ct.bid(1)         # axis 1 的 block index
    num_x = ct.num_blocks(0)  # axis 0 的 block 總數

C++ ct::bid() 回傳含三維 block index 的 uint3；ct::num_blocks() 回傳含各維 block 總數的 dim3（依 launch 參數決定）。
Python ct.bid(axis) 取 axis（0、1、2）回傳該軸 block index（int32 scalar）；ct.num_blocks(axis) 回傳該軸 block 總數——適合用於 bounds check 與 loop 計數。

Tip

Tile code 沒有 threadIdx 的概念。記住對照：SIMT blockIdx + threadIdx → tile 只需 ct::bid()。

2.4.4 建立 Tile

Tile 是 tile kernel 真正操作的對象：一個固定大小、多維的 scalar 元素陣列，其 shape 與 element type 在編譯期已知。

每一維必須是 2 的次方。
Tile 具 value semantics：複製 tile 會複製其元素，兩份副本完全獨立；但因 compiler 控制 tile 在 hardware 內的表示方式，複製成本低。
Programmer 不需也不能 為 tile 配置或釋放記憶體。
建立方式：(1) 從 array load 資料；(2) 用 factory function 產生填滿特定 pattern 的 tile。

C++ tile 型別是顯式的：ct::tile<T, ct::shape<dims...>>，T 是元素型別，ct::shape<dims...> 把各維大小編碼成 template 引數（編譯期已知）。

__tile__ void factories() {
    namespace ct = cuda::tiles;
    using i32x8   = ct::tile<int,   ct::shape<8>>;     // 1-D：8 個 int
    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;  // 2-D：4x4 float

    auto z      = ct::zeros<f32x4x4>();    // 全 0
    auto o      = ct::ones<f32x4x4>();     // 全 1
    auto filled = ct::full<f32x4x4>(3.14f);// 全 3.14
    auto seq    = ct::iota<i32x8>();       // {0,1,2,3,4,5,6,7}
}

@ct.function
def factories():
    zeros  = ct.zeros((64, 64), dtype=ct.float32)   # 64x64 全 0.0
    ones   = ct.ones((128,),   dtype=ct.float16)    # 128 元素全 1.0
    filled = ct.full((32, 32), 3.14, dtype=ct.float32)  # 32x32 全 3.14
    seq    = ct.arange(8, dtype=ct.int32)           # [0,1,2,3,4,5,6,7]

Factory（填滿 pattern）	C++	Python
全 0 / 全 1	`ct::zeros<Tile>()` / `ct::ones<Tile>()`	`ct.zeros(shape, dtype)` / `ct.ones(shape, dtype)`
全為定值	`ct::full<Tile>(val)`	`ct.full(shape, fill_value, dtype)`
遞增序列	`ct::iota<Tile>()` → `(0,1,...,N-1)`	`ct.arange(size, dtype=...)` → `[0,...,size-1]`

C++ 因 shape 是型別的一部分，永遠在編譯期已知；範例慣用 using alias（如 using f32x4x4 = ct::tile<float, ct::shape<4, 4>>）讓 call site 可讀。
Python 中 shape tuple 與 dtype 都是編譯期值；Python literal（如 (64, 64)、ct.float32）自然滿足，也可用 Constant-annotated 參數提供（見下節）。產生的 tile 暴露 .shape、.dtype、.ndim 屬性，反映其編譯期屬性。

Important

Tile 的 shape 與 dtype 必須編譯期已知，且每維為 2 的次方。這是後續 compile-time constant 機制存在的根本原因。

2.4.5 Compile-Time Constants（編譯期常數）

Tile compiler 會為每一種 tile shape、data type 與其他結構參數的組合產生專屬機器碼。因此任何影響產生碼的值都必須在編譯期已知——也就是 tile 的 shape 與 dtype 必須是編譯期值。除了直接寫 literal，shape 也能透過 kernel 介面以「編譯期已知值」傳入。

2.4.5.1 Python `Constant[T]`

在 kernel 參數上加 ct.Constant[T] type hint，將其標記為 constant-embedded：該參數在 kernel 內的每一次使用，都如同把 literal 值寫在該處一樣（compiler 看到的是具體 literal，並據此產生 specialized code）。

@ct.kernel
def my_kernel(TILE: ct.Constant[int]):
    # TILE 被 constant-embedded：compiler 看到其 literal 值（如 128）並產生專屬碼，
    # 此處 TILE 驅動 factory tile 的 shape。
    zeros = ct.zeros((TILE,), dtype=ct.float32)

型別引數可省略：ct.Constant（不帶型別）可嵌入任意型別的常數。
最常用於整數 ct.Constant[int]，用在驅動 tile shape 與 loop bound 的參數上。

2.4.5.2 C++ `integral_constant` 與 `_ic` Literal

CUDA Tile C++ 透過 ct::integral_constant 表達編譯期值——其數值編碼在型別本身。ct::literals namespace 的 _ic literal 提供簡寫：0_ic 產生一個 ct::integral_constant<0> 值。

接受編譯期值的 API 同時接受兩種形式：non-type template parameter（NTTP）形式，與 _ic literal 形式。

__tile__ void concat_demo() {
    namespace ct = cuda::tiles;
    using namespace ct::literals;
    using T = ct::tile<int, ct::shape<4, 8>>;
    T lhs = ct::full<T>(0);
    T rhs = ct::full<T>(1);
    auto a = ct::cat<0>(lhs, rhs);       // NTTP 形式（軸 0 是編譯期值）
    auto b = ct::cat(lhs, rhs, 0_ic);    // _ic 形式（同一個編譯期軸）
}

ct::extents 與 ct::shape 各有 NTTP 形式與 brace 形式。與 NTTP 形式不同，brace 形式接受 runtime 值——當一或多個維度只在 launch 時才知道時，就用它：編譯期維度寫 _ic literal、runtime 維度寫普通變數。ct::tensor_span、ct::partition_view（見 Tile-Space Loads and Stores）即用此形式包裝這類 array。

auto shape2d = ct::extents{8_ic, length};  // 8 是編譯期；length 是 runtime

編譯期值的兩種寫法（C++）
 NTTP 形式：  ct::cat<0>(...)            ct::extents<std::uint32_t, 4, 8>
 _ic  形式：  ct::cat(..., 0_ic)         ct::extents{8_ic, length}
                                                    ↑      ↑
                                              編譯期   runtime（brace 形式才行）

Tip

_ic literal 是「在 value-form API 引數需要編譯期值之處」的統一簡寫——無論是 ct::cat 的 dimension，或 extents／shape 的分量。

Python vs C++ 編譯期常數對照

Python：ct.Constant[T] 標記 kernel 參數為 constant-embedded。
C++：ct::integral_constant / _ic literal 把值編碼進型別；NTTP 與 _ic 兩種形式可互換，但只有 extents／shape 的 brace 形式能混搭 runtime 維度。

考試/測驗重點

情境/關鍵字	答案
C++ tile kernel entry point 修飾符	`__tile_global__`（類比 `__global__`）
C++ tile device function 修飾符	`__tile__`（類比 `__device__`）
Python tile kernel / function 裝飾器	`@ct.kernel` / `@ct.function`（後者可省略）
triple-chevron 第二參數	必須為 1（thread 數由 compiler 決定）
也可用哪些 runtime API 啟動 tile kernel	`cudaLaunchKernel`、`cudaLaunchKernelEx`（同樣 `grid, 1`）
Python 啟動 API 與其四參數	`ct.launch(stream, grid, kernel, args_tuple)`
每個 tile block 由幾個邏輯 thread 執行（programmer 視角）	單一邏輯 thread；block 內平行由 compiler 管
Grid-sizing ceil 除法	`(N + tile_size - 1) / tile_size`；Python `math.ceil(N/TILE)`
tile code 是否需要 threadIdx	否，只需 block index（`ct::bid()` / `ct.bid(axis)`）
C++ `ct::bid()` / `ct::num_blocks()` 回傳型別	`uint3` / `dim3`（`.x/.y/.z`）
Python `ct.bid(axis)` 回傳型別	`int32` scalar
tile 每一維的限制	必須是 2 的次方
tile 的記憶體管理	programmer 不配置/釋放；value semantics、複製成本低
C++ tile 型別寫法	`ct::tile<T, ct::shape<dims...>>`
`iota` / `arange` 內容	`(0,1,...,N-1)`
`ct::full` vs `ct.full`	C++ `ct::full<Tile>(val)`；Python `ct.full(shape, fill_value, dtype)`
Python 編譯期常數參數	`ct.Constant[T]`（型別可省略，最常 `ct.Constant[int]`）
C++ 編譯期常數型別 / literal	`ct::integral_constant` / `_ic`（`0_ic` → `integral_constant<0>`）
哪種形式可混搭 runtime 維度	`ct::extents`/`ct::shape` 的 brace 形式（NTTP 不行）
易混淆：tile vs array	tile = block 本地、shape 編譯期固定；array = global memory、所有 block 可見
Tile 共用的 compiler 後端	CUDA Tile IR（Python 與 C++ 執行語意相同）

常見陷阱

C++ triple-chevron 第二參數不是 thread 數而是固定 1，寫成其他值即錯。
__tile__ 與 __device__ 函式目前不能互相呼叫（限制未來可能解除），但可在同一 .cu 共存。
tile 維度只能是 2 的次方；shape／dtype 必須編譯期已知，runtime 維度需用 brace 形式的 extents。