Tile 程式設計 (Tile Programming)

#cuda #programming-model #tile-programming #concept

重點總覽

項目	重點
Tile 程式模型	在整個 thread block 層級寫程式，描述對多維資料集合（tiles）的運算；compiler 把運算映射到 block 內各 thread
programmer 職責	只指定 grid 維度；每 block 的 thread 數由 compiler 依 tile 運算決定
單一控制流	block 走單一控制流，支援條件與迴圈，但沒有 warp divergence 概念
Array（global array）	多維、存於 device memory、可變（mutable）、有 shape 與 dtype
Tile	只存在於 tile code、block 區域、不可變（immutable）、各維為 2 的次方且編譯期已知、不一定有記憶體表示、不可當 kernel 參數
Tile space	將 array 概念性切成等大、不重疊的 tiles；以 tile-space 索引選取 tile
Load / Store	load 把 array 區塊讀成 tile（越界可補零）；store 為反向、越界寫入被丟棄；另支援 gather/scatter
Tile 運算	elementwise、matrix multiply、reductions、reshape/transpose、type conversion；shape 不同時較小者自動 broadcast
與 SIMT 關係	兩模型共存、per-kernel 選擇；SIMT 細粒度、tile 較高層抽象且可跨架構執行

Tile 程式模型概觀 (1.2.2.3)

除了前述的 SIMT 模型外，CUDA 另支援 tile programming model。在此模型中，programmer 不再為每個 thread 撰寫程式，而是在整個 thread block 的層級撰寫程式，描述對多維資料集合（稱為 tiles）的運算。compiler 負責把這些運算映射到 block 內的各個 thread。

Tile kernel 一樣 launch 在 grid of blocks 上；每個 block 執行 tile kernel，並可查詢自己在 grid 中的位置以決定負責哪一塊資料。
programmer 只指定 grid 維度；每 block 的 thread 數由 compiler 根據 kernel 中的 tile 運算自動決定。
block 內走單一控制流（single control flow）：仍支援條件（conditionals）與迴圈（loops），但因整個 block 同一控制流，故沒有 warp divergence 的概念。
scalar 運算（如計算索引或迴圈邊界）由 block 中單一 thread 執行；tile 運算（如兩個 tile 逐元素相加）則由 block 內所有 thread 平行協作執行。

Important

不要混淆 block 與 tile：block 是執行單位（unit of execution），tile 是資料單位（unit of data）。單一 block 可建立並操作許多不同 shape、不同 data type 的 tiles。

SIMT 模型                          Tile 模型
─────────────                      ─────────────
programmer 寫 per-thread 程式       programmer 寫 per-block 程式
控制每個 thread 如何存取資料         描述對 tiles 的運算
                                    ↓
                              compiler 把運算分配給 block 內各 thread
                              compiler 決定 threads/block

Tip

記法：tile 模型把「thread 層級的決策」交給 compiler，programmer 只需思考「block 對一塊資料做什麼運算」。

Arrays 與 Tiles (1.2.2.3.1)

Tile kernel 處理兩種資料：arrays 與 tiles。兩者性質幾乎相反，是本主題最容易混淆的對比，務必用表格牢記。

比較項	Array（global array）	Tile
存放位置	device memory	只存在於 tile code，block 區域
可變性	可變（mutable），可被 store 修改	不可變（immutable），每個運算產生新 tile
維度大小	多維、由 shape 描述	每個維度須為 2 的次方且編譯期已知
記憶體表示	一定有（在 device memory）	不一定有；由 compiler 決定（registers / shared memory / SM 其他資源）
可否當 kernel 參數	可	不可；完全在 tile code 內建立與消耗
屬性	有 shape 與 data type	多維值的集合

Array（global array）：多維容器，元素存於 device memory，內容可被 kernel 內的 store 運算修改，具 shape 與 dtype。
Tile：只存在於 tile code、且 local to a single block 的多維值集合；immutable——每次運算都產生新 tile 而非修改既有 tile。
Tile 不一定有記憶體表示：compiler 決定如何儲存，可能用 registers、shared memory 或 SM 的其他資源。

Warning

Tile 的每個維度必須是 2 的次方且編譯期已知（值須在 kernel 執行前可決定，而非執行期才算出）。此外 tile 不能作為 kernel 參數傳遞。這兩點與一般 array 形成例外限制。

Tile space 與資料搬移 (1.2.2.3.2)

資料透過 load 與 store 在 array 與 tile 之間搬移，兩者皆建立在 tile space 概念上。

Tile space：把 array 概念性地切成等大小、不重疊的 tiles 後所形成的索引空間。
- 例：array shape 為 (M, N)，load 指定 tile shape (tm, tn)，則 array 被概念性分成 ⌈M/tm⌉ 列 × ⌈N/tn⌉ 行的 tiles。
- tile-space 索引（如 (i, j)）用來指定要 load 哪一個 tile；load 回傳 shape (tm, tn) 的 tile。
越界處理：當 array 維度非 tile 維度的整數倍時，邊緣 tile 會超出 array 邊界；load 自行指定越界元素如何處理，例如補零（fill with zeros）。
Store：為 load 的反向操作——給定一個 tile 與 tile-space 索引，把 tile 元素寫回 array 對應區域；落在 array 邊界外的寫入會被靜默丟棄（silently discarded）。
Gather / Scatter：tile 程式也支援，可從 array 任意位置 load，或 store 到任意位置。

Array (M, N)                        Tile space  (⌈M/tm⌉ × ⌈N/tn⌉)
┌──────────────┐                    ┌────┬────┬────┐
│              │    load (i,j)      │0,0 │0,1 │0,2 │
│   array in   │  ───────────────►  ├────┼────┼────┤
│ device mem   │   store (i,j)      │1,0 │1,1 │1,2 │ ← 邊緣 tile 越界→補零
│              │  ◄───────────────  ├────┼────┼────┤
└──────────────┘                    │... │... │... │
   (mutable)         tile (tm,tn)   └────┴────┴────┘
                     (immutable)

Warning

Load 與 store 的越界行為不對稱：load 越界讀取依指定方式處理（如補零），store 越界寫入則被丟棄。容易考的混淆點。

Tile 運算 (1.2.2.3.3)

Tile 程式提供一組作用於 tile 的內建運算：

Elementwise 算術：逐元素的算術運算。
Matrix multiplication：矩陣乘法。
Reductions：沿一或多個軸做 reduction，例如 sum、maximum。
Shape manipulation：形狀操作，例如 reshape、transpose。
Type conversion：型別轉換。

Tip

Broadcasting：當兩個不同 shape 的 tile 在同一運算中結合時，較小的 tile 會自動被擴展（expand）以匹配較大者，再套用運算。免去手動複製資料。

與 SIMT 的關係 (1.2.2.3.4)

Tile programming 與 SIMT programming 在 CUDA 中共存（coexist），並非取代關係。

面向	SIMT 程式	Tile 程式
抽象層級	細粒度：對個別 thread 的控制	較高層抽象：簡化 kernel 開發
thread 決策	programmer 控制	交給 compiler
跨架構	可能需調整	同一 tile kernel 可跨不同 GPU 架構執行，無需改原始碼
適用	某些演算法與最佳化技巧仍需要	想要更高層、可攜的描述時

一個 application 可同時包含 SIMT 與 tile kernel，且兩種 kernel 可操作 device memory 中的同一份資料。
程式模型的選擇是 per-kernel 的決定。
因為 thread 層級決策交給 compiler，同一 tile kernel 可在不同 GPU 架構上執行而不需修改原始碼。
兩種模型都建構在相同的底層硬體（SMs、thread blocks、grids）之上，也使用相同的 device memory spaces。

              同一份 device memory（global memory）
        ┌──────────────────────────────────────────┐
        │   SMs / thread blocks / grids（共同硬體）   │
        └──────────────────────────────────────────┘
            ▲                              ▲
            │ per-kernel 選擇               │
      ┌─────┴──────┐                ┌──────┴──────┐
      │ SIMT kernel│   可並存於同一    │ Tile kernel │
      │ 細粒度控制  │   application    │ 高層、可攜   │
      └────────────┘                └─────────────┘

考試/測驗重點

情境/關鍵字	答案
Tile 程式在哪個層級寫程式？	整個 thread block 層級
誰決定每 block 的 thread 數？	compiler（依 tile 運算決定）；programmer 只給 grid 維度
Tile 模型有 warp divergence 嗎？	沒有；block 走單一控制流
block vs tile	block＝執行單位；tile＝資料單位；一個 block 可有多個 tile
array 可變還是不可變？	array mutable；tile immutable（每次運算產生新 tile）
tile 各維度限制	須為 2 的次方且編譯期已知
tile 能當 kernel 參數嗎？	不能
tile 一定存在記憶體嗎？	不一定；compiler 決定（registers/shared memory/SM 資源）
tile space 是什麼？	把 array 切成等大、不重疊 tiles 的索引空間
load 越界 vs store 越界	load 依指定處理（如補零）；store 越界被丟棄
任意位置存取的運算	gather / scatter
不同 shape tile 結合	較小者自動 broadcast 擴展
同一 tile kernel 可跨架構嗎？	可；thread 決策交給 compiler，無需改原始碼
SIMT 與 tile 能並用嗎？	可；per-kernel 選擇，共用同一 device memory 與硬體
scalar 運算 vs tile 運算誰執行	scalar 由單一 thread；tile 運算由所有 thread 平行執行

重點總覽

Tile 程式模型概觀 (1.2.2.3)

Arrays 與 Tiles (1.2.2.3.1)

Tile space 與資料搬移 (1.2.2.3.2)

Tile 運算 (1.2.2.3.3)

與 SIMT 的關係 (1.2.2.3.4)

考試/測驗重點

Related Notes