Tile Atomics 與最佳化 (Tile Atomics and Optimization)

#cuda #tile-programming #atomics #concept

重點總覽

項目	重點
Tile atomics 語意	每個 tile 元素做一次 atomic update；單一元素 atomic，整個 call 非 atomic；元素間順序未指定 (unspecified)
Cross-block contention	各 block 產生 partial result，用 atomic 合併進 global memory 同一位置；需 device scope
Intra-block contention	tile 多個元素寫入同一位置；只需 block scope；加法可交換故順序不影響結果
Python atomics	以 index 定址 (同 `ct.gather`/`ct.scatter`)；預設 bounds-check on、`ACQ_REL`、device scope；`TiledView.atomic_add` 不回傳值、lower 成 PTX atomic reduction
C++ atomics	傳 pointer + value；memory order/thread scope 為編譯期 type tag；scope 省略時預設 system-wide
支援的 atomic 運算	`atomic_and/or/xor/max/min/add/xchng/cas`，皆 element-wise
Optimization hints	附加在 source construct 的 metadata，不改變語意、compiler 可忽略；per-construct、可 per-architecture
C++ hint 語法	`[[ cutile::hint(arch, kind=value, ...) ]]`；arch 用 `__CUDA_ARCH__` 慣例 (900/1000)，`0` = 通用
Python hint	kernel 級為 `@ct.kernel(...)` kwargs；per-call 為呼叫點 kwargs；per-arch 用 `ByTarget(...)`；`replace_hints` 做 autotuning
Hint kinds	CTAs per cluster、occupancy、memory access latency、allow TMA
`__restrict__`	保證該記憶體只透過此 pointer 存取；讓 compiler 不必為 overlap 做保守同步
16-byte aligned	`ct::assume_aligned(p, 16_ic)`；`partition_view` 走 TMA 的前提；`cudaMalloc` 保證 ≥16-byte
`partition_view` / `ct::irange`	view 形式可 lower 成 TMA；`ct::irange` 讓 compiler 做 pipelining/vectorization

2.4.10 Atomic Memory Operations（atomic 記憶體運算）

Tile code 有兩種情況需要 memory atomics：

Cross-block contention：每個 block 算出 partial result，再用 atomic 把它合併到 global memory 的同一位置。
Intra-block contention：一個 tile 內多個元素寫入記憶體的同一位置。

Important

tile 上的 atomic 對 tile 的每個元素各做一次 atomic update。每個元素的運算是 atomic 的，但整個 call 不是 atomic；各元素 atomic 運算的先後順序未指定 (unspecified)。

Python 與 C++ 的呼叫差異

面向	Python	C++
定址方式	用 array 的 index（與 `ct.gather`/`ct.scatter` 同慣例）	傳 pointer + value：單一位置用 raw pointer + scalar；多位置用 tile of pointers + tile of values
memory order	選用參數，預設 `ACQ_REL`	呼叫點的編譯期 type tag，如 `ct::memory_order_relaxed_t{}`
thread scope	選用參數，預設 device scope	type tag（同形式），省略時預設 system-wide
bounds checking	選用參數，預設開啟	—
`TiledView` 方法	`TiledView.atomic_add(index, update)`：以 tile-space index 定址、不回傳值、lower 成 PTX atomic reduction	—

Tip

在 Python 中，當不需要舊值 (prior value) 時優先用 TiledView 形式（如 TiledView.atomic_add），效能較好，因為它直接 lower 成 PTX atomic reduction。

2.4.10.1 Cross-block Contention

不同 block 寫入同一記憶體位置 out。若沒有 atomic，平行執行的 block 會得到錯誤結果。此處必須用 device thread scope（C++ ct::thread_scope_device_t{}；Python thread scope 預設即 device），因為運算結果要對 device 上所有 block 可見。

__tile_global__ void block_sum(int* __restrict__ arr, int* __restrict__ out,
                               std::size_t N) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  constexpr auto TILE = 16_ic;
  arr = ct::assume_aligned(arr, 16_ic);
  out = ct::assume_aligned(out, 16_ic);
  auto aView = ct::partition_view{ct::tensor_span{arr, ct::extents{N}},
                                  ct::shape{TILE}};
  int bid = ct::bid().x;
  auto tile    = aView.load_masked(bid);   // partial final tile -> OOB lanes = 0
  auto partial = ct::sum(tile, 0_ic);      // reduce 成 1-element tile
  ct::atomic_add(out, (int)partial,        // 累加到 out[0]
                 ct::memory_order_relaxed_t{},  // accumulator -> relaxed 即可
                 ct::thread_scope_device_t{});  // 對整個 device 可見
}

@ct.kernel
def block_sum(arr, out, TILE: ct.Constant[int]):
    bid = ct.bid(0)
    # partial final tile -> OOB lanes default to 0
    tile = ct.load(arr, index=(bid,), shape=(TILE,),
                   padding_mode=ct.PaddingMode.ZERO)
    partial = ct.sum(tile)                            # reduce 成 scalar
    out.tiled_view((1,)).atomic_add((0,), partial)    # atomically 累加進 out[0]

重點：每個 block 的 partial sum 累加進 out[0] 後即丟棄、不需舊值，故 Python 用 TiledView.atomic_add；累加器用 relaxed order 已足夠。

Block 0 ─ partial0 ─┐
Block 1 ─ partial1 ─┼─ atomic_add ──> out[0]   (device scope：跨 block 可見)
Block 2 ─ partial2 ─┘

2.4.10.2 Intra-block Contention

一個 tile 的所有值都 atomically 加到記憶體同一位置。下例中 ptrs tile 的每個元素都指向同一位置 slot，ct::iota<i32x16>() 產生的每個元素都被 atomically 加進該位置。由於只在單一 block 內競爭，用 block thread scope thread_scope_block_t{} 即可。

using i32x16 = ct::tile<int, ct::shape<16>>;
int* slot = /* pointer to the contended location */;
// 16 lanes 全部瞄準同一位址；加法可交換，故未指定的順序不影響此 sum，
// 競爭限於單一 block，故 block scope 足矣。
auto ptrs = ct::full<ct::tile<int*, ct::shape<16>>>(slot);
ct::atomic_add(ptrs, ct::iota<i32x16>(),
               ct::memory_order_relaxed_t{},
               ct::thread_scope_block_t{});

tile lanes:  iota -> [0][1][2] ... [15]
                       \  \  |  ... /
                        atomic_add ──> *slot   (block scope)
   每元素一次 atomic；順序未指定；加法可交換故結果穩定

Warning

這只是示範。若要在 block 內把 tile 加總成 scalar，應改用 tile reduction（見 02-Programming-GPUs/12-Tile-Operations-and-Primitives），效能更佳，不要用 intra-block atomics 硬湊。

2.4.10.3 Supported Atomic Operations

所有運算皆為 element-wise，差別在於「待寫入值」如何與「記憶體現值」結合：

運算	行為
`atomic_and` / `atomic_or` / `atomic_xor`	對傳入值與記憶體值做 element-wise bitwise AND / OR / XOR
`atomic_max` / `atomic_min`	element-wise 比較後存入較大 / 較小值
`atomic_add`	把傳入值加到記憶體值再存回
`atomic_xchng`	寫入傳入值，並回傳寫入前的舊值
`atomic_cas`	element-wise 比較記憶體值與 expected 值；相符才以 desired 值取代

Tip

完整 atomic 運算文件請參考 CUDA Tile C++ API Reference 或 cuTile Python API Reference 的 memory operations 章節。

2.4.11 Optimization Hints（最佳化提示）

Optimization hint 是附加在 source construct（tile kernel function、load/store 呼叫點等）上的 metadata，用來引導 compiler 的 code generation。

Important

Hints 不改變程式語意：kernel 加不加 hint 都會編譯並執行出相同結果，因此可自由增刪、調整而不影響正確性。compiler 也可以忽略任何 hint。

兩個共通性質：

Per-construct：hint 只作用於它所附加的那個 kernel function 或那個 call expression，不影響周邊程式碼。
可 per-architecture：每個 hint 可對不同 GPU 架構設不同值，或設一個套用到所有目標的單一值。

兩種語言的暴露方式不同：C++ 用放在宣告/敘述上的 C++ attribute；Python 用 kernel decorator 與各記憶體運算呼叫點的 keyword arguments。hint 種類（每個 hint 控制什麼）兩語言共用。

2.4.11.1 C++ — `cutile::hint` Attribute

[[ cutile::hint(arch, kind1=value1, kind2=value2, ...) ]]

第一個引數是目標架構，用 __CUDA_ARCH__ 慣例的整數編碼（例如 900 表示 sm_90、1000 表示 sm_100）。
特殊值 0 表示 architecture-agnostic（套用到每個目標架構）。
其餘每個引數是一個 kind=value pair。

放置規則：

construct	放置位置
tile kernel function	放在 function declaration 上
記憶體運算（`ct::load`、`ct::store`、`partition_view` load/store）	放在含該 call 的 expression-statement 上

[[ cutile::hint(900,  num_cta_in_cga=4),    // sm_90:  偏好 4 CTAs/cluster
   cutile::hint(1000, num_cta_in_cga=8) ]]  // sm_100: 偏好 8 CTAs/cluster
__tile_global__ void optimization_hints(float* __restrict__ in,
                                         float* __restrict__ out) {
  // ...
  ct::tile<float, ct::shape<8>> tile;
  [[ cutile::hint(0, latency=8) ]]          // expression-statement hint：此 load 屬高頻寬
  tile = inView.load(bx);
  outView.store(tile, bx);
}

Important

當同一 construct 有多個同 kind 的 hint 時，architecture-specific hint 會覆蓋 architecture-agnostic hint。

2.4.11.2 Python — Decorator Arguments 與 Call-Site Keywords

Kernel 級 hints：作為 @ct.kernel(...) decorator 的 keyword arguments。compiled kernel 物件還有 .replace_hints(**hints) 方法，回傳一個 hint 被覆蓋的新 kernel；新 kernel 有自己的 JIT cache，因此 replace_hints 是 autotuning 迴圈的天然構件。
Per-call hints：放在記憶體運算呼叫點的 keyword arguments：ct.load/ct.store、TiledView.load/TiledView.store、ct.gather/ct.scatter。

per-architecture 值用 cuda.tile.ByTarget(*, default=..., sm_XXX=..., sm_YYY=...) 包起來；架構 key 必須是 "sm_<major><minor>" 形式的字串（如 "sm_100"、"sm_120"）。未用 ByTarget 的純值套用到每個目標，等同 C++ 的 arch=0。

@ct.kernel(num_ctas=ByTarget(sm_90=4, sm_100=8))   # kernel 級、per-arch hint
def optimization_hints(in_, out, TILE: ct.Constant[int]):
    bid = ct.bid(0)
    tile = ct.load(in_, index=(bid,), shape=(TILE,), latency=8)  # per-call hint
    ct.store(out, index=(bid,), tile=tile)

# Autotuning：不改 source 就產生新 kernel（有自己的 JIT cache）
tuned_kernel = optimization_hints.replace_hints(num_ctas=8)

2.4.11.3 Hint Kinds（hint 種類）

兩語言共用同一組 hint；C++ 名與 Python 名只是同一底層 hint 的不同拼法。

Hint	C++ 名	Python 名	允許值	作用對象 / 意義
CTAs per cluster	`num_cta_in_cga` (kernel attr)	`num_ctas` (`@ct.kernel` 引數)	`1, 2, 4, 8, 16`；sm_80 只適用 1	每個 CGA 偏好啟動幾個 CTA
Occupancy	`occupancy` (kernel attr)	`occupancy` (`@ct.kernel` 引數)	整數 `[1, 32]`	每個 SM 目標 active CTA 數；compiler 視為建議、盡量遵循
Memory access latency	`latency`（含 call 的 expression-statement 上）	`latency`（call-site kwarg）	整數 `[1, 10]`，1=輕量 DRAM、10=重度	套用於 tile-space load/store 與 gather/scatter；值越大通常排更深的 prefetch
Allow TMA	`allow_tma`（expression-statement 上）	`allow_tma`（call-site kwarg）	`true`/`false`（C++）、`True`/`False`（Python），預設允許 TMA	僅 tile-space load/store（gather/scatter 不接受）；設 false 指示 compiler 不把此 load/store 降到 TMA

Warning

Allow TMA 與 latency 的適用範圍不同：latency 同時適用於 tile-space load/store 與 gather/scatter；但 allow_tma 只適用於 tile-space load/store，gather/scatter 不接受此 hint。

Hint 解析優先序（同 kind、同 construct）：
  arch-specific (如 sm_100=8)  ──覆蓋──>  arch-agnostic (arch=0 / 純值)

2.4.12 C++ Performance Tips（C++ 效能技巧）

本指南的 C++ kernel 都用到同一組註記與慣用法，以下說明其作用與重要性。

2.4.12.1 對記憶體陣列使用 `restrict` Pointers

__restrict__ 告訴 compiler：該 pointer 所存取的記憶體區域，在此 pointer 生命週期內只會透過這個 pointer 存取。在 tile C++ 中，對符合此條件的陣列 pointer 標上 __restrict__ 是良好記憶體運算效能的關鍵。

為何重要——考慮一個 element-wise copy：

情況	compiler 行為
陣列不重疊（`__restrict__`）	load 可平行化成多個獨立讀；store 可平行化成多個寫，各寫只依賴它要寫的元素之 load → 可交錯讀寫
陣列可能重疊（無 `__restrict__`）	必須保證整個 tile 的所有 load 完成後才發出任何 store，否則 store 可能在某元素被讀前就覆蓋它 → 保守碼、無法交錯

非重疊 (__restrict__):   load─┐ load─┐ load─┐   讀寫可交錯、可 pipeline
                         store┘ store┘ store┘

可能重疊 (無 __restrict__): [ 所有 load ] ──barrier──> [ 所有 store ]

Warning

當記憶體區域實際上可被另一個 pointer 存取時，仍對 pointer 標 __restrict__ 會導致 undefined behavior。

2.4.12.2 標記陣列 Pointer 為 16-Byte Aligned

用 ct::assume_aligned 把陣列 pointer 標為 16-byte 對齊：

__tile_global__ void foo(float* __restrict__ in) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;
  in = ct::assume_aligned(in, 16_ic);
  ct::tensor_span t{in, ct::extents{256_ic, 256_ic}};
  ct::partition_view{t, ct::shape{4_ic, 4_ic}};
}

此對齊保證是 ct::partition_view 使用 Tensor Memory Accelerator (TMA) 的必要條件。
執行期必須真的提供 16-byte 對齊的 pointer，否則行為未定義。
cudaMalloc 等 CUDA 配置器回傳的 pointer 保證至少 16-byte 對齊。

2.4.12.3 結構化存取優先用 `ct::partition_view`

對於結構化記憶體存取，優先用 ct::partition_view，而非 gather/scatter 形式的 ct::load/ct::store。view-based 形式可在支援的硬體上 lower 成 TMA，比 per-element gather 快得多。

2.4.12.4 有界迴圈用 `ct::irange`

迭代固定範圍時用 ct::irange 取代普通 for 迴圈。結構化形式讓 compiler 能做 pipelining 與 vectorization——當迴圈 bound 與 step 是不透明的整數運算式時無法做這些最佳化。

for (auto idx : ct::irange(lowerBound, upperBound, step)) {
  // ...
}

Tip

三項 TMA 相關技巧環環相扣：__restrict__ + 16-byte aligned + partition_view 一起用，才能讓 compiler 把記憶體存取 lower 成最快的 TMA 路徑。

考試/測驗重點

情境 / 關鍵字	答案
tile atomic 是否整個 call 都 atomic？	否。每元素各一次 atomic，整個 call 非 atomic，元素間順序未指定
cross-block contention 要用什麼 scope？	device scope（`ct::thread_scope_device_t{}`），結果要對所有 block 可見
intra-block contention 要用什麼 scope？	block scope（`ct::thread_scope_block_t{}`）即可
Python atomic 預設值（3 個）	bounds checking on、memory order ACQ_REL、thread scope device
C++ atomic 的 thread scope 省略時預設？	system-wide（注意：與 Python 的 device 預設不同！）
C++ memory order / scope 怎麼指定？	編譯期 type tag，如 `ct::memory_order_relaxed_t{}`、`ct::thread_scope_block_t{}`
不需要舊值時 Python 該用哪種 atomic？	`TiledView` 形式（不回傳值、lower 成 PTX atomic reduction、較快）
把 tile 加總成 scalar 該用 atomics 嗎？	不該，用 tile reduction 才是正解
`atomic_xchng` 回傳什麼？	寫入前記憶體中的舊值
`atomic_cas` 行為	比較記憶體值與 expected，相符才換成 desired
hint 會改變程式正確性嗎？	不會，純引導 codegen，compiler 可忽略
C++ hint `arch=0` 意義	architecture-agnostic，套用到所有目標架構
sm_90 / sm_100 的 arch 整數	900 / 1000（`__CUDA_ARCH__` 慣例）
同 kind 多 hint 衝突如何解？	arch-specific 覆蓋 arch-agnostic
Python per-arch hint 用什麼？	`ByTarget(default=..., sm_XXX=...)`，key 為 `"sm_<major><minor>"` 字串
autotuning 用哪個 API？	`kernel.replace_hints(**hints)`，回傳有獨立 JIT cache 的新 kernel
CTAs per cluster 允許值 / sm_80 限制	`1,2,4,8,16`；sm_80 只可用 1
occupancy 範圍 / latency 範圍	occupancy `[1,32]`；latency `[1,10]`（1 輕、10 重）
`allow_tma` 適用範圍陷阱	只適用 tile-space load/store；gather/scatter 不接受
`latency` 適用範圍	tile-space load/store 與 gather/scatter 皆可
為何 `__restrict__` 提升效能？	保證不重疊，compiler 不必先讓所有 load 完成才 store，可交錯讀寫
`__restrict__` 用錯後果	區域實際可被他 pointer 存取仍標記 → undefined behavior
16-byte aligned 為何需要？	讓 `partition_view` 能用 TMA；runtime 須真對齊否則 UB
`cudaMalloc` 對齊保證	至少 16-byte 對齊
`partition_view` vs gather/scatter	partition_view 可 lower 成 TMA、遠快於 per-element gather
為何用 `ct::irange` 而非 `for`？	讓 compiler 做 pipelining / vectorization

重點總覽

2.4.10 Atomic Memory Operations（atomic 記憶體運算）

Python 與 C++ 的呼叫差異

2.4.10.1 Cross-block Contention

2.4.10.2 Intra-block Contention

2.4.10.3 Supported Atomic Operations

2.4.11 Optimization Hints（最佳化提示）

2.4.11.1 C++ — cutile::hint Attribute

2.4.11.2 Python — Decorator Arguments 與 Call-Site Keywords

2.4.11.3 Hint Kinds（hint 種類）

2.4.12 C++ Performance Tips（C++ 效能技巧）

2.4.12.1 對記憶體陣列使用 __restrict__ Pointers

2.4.12.2 標記陣列 Pointer 為 16-Byte Aligned

2.4.12.3 結構化存取優先用 ct::partition_view

2.4.12.4 有界迴圈用 ct::irange

考試/測驗重點

Related Notes

2.4.11.1 C++ — `cutile::hint` Attribute

2.4.12.1 對記憶體陣列使用 `restrict` Pointers

2.4.12.3 結構化存取優先用 `ct::partition_view`

2.4.12.4 有界迴圈用 `ct::irange`