L2 Cache Control

重點總覽

當 kernel 反覆存取同一塊 global memory（persisting），可把一部分 L2 cache「保留」（set-aside）給這些存取優先使用，藉此提升頻寬、降低延遲；只被讀一次的資料則屬於 streaming。此能力需要 compute capability 8.0 以上。

項目	重點
適用裝置	compute capability 8.0 以上
兩種 API	CUDA runtime API（CUDA 11.0 起，程式化控制）／`cuda::annotated_ptr`（libcu++，CUDA 11.5 起，指標標註）
set-aside	從 L2 撥出一塊保留給 persisting 存取優先使用；空閒時 normal/streaming 才可借用
set-aside 上限	`cudaDeviceProp::persistingL2CacheMaxSize`
調整 set-aside 大小	`cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size)`
access policy window	指定一段連續 global memory 範圍 `[base_ptr .. base_ptr+num_bytes)` 的 L2 持久化屬性
設定途徑	CUDA Stream 屬性或 CUDA Graph Kernel Node 屬性
三種存取屬性	`cudaAccessPropertyStreaming` / `cudaAccessPropertyPersisting` / `cudaAccessPropertyNormal`
`hitRatio`	約以該機率挑選窗口內存取套用 `hitProp`，可避免 cache line 抖動（thrashing）
reset 回 normal	三種方式：`Normal` 屬性窗口、`cudaCtxResetPersistingL2Cache()`、自動（不建議依賴）
查詢屬性	`cudaGetDeviceProperties`：`l2CacheSize`、`persistingL2CacheMaxSize`、`accessPolicyMaxWindowSize`

persisting vs streaming

同一份資料被 kernel 多次存取 → persisting（值得留在 L2）；只用一次 → streaming（用完即丟，優先被驅逐）。L2 cache control 的核心就是把這個語意明確告訴硬體。

L2 Cache Set-Aside for Persisting Accesses

可從 L2 cache 撥出一塊「保留區」（set-aside）給 persisting 存取使用。

persisting 存取對這塊保留區有優先使用權；normal 或 streaming 存取只有在保留區未被 persisting 使用時才能借用。
保留大小可在上限內調整，上限為 prop.persistingL2CacheMaxSize。

cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); // 保留 3/4 L2 或最大允許值

上面取「L2 的 75%」與「硬體允許上限」中較小者，作為 set-aside 大小。

MIG 與 MPS 的限制

GPU 處於 Multi-Instance GPU (MIG) 模式時，L2 set-aside 功能被停用。
使用 Multi-Process Service (MPS) 時，無法用 cudaDeviceSetLimit 改變 set-aside 大小；只能在 MPS server 啟動時透過環境變數 CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT 指定。

L2 Policy for Persisting Accesses

access policy window 指定一段連續的 global memory 範圍與其在 L2 的持久化屬性；落在窗口內的存取依機率套用 hitProp，其餘套用 missProp。可掛在 CUDA Stream 或 CUDA Graph Kernel Node 上。

cudaStreamAttrValue stream_attribute;                       // Stream 層級屬性結構
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr);
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;  // < accessPolicyMaxWindowSize
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;        // cache 命中比例提示
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

此後在該 stream 執行的 kernel，對 [ptr..ptr+num_bytes) 的存取比其他位置更可能持久留在 L2。Graph 版本則用 cudaKernelNodeAttrValue 搭配 cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow, &node_attribute)，欄位完全相同。

num_bytes 必須小於 cudaDeviceProp::accessPolicyMaxWindowSize。
上例 hitRatio = 0.6：窗口內約 60% 存取套用 persisting，40% 套用 streaming。
哪些存取被歸為 persisting 是隨機的，機率約等於 hitRatio；實際機率分布取決於硬體架構與記憶體範圍。

hitRatio 如何避免 thrashing

以 set-aside 為 16KB、num_bytes 為 32KB 的窗口為例：

`hitRatio`	硬體行為
0.5	隨機挑 32KB 中的 16KB 標為 persisting，剛好放進 16KB 保留區
1.0	嘗試把整個 32KB 都快取；因保留區較小，會驅逐 cache line 以保留最近用到的 16KB

set-aside L2 = 16KB        window = 32KB
hitRatio=1.0  ┌──────────────┬──────────────┐  整窗 32KB 想塞進 16KB
              │  ←──── 不斷互相驅逐(thrash) ──→ │
              └──────────────┴──────────────┘
hitRatio=0.5  ┌──────────────┐ 只標 16KB persisting，剛好放下，不抖動
              └──────────────┘

hitRatio < 1.0 可手動控制不同 stream 的 accessPolicyWindow 能在 L2 快取多少資料。
例：兩個並行 kernel 各有 16KB 窗口、共用 16KB 保留區；都用 hitRatio = 1.0 會互相驅逐 cache line，改用 0.5 則較不會驅逐自己或對方的 persisting cache line。

L2 Access Properties

針對不同 global memory 存取定義三種屬性：

屬性	語意
`cudaAccessPropertyStreaming`	較不易留在 L2，會被優先驅逐
`cudaAccessPropertyPersisting`	較易留在 L2，會在 set-aside 區被優先保留
`cudaAccessPropertyNormal`	強制把先前套用的 persisting 屬性重設為 normal，等同該存取從未套過屬性

為何需要 Normal 來重設

先前 kernel 留下的 persisting cache line 可能在用完後仍長期占住 L2（persistence-after-use），壓縮到後續不使用 persisting 的 kernel 可用的 L2。用 cudaAccessPropertyNormal 重設窗口，可移除前一次存取的「優先保留」狀態。

L2 Persistence Example

完整流程：撥出 set-aside → 設窗口 → kernel 反覆使用 → 關閉窗口並 reset L2。

cudaStreamCreate(&stream);
cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size);            // 保留 set-aside
size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes); // 取較小者

stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(data1);
stream_attribute.accessPolicyWindow.num_bytes = window_size;
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

for (int i = 0; i < 10; i++)
    cuda_kernelA<<<grid_size, block_size, 0, stream>>>(data1); // data1 被多次使用 → 受惠
cuda_kernelB<<<grid_size, block_size, 0, stream>>>(data1);     // 同 stream 另一 kernel 也受惠

stream_attribute.accessPolicyWindow.num_bytes = 0;             // 窗口大小設 0 即停用
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);
cudaCtxResetPersistingL2Cache();                               // 清除 L2 中的 persistent line
cuda_kernelC<<<grid_size, block_size, 0, stream>>>(data2);     // data2 可用完整 L2(normal 模式)

重點：用完 persisting 後務必先把 num_bytes 設 0 關閉窗口，再呼叫 cudaCtxResetPersistingL2Cache()，後續 kernel 才能以 normal 優先權完整使用 L2。

set-aside ──► set window(data1) ──► kernelA ×10 ──► kernelB
                                                       │
                          num_bytes=0(關窗) ◄──────────┘
                                  │
                  cudaCtxResetPersistingL2Cache() ──► kernelC(data2, 完整 L2)

Reset L2 Access to Normal

persisting cache line 可能在用完後仍長期留在 L2，因此「重設回 normal」對 streaming/normal 存取以正常優先權使用 L2 很重要。三種重設方式：

以 cudaAccessPropertyNormal 屬性重設先前的 persisting 記憶體區。
呼叫 cudaCtxResetPersistingL2Cache() 把所有 persisting cache line 重設為 normal。
未被觸碰的 line 最終會自動重設為 normal。

不要依賴自動重設

自動重設發生的時間長度不確定，強烈不建議依賴；應主動用方式 1 或 2 明確重設。

Manage Utilization of L2 Set-Aside Cache

多個並行（不同 stream）的 kernel 可各自指定 access policy window，但保留區是被所有並行 kernel 共用的：總使用量是各 kernel 個別用量的加總。當 persisting 存取總量超過保留區容量時，持久化的好處會遞減。

管理保留區利用率須一併考量：

L2 set-aside 的大小。
可能並行執行的 CUDA kernel。
這些並行 kernel 各自的 access policy window。
何時、如何 reset L2，讓 normal/streaming 存取能以相同優先權使用先前保留的 L2。

Query L2 Cache Properties

L2 相關屬性屬於 cudaDeviceProp，用 cudaGetDeviceProperties 查詢：

欄位	意義
`l2CacheSize`	GPU 上可用的 L2 cache 總量
`persistingL2CacheMaxSize`	可撥給 persisting 存取的 L2 最大量（set-aside 上限）
`accessPolicyMaxWindowSize`	access policy window 的最大尺寸（`num_bytes` 上限）

Control L2 Cache Set-Aside Size for Persisting Memory Access

set-aside 大小以 cudaLimit 形式管理：用 cudaDeviceGetLimit 查詢、用 cudaDeviceSetLimit 設定，limit 名稱為 cudaLimitPersistingL2CacheSize，最大值為 cudaDeviceProp::persistingL2CacheMaxSize。

enum cudaLimit {
    /* other fields not shown */
    cudaLimitPersistingL2CacheSize
};

cudaLimitPersistingL2CacheSize 即是控制 L2 保留大小的旋鈕，搭配 set/get Limit 一對即可調整與讀回。

考試/測驗重點

題型	關鍵答案
L2 持久化最低 compute capability？	8.0 以上
兩種 API？	CUDA runtime API（11.0 起）/ cuda::annotated_ptr（libcu++，11.5 起）
設定 set-aside 大小的 API 與 limit？	cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size)
set-aside 上限欄位？	cudaDeviceProp::persistingL2CacheMaxSize
MIG 模式下 set-aside？	被停用
MPS 下如何設 set-aside？	不能用 cudaDeviceSetLimit；改用 CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT 環境變數於 MPS 啟動時設定
access policy window 三大屬性欄位？	base_ptr / num_bytes / hitRatio + hitProp / missProp
num_bytes 上限？	必須 < accessPolicyMaxWindowSize
三種 access property？	Streaming（優先驅逐）/ Persisting（優先保留）/ Normal（重設為一般）
hitRatio 意義？	約以該機率挑選窗口內存取套 hitProp；隨機，分布依硬體與範圍而定
hitRatio=0.5 vs 1.0（16KB set-aside, 32KB window）？	0.5：隨機 16KB 標 persisting 剛好放下；1.0：嘗試塞整 32KB → 驅逐保留最近 16KB
hitRatio<1.0 的用途？	避免 thrashing、控制並行 stream 在 L2 的快取量、減少互相驅逐
設定窗口可掛在哪兩種物件？	CUDA Stream（cudaStreamAttributeAccessPolicyWindow）/ Graph Kernel Node（cudaKernelNodeAttributeAccessPolicyWindow）
重設回 normal 三方式？	Normal 屬性窗口 / cudaCtxResetPersistingL2Cache() / 自動（不建議依賴）
為何不可依賴自動重設？	自動重設所需時間長度不確定
如何停用一個 access policy window？	把 num_bytes 設為 0 後再 SetAttribute
並行 kernel 與 set-aside 的關係？	保留區被所有並行 kernel 共用，總用量為各自加總，超量則好處遞減
查 L2 屬性的 API？	cudaGetDeviceProperties → l2CacheSize / persistingL2CacheMaxSize / accessPolicyMaxWindowSize

重點總覽

L2 Cache Set-Aside for Persisting Accesses

L2 Policy for Persisting Accesses

hitRatio 如何避免 thrashing

L2 Access Properties

L2 Persistence Example

Reset L2 Access to Normal

Manage Utilization of L2 Set-Aside Cache

Query L2 Cache Properties

Control L2 Cache Set-Aside Size for Persisting Memory Access

考試/測驗重點

Related Notes