SIMT 裝置記憶體空間 (SIMT Device Memory Spaces)

重點總覽

CUDA device 有多種記憶體空間，kernel 內的 thread 可依其 scope（範圍）、lifetime（生命週期） 與 physical location（實體位置） 存取。下表為原文 Table 1 的核心摘要，後續每個概念各有一節說明。

項目	重點
Global memory	Grid scope、Application lifetime、實體在 Device；主要資料儲存、所有 thread 可見、persistent
Constant memory	Grid scope、Application lifetime、實體在 Device；read-only、每 device 一份、典型 64KB
Shared memory	Block scope、Kernel lifetime、實體在 SM；user-managed scratchpad、與 L1 共用實體空間
Local memory	Thread scope、Kernel lifetime、實體在 Device（global space）；register spilling 的去處
Register	Thread scope、Kernel lifetime、實體在 SM；最快、由 compiler 管理、thread local
static `__shared__`	編譯期決定大小、宣告在 kernel body 內
dynamic `extern __shared__`	launch 時用第三個 `<<<>>>` 參數決定大小、一個 kernel 僅能有一個
Caches (L2/L1)	L2 全 SM 共用、L1 在每個 SM 且與 shared memory 共用實體空間
Texture/Surface	現代 GPU 對 non-texture load 已無效能優勢、僅渲染情境有用
Distributed Shared Memory	CC 9.0+、thread block cluster 內跨 block 存取彼此 shared memory

Scope / Lifetime / Location 三維度

記住每種記憶體要同時掌握三件事：誰能看到它（scope）、它存活多久（lifetime）、實體在哪（location）。考試最愛混淆 Local memory：它的 scope 是 thread，但實體位置在 device（global）space，不在 SM。

存取速度（快 → 慢）           實體位置
Register ──────────────────  SM 晶片上
Shared / L1 ───────────────  SM 晶片上
L2 cache ──────────────────  Device（全 SM 共用）
Constant（透過 cache）─────  Device
Local / Global ────────────  Device（off-chip DRAM）

2.3.3.1 Global Memory

Global memory（又稱 device memory）是儲存 kernel 內所有 thread 皆可存取資料的主要空間，類似 CPU 系統中的 RAM。GPU 上的 kernel 直接存取 global memory，如同 CPU 程式存取系統記憶體。

Scope = Grid、Lifetime = Application、Location = Device。
Persistent（持久）：一旦配置，資料會保留到該 allocation 被釋放或 application 結束；cudaDeviceReset 也會釋放所有 allocation。
以 cudaMalloc、cudaMallocManaged 配置；以 cudaMemcpy 從 host 複製資料進來；以 cudaFree 釋放。
流程：kernel launch 前由 CUDA API 配置並初始化 → kernel 執行時 thread 可讀寫 → kernel 完成後結果可複製回 host 或供其他 kernel 使用。
kernel 回傳型別為 void，唯一把數值結果回傳 host 的方式就是寫入 global memory。
因所有 thread 皆可存取，必須小心避免 data race。

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength) {
    int workIndex = threadIdx.x + blockIdx.x * blockDim.x;
    if (workIndex < vectorLength)
        C[workIndex] = A[workIndex] + B[workIndex];
}

A、B、C 三個陣列都在 global memory，這個 vector add kernel 透過全域索引存取它們。

2.3.3.2 Shared Memory

Shared memory 是 thread block 內所有 thread 皆可存取的空間，實體位於每個 SM 上，與 L1 cache 共用同一塊實體資源（unified data cache）。

Scope = Block、Lifetime = Kernel、Location = SM。資料在整個 kernel 執行期間持續存在。
可視為 user-managed scratchpad：容量遠小於 global memory，但因在 SM 上，頻寬更高、延遲更低。
block 內 thread 同步用 __syncthreads()（C++）／cuda.syncthreads()（Python）：阻擋 block 內所有 thread，直到全部抵達該呼叫點。
shared memory 大小依架構而異；與 L1 共用實體空間 → 用 shared memory 會減少可用 L1，反之若不用 shared memory，整塊空間都給 L1。
查詢：cudaGetDeviceProperties → sharedMemPerMultiprocessor（每 SM）、sharedMemPerBlock（每 block）。
cudaFuncSetCacheConfig 可向 runtime 表達「偏好較大 shared memory 或較大 L1」，但只是 preference，不保證被採納。

// 假設 blockDim.x 為 128
__global__ void example_syncthreads(int* input_data, int* output_data) {
    __shared__ int shared_data[128];
    shared_data[threadIdx.x] = input_data[blockDim.x * blockIdx.x + threadIdx.x];
    // 所有 thread 同步，保證對 shared_data 的寫入都排在解除阻擋之前
    __syncthreads();
    // 由單一 thread 安全讀取 shared_data
    if (threadIdx.x == 0) {
        float sum = 0;
        for (int i = 0; i < blockDim.x; ++i) sum += shared_data[i];
        output_data[blockIdx.x] = sum;
    }
}

寫入 shared memory 後，必須 __syncthreads() 才能讓其他 thread 安全讀取，否則為 data race。

L1 與 shared memory 共用實體空間

兩者搶同一塊 unified data cache。多用 shared memory，L1 可用量就變小；完全不用 shared memory 時整塊都歸 L1。

2.3.3.2.1 Static Allocation of Shared Memory

在 kernel body 內用 __shared__（C++）或 cuda.shared.array()（Python）宣告變數，大小必須在編譯期決定，存活整個 kernel 執行期間。

__shared__ float sharedArray[1024];

宣告後，block 內所有 thread 都能存取這個 shared 陣列。

2.3.3.2.2 Dynamic Allocation of Shared Memory

C++ 在 launch 時把每 block 所需 shared memory 位元組數放在 triple chevron 的第三個（可選）參數：functionName<<<grid, block, sharedMemoryBytes>>>()；未指定時預設為 0。Python 用 cuda.core.launch() 搭配 LaunchConfig 的 shmem_size 欄位。

kernel 內以 extern __shared__ 搭配空 [] 宣告，於 launch 時動態配置。
Python 對應為 cuda.shared.array(shape=0, ...)。

extern __shared__ float sharedArray[];

只能有一個動態 shared 陣列

一個 kernel 僅能有單一動態配置的 shared 陣列。若需多個，必須配置一個夠大的單一陣列再手動切割（partition）。

extern __shared__ float array[];
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int*   array2 = (int*)&array1[64];

切割時 pointer 必須對齊到所指型別：若改成 &array0[127]，array1 未對齊 4 bytes 就會出錯。Python 無 pointer，故無此 type punning。

2.3.3.3 Registers

Register 位於 SM 上，具 thread local scope，是 kernel 執行期間最快的 thread 私有儲存。

Scope = Thread、Lifetime = Kernel、Location = SM。
使用量由 compiler 管理。
查詢：regsPerMultiprocessor（每 SM）、regsPerBlock（每 block）device property。
NVCC 可用 -maxrregcount 限制每 kernel 的 register 上限。
降低 register 數 → 可能讓更多 block 同時排程於 SM（提高 occupancy），但也可能造成更多 register spilling（值被寫出至 global memory 後再讀回，騰出空間）。

少用 register 可塞更多 block 進 SM；但壓太兇會 spill 到 global memory，反而拖慢速度。這是 occupancy 調校的核心 trade-off。

2.3.3.4 Local Memory

Local memory 是與 register 類似、由 NVCC 管理的 thread local 儲存，但實體位置在 global memory space。「local」指的是邏輯 scope，不是實體位置。

Scope = Thread、Lifetime = Kernel、Location = Device（global space）。
compiler 可能放進 local memory 的 automatic 變數：
- 無法判定以常數索引的陣列；
- 大型 struct 或會吃掉太多 register 空間的陣列；
- 任何在 register 不足時溢出的變數，即 register spilling。
因實體在 device memory，延遲與頻寬等同 global memory，同樣受 memory coalescing 規範約束。
但 local memory 的佈局是「連續的 32-bit word 由連續 thread ID 存取」，所以只要 warp 內所有 thread 存取相同相對位址（同一陣列索引或同一 struct 成員），就能完全 coalesced。

別被「local」誤導

Local memory 名為 local，實體卻在 off-chip 的 device memory，速度跟 global memory 一樣慢——它不是晶片上的快取。

2.3.3.5 Constant Memory

Constant memory 具 grid scope，在整個 application lifetime 可用；位於 device，對 kernel 唯讀（read-only）。

Scope = Grid、Lifetime = Application、Location = Device。
C++ 在任何 kernel／function 之外用 __constant__ 宣告；Python 在 kernel 內用 numba.cuda.const.array_like(ary)。
特性：駐於 constant memory space、每個 device 一份獨立物件、grid 內所有 thread 與 host 皆可存取（C++ 透過 cudaGetSymbolAddress()／cudaGetSymbolSize()／cudaMemcpyToSymbol()／cudaMemcpyFromSymbol()）。
Lifetime 細節：C++ 為其建立 context 的 lifetime；Python 為宣告它的 kernel 的 lifetime。
容量查詢：totalConstMem device property；通常每 device 約 64KB，相對其他記憶體很小。
適合：每個 thread 以唯讀方式使用的少量資料。

// 在 .cu 檔
__constant__ float coeffs[4];
__global__ void compute(float* out) {
    int idx = threadIdx.x;
    out[idx] = coeffs[0] * idx + coeffs[1];
}
// 在 host 程式
float h_coeffs[4] = {1.0f, 2.0f, 3.0f, 4.0f};
cudaMemcpyToSymbol(coeffs, h_coeffs, sizeof(h_coeffs));
compute<<<1, 10>>>(device_out);

host 端用 cudaMemcpyToSymbol 把資料寫入 __constant__ 符號，kernel 端唯讀存取。

2.3.3.6 Caches

GPU 具多層 cache 結構，包含 L2 與 L1。

Cache	位置	共用範圍	查詢
L2	Device	所有 SM 共用	`l2CacheSize`（`cudaGetDeviceProperties`）
L1	每個 SM	與 shared memory 共用實體空間	（見 shared memory）

若 kernel 未使用 shared memory，整塊實體空間都給 L1。
L2／L1 行為可透過進階函式控制（Configuring L1/Shared Memory Balance、L2 Cache Control、Low-Level Load and Store Functions）。
若不用這些 hint，compiler 與 runtime 會自行盡力有效利用 cache。

2.3.3.7 Texture and Surface Memory

GPU 可能有「從影像載入資料作為 3D 渲染 texture」的特化指令，CUDA 以 texture object API 與 surface object API 暴露。

在目前所有受支援的 NVIDIA GPU 上，texture／surface memory 對非圖形（non-graphics）應用無任何效能優勢。
過往舊 GPU 在某些情境用 texture memory 有好處；現在這些情境可直接用 load／store 指令處理。
仍有用的場景：讀取 texture／surface 資料做渲染，例如為 NVIDIA OptiX 撰寫 hit shader（OptiX 以 CUDA 為 shader 語言）。
舊 code base 若仍用這些 API 做 non-texture load，可參考 legacy CUDA C++ Programming Guide。

新程式碼不要用 texture 來加速一般載入

現代 GPU 直接 load/store 即可，texture/surface 對非渲染用途已無 performance benefit。

2.3.3.8 Distributed Shared Memory

前提

Distributed shared memory 需搭配 thread block cluster，使用 cooperative_groups API，目前僅 C++ 可用。Thread Block Clusters 於 compute capability 9.0 引入。

讓 thread block cluster 內的 thread 能存取「該 cluster 中所有參與 block」的 shared memory；這塊分散的 shared memory 稱為 Distributed Shared Memory，其位址空間稱 Distributed Shared Memory address space。
cluster 內 thread 可對該位址空間 read／write／atomics，不論位址屬於本地或遠端 block。
不論是否用 distributed shared memory，shared memory 大小規格（static 或 dynamic）仍以 per thread block 計；distributed shared memory 總大小 = 每 cluster 的 block 數 × 每 block 的 shared memory 大小。
存取前提：所有 block 都必須存在。用 cluster.sync()（cluster_group）保證所有 block 已開始執行；也須確保所有 distributed shared memory 操作在 block 退出前完成（遠端 block 讀本地 shared memory 時，需確保讀取完成才可退出）。

#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void clusterHist_kernel(int* bins, /* ... */) {
    extern __shared__ int smem[];
    cg::cluster_group cluster = cg::this_cluster();
    unsigned int clusterBlockRank = cluster.block_rank();
    // ... 初始化 smem 為 0 ...
    cluster.sync();                       // 確保全 cluster 的 shared memory 都已歸零且 block 都已啟動
    // 計算目標 block rank 與 offset
    int* dst_smem = cluster.map_shared_rank(smem, dst_block_rank);
    atomicAdd(dst_smem + dst_offset, 1);  // 跨 block atomic 更新 histogram bin
    cluster.sync();                       // 確保所有 distributed 操作完成後才退出
}

cluster.map_shared_rank() 取得遠端 block 的 shared memory 指標，是 distributed shared memory 的關鍵 API。

應用範例：histogram。傳統做法在各 block 的 shared memory 累積再做 global atomics，受限於 shared memory 容量；distributed shared memory 提供中介層，依 bins 大小選擇用 shared、distributed shared 或直接 global memory 計算。
launch 時 cluster size 依所需 distributed shared memory 而定；cluster_size == 1 等於不用 distributed shared memory（只用 block 本地 shared memory）。

cudaLaunchConfig_t config = {0};
config.gridDim = array_size / threads_per_block;
config.blockDim = threads_per_block;
config.dynamicSmemBytes = nbins_per_block * sizeof(int);   // 動態 shared memory 大小是 per block
cudaFuncSetAttribute(clusterHist_kernel,
    cudaFuncAttributeMaxDynamicSharedMemorySize, config.dynamicSmemBytes);
cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = cluster_size;              // y, z 設 1
config.numAttrs = 1; config.attrs = attribute;
cudaLaunchKernelEx(&config, clusterHist_kernel, /* args... */);

用 cudaLaunchKernelEx 搭配 cudaLaunchAttributeClusterDimension 在 runtime 動態決定 cluster 維度；注意 dynamic shared memory 大小仍以 per block 指定。

考試/測驗重點

情境/關鍵字	答案
Local memory 實體在哪？	在 device（global）memory space，不是 SM；速度等同 global memory
「local」指的是什麼？	邏輯 scope（thread local），不是實體位置
Shared memory 與哪個 cache 共用空間？	L1 cache（unified data cache，在每個 SM 上）
不用 shared memory 時 L1 如何？	整塊實體空間都給 L1 cache
一個 kernel 能有幾個 dynamic shared 陣列？	僅一個；多個需自行配置大陣列再手動 partition
dynamic shared memory 大小在哪指定？	C++ `<<<grid, block, bytes>>>` 第三參數（預設 0）；Python `LaunchConfig.shmem_size`
static shared memory 大小何時決定？	編譯期（compile time）
Constant memory 典型大小／可寫嗎？	約 64KB/device、對 kernel read-only
host 如何寫入 `__constant__`？	`cudaMemcpyToSymbol`（另有 `cudaMemcpyFromSymbol` 等）
kernel 回傳值如何回傳 host？	kernel 為 `void`，只能寫入 global memory
`-maxrregcount` 的副作用	可能更多 block 同時排程，但也可能更多 register spilling
register spilling 是什麼？	register 值被寫出至 global memory 再讀回，以騰出空間
Distributed shared memory 需求的 CC？	compute capability 9.0，搭配 thread block cluster + cooperative groups（僅 C++）
distributed shared memory 總大小公式	cluster 內 block 數 × 每 block shared memory 大小
跨 block 取遠端 shared memory 指標的 API	`cluster.map_shared_rank(smem, rank)`
Texture/surface memory 在現代 GPU 對一般 load？	無效能優勢；僅渲染（如 OptiX hit shader）有用
`cudaFuncSetCacheConfig` 保證生效嗎？	否，只是 preference，runtime 可不採納
哪些記憶體實體在 SM 上？	Register、Shared memory（L1）
哪些記憶體實體在 Device 上？	Global、Constant、Local、L2
Global memory scope / lifetime？	Grid / Application（persistent，到 free 或程式結束）