CUDA C++ Kernel 與啟動 (CUDA C++ Kernels and Launch)

#cuda #cuda-cpp #kernel-launch #compilation #concept

重點總覽

項目	重點
NVCC 編譯	`nvcc` 是 compiler driver，封裝 C++/PTX 編譯各階段，提供熟悉的 command line 選項
Kernel 定義	用 `__global__` 修飾、回傳型別必為 `void`，由大量平行 threads 同時執行
啟動 kernel	透過 execution configuration 指定 thread 數；兩種方式：triple chevron `<<<grid, block>>>` 與 `cudaLaunchKernelEx`
Triple chevron	`<<<grid, block>>>` 前兩參數為 grid 維度與 block 維度；1D 可用 integer，多維用 `dim3`
Block 上限	一個 block 內所有 threads 駐於同一 SM 共享資源，現行 GPU 每 block 上限 1024 threads
非同步啟動	kernel launch 對 host thread 為 asynchronous，host 不等 kernel 完成即繼續，需另行同步
Index intrinsics	`threadIdx` / `blockIdx` / `blockDim` / `gridDim`，皆為 `.x/.y/.z` 三分量向量
Global index	`workIndex = threadIdx.x + blockDim.x * blockIdx.x` 決定每個 thread 處理哪個元素
Bounds checking	用 `if(workIndex < length)` 防越界，可啟動多於需要的 threads；blocks 數用 ceiling 除法

NVCC 編譯 (Compilation with NVCC)

用 C++ 撰寫的 GPU 程式碼由 NVIDIA Cuda Compiler (nvcc) 編譯。nvcc 本身是一個 compiler driver（編譯驅動器），它不直接做全部編譯，而是把編譯切成多個階段、再呼叫底層工具集去執行。

nvcc 簡化了 C++ 或 PTX 程式碼的編譯流程，提供簡單熟悉的 command line 選項。
同一套 nvcc 指令可用於：安裝 CUDA Toolkit 的任何 Linux、Windows command line / PowerShell、或 WSL。
本筆記聚焦 CUDA runtime API（建構於較低階的 CUDA driver API 之上），是 C++ 中最常用的 CUDA 使用方式。

  .cu source
      │  nvcc (compiler driver)
      ▼
  ┌─────────────┬──────────────┐
  │  host C++   │  device code │
  │  → host 編譯 │  → PTX/cubin │
  └─────────────┴──────────────┘
            ▼
        executable / fatbin

Tip

nvcc 完整選項與各編譯階段細節在 02-Programming-GPUs/17-NVCC-Compiler，平台層（PTX/cubin/JIT）見 01-Introduction-to-CUDA/05-CUDA-Platform。

Kernel 定義 (Kernels / Specifying Kernels)

Kernel 是在 GPU 上執行、可由 host 呼叫的函式，設計上由大量平行 threads 同時執行。

用 __global__ declaration specifier 標記，告訴 compiler 此函式要編譯給 GPU，並可被 kernel launch 呼叫。
kernel launch 是「啟動一個 kernel 開始執行」的操作，通常由 CPU 發起。
Kernel 的回傳型別必為 void。

// Kernel definition
__global__ void vecAdd(float* A, float* B, float* C)
{
}

最精簡的 kernel 骨架：__global__ + void 回傳。後面會在函式體內用 thread index 決定每個 thread 的工作。

Important

__global__ 函式回傳值一定是 void；要回傳結果只能透過參數（指標）寫入 device 可存取的記憶體。函式/變數修飾符的完整分類見 02-Programming-GPUs/04-CUDA-Cpp-Errors-and-Specifiers。

啟動 Kernel 與 Execution Configuration (Launching Kernels)

平行執行 kernel 的 thread 數量，是在 kernel launch 時透過 execution configuration（執行配置）指定。同一個 kernel 的不同呼叫可使用不同的 execution configuration（不同 thread 數或 block 數）。

從 CPU 啟動 kernel 有兩種方式：

方式	說明
Triple chevron `<<< >>>`	最常用，CUDA C++ 語言擴充，本筆記重點
`cudaLaunchKernelEx`	進階方式，範例與細節見 Section 3.1.1

Triple Chevron Notation

之所以叫 triple chevron，是因為它用三個角括號 <<< >>> 包住 execution configuration；參數以逗號分隔，類似函式呼叫的參數。

__global__ void vecAdd(float* A, float* B, float* C) { }

int main()
{
    // Kernel invocation: 1 個 block、每 block 256 threads
    vecAdd<<<1, 256>>>(A, B, C);
}

前兩個參數分別是 grid 維度 與 thread block 維度。
上例啟動單一 block、含 256 threads；每個 thread 執行完全相同的 kernel 程式碼。
使用 1D block / grid 時，維度可直接用 integer 指定。

多維：dim3

當 grid 或 block 為 2D / 3D 時，用 CUDA 型別 dim3 當作 grid 與 block 維度參數。

int main()
{
    dim3 grid(16, 16);   // 16x16 的 block 網格
    dim3 block(8, 8);    // 每個 block 為 8x8 threads
    MatAdd<<<grid, block>>>(A, B, C);
}

dim3(16,16) 未指定的 .z 會自動預設為 1。

Block 上限 (1024)

Important

一個 block 內所有 threads 都駐留在同一個 SM (streaming multiprocessor) 並共享該 SM 的資源，因此每 block 有 thread 數上限。現行 GPU 每個 block 最多 1024 threads。若資源允許，多個 block 可同時排程在同一個 SM 上。

Kernel launch 是非同步的

Warning

Kernel launch 對 host thread 為 asynchronous：kernel 只是被「設定好」要在 GPU 上執行，host 程式碼不會等待 kernel 完成（甚至不等它開始）就繼續往下跑。要確認 kernel 已完成，必須使用某種 CPU↔GPU 同步機制。

host thread:  launch ──► (立即返回，繼續執行 host 程式) ...
GPU:                  └─► kernel 在背景排程/執行
                          ▲
                          └ 需同步 (如全域同步) 才能確定已完成

最基本的同步是「完全同步整個 GPU」，進階方法見 02-Programming-GPUs/03-CUDA-Cpp-Sync-and-Workflow 與 02-Programming-GPUs/14-Async-Streams-and-Events。

Thread 與 Grid Index Intrinsics

在 kernel 內，CUDA 提供 intrinsics 讓每個 thread 取得 execution configuration 參數與自身索引，藉此判斷自己該做哪份工作。

Intrinsic	意義	範圍
`threadIdx`	thread 在其 block 內的索引	`.x` 為 `0 ~ blockDim.x-1`
`blockIdx`	block 在 grid 內的索引	`.x` 為 `0 ~ gridDim.x-1`
`blockDim`	block 的維度（launch 時指定）	—
`gridDim`	grid 的維度（launch 時指定）	—

四者皆為三分量向量，有 .x / .y / .z 成員；launch 未指定的維度預設為 1。
threadIdx 與 blockIdx 皆從 0 開始（zero indexed）。

計算 global index

__global__ void vecAdd(float* A, float* B, float* C)
{
    // 此 thread 負責計算哪個元素
    int workIndex = threadIdx.x + blockDim.x * blockIdx.x;
    C[workIndex] = A[workIndex] + B[workIndex];
}

// A, B, C 為 1024 元素向量：4 個 block x 256 threads
vecAdd<<<4, 256>>>(A, B, C);

workIndex = threadIdx.x + blockDim.x * blockIdx.x 是 1D 平行化的慣用公式，把 (block 索引, block 內索引) 攤平成全域唯一索引。

blockIdx:      0              1              2              3
threadIdx: 0..255         0..255         0..255         0..255
workIndex: 0..255     256+(0..255)   512+(0..255)   768+(0..255)
           └──────── global index 0 .. 1023 ────────┘

block 0：workIndex = threadIdx.x（blockIdx.x = 0）。
block 1：workIndex = threadIdx.x + 256（blockDim.x * blockIdx.x = 256）。
block 2：workIndex = threadIdx.x + 512，依此類推。
擴展到 2D / 3D 時，每個維度通常套用相同模式。

Bounds Checking 與計算 block 數

上面的範例假設向量長度是 block size 的整數倍。要處理任意長度，需在 kernel 內加上邊界檢查，避免越界存取陣列。

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
    int workIndex = threadIdx.x + blockDim.x * blockIdx.x;
    if (workIndex < vectorLength) {     // bounds check
        C[workIndex] = A[workIndex] + B[workIndex];
    }
}

加上 if(workIndex < vectorLength) 後，可以啟動多於需要的 threads 而不會越界；超出範圍的 threads 直接結束、不做事。

Warning

在 block 內啟動多餘但不做事的 threads 開銷很小，可接受；但**「整個 block 都沒有 thread 在工作」應該避免**——也就是不要啟動完全閒置的 block。

計算需要幾個 block（ceiling 除法）

block 數 = ⌈需要的 thread 數 / 每 block thread 數⌉，即「向上取整的整數除法」。

int threads = 256;
// 加上 threads-1 再整數除法，等同 ceiling
int blocks = (vectorLength + threads - 1) / threads;
vecAdd<<<blocks, threads>>>(devA, devB, devC, vectorLength);

+ (threads - 1) 的技巧：只有當 vectorLength 不能被 threads 整除時，才會多加一個 block。

CCCL (CUDA Core Compute Library) 提供更易讀的工具 cuda::ceil_div（含括 <cuda/cmath>）做同樣的事：

#include <cuda/cmath>
int threads = 256;
int blocks  = cuda::ceil_div(vectorLength, threads);
vecAdd<<<blocks, threads>>>(devA, devB, devC, vectorLength);

Tip

每 block 256 threads 只是任意但常見的合理起點值，實務上常以此為調校起點。

考試/測驗重點

情境/關鍵字	答案
kernel 用什麼修飾符宣告？	`__global__` declaration specifier
kernel 回傳型別必為？	`void`（只能透過指標參數輸出結果）
`nvcc` 是什麼？	compiler driver，封裝並呼叫各編譯階段工具，非單一編譯器
啟動 kernel 的兩種方式	triple chevron `<<<grid,block>>>` 與 `cudaLaunchKernelEx`
`<<<a, b>>>` 中 a、b 分別是？	a = grid 維度（block 數），b = block 維度（thread 數）
每 block thread 數上限	現行 GPU 為 1024（同一 block 共享同一 SM 資源）
多維 grid/block 用什麼型別？	`dim3`（未指定維度預設為 1）
kernel launch 對 host 是同步還是非同步？	非同步 (asynchronous)；host 不等 kernel 完成即繼續
global index 公式	`threadIdx.x + blockDim.x * blockIdx.x`
`threadIdx.x` 範圍	`0 ~ blockDim.x - 1`（zero indexed）
`blockIdx.x` 範圍	`0 ~ gridDim.x - 1`（zero indexed）
`blockDim` / `gridDim` 由誰決定？	由 launch 時的 execution configuration 指定
為何要 bounds check？	向量長度非 block size 整數倍時，防止越界存取陣列
算 block 數的 ceiling 寫法	`(N + threads - 1) / threads` 或 `cuda::ceil_div(N, threads)`
多餘 threads vs 多餘 block	啟動閒置 threads 開銷小；應避免「整個 block 都閒置」
`cuda::ceil_div` 需含哪個 header？	`<cuda/cmath>`（來自 CCCL）

重點總覽

NVCC 編譯 (Compilation with NVCC)

Kernel 定義 (Kernels / Specifying Kernels)

啟動 Kernel 與 Execution Configuration (Launching Kernels)

Triple Chevron Notation

多維：dim3

Block 上限 (1024)

Kernel launch 是非同步的

Thread 與 Grid Index Intrinsics

計算 global index

Bounds Checking 與計算 block 數

計算需要幾個 block（ceiling 除法）

考試/測驗重點

Related Notes