GPU 運算基礎 (GPU Computing Foundations)

#cuda #gpu-architecture #gpu-vs-cpu #concept

重點總覽

項目	重點
GPU 起源	原為 3D 繪圖的 fixed-function hardware，2003 起部分 pipeline 可程式化
CUDA 誕生	2006 NVIDIA 推出 CUDA，讓任意運算不經繪圖 API 即可使用 GPU throughput
GPU 強項	高 instruction throughput 與 memory bandwidth，擅長數千 threads 並行
CPU 強項	低延遲序列執行（一個 thread 盡快跑完），僅數十 threads 並行
Transistor 配置	GPU 多用於 data processing；CPU 多用於 cache 與 flow control
上手途徑	Libraries（cuBLAS/cuFFT/cuDNN/CUTLASS）、AI frameworks、DSLs（Warp/Triton）

GPU 的演進與 CUDA 起源

GPU（Graphics Processing Unit） 最初是專為 3D graphics 設計的特殊用途處理器，以 fixed-function hardware（固定功能硬體）加速即時 3D rendering 中的平行運算。經過數個世代演進，GPU 逐漸變得更可程式化；到 2003 年，繪圖 pipeline 的部分階段已完全可程式化，能對 3D 場景或影像的每個元件平行執行自訂程式碼。

2006 年，NVIDIA 推出 CUDA（Compute Unified Device Architecture），使任意運算工作負載都能利用 GPU 的 throughput，且獨立於繪圖 API。
自此 GPU computing 被用於幾乎所有類型的運算：科學模擬（fluid dynamics、energy transport）、商業應用（databases、analytics）。
GPU 的能力與可程式化也支撐了新演算法與技術，從 image classification 到 generative AI（diffusion、large language models）。

Tip

「2006 年 CUDA」與「2003 年 pipeline 可程式化」是兩個常被混淆的年份：2003 = 部分 pipeline 可程式化，2006 = CUDA 讓非繪圖運算也能用 GPU。

GPU 可程式化演進
─────────────────────────────────────────────►  時間
[早期]            [2003]                 [2006]
fixed-function    部分 pipeline           CUDA 推出
3D 繪圖硬體   →   完全可程式化       →   任意運算使用 GPU throughput
                  (僅限繪圖)              (獨立於繪圖 API)

使用 GPU 的好處

在相近的價格與功耗範圍（price and power envelope）內，GPU 提供遠高於 CPU 的 instruction throughput 與 memory bandwidth，因此許多應用在 GPU 上明顯快於 CPU。GPU 與 CPU 的設計目標不同：CPU 追求把單一序列操作（即一個 thread）盡快跑完並可平行執行數十個 threads；GPU 則追求平行執行數千個 threads，以較低的單執行緒效能換取更高的總 throughput。

設計取向差異：CPU = 低延遲、序列、少量 threads；GPU = 高吞吐、大量並行、犧牲 single-thread 效能。
Transistor 配置：GPU 把較多 transistor 用於 data processing units；CPU 把較多 transistor 用於 data caching 與 flow control。
GPU 專為**高度平行運算（highly parallel computations）**而特化。

Important

核心心智模型：GPU 用 throughput 換 latency。它不讓單一 thread 更快，而是讓海量 threads 同時推進，靠並行度堆出總體效能。

Warning

「GPU 一定比 CPU 快」是過度簡化。原文僅指出 GPU 在高度平行工作負載上於相近價格/功耗下勝出；對低延遲的序列工作，CPU 反而更適合。此外 FPGA 等裝置同樣很省電，只是程式設計彈性遠不如 GPU。

晶片資源配置（示意，對應原文 Figure 1）
        CPU                              GPU
 ┌───────────────┐            ┌───────────────────────┐
 │ Control │Cache│            │  多數面積 = 大量 ALU   │
 │ ┌─┐ ┌─┐ │     │            │ ▢▢▢▢▢▢▢▢ ▢▢▢▢▢▢▢▢      │
 │ │ALU│ALU│大量 │            │ ▢▢▢▢▢▢▢▢ ▢▢▢▢▢▢▢▢      │
 │ └─┘ └─┘ │cache│            │ ▢▢▢▢▢▢▢▢ ▢▢▢▢▢▢▢▢      │
 └───────────────┘            └───────────────────────┘
 transistor → cache/控制      transistor → data processing
 少量 threads、低延遲          數千 threads、高 throughput

快速上手途徑

利用 GPU 運算能力的方式很多；本指南聚焦以 C++ 等高階語言為 CUDA GPU 平台撰寫程式，但許多應用無需直接撰寫 GPU 程式碼即可用上 GPU。由低到高有三類抽象層：直接使用最佳化 libraries、套用 AI frameworks、或以 DSLs 撰寫更高階的 GPU 程式。

Libraries（函式庫）：已實作好的演算法（尤其 NVIDIA 提供者）通常比自行重寫更有生產力且更高效。範例：cuBLAS、cuFFT、cuDNN、CUTLASS；它們針對每種 GPU 架構最佳化，兼顧生產力、效能與可移植性。
Frameworks（框架）：特別是 AI 框架，提供 GPU 加速的 building blocks；其加速多半是底層呼叫上述 GPU-accelerated libraries 達成的。
DSLs（領域特定語言）：如 NVIDIA 的 Warp、OpenAI 的 Triton，會編譯後直接在 CUDA 平台上執行，提供比本指南所述高階語言更高階的 GPU 程式設計方式。

Tip

選用順序的直覺：能用 library 就別自己寫 kernel；需要客製化模型訓練/推論就用 framework；要寫高階自訂 kernel 又不想碰 C++ 細節，可考慮 Warp / Triton 等 DSL。

GPU 程式設計抽象層（由高到低）
  DSLs (Warp / Triton)        ← 最高階，編譯到 CUDA 平台
        │
  AI Frameworks               ← 底層常呼叫 GPU 加速 libraries
        │
  Libraries (cuBLAS/cuFFT/    ← 直接使用最佳化函式庫
   cuDNN/CUTLASS)
        │
  高階語言 CUDA (C++)         ← 本指南重點，直接寫 GPU 程式碼

考試/測驗重點

情境/關鍵字	答案
CUDA 推出年份	2006（NVIDIA）
繪圖 pipeline 部分階段可程式化年份	2003
CUDA 全名	Compute Unified Device Architecture
GPU 相對 CPU 的兩大優勢	更高的 instruction throughput 與 memory bandwidth
比較基準（前提）	相近的 price and power envelope
一個 thread 指什麼	一段序列操作；CPU 擅長盡快跑完單一 thread
GPU 並行規模 vs CPU	GPU 數千 threads；CPU 僅數十 threads
GPU transistor 偏重	data processing units
CPU transistor 偏重	data caching 與 flow control
同樣省電但彈性不如 GPU 的裝置	FPGA
三條上手途徑	libraries、AI frameworks、DSLs
四個 NVIDIA library 範例	cuBLAS、cuFFT、cuDNN、CUTLASS
兩個 DSL 範例	NVIDIA Warp、OpenAI Triton
DSL 如何在 GPU 執行	編譯後直接在 CUDA 平台上執行
AI framework 為何快	底層呼叫 GPU 加速 libraries

重點總覽

GPU 的演進與 CUDA 起源

使用 GPU 的好處

快速上手途徑

考試/測驗重點

Related Notes