CUDA 學習地圖 (MOC)
Overview
本學習地圖涵蓋《CUDA Programming Guide》:
- 第一章 Introduction to CUDA:語言無關(language-agnostic)的 CUDA 程式模型與平台概觀——為何用 GPU、執行模型、程式設計範式、記憶體階層,最後收束到 CUDA 軟硬體平台。
- 第二章 Programming GPUs in CUDA:實際動手寫 GPU 程式——CUDA C++ 與 Python 的 kernel/記憶體/同步/錯誤處理、SIMT kernel(記憶體空間、coalescing、bank conflict、atomics、occupancy)、Tile kernel、非同步執行(streams/events/graphs)、Unified/System memory,以及 NVCC 編譯器。
各主題的重要性:
- 第一章建立心智模型與術語(throughput vs latency、thread→warp→block→grid、記憶體階層、CC/PTX/cubin)。
- 第二章把這些模型落實到 C++/Python 程式碼,並深入效能(coalescing、bank conflict、occupancy)、非同步重疊與編譯工具鏈,是寫出正確且高效 CUDA 程式的核心。
Topic Map
第一章 Introduction to CUDA
| Section | Notes | Status |
|---|---|---|
| 1.1 使用 GPU 的好處與上手途徑 | GPU 運算基礎 | [ ] |
| 1.2.1–1.2.2.2 執行模型與 SIMT | 執行模型與 SIMT | [ ] |
| 1.2.2.3 Tile 程式模型 | Tile 程式設計 | [ ] |
| 1.2.3 GPU 記憶體階層 | GPU 記憶體階層 | [ ] |
| 1.3 CUDA 平台 | CUDA 平台 | [ ] |
第二章 Programming GPUs in CUDA
| Section | Notes | Status |
|---|---|---|
| 2.1.1–2.1.2 CUDA C++ Kernel 與啟動 | CUDA C++ Kernel 與啟動 | [ ] |
| 2.1.3 CUDA C++ 記憶體管理 | CUDA C++ 記憶體管理 | [ ] |
| 2.1.4–2.1.6 同步與完整流程 | 同步與完整流程 | [ ] |
| 2.1.7–2.1.10 錯誤檢查與修飾符 | 錯誤檢查與修飾符 | [ ] |
| 2.2 CUDA Python | CUDA Python 入門 | [ ] |
| 2.3.1–2.3.2 SIMT 基礎與 Thread 階層 | SIMT 基礎與 Thread 階層 | [ ] |
| 2.3.3 SIMT 裝置記憶體空間 | SIMT 裝置記憶體空間 | [ ] |
| 2.3.4 SIMT 記憶體效能 | SIMT 記憶體效能 | [ ] |
| 2.3.5–2.3.7 Atomics/Cooperative/Occupancy | Atomics/Cooperative/Occupancy | [ ] |
| 2.4.1–2.4.5 Tile Kernel 結構 | Tile Kernel 結構 | [ ] |
| 2.4.6–2.4.7 Tile 載入儲存與控制流 | Tile 載入儲存與控制流 | [ ] |
| 2.4.8–2.4.9 Tile 運算與基本操作 | Tile 運算與基本操作 | [ ] |
| 2.4.10–2.4.12 Tile Atomics 與最佳化 | Tile Atomics 與最佳化 | [ ] |
| 2.5.1–2.5.3 非同步 Streams 與 Events | 非同步 Streams 與 Events | [ ] |
| 2.5.4–2.5.10 Callbacks/排序/Graphs | Callbacks/排序/Graphs | [ ] |
| 2.6 Unified 與 System Memory | Unified 與 System Memory | [ ] |
| 2.7 NVCC 編譯器 | NVCC 編譯器 | [ ] |
第三章 Advanced CUDA
| Section | Notes | Status |
|---|---|---|
| 3.1.1–3.1.2 進階啟動與 Clusters | 進階啟動與 Clusters | [ ] |
| 3.1.3–3.1.4 進階 Streams 與相依啟動 | 進階 Streams 與相依啟動 | [ ] |
| 3.1.5–3.1.6 批次傳輸與環境變數 | 批次傳輸與環境變數 | [ ] |
| 3.2.1–3.2.2 使用 PTX 與硬體模型 | 使用 PTX 與硬體模型 | [ ] |
| 3.2.3–3.2.4.1 Thread Scopes 與 Scoped Atomics | Thread Scopes 與 Scoped Atomics | [ ] |
| 3.2.4.2–3.2.4.3 非同步 Barriers 與 Pipelines | 非同步 Barriers 與 Pipelines | [ ] |
| 3.2.5–3.2.6 非同步資料複製與 L1/Shared 配置 | 非同步資料複製與 L1/Shared 配置 | [ ] |
| 3.3 CUDA Driver API | CUDA Driver API | [ ] |
| 3.4 多 GPU 程式設計 | 多 GPU 程式設計 | [ ] |
| 3.5 CUDA 功能導覽 | CUDA 功能導覽 | [ ] |
第四章 CUDA Features
| Section | Notes | Status |
|---|---|---|
| 4.1.1 Unified Memory:完整支援 | Unified Memory:完整支援 | [ ] |
| 4.1.2–4.1.4 Unified Memory:平台與效能提示 | Unified Memory:平台與效能提示 | [ ] |
| 4.2.1–4.2.2 CUDA Graphs:結構與擷取 | CUDA Graphs:結構與擷取 | [ ] |
| 4.2.3–4.2.4 CUDA Graphs:更新與條件節點 | CUDA Graphs:更新與條件節點 | [ ] |
| 4.2.5–4.2.8 CUDA Graphs:記憶體節點與裝置啟動 | CUDA Graphs:記憶體節點與裝置端啟動 | [ ] |
| 4.3 Stream-Ordered Memory Allocator | Stream-Ordered Memory Allocator | [ ] |
| 4.4 Cooperative Groups 深入 | Cooperative Groups 深入 | [ ] |
| 4.5 Programmatic Dependent Launch 深入 | PDL 深入 | [ ] |
| 4.6 Green Contexts | Green Contexts | [ ] |
| 4.7–4.8 Lazy Loading 與 Error Log | Lazy Loading 與 Error Log | [ ] |
| 4.9 Asynchronous Barriers 深入 | Asynchronous Barriers 深入 | [ ] |
| 4.10 Pipelines 深入 | Pipelines 深入 | [ ] |
| 4.11.1 非同步複製:LDGSTS | 非同步複製:LDGSTS | [ ] |
| 4.11.2 非同步複製:TMA | 非同步複製:TMA | [ ] |
| 4.11.3 非同步複製:STAS | 非同步複製:STAS | [ ] |
| 4.12 Work Stealing 與 Cluster Launch Control | Work Stealing 與 Cluster Launch Control | [ ] |
| 4.13 L2 Cache Control | L2 Cache Control | [ ] |
| 4.14 Memory Synchronization Domains | Memory Synchronization Domains | [ ] |
| 4.15 Interprocess Communication | Interprocess Communication | [ ] |
| 4.16 Virtual Memory Management | Virtual Memory Management | [ ] |
| 4.17 Extended GPU Memory | Extended GPU Memory | [ ] |
| 4.18 CUDA Dynamic Parallelism | CUDA Dynamic Parallelism | [ ] |
| 4.19.1 Graphics Interoperability | Graphics Interoperability | [ ] |
| 4.19.2 External Resource Interop | External Resource Interop | [ ] |
| 4.20 Driver Entry Point Access | Driver Entry Point Access | [ ] |
Practice Notes
| Practice Set | Notes | Questions | Status |
|---|---|---|---|
| 第一章 Introduction to CUDA 練習 | 第一章練習題 | 10 | [ ] |
| 第二章 Programming GPUs 練習 | 第二章練習題 | 20 | [ ] |
| 第三章 Advanced CUDA 練習 | 第三章練習題 | 15 | [ ] |
| 第四章 CUDA Features 練習 | 第四章練習題 | 25 | [ ] |
Study Tools
Tag Index
| Tag | Related Topics | Level |
|---|---|---|
| cuda | 全章節:CUDA 程式模型、平台與實作 | top |
| gpu-architecture | GPU vs CPU、SM/GPC 硬體概念模型 | domain |
| execution-model | 異質系統、thread 階層、warp 排程、occupancy | domain |
| programming-model | Tile 程式設計、與 SIMT 的關係 | domain |
| memory | DRAM/global、on-chip、cache、unified、記憶體效能 | domain |
| cuda-platform | Compute Capability、PTX、編譯與相容性 | domain |
| cuda-cpp | CUDA C++ kernel/記憶體/同步/錯誤處理/修飾符 | domain |
| cuda-python | CUDA Python kernel/記憶體/同步/錯誤處理 | domain |
| async-execution | streams、events、callbacks、stream ordering、graphs | domain |
| driver-api | CUDA Driver API、context、module、cuLaunchKernel | domain |
| multi-gpu | 多 device 管理、peer-to-peer、device selection | domain |
| cuda-graphs | graph 結構、stream capture、conditional/memory nodes、device launch | domain |
| gpu-vs-cpu | throughput vs latency、transistor 配置差異 | detail → gpu-architecture |
| thread-hierarchy | thread → warp → block → grid → cluster | detail → execution-model |
| simt | SIMT 範式、SIMT vs SIMD、__syncthreads | detail → execution-model |
| warp | 32 threads/warp、lane masking、warp divergence | detail → execution-model |
| atomics | race 避免、atomicAdd、C++/Python/tile atomics | detail → execution-model / programming-model |
| cooperative-groups | group handle、collective、cluster.sync | detail → execution-model |
| occupancy | active warps / 最大、resource 限制、maxrregcount | detail → execution-model |
| kernel-launch | <<<>>>、cudaLaunchKernelEx、grid-sizing、launch |
detail → cuda-cpp / cuda-python / programming-model |
| memory-management | cudaMalloc/cudaMemcpy/cudaMallocManaged | detail → cuda-cpp / memory |
| error-handling | cudaGetLastError、async error、CUDA_LOG_FILE | detail → cuda-cpp |
| tile-programming | array/tile、tile space、broadcasting、load/store | detail → programming-model |
| tile-primitives | matmul、reductions、transpose、selection | detail → programming-model |
| shared-memory | per-block on-chip 記憶體、unified data cache、L1 | detail → memory |
| unified-memory | cudaMallocManaged、UVA、HMM/ATS、mapped memory 例外 | detail → memory |
| memory-coalescing | 32-byte transaction、合併條件、利用率 | detail → memory |
| bank-conflict | 32 banks、stride、padding 解法 | detail → memory |
| page-locked-memory | cudaMallocHost、pinned、async copy 前提 | detail → memory |
| cuda-stream | 建立/同步、重疊、ordering、default stream | detail → async-execution |
| cuda-event | timing、status query、跨 stream 相依 | detail → async-execution |
| compute-capability | CC X.Y、SM 版本、binary 相容性 |
detail → cuda-platform |
| ptx | 虛擬 ISA、compute_XY、forward compatibility |
detail → cuda-platform |
| compilation | cubin/fatbin、JIT、NVRTC | detail → cuda-platform |
| nvcc | 編譯流程、PTX/cubin 生成、separate compilation、LTO | detail → cuda-platform |
| programmatic-dependent-launch | PDL、primary/secondary kernel 重疊 | detail → async-execution |
| independent-thread-scheduling | Volta+ 每 thread 獨立 PC、__syncwarp | detail → execution-model |
| thread-scope | thread/block/cluster/device/system 五級 | detail → execution-model |
| scoped-atomics | cuda::atomic 指定 scope、coherency point | detail → execution-model / memory |
| memory-ordering | relaxed/acquire/release/acq_rel/seq_cst | detail → execution-model |
| async-barrier | cuda::barrier、arrive/wait、producer-consumer | detail → async-execution |
| pipeline | cuda::pipeline、多階段 prefetch | detail → async-execution |
| async-copy | memcpy_async、LDGSTS、global→shared | detail → async-execution / memory |
| cuda-context | primary context、context stack、cuCtxPush/Pop | detail → driver-api |
| peer-to-peer | cudaMemcpyPeer、enablePeerAccess、P2P access | detail → multi-gpu |
| green-context | SM 切分、低延遲、resource descriptor | detail → cuda-platform |
| tma | Tensor Memory Accelerator、tensor map、bank swizzle | detail → async-execution |
| work-stealing | cluster launch control、block cancellation | detail → execution-model |
| l2-cache | set-aside、persisting access、access policy | detail → memory |
| memory-sync-domain | fence interference、traffic isolation | detail → memory |
| ipc | cudaIpcGetMemHandle、VMM IPC、跨 process 共享 | detail → memory |
| virtual-memory | cuMemCreate/cuMemMap、reserve/map、multicast | detail → memory |
| extended-gpu-memory | EGM、socket id、NUMA | detail → memory |
| dynamic-parallelism | CDP2、parent/child grid、device-side launch | detail → execution-model |
| graphics-interop | OpenGL/Direct3D/SLI interop | detail → cuda-platform |
| external-interop | Vulkan/Direct3D/NVSCI、external memory/semaphore | detail → cuda-platform |
| concept | 57 篇概念筆記(5 + 17 + 10 + 25) | note-type |
| practice | 練習題檔 | note-type |
| dashboard | 本 MOC、面試陷阱題、速查表 | note-type |
| exam-traps | 面試陷阱題檔(Exam Traps) | note-type |
Tag 階層規則:階層為 top → domain → detail;任一 detail tag 必須同時掛上其 parent domain tag。跨範式的 technique tag(如
kernel-launch、atomics、memory-management)依其出現情境掛對應 domain(C++ 篇掛cuda-cpp、Python 篇掛cuda-python、tile 篇掛programming-model等)。
Weak Areas
第一章
第二章
第三章
第四章
Non-core Topic Policy
| 主題範圍 | 分類 | 處理方式 | 原因 |
|---|---|---|---|
| 第一章全部主題 | 核心教學內容 | 全數納入 | 本章皆為語言無關的 CUDA 程式模型與平台核心概念,無非學術或需排除的內容 |
| 第二章全部主題 | 核心教學內容 | 全數納入 | C++/Python/SIMT/Tile/Async/記憶體/NVCC 皆為實作 CUDA 的核心,全數納入 |
| 第三章全部主題 | 核心教學內容 | 全數納入 | 進階 API、進階 kernel、Driver API、多 GPU、功能導覽皆為進階核心,全數納入 |
| 第四章全部主題 | 核心教學內容 | 全數納入 | unified memory、graphs、虛擬記憶體、dynamic parallelism、interop 等皆為 CUDA 進階功能核心 |
兩章皆無非學術/排除內容;整本指南目前涵蓋的章節皆為核心教學內容,全數納入學習地圖。