CUDA C++ Programming Guide

mutourend 发布时间：2021-12-19 20:50:21 ，浏览量：4

1. 引言

在同等价位和power envelope情况下，Graphics Processing Unit(GPU) 比CPU提供了更高的instruction throughput 和 memory bandwidth。其它计算设备如FPGA，也是energy efficient，但是不如GPU的program flexibility。

GPU和CPU的设计目标不同：

1）CPU设计为：擅于execute a sequence of operations, called a thread, as fast as possible。且可同时并行运行几十个线程。
2）GPU设计为：擅于并行运行几千个线程（amortizing the slower single-thread performance to achieve greater throughput）。

GPU擅长高度并行计算，因此，将更多的transistors用于data processing（如浮点计算），而不是data caching和flow control：在这里插入图片描述 GPU可hide memory access latencies with computation, instead of relying on large data caches and complex flow control来avoid long memory access latencies，因为data caches和flow control都是expensive in terms of transistors。

通常，应用中包含了串行部分和并行部分，而系统通常也设计为GPU和CPU的混合体，以实现总体性能最优。具有高度并行性的应用可利用GPU的并行特征实现比 CPU更优的性能。

2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model

2006年11月，NVIDIA®引入CUDA®: A General-Purpose Parallel Computing Platform and Programming Model——可使用C++等语言作为high-level programming language。可采购相应的显卡，也可选择GPU云服务器。在这里插入图片描述 multicore CPU和manycore GPU的出现意味着主流处理器芯片都是并行系统。软件开发的挑战在于：transparently scale其并行性，以充分利用新增的processor cores。CUDA设计为降低C语言开发者的学习曲线。

GPU core有3个关键抽象：

1）a hierarchy of thread groups
2）shared memories
3）barrier synchronization

这些抽象提供了细粒度的data parallelism和thread parallelism，嵌套在粗粒度的data parallelism 和 task parallelism 之内。这将指导开发者，将问题分解为粗粒度的子问题，然后独立并行的借助blocks of threads来来解决。每个子问题分解为细粒度，可cooperatively parallel by all threads within the block来解决。这种分解有助于保持language expressivity，支持threads to cooperate when solving each sub-problem，同时支持自动扩展。即：

粗粒度的子问题：可独立并行运算，使用blocks of threads。
每个子问题内的细粒度：由单个block内的的threads parallel来解决。可schedule到GPU内的任意可用multiprocessors上，以任意顺序，并行或串行均可。

在这里插入图片描述

2.1 kernels

CUDA中的函数不同于C++函数，称为kernels。当kernels被调用时，可executed N times in parallel by N different CUDA threads，这一点不同于传统的C++函数。

kernel以__global__来标记，CUDA threads的数量由 execution configuration来指定。每个运行该kernel的thread都有a unique thread ID，可kernel可通过内置变量来访问。

如下例，借助内置变量threadIdx，实现两个size为 N N N的向量 A + B = C A+B=C A+B=C。

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
 int i = threadIdx.x;
 C[i] = A[i] + B[i];
}
int main()
{
 ...
 // Kernel invocation with N threads
 VecAdd(A, B, C);
 ...
}

2.2 thread hierarchy

threadIdx为3-component vector，因此可使用one-dimensional thread index，或 two-dimensional thread index，或three-dimensional thread index来标记threads，所形成的的1-dimensional、2-dimensional或3-dimensional block of threads 称为a thread block。从而可invoke computation across the elements in a domain such as a vector, matrix, or volume。

以size为N*N的矩阵相加为例A+B=C：

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
 float C[N][N])
{
 int i = threadIdx.x;
 int j = threadIdx.y;
 C[i][j] = A[i][j] + B[i][j];
}
int main()
{
 ...
 // Kernel invocation with one block of N * N * 1 threads
 int numBlocks = 1;
 dim3 threadsPerBlock(N, N);
 MatAdd(A, B, C);
 ...
}

每个block内的thread数量是有限的，因为一个block内的所有thread运行在同一processor core上，必须共享该core上有限的内存资源。当前GPU，threadsPerBlock最多为1024。但是，一个kernel可运行在多个block之上，总的threads数量等于numBlocks * threadsPerBlock。因此numBlocks取决于待处理的数据量。 blocks可以1维、2维或3维grid of thread blocks来表示：在这里插入图片描述当numBlocks>1时，需借助内置变量blockIdx和blockDim：

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
 int i = blockIdx.x * blockDim.x + threadIdx.x;
 int j = blockIdx.y * blockDim.y + threadIdx.y;
 if (i

关注

打赏

1688896170

查看更多评论

CUDA C++ Programming Guide

[ 申请 ]友情链接：