CUDA为a platform and programming model for CUDA-enabled GPUs。该平台通过GPU来进行计算。CUDA为GPU编程和管理 提供C/C++语言扩展和API。
CUDA编程中,会同时使用CPU和GPU进行计算:
- CPU system:称为host。
- GPU system:称为device。
CPU和GPU为独立的系统,具有各自的内存空间。通常,在CPU上运行的串行工作,而将并行计算卸载给GPU。
2. CUDA和C对比以Hello world程序为例: 二者最大的不同在于
__global__
说明符 和 语法:
__global__
说明符:用于标明该函数运行于device(GPU)。这类函数可通过host code调用,如通过main()
函数调用。也可被称为“kernels”。语法:当kernel被调用时,其执行配置由
语言提供,如
cuda_helo()
。在CUDA术语中,这被称为“kernel launch”。
编译CUDA程序与编译C语言类似。NVIDIA在其CUDA toolkit中提供了名为nvcc
的CUDA编译器来编译CUDA code——通常源代码文件名为.cu
。
以vector addition为例,相应的C语言实现为(vector_add.c):
#define N 10000000
void vector_add(float *out, float *a, float *b, int n) {
for(int i = 0; i time ./vector_add
NVIDIA也提供了名为nvprof
的命令行profiler工具,可提供更多程序性能信息:
$> nvprof ./vector_add
以Tesla M2050为例,相应profiling为:
==6326== Profiling application: ./vector_add
==6326== Profiling result:
Time(%) Time Calls Avg Min Max Name
97.55% 1.42529s 1 1.42529s 1.42529s 1.42529s vector_add(float*, float*, float*, int)
1.39% 20.318ms 2 10.159ms 10.126ms 10.192ms [CUDA memcpy HtoD]
1.06% 15.549ms 1 15.549ms 15.549ms 15.549ms [CUDA memcpy DtoH]
3. CUDA并行化
CUDA使用kernel execution configuration 来告诉CUDA runtime该在GPU中启动多少个线程。 CUDA organizes threads into a group called “thread block”。 Kernel可启动多个thread blocks,organized into a “grid” structure。
kernel execution configuration的语法为:
表示kernel launches with a gird of M
thread blocks。每个thread block具有T
parallel threads。
接下来,将使用multithread来parallelize上例中的vector addition,如使用a thread block with 256 threads,相应的kernel execution configuration为:
vector_add > (d_out, d_a, d_b, N);
CUDA提供了内置变量来访问thread information,此例中包含了一下2个内置变量:
threadIdx.x
:包含了the index of the thread within the block。此例中,index范围为0~255。blockDim.x
:包含了the size of thread block(number of threads in the thread block)。此例中,该值为256。
完整的vector_add_thread.cu源代码为:
#include
#include
#include
#include
#include
#include
#define N 10000000
#define MAX_ERR 1e-6
__global__ void vector_add(float *out, float *a, float *b, int n) {
int index = threadIdx.x;
int stride = blockDim.x;
for(int i = index; i < n; i += stride){
out[i] = a[i] + b[i];
}
}
int main(){
float *a, *b, *out;
float *d_a, *d_b, *d_out;
// Allocate host memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize host arrays
for(int i = 0; i < N; i++){
a[i] = 1.0f;
b[i] = 2.0f;
}
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);
// Transfer data from host to device memory
cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);
// Executing kernel
vector_add(d_out, d_a, d_b, N);
// Transfer data back to host memory
cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
// Verification
for(int i = 0; i < N; i++){
assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
}
printf("PASSED\n");
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);
// Deallocate host memory
free(a);
free(b);
free(out);
}
$> nvcc vector_add_thread.cu -o vector_add_thread
$> nvprof ./vector_add_thread
相应的性能为:
==6430== Profiling application: ./vector_add_thread
==6430== Profiling result:
Time(%) Time Calls Avg Min Max Name
39.18% 22.780ms 1 22.780ms 22.780ms 22.780ms vector_add(float*, float*, float*, int)
34.93% 20.310ms 2 10.155ms 10.137ms 10.173ms [CUDA memcpy HtoD]
25.89% 15.055ms 1 15.055ms 15.055ms 15.055ms [CUDA memcpy DtoH]
以上为1个thread block。CUDA GPU具有多个并行处理器,名为Streaming Multiprocessors(SMs)。每个SM包含了多个并行处理器,可运行多个concurrent thread blocks。为了充分利用CUDA GPU,kernel应启动多个thread blocks。此时CUDA再额外提供2个内置变量:
blockIdx.x
:包含the index of the block with in the grid。gridDim.x
:包含the size of the grid。
若一共需要
N
N
N个线程,每个thread block有256个线程,则至少需要
N
/
256
N/256
N/256个thread blocks。对于每个thread需要有a unique index,该index的计算规则为:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
多个thread block的vector_add_grid.cu源代码为:
#include
#include
#include
#include
#include
#include
#define N 10000000
#define MAX_ERR 1e-6
__global__ void vector_add(float *out, float *a, float *b, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
// Handling arbitrary vector size
if (tid < n){
out[tid] = a[tid] + b[tid];
}
}
int main(){
float *a, *b, *out;
float *d_a, *d_b, *d_out;
// Allocate host memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize host arrays
for(int i = 0; i < N; i++){
a[i] = 1.0f;
b[i] = 2.0f;
}
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);
// Transfer data from host to device memory
cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);
// Executing kernel
int block_size = 256;
int grid_size = ((N + block_size) / block_size);
vector_add(d_out, d_a, d_b, N);
// Transfer data back to host memory
cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
// Verification
for(int i = 0; i < N; i++){
assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
}
printf("PASSED\n");
// Deallocate device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_out);
// Deallocate host memory
free(a);
free(b);
free(out);
}
编译并profile性能:
$> nvcc vector_add_grid.cu -o vector_add_grid
$> nvprof ./vector_add_grid
在Tesla M2050上的性能表现为:
==6564== Profiling application: ./vector_add_grid
==6564== Profiling result:
Time(%) Time Calls Avg Min Max Name
55.65% 20.312ms 2 10.156ms 10.150ms 10.162ms [CUDA memcpy HtoD]
41.24% 15.050ms 1 15.050ms 15.050ms 15.050ms [CUDA memcpy DtoH]
3.11% 1.1347ms 1 1.1347ms 1.1347ms 1.1347ms vector_add(float*, float*, float*, int)
4. 性能对比
VersionExecution Time (ms)Speedup1 thread1425.291.00x1 block22.7862.56xMultiple blocks1.131261.32x
5. OpenCL
OpenCL全称为:Open Computing Language。 OpenCL为:
- Open, royalty-free standard C-language extension
- For parallel programming of heterogeneous system using GPUs, CPUS, CBE, DSP’s and other processors including embedded mobile devices。
- 初始由苹果公司发起。苹果公司put OpenCL in OSX Snow Leopard and is active in the working group。Wroking group内包含NVIDIA, Intel,AMD,IBM等等。
- 由Khronos Group管理。该Group同时管理了OpenGL std。
基本的程序结构为:
[1] Getting started with OpenCL and GPU Computing [2] Introduction to GPU Computing with OpenCL [3] OpenCL™ Programming Guide for the CUDA™ Architecture [4] CUDA Tutorial