您当前的位置: 首页 > 

mutourend

暂无认证

  • 1浏览

    0关注

    661博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

CUDA入门

mutourend 发布时间:2021-12-12 23:05:06 ,浏览量:1

1. 引言

CUDA为a platform and programming model for CUDA-enabled GPUs。该平台通过GPU来进行计算。CUDA为GPU编程和管理 提供C/C++语言扩展和API。

CUDA编程中,会同时使用CPU和GPU进行计算:

  • CPU system:称为host。
  • GPU system:称为device。

CPU和GPU为独立的系统,具有各自的内存空间。通常,在CPU上运行的串行工作,而将并行计算卸载给GPU。

2. CUDA和C对比

以Hello world程序为例: 在这里插入图片描述 二者最大的不同在于__global__说明符 和 语法:

  • __global__说明符:用于标明该函数运行于device(GPU)。这类函数可通过host code调用,如通过main()函数调用。也可被称为“kernels”。
  • 语法:当kernel被调用时,其执行配置由语言提供,如cuda_helo()。在CUDA术语中,这被称为“kernel launch”。

编译CUDA程序与编译C语言类似。NVIDIA在其CUDA toolkit中提供了名为nvcc的CUDA编译器来编译CUDA code——通常源代码文件名为.cu

以vector addition为例,相应的C语言实现为(vector_add.c):

#define N 10000000

void vector_add(float *out, float *a, float *b, int n) {
    for(int i = 0; i  time ./vector_add

NVIDIA也提供了名为nvprof的命令行profiler工具,可提供更多程序性能信息:

$> nvprof ./vector_add

以Tesla M2050为例,相应profiling为:

==6326== Profiling application: ./vector_add
==6326== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 97.55%  1.42529s         1  1.42529s  1.42529s  1.42529s  vector_add(float*, float*, float*, int)
  1.39%  20.318ms         2  10.159ms  10.126ms  10.192ms  [CUDA memcpy HtoD]
  1.06%  15.549ms         1  15.549ms  15.549ms  15.549ms  [CUDA memcpy DtoH]
3. CUDA并行化

CUDA使用kernel execution configuration 来告诉CUDA runtime该在GPU中启动多少个线程。 CUDA organizes threads into a group called “thread block”。 Kernel可启动多个thread blocks,organized into a “grid” structure。

kernel execution configuration的语法为:


表示kernel launches with a gird of M thread blocks。每个thread block具有T parallel threads。

接下来,将使用multithread来parallelize上例中的vector addition,如使用a thread block with 256 threads,相应的kernel execution configuration为:

vector_add > (d_out, d_a, d_b, N);

CUDA提供了内置变量来访问thread information,此例中包含了一下2个内置变量:

  • threadIdx.x:包含了the index of the thread within the block。此例中,index范围为0~255。
  • blockDim.x:包含了the size of thread block(number of threads in the thread block)。此例中,该值为256。

在这里插入图片描述

完整的vector_add_thread.cu源代码为:

#include 
#include 
#include 
#include 
#include 
#include 

#define N 10000000
#define MAX_ERR 1e-6

__global__ void vector_add(float *out, float *a, float *b, int n) {
    int index = threadIdx.x;
    int stride = blockDim.x;

    for(int i = index; i < n; i += stride){
        out[i] = a[i] + b[i];
    }
}

int main(){
    float *a, *b, *out;
    float *d_a, *d_b, *d_out; 

    // Allocate host memory
    a   = (float*)malloc(sizeof(float) * N);
    b   = (float*)malloc(sizeof(float) * N);
    out = (float*)malloc(sizeof(float) * N);

    // Initialize host arrays
    for(int i = 0; i < N; i++){
        a[i] = 1.0f;
        b[i] = 2.0f;
    }

    // Allocate device memory 
    cudaMalloc((void**)&d_a, sizeof(float) * N);
    cudaMalloc((void**)&d_b, sizeof(float) * N);
    cudaMalloc((void**)&d_out, sizeof(float) * N);

    // Transfer data from host to device memory
    cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);

    // Executing kernel 
    vector_add(d_out, d_a, d_b, N);
    
    // Transfer data back to host memory
    cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);

    // Verification
    for(int i = 0; i < N; i++){
        assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
    }

    printf("PASSED\n");

    // Deallocate device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_out);

    // Deallocate host memory
    free(a); 
    free(b); 
    free(out);
}

$> nvcc vector_add_thread.cu -o vector_add_thread
$> nvprof ./vector_add_thread

相应的性能为:

==6430== Profiling application: ./vector_add_thread
==6430== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 39.18%  22.780ms         1  22.780ms  22.780ms  22.780ms  vector_add(float*, float*, float*, int)
 34.93%  20.310ms         2  10.155ms  10.137ms  10.173ms  [CUDA memcpy HtoD]
 25.89%  15.055ms         1  15.055ms  15.055ms  15.055ms  [CUDA memcpy DtoH]

以上为1个thread block。CUDA GPU具有多个并行处理器,名为Streaming Multiprocessors(SMs)。每个SM包含了多个并行处理器,可运行多个concurrent thread blocks。为了充分利用CUDA GPU,kernel应启动多个thread blocks。此时CUDA再额外提供2个内置变量:

  • blockIdx.x:包含the index of the block with in the grid。
  • gridDim.x:包含the size of the grid。

在这里插入图片描述 若一共需要 N N N个线程,每个thread block有256个线程,则至少需要 N / 256 N/256 N/256个thread blocks。对于每个thread需要有a unique index,该index的计算规则为:

int tid = blockIdx.x * blockDim.x + threadIdx.x;

多个thread block的vector_add_grid.cu源代码为:

#include 
#include 
#include 
#include 
#include 
#include 

#define N 10000000
#define MAX_ERR 1e-6

__global__ void vector_add(float *out, float *a, float *b, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Handling arbitrary vector size
    if (tid < n){
        out[tid] = a[tid] + b[tid];
    }
}

int main(){
    float *a, *b, *out;
    float *d_a, *d_b, *d_out; 

    // Allocate host memory
    a   = (float*)malloc(sizeof(float) * N);
    b   = (float*)malloc(sizeof(float) * N);
    out = (float*)malloc(sizeof(float) * N);

    // Initialize host arrays
    for(int i = 0; i < N; i++){
        a[i] = 1.0f;
        b[i] = 2.0f;
    }

    // Allocate device memory 
    cudaMalloc((void**)&d_a, sizeof(float) * N);
    cudaMalloc((void**)&d_b, sizeof(float) * N);
    cudaMalloc((void**)&d_out, sizeof(float) * N);

    // Transfer data from host to device memory
    cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);


    // Executing kernel 
    int block_size = 256;
    int grid_size = ((N + block_size) / block_size);
    vector_add(d_out, d_a, d_b, N);
    
    // Transfer data back to host memory
    cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);

    // Verification
    for(int i = 0; i < N; i++){
        assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
    }

    printf("PASSED\n");

    // Deallocate device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_out);

    // Deallocate host memory
    free(a); 
    free(b); 
    free(out);
}

编译并profile性能:

$> nvcc vector_add_grid.cu -o vector_add_grid
$> nvprof ./vector_add_grid

在Tesla M2050上的性能表现为:

==6564== Profiling application: ./vector_add_grid
==6564== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 55.65%  20.312ms         2  10.156ms  10.150ms  10.162ms  [CUDA memcpy HtoD]
 41.24%  15.050ms         1  15.050ms  15.050ms  15.050ms  [CUDA memcpy DtoH]
  3.11%  1.1347ms         1  1.1347ms  1.1347ms  1.1347ms  vector_add(float*, float*, float*, int)
4. 性能对比 VersionExecution Time (ms)Speedup1 thread1425.291.00x1 block22.7862.56xMultiple blocks1.131261.32x 5. OpenCL

OpenCL全称为:Open Computing Language。 OpenCL为:

  • Open, royalty-free standard C-language extension
  • For parallel programming of heterogeneous system using GPUs, CPUS, CBE, DSP’s and other processors including embedded mobile devices。
  • 初始由苹果公司发起。苹果公司put OpenCL in OSX Snow Leopard and is active in the working group。Wroking group内包含NVIDIA, Intel,AMD,IBM等等。
  • 由Khronos Group管理。该Group同时管理了OpenGL std。

在这里插入图片描述 基本的程序结构为: 在这里插入图片描述

参考资料

[1] Getting started with OpenCL and GPU Computing [2] Introduction to GPU Computing with OpenCL [3] OpenCL™ Programming Guide for the CUDA™ Architecture [4] CUDA Tutorial

关注
打赏
1664532908
查看更多评论
立即登录/注册

微信扫码登录

0.1154s