您当前的位置: 首页 > 

mutourend

暂无认证

  • 1浏览

    0关注

    661博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

Cooperative Groups:更灵活的CUDA thread同步

mutourend 发布时间:2022-01-16 13:37:29 ,浏览量:1

1. 引言

__syncthreads()仅支持单一block内线程间的同步。 而Cooperative Groups支持跨grid,跨多个GPU设备的同步。

相关代码实现可看:

  • https://github.com/olcf/cuda-training-series

cudaLaunchCooperativeKernel参数定义为:

template < class T >

__host__​cudaError_t cudaLaunchCooperativeKernel ( const T* func, dim3 gridDim, dim3 blockDim, void** args, size_t sharedMem = 0, cudaStream_t stream = 0 ) [inline]
Launches a device function.
Parameters
func
- Device function symbol
gridDim
- Grid dimentions
blockDim
- Block dimentions
args
- Arguments
sharedMem
- Shared memory (defaults to 0)
stream
- Stream identifier (defaults to NULL)

cudaLaunchCooperativeKernel会调用kernel func函数,将该函数运行在gridDim (gridDim.x gridDim.y gridDim.z)个grid of blocks,每个block内包含了blockDim (blockDim.x blockDim.y blockDim.z)个threads。 【当前,每个block最多仅能有1024个threads,每个grid最多有65535个blocks。】 运行该kernel的设备必须具有非零的设备属性值cudaDevAttrCooperativeLaunch

2. 多block线程同步

若采用cooperative groups方式,kernel必须使用cudaLaunchCooperativeKernel来调用。

可采用atomics with bitfields这种更简单的方式来实现,如:

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ uint64_t CompleteMask; 

//This is where we put in all the smarts
//from the CPU reference solver
__global__ void doWork() {
    atomicAnd(&CompleteMask, 0);
    //do lots of work

    const auto SollMask = (1             
关注
打赏
1664532908
查看更多评论
0.1385s