GPU/CUDA programming models

The last preparation of CUDA programming, two key points of CUDA programming are introduced.

All the pics and contents are not original. The contents of the whole series are mainly collected from:

CIS 565 2019

CMU 15-418/618 (2018)

Outline
CPU GPU interaction model
GPU memory model
GPU threads organization model
Programming model

CPU-GPU communication

CPU GPU communication

Physical memory space for each device, communicated through PCIE IO with a bandwidth of (8~16 GB/s)

There are 2 standards for memory, DDR has low latency, HBM has high bandwidth.

Most commonly used GPU memory standard is GDDR. For CPU memory, DDR is commonly used.

GPU memory hierarchy

Schematic of CUDA memory hierarchy in GPU

Shared Memory - dedicated HW - single cycle

Local Memory - DRAM (obstruct memory, actually saved in graphic memory), no cache - slow

Global Memory - DRAM, no cache - slow

Constant Memory - DRAM, cached - 1...10s...100s of cycles ,depending on cache locality

Texture Memory - DRAM, cached - 1...10s...100s of cycles ,depending on cache locality

Instruction Memory (invisible)- DRAM, cached

In reality:

GTX980 SMM unit(one core)

GTX980-arch-with-16-SMMs

GPU thread hierarchies

thread-hierarchies

Grid >> Block >> Thread

One block contains several threads, one grid contains several blocks.

Note: It is all obstruct concepts, how many threads or blocks is depends on the programmer.

When running one program, also called Kernel, a Grid is activated, containing several Blocks. Inside one block, there is one Shared Memory and several Threads that can read the shared memory and synchronize _syncthreads_.

thread mapping to the memory

grid <-> Device
block <-> SM
thread <-> ALU

In terms of CPU, there is a similar mapping:

grid <-> multi core
block <-> vector SSE
thread <-> scalar SSE

software thread memory accessibility

Every thread has a Local memory,

every block has a shared memory

there is a global memory available for all the kernels/grids

there is a constant memory available for all the kernels/grids, READ ONLY for the kernels but RW for the host (CPU).

there is a host memory available for all the devices, if multiple devices are connected with the host.

note that traditionally, the kernels are running in serial for each device, so that the kernel can get access to one global memory. But modern devices supports multiple kernels running parallel. The global memory will be divided for each kernels.

Programming model

Concepts

Conventionally GPU is used for graphic processing, processing pixels and pixel blocks. The instructions on each pixels are the same, naturally parallel. So its suitable for SIMD.

SIMD (Single Instruction Multiple Data)

Vector divided into pieces and run the same instruction

SIMT (Single Instruction Multiple Thread)

SIMD is kind of a low-level concept, SIMT is slightly higher level, but logically they are identical.

same instruction for multiple thread
GPU version of SIMD
massive threads model gives high parallelism
tread switch gives latency hiding

Programming language: Extended C

Decorated C language.

Declspecs:
- global,device,shared ,local, constant
- different memory places
Keys
- threadIds, blockIdx, indexes for threads and blocks
Intrinsics
- __syncthreads
APIs
- Memory ,symbol, execution, management
Calling functions
- need to tell the CPU how many blocks and threads are needed for one instruction

CUDA Declarations

	Operating Location	Calling Location
`__device__ float DeviceFunc()`	device	device
`__global__ void KernelFunc()`	device GPU	host
`__host__ float HostFunc()`	host CPU	host

__global__ defines a kernel function

Entrance function, calling by CPU running on GPU
Must return void
Need to define the number of blocks for each kernel, and number of threads for each block

__device__ and __host__ can be used together for one function

when same operation for both device and host
the indexing need to be careful when calling this kind of function

#CUDA

GPU/CUDA programming models

https://daydreamatnight.github.io/2022/08/30/CUDA-fundamental-4/

Author

Ryan LI

Posted on

August 30, 2022

Licensed under

CUDA Programming 1 Previous

GPU architecture Next