GPU/CUDA programming models

The last preparation of CUDA programming, two key points of CUDA programming are introduced.

All the pics and contents are not original. The contents of the whole series are mainly collected from:

NVIDIA CUDA初级教程视频

CIS 565 2019

CMU 15-418/618 (2018)

Outline
CPU GPU interaction model
GPU memory model
GPU threads organization model
Programming model

CPU-GPU communication

CPU GPU communication

Physical memory space for each device, communicated through PCIE IO with a bandwidth of (8~16 GB/s)

There are 2 standards for memory, DDR has low latency, HBM has high bandwidth.

Most commonly used GPU memory standard is GDDR. For CPU memory, DDR is commonly used.

GPU memory hierarchy

Schematic of CUDA memory hierarchy in GPU

Register - dedicated HW -single cycle

Shared Memory - dedicated HW - single cycle

Local Memory - DRAM (obstruct memory, actually saved in graphic memory), no cache - slow

Global Memory - DRAM, no cache - slow

Constant Memory - DRAM, cached - 1...10s...100s of cycles ,depending on cache locality

Texture Memory - DRAM, cached - 1...10s...100s of cycles ,depending on cache locality

Instruction Memory (invisible)- DRAM, cached

In reality:

GTX980 SMM unit(one core)

GTX980-arch-with-16-SMMs

GPU thread hierarchies

thread-hierarchies

Grid >> Block >> Thread

One block contains several threads, one grid contains several blocks.

Note: It is all obstruct concepts, how many threads or blocks is depends on the programmer.

When running one program, also called Kernel, a Grid is activated, containing several Blocks. Inside one block, there is one Shared Memory and several Threads that can read the shared memory and synchronize _syncthreads_.

thread mapping to the memory

  • grid <-> Device
  • block <-> SM
  • thread <-> ALU

In terms of CPU, there is a similar mapping:

  • grid <-> multi core

  • block <-> vector SSE

  • thread <-> scalar SSE

software thread memory accessibility

Every thread has a Local memory,

every block has a shared memory

there is a global memory available for all the kernels/grids

there is a constant memory available for all the kernels/grids, READ ONLY for the kernels but RW for the host (CPU).

there is a host memory available for all the devices, if multiple devices are connected with the host.

note that traditionally, the kernels are running in serial for each device, so that the kernel can get access to one global memory. But modern devices supports multiple kernels running parallel. The global memory will be divided for each kernels.

Programming model

Concepts

Conventionally GPU is used for graphic processing, processing pixels and pixel blocks. The instructions on each pixels are the same, naturally parallel. So its suitable for SIMD.

SIMD (Single Instruction Multiple Data)
  • Vector divided into pieces and run the same instruction
SIMT (Single Instruction Multiple Thread)

SIMD is kind of a low-level concept, SIMT is slightly higher level, but logically they are identical.

  • same instruction for multiple thread

  • GPU version of SIMD

  • massive threads model gives high parallelism

  • tread switch gives latency hiding

Programming language: Extended C

Decorated C language.

  • Declspecs:
    • global,device,shared ,local, constant
    • different memory places
  • Keys
    • threadIds, blockIdx, indexes for threads and blocks
  • Intrinsics
    • __syncthreads
  • APIs
    • Memory ,symbol, execution, management
  • Calling functions
    • need to tell the CPU how many blocks and threads are needed for one instruction

CUDA Declarations

Operating Location Calling Location
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device GPU host
__host__ float HostFunc() host CPU host

__global__ defines a kernel function

  • Entrance function, calling by CPU running on GPU
  • Must return void
  • Need to define the number of blocks for each kernel, and number of threads for each block

__device__ and __host__ can be used together for one function

  • when same operation for both device and host
  • the indexing need to be careful when calling this kind of function

GPU/CUDA programming models
https://daydreamatnight.github.io/2022/08/30/CUDA-fundamental-4/
Author
Ryan LI
Posted on
August 30, 2022
Licensed under