CUDA programming 2

More details of the basics, the matrix multiply example is carefully analysed and a optimised version is coded.

All the pics and contents are not original. The contents of the whole series are mainly collected from:

NVIDIA CUDA初级教程视频

CIS 565 2019

CMU 15-418/618 (2018)

Build-ins and functions

Declarations

Executed on the: Only callable from the
__global__ device host
__device__ device device
__host__ host host

note:

  • __global__ and __device__ can declare one function

    __host__ __device__ func ()
    {
    // …………
    }
  • __global__ function must return void

  • __device__ inline function by default, might have been changed recently

  • Recursion allowed for some device function, be carefull

  • No static parameters

  • No malloc(), memory will run out soon if all the threads use malloc

  • careful using pointers

Vector datatypes

Including:

char[1-4], uchar[1-4], short[1-4], ushort[1-4], int[1-4], uint[1-4], longlong[1-4], ulonglong[1-4], float[1-4], double1, double2.

Instantantiate with function make_<type name>:

int2 i2 = ake_int2(1,2);
float4 f4 = make_floate4(1.0f, 2.0f, 3.0f, 4.0f);

Reference with .x, .y, .z, .w :

int2 i2 = ake_int2(1,2);
int x = i2.x;
int y = i2.y;

Math

Including basic build-in math functions such as:

sqrt, rsqrt, exp, log, sin, cos, tan, sincos, asin, acos, atan2, asin, acos, atan2, trunc, ceil, floor...

Besides, there is another type of math functions: Intrinsic functions:

  • Device only
  • 10x faster but lower precision
  • Prefixed with __
  • names as: __exp, __log, __sin, __pow...

Threads Synchronizition

Recall Thread hierarchy

Recall CUDA memory hierarchy in GPU, the thread can be index with blockId and threadId: (1D indexing for example)

int threadID = blockIdx.x * blockDim.x + threadIdx.x;
float x = input[threadID];
float y = func(x);
output[threadID] = y;

Synchronizing threads

Threads in the same block (not gird) can be synchronized with function __syncthreads(), this function create a barrier:

Mds[i] = Md[j];
__syncthreads();
func(Mds[i], Mds[i+1]);
  • need similar execution time for each thread
  • block synchronize is a compromise because global synchronization costs too much.

Problem of synchronizing

synchronizing actually breaks the parallelization, and more severely, it can leads to thread deadlock, such as below:

if (someFunc())
{
    __syncthreads();
}
else:
{
    __syncthreads();
}

no syncthreads in branches !

Scheduling model

Example

Take a old G80 as an example:

  • 16 SMs, 8 SPs per SM
    • 128 SPs in total
  • up to768 treads on a SM, not all been executed
    • 12288 threads in total, (on 16 SMs)

Warp

Warp is a logical concept, it represents a goup of threads in a block.

  • threadIndxs are concecutive.
    • if there are 64 threads run on the same block, a warp has 32 threads. So it has 2 warps, the threadIndx are [0-31],[32-63].
  • Basic unit of thread scheduling
  • Run on the same SM (block)

warp on the same SM

  • Ant any time, only one of the warps is executed by SM.
    • The threads in the same warp are sychronized in nuture, the instruction is the same as well.
  • The not-running warps are saved in the context memory
    • The warps whose next instruction hasits operands ready for consumptiona are sligible for execution.
    • They can be switched (schedualed) as a click, with no overhead in order to hide latency.

Branches

Recall the branch divergence, it is also called warp divergence

warp on branch

In extreme, the performance will drop drastically, yet is has been highly optimised. And it is unessasary to worry about, except massive complicated branches exist, which is very rare.

Cycles

If More threads than SPs

In G80, there are 32 threads per warp but 8 SPs per SM. The num of thead in one warp is more than the number of SPs (ALUs). So the serial in warp is needed.

When an SM schedules a warp:

  • Its instruction is ready
  • 8 threads enter the SPs on the 1st cycle
  • 8 more on the 2nd, 3rd, and 4th cycles
  • Therefore, 4 cycles are required to dispatch a warp

In new archtecture, the number of SPs are way more than that of threads per warp.

Number of warps to hide latency

Example, a kernel has 1 global memory read (200 cycles) and 4 independent multiples/adds. How many warps are required to hide the memory latency in G80?

  1. Each warp has 4 MUL/ADDs, there are 16 cycles. Because 4 cycles in one wrap
  2. To conduct a global memory read, it has 200 cycles stall. We need to cover it by MUL/ADD cyckes.
    • Cycles needed \(ceil(200/16) = 13\);
  3. 13 warps are needed for one block.

Memory model

Table

Memory Device Host Scope Locality Latency Bandwidth Note
registers R/W × thread On-chip Short *More register per thread, less thread per grid
local R/W × thread global memory Long Automatic arrays
shared R/W × bock on-chip Short *More memory per block, less block per grid.
Full speed random access
global R/W R/W grid off-chip Long 100 cycles GT200 -150 GB/s, G80 – 86.4 GB/s Up to 4 GB
Random access causes performance hit
constant R R/W grid global memory (cached) Short High Up to 64 KB
When all threads access the same location

* Increasing the number of registers used by a kernel , exceeding limit reduces threads by the block.

For example, there are 8K registers, up to \(768\) threads per SM. \(256\) threads per block so \(768/256\) 3 blocks on SM.

  • \(8K/768 = 10\) registers per thread.

  • If each thread uses 11 registers.

    • only 2 blocks can fit the register number. So only \(2*256 = 512\) threads on SM. \(512*11 = 5.6K\) registers are used.

* For Shared Memory, the shared memory is \(16\)KB. \(8\) blocks per SM, then that's \(16/8=2\)KB per block.

If each block uses 5 KB, obly 3 blocks a SM can host.

* there is another Special Register used for build-in varibles, such as:

  • threadIdx
  • blockIdx
  • blockDim
  • gridDim

Declaration

Variable Declaration Memory Scope Lifetime
Automatic variables other than arrays register thread kernel
Automatic array variables local thread kernel
__shared__ int sharedVar; shared block kernel
__device__ int globalVar; global grid application
__constant__ int constantVar; constant grid application
Global and constant variables:

Host can access with

  • cudaGetSymbolAddress()
  • cudaGetSymbolSize()
  • cudaMemcpyToSymbol()
  • cudaMemcpyFromSymbol()

Constants is better (previously a must) be declared outside of a function body

__constant__ float constData[256]; // global scope
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

Matrix multiply Again

Problems that Still Exist

Matrix Multiply Version 1

Recall the limitations of the code from last time:

  • Limited matrix size
    • Only uses one block
    • G80 and GT200 – up to 512 threads per block
  • Lots of global memory access
    • two store-read and pre MUL-ADD
    • with the data in the same column, the same colmn data of N matrix is readed again and again. Same for the data in the same row.
    • Its better if the colume data can be calculated with only one-time read of the column N.

Solve problems

Remove size limitation

Matrix Multiply V2 remove size lim

  • Divide the matrix into tiles, assign each tile to a block.
  • Use threadIdx and blockIdx for indexing.
  • Able to use multiple blocks.
Implement the kernel
__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
float* devP, const int width)
{
    // Calculate row and col index of P and M 
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Initialize accumulator to 0
    float pValue = 0;
    
    // Multiply and add
    for (int k = 0; k < width; k++) 
    {
        float m = devM[row * width + k];
        float n = devN[k * width + col];
        pValue += m * n;
    }
    // Write value to device memory - each thread has unique index to write to
    devP[row * width + col] = pValue;
}
Invoke the kernel
void MatrixMultiplyOnDevice(float* hostP,
const float* hostM, const float* hostN, const int width)
{
    int sizeInBytes = width * width * sizeof(float);
    float *devM, *devN, *devP;
    
    // Allocate M and N on device
    cudaMalloc((void**)&devM, sizeInBytes);
    cudaMalloc((void**)&devN, sizeInBytes);
    
    // Allocate P
    cudaMalloc((void**)&devP, sizeInBytes);
    
    // Copy M and N from host to device
    cudaMemcpy(devM, hostM, sizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(devN, hostN, sizeInBytes, cudaMemcpyHostToDevice);
    
    // Call the kernel here
    // Setup thread/block execution configuration
    dim3 dimBlocks(TILE_WIDTH, TILE_WIDTH);
    dim3 dimGrid(WIDTH / TILE_WIDTH, WIDTH / TILE_WIDTH);
    // Launch the kernel
    MatrixMultiplyKernel<<<dimGrid, dimBlocks>>>(devM, devN, devP, width
    
    // Copy P matrix from device to host
    cudaMemcpy(hostP, devP, sizeInBytes, cudaMemcpyDeviceToHost);
    
    // Free allocated memory
    cudaFree(devM); cudaFree(devN); cudaFree(devP);
}

global memory access

Limitation

Limited by global memory bandwidth

  • G80 peak GFLOPS: 346.5

  • Require 346.5*4 bytes = 1386 GB/s to achieve this

    4 bytes for each float datatype

  • G80 memory bandwidth: 86.4 GB/s

    • Limits code to 86.4/4 = 21.6 GFLOPS
    • In practice, code runs at 15 GFLOPS, less than 1/2 of the peak performance!
  • Must drastically reduce global memory access

Solve

Matrix Multiply V3 phases

Becuase for matrix multiply, each input element is read by Width threads

We can use shared memory to reduce global memory bandwidth

Break kernel into phases

  • Each phase accumulates Pd using a subset of Md and Nd Each phase accumulates Pd using a subset of Md and Nd
  • Each phase has good data locality
__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
float* devP, const int width)
{
    // Shared memory of M and N submatrices, size of tile width
    __shared__ float sM[TILE_WIDTH][TILE_WIDTH];
    __shared__ float sN[TILE_WIDTH][TILE_WIDTH];
    int bx = blockIdx.x; 
    int by = blockIdx.y;
    int tx = threadIdx.x; 
    int ty = threadIdx.y;
    int col = bx * TILE_WIDTH + bx;
    int row = by * TILE_WIDTH + ty;
    // Initialize accumulator to 0
    float pValue = 0;
    // Multiply and add
    // m is index of current phase
    // width / TILE_WIDTH is the number of phases
    for (int m = 0; m < width / TILE_WIDTH; m++) 
    {
        // Bring one element from each devM and devN into shared memory
        sM[ty][tx] = devM[row * width + (m * TILE_WIDTH + tx)];
        sN[ty][tx] = devN[col + (m * TILE_WIDTH + ty) * Width];
        __syncthreads();
        
        // Accumulate subset of dot product
        for (int k = 0; k < TILE_WIDTH; ++k)
        {
            Pvalue += sM[ty][k] * sN[k][tx]; 
        }
        __synchthreads();
        
    }
    devP[row * width + col] = pValue;
}
pick TILE_WIDTH

By exceeding the maximum number of threads/block

  • G80 and GT200 – 512
  • Fermi & newer – 1024

By exceeding the shared memory limitations

  • G80: 16KB per SM and up to 8 blocks per SM
    • 2KB per block
    • 1KB for Nds and 1KB for Mds (16*16*4)
    • TILE_WIDTH = 16
    • A larger TILE_WIDTH will result in less blocks
Benefits

Reduces global memory access by a factor of TILE_WIDTH

  • 16x16 tiles reduces by a factor of 16

G80

  • Now global memory supports 21.6 * 16 = 345.6 GFLOPS
  • Close to maximum of 346.5 GFLOPS
First-order Size Considerations
Each thread block should have many threads
TILE_WIDTH of 16 gives 16*16 = 256 threads
There should be many thread blocks
A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8192 mul/add operations.
Memory bandwidth no longer a limiting factor
  • Each thread block should have many threads
    • TILE_WIDTH of 16 gives 16*16 = 256 threads
  • There should be many thread blocks
    • A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
  • Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8192 mul/add operations.
    • Memory bandwidth no longer a limiting factor

Atomic functions

arithmic operations bit operations
atomicAdd(); atomicAnd();
atomicSub(); atomicOr();
atomicMin(); atomicXor();
atomicExch();
atomicMin();
atomicMax();
atomicAdd();
atomicDec();
atomicCAS();

Atomic operations are operations which are performed without interference from any other threads. Atomic operations are often used to prevent race conditions which are common problems in mulithreaded applications.

For example, suppose you have two threads named A and B. Now suppose each thread wants to increase the value of memory location 0x1234 by one. Suppose the value at memory location 0x1234 is 5. If A and B both want to increase the value at location 0x1234 at the same time, each thread will first have to read the value. Depending on when the reads occur, it is possible that both A and B will read a value of 5. After adding a value of 1, both A and B will want to write 6 into the memory location, which is not correct! The value, 5, should have been increased twice (once by each thread), but instead, the value was only increased once! This is called a race condition, and can happen in any multi-threaded program if the programmer is not careful.

Atomic operations in CUDA generally work for both shared memory and global memory. Atomic operations in shared memory are generally used to prevent race conditions between different threads within the same thread block. Atomic operations in global memory are used to prevent race conditions between two different threads across the blocks.


CUDA programming 2
https://daydreamatnight.github.io/2022/09/02/CUDA-fundamental-6/
Author
Ryan LI
Posted on
September 2, 2022
Licensed under