CUDA Performance

The limits of performance and way of optimisation are introduced with examples, mostly memory optimisation and warp divergence.

All the contents are not original, all collected from

NVIDIA CUDA初级教程视频

CIS 565 2019

One rule

Efficient data-parallel algorithms + Optimizations based on GPU Architecture = Maximum Performance

Memory Optimisation

Minimise CPU-GPU data transfer

  • The data transfer between CPU and GPU depends on PCIe bus. The host <-> device bandwidth are far lower than that of global memory.

  • 8GB/s (PCIe x16 Gen2) vs 156GB/s & 515Ginst/s (C2050 (Fermi))

  • Minimise data transfer

    • Allocate, operating, free the intermediate variables directly on device

    • Sometimes it isfaster to compute multiple times on GPU than compute one time on CPU with two data transfers.

      • The cost by data transfer might be bigger than the multiple computational cost on GPU.
    • Directly transfer CPU code to GPU might not get a higher performance, if the data transfer is not optimised.

    • Transfering bigger data is better than smaller data. If the data < 80KB, the latency dominants the performance.

    • data transfer at the same time with the computation

      • double buffering, one for the data transfer and the other for the computation

        The term double buffering is used for copying data between two buffers for direct memory access (DMA) transfers, not for enhancing performance, but to meet specific addressing requirements of a device.

Memory Coalescing (most important)

global memory

Although the bandwidth of the global memory is big, the latency is as high as 400-800 cycles. Load and store multiple times can be very time consuming and inefficient.

L1 and L2 Cache
  • L1 cache is 128-bytes aligned
    • Multiples of 128B are read
  • L2 cache is 32-bytes aligned
    • Multiples of 32-bytes are read

L1 L2 cache memory access

Memory Coalescing (caching or non-caching)

Global Memory -> caching

  • For Fermi, Global Memory -> L1 cache allowed (default)
    • It can be changed into Global Memory -> L2 cache
    • by adding flags wen compiling, as nvcc -Xptxas -dlem=cg
  • For Kepler and newer, L1 is reserved for Local Memory
    • Only Global Memory -> L2 caching
conditions

When a warp requests 32 aligned, 4-byte words, there are several conditions:

warp word requested threads word requested caching cache-lines data move across the bus Bus utilization
32 aligned, 4-byte 32 threads requesting 1 float each (continuous in memory) L1 1 128 bytes 100%
32 aligned, 4-byte 32 threads requesting 1 float each (not sequentially indexed) L1 1 128 bytes 100%
32 aligned, 4-byte 32 threads requesting 1 float each (not all continuous in memory) L1 2 256 bytes 50%
32 aligned, 4-byte 32 threads requesting 1 float each (not all continuous in memory) L2 5 160 bytes 80%
1 4-byte ALL 32 threads requesting 1 float value L1 1 128 bytes 3.125%
1 4-byte ALL 32 threads requesting 1 float value L2 1 32 bytes 12.5%
32 scattered 4-byte 32 threads requesting 1 float each (randomly) L1 N N * 128 bytes 128 / (N * 128)
32 scattered 4-byte 32 threads requesting 1 float each (randomly) L2 N N * 32 bytes 128 / (N * 32)

32 threads requesting 1 float each (32 threads requesting 1 float each (continuous in memory).png)

32 threads requesting 1 float each (32 threads requesting 1 float each (not sequentially indexed).png)

32 threads requesting 1 float each (32 threads requesting 1 float each (not all continuous in memory).png)

32 threads requesting 1 float each (32 threads requesting 1 float each (not all continuous in memory) on L2.png) on L2

32 threads requesting 1 float each (32 threads requesting 1 float each (randomly).png)

32 threads requesting 1 float each (32 threads requesting 1 float each (randomly) L2.png) L2

principle

requesting large, consecutive locations instead of random locations

Memory coalescing – rearrange access patterns to improve performance

  • Useful today but will be less useful with large on-chip caches

The GPU coalesces consecutive reads in a full-warp into a single read

example

When accessing matrix data, avoid single thread requesting consecutive locations.

suitable memory requesting model for row-major matrix

suitable memory requesting model for column-major matrix

Share memory

What if it is impossible to request memory concecutively? It happens especially for the high dimensional conditions.

Strategy:

  • load global memory in a coalesce-able fashion into shared memory

  • Then access shared memory randomly at maximum bandwidth

Much lower latency then global memory.

Reduce the access of global memory

Inter-thread communication within a block

Bank Conflicts

When called a parallel data cache, multiple threads can access shared memory at the same time.

Share memory is divided into 32 32-bit banks banks

  • Can be accessed simultaneously
  • Requests to the same bank are serialized

Banks

  • Each bank can service one address per two cycles
  • Per-bank bandwidth: 32-bits per two cycles
  • Successive 32-bit words are assigned to successive banks

share memory bank illustration

example
__shared__ float tile[64];

share memory bank storage example, note the bank illustration is transposed as the upper one

Bank Conflict

Two simultaneous accesses to the same bank, but not the same address,

  • The memory accessing will be run serialized
    • G80-GT200: 16 banks, with 8 SPs concurrently executing
    • Fermi & Newer: 32 banks, with 16 SPs concurrently executing
conditions

Bank no conflicts illustration

Bank conflicts illustration

Bank broadcasting illustration

The latency of share memory is comparable with the registers, when no bank conflicts.

The conflicts can be detected by warp_serialize profiler.

examples
  1. Access to different banks by a warp executes in parallel.
__shared__ float tile[64];
int tidx = threadidx.x;
float foo = tile[tidx] - 3;

share memory bank storage requested concecutively

  1. Access to same element of bank by a warp executes in parallel.

    __shared__ float tile[64];
    int tidx = threadidx.x;
    int bar = tile[tidx - tidx % 2];

    same element in share memory bank storage requested

  2. Access to defferent element of bank by a warp executes in serial

    • 2-way conflict
    __shared__ float tile[64];
    int tidx = threadidx.x;
    int bar = tile[tidx + tidx % 2 * 31];

    different elements in share memory bank storage requested, 2-way conflict

Fermi vs Kepler & Newer

Fermi (Compute 2.x)

  • Bank width: 32-bits for 2 clock cycles

Kepler (Compute >= 3.x)and newer

  • Bank width: 64-bits per 1 clock cycle
  • Also 2 modes - either successive 32-bit words (in 32-bit mode) or successive 64-bit words (64-bit mode) are assigned to successive banks. So bank conflicts can still occur.

Note: (*) However, devices of compute capability 3.x typically have lower clock frequencies than devices of compute capability 2.x for improved power efficiency

Example Matrix Transpose

Inherently parallel - Each element independent of another

Simple to implement

Matrix transpose illustration

CPU

for(int i = 0; i < rows; i++)
{
	for(int j = 0; j < cols; j++)
    {
		transpose[i][j] = matrix[j][i]       
    }
}
  • O(n2 ) slow

GPU V1

__global__ void matrixTranspose(float *_A, float *_A_t)
{
    int row = blockDim.y*blockIdx.y+threadIdx.y;
	int col = blockDim.x*blockIdx.x+threadIdx.x;

    _A_t[row*sizeX+col] = _A[col*sizeY+row];
}
  • O(1) - Launch 1 thread per element

  • Essentially one memcpy from global-to-global

should be fast YET:

Problem

Memory coalescing, recall the matrix accessing example,

  • when row-major: consecutive locations when loading yet scattered locations on writing
  • when col-major: consecutive locations when writing yet scattered locations on loading

memory access modes on READ and WRITE with row major matrix

Improvement

Recall the share memory, we can make both the global memory <--> share memory coalesced in both directions.

GPU V2

  1. Compute input index (same as in naive transpose)

  2. Copy data to shared memory

  3. Compute output index

    • Remember, coalesced memory access

    • Hint, transpose only in shared memory

  4. Copy data from shared memory to output

memory access modes on READ and WRITE via share memory

Matrix transpose via share memory

__global__ void matrixTransposeShared(const float *_a, float *_b)
{
    __shared__ float mat[BLOCK_SIZE_Y][BLOCK_SIZE_X];
    int bx = blockIdx.x * BLOCK_SIZE_X;
    int by = blockIdx.y * BLOCK_SIZE_Y;
    int i = bx + threadIdx.x; //input
    int j = by + threadIdx.y; //input
    int ti = by + threadIdx.x; //output
    int tj = bx + threadIdx.y; //output
    
    if(i < sizeX && j < sizeY)
    {
    	mat[threadIdx.y][threadIdx.x] = a[j * sizeX + i]; // read
    }
    
    __syncthreads(); //Wait for all data to be copied
    
    if(tj < sizeY && ti < sizeX)
    {
    	b[tj * sizeY + ti] = mat[threadIdx.x][threadIdx.y]; // write
    }
}
Problem

Recall bank conflict,

Access to defferent element of bank by a warp executes in serial, as a result ,when writing, there is a

  • 32-way conflict
b[tj * sizeY + ti] = mat[threadIdx.x][threadIdx.y];

different elements in share memory bank storage requested 32 way conflict

Improvement

Resolving bank conflict by stuffing blank element at the end of each bank.

GPU V3

// __shared__ float mat[BLOCK_SIZE_Y][BLOCK_SIZE_X];
__shared__ float mat[BLOCK_SIZE_Y][BLOCK_SIZE_X + 1];
// ...
b[tj * sizeY + ti] = mat[threadIdx.x][threadIdx.y]

Elements per row = 32

Shared Mem per row = 33

1 empty element per row

share memory bank storage requested with no conflicts

Now it is very very close to production ready!

Furthere improvements

More work per thread - Do more than one element

Loop unrolling

GPU V4

More work per thread:

  • Threads should be kept light
  • But they should also be saturated
  • Give them more operations

Loop unrolling

  • Allocate operation in a way that loops can be unrolled by the compiler for faster execution

  • Warp scheduling

  • Kernels can execute 2 instructions simultaneously as long as they are independent

  • Use same number of blocks, shared memory

    • Reduce threads per block by factor (SIDE)

    loop unrolled illustration

HOST DEVICE
Same number of blocks Allocate same shared memory
Compute new threads per block Compute input indices similar to before
Copy data to shared mem using loop (k)
- Unrolled index: add k to y
Compute output indices similar to before
Copy data from shared memory into global memory
- Unrolled index: add k to y
template<int TILE, int SIDE> //TILE = 32, SIDE = 8
__global__ void matrixTransposeUnrolled(const float* a, float* b)
{
    //Allocate appropriate shared memory (avoid bank conflict)
    __shared__ float mat[TILE][TILE + 1];
    
    //Compute input index
    int x = blockIdx.x * TILE + threadIdx.x;
    int y = blockIdx.y * TILE + threadIdx.y;
    //Copy data from input to shared memory. Multiple copies per thread.
    #pragma unroll
    for (int k = 0; k < TILE; k += SIDE)
    {
        if (x < sizeX && y + k < sizeY)
        {
        	mat[threadIdx.y + k][threadIdx.x] = a[((y + k) * sizeX) + x];
        }
    }
    
    __syncthreads();
    
    //Compute output index
    x = blockIdx.y * TILE + threadIdx.x;
    y = blockIdx.x * TILE + threadIdx.y;
    //Copy data from shared memory to global memory. Multiple copies per thread.
    #pragma unroll
    for (int k = 0; k < TILE; k += SIDE)
    {
        if (x < sizeY && y + k < sizeX)
        {
        	b[(y + k) * sizeY + x] = mat[threadIdx.x][threadIdx.y + k];
        }
    }
}

Benchmark result

Matrix transpose benchmark

Divergent Optimisation

Example Parallel Reduction

Example first this time

Reduction: An operation that computes a single result from a set of data. For example the sum and max

sum

With data $[a] = $: \[ \left[0\right] \qquad \left[1\right] \qquad \left[2\right] \qquad \left[3\right] \qquad \left[4\right] \qquad \left[5\right] \qquad \left[6\right] \qquad \left[7\right] \qquad \]

Need to calculate the in parallel \[ s = \sum_ {i=0}^N a_i \]

Serial vs Parallel

serial and parallel reduction comparison, the serial (left) and parallel codes (right) have complicity n and O(log2n) respectively

It is obvious, the merits of parallel reduce are:

  • Binary
    • example: a * b, a + b, a & b, a | b
    • not binary: !(a), (a)!
  • Associative
    • example: a * b, a + b, a & b, a | b
    • non associative: a / b, a - b

GPU V1: Interleaved Addressing

Parallel binary reduce is applied to a part of the whole array in each block.

Multiple blocks help in:

  • Maximizing Occupancy by keeping SMs busy
  • Processing very large arrays.

Parallel reduce is not arithmetic intensive, it takes only 1 Flop per thread(1 add) so it is completely memory bandwidth bounded.

Need a way to communicate partial results between blocks

  • Global sync is not practical due to the overhead of sync across so many cores
  • Solution: Call the reduce kernel recursively to reduce the results from previous reduce

Parallel reduction example

\(\log_2(n)\) passes for n elements

\(O(\log_2n)\) complexity

code
  1. Declare dynamic shared memory and compute index
  2. Load input into shared memory
  3. Reduce in shared memory
  4. Copy result of each block into global memory

We only modify Parts 2 and 3 for optimization

__global__ void reduce_v1(const float* d_idata, float* d_odata, int n)
{
    // step 1
    // Dynamic allocation of shared memory - See kernel call in host code
    extern __shared__ float smem[];
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Calculate 1D Index
    
    // step 2
    if(idx < n)
    {
	    smem[threadIdx.x] = d_idata[idx]; // Copy input data to shared memory
    }
    __syncthreads();
    
    // step 3
    // Reduce within block
    // Start from c = 1, up to block size, each time doubling the offset
    for(int c = 1; c < blockDim.x; c *= 2)
    {
        if (threadIdx.x % (2 * c) == 0)// Add only on left index of each level
        {
        	smem[threadIdx.x] += smem[threadIdx.x + c];
        }
        __syncthreads();
    }
    
    // step 4
    // Copy result of reduction to global memory
    if(threadIdx.x == 0)
    {
	    d_odata[blockIdx.x] = smem[0];
    }
}

Reduction with interleaved addressing

1st pass: threadid.x = 0, 2, 4, 6 ... are working while the others keep idle.

2st pass: threadid.x = 0, 4 ... are working while the others keep idle.

3st pass: threadid.x = 0 ... are working while the others keep idal.

At each pass, number of threads required is halved. The idle threads doubled.

note that the as stride increases, the not-working threads do the same work with the working threads. Because threads in the same warp need to run exact the same computations. Yet no registered are allocated to them.

problems:
  • Interleaved addressing

  • Too many divergent branches

Improvement
  • Avoiding using modulo operator %

  • Non-divergent branches

Warp revision

A thread block is broken down to 32-thread warps, warps are executed physically in a SM(X)

Total number of warps in a block: ceil(T/Wsize)

  • Each thread in a warp execute one common instruction at a time

  • Warps with diverging threads execute each branch serially

Warp Partitioning

Definitions

Warp Partitioning: how threads from a block are divided into warps - based on consecutive increasing threadIdx.

Knowledge of warp partitioning can be used to:

  • Minimize divergent branches
  • Retire warps early

Recall the divergent branches inside one warp: the performance can be largely deterited.

examples

For warpSize = 32,

  • the code below will cause divergence inside a warp.

    if (threadIdx.x > 15)
    {
    	// ...
    }
  • the code below suffers no from the branch divergence

    if (threadIdx.x > warpSize - 1)
    {
    	// ...
    }

    because the divergence happens across the warps.

Rule

Make threads per blocks to be a multiple of a warp (32)

  • Incomplete warps waste unused cores
  • 256 threads per blocks is a good starting point

Try to have all threads in warp execute in lock step

  • Divergent warps will use time to compute all paths as if they were in serial order

GPU V2: Removing divergence branching

Revise the problems existing

  • Interleaved addressing

  • Too many divergent branches

Now we are dealing with the second one.

__global__ void reduce_v2(const float* d_idata, float* d_odata, int n)
{
    // step 1 and 2 ...
    // step 3
    for(int c = 1; c < blockDim.x; c *= 2)
    {
        int index = threadIdx.x * 2 * c;
        if (index < blockDim.x)		// No divergence except last warp
        {
        	smem[index] += smem[index + c];
        }
        __syncthreads();
    }
    // step 4 ...
}

Simple modification but

  • Need to change the for-loop structure

  • Need the same values, just not the modulo etc

  • Restructure

GPU V3: Sequential Addressing

Now we are dealing with the first problem.

__global__ void reduce_v2(const float* d_idata, float* d_odata, int n)
{
    // step 1 and 2 ...
    // step 3
    // Reduce within block
    for(int c = blockDim.x / 2; c > 0; c >>= 1)
    {
        // No need for index or modulo. It is replaced by c – similar to stage 0
        if (threadIdx.x < c)
        {
        	smem[threadIdx.x] += smem[threadIdx.x + c];
        }
        __syncthreads();
    }
    // step 4 ...
}

note that bitwise shifting >> substitutes the / operation.

Reduction with sequential addressing

1st pass: threadid.x = 0, 1, 2, 3 ... are working while ( 1/2) keep idle.

2st pass: threadid.x = 0, 1 ... are working while ( 3/4) keep idle.

3st pass: threadid.x = 0 ... are working while ( 7/8) keep idal.

...

At each pass, number of threads required is halved.

The idle threads are able to release because the indexes of the working threads are changed.

Compared with V1/2:

Suppose the warpSize = 2

Algorithms comparison

Given the knowledge of warp partitioning

Pass number Original Optimised
1st pass 4 divergences, 0 warp can be retired 0 divergence, 2 warps can be retired
2st pass 2 divergences, 2 warp can be retired 0 divergences, 3 warps can be retired
3st pass 1 divergence, 3 warps can be retired 1 divergence, 3 warps can be retired
Problem

Always think of:

  • Last warp divergence?
  • Memory Coalescing?
  • Bank conflicts?
  • How is thread usage?
Imporvements

How to reduce the idel warps?

  • Launch only half the threads per block
    • reduce the threads load in step 2
    • Get’s rid of maximum wastage
  • How to load data into shared memory?
    • adding on Load

GPU V4.1: Single add on load

Each thread reads multiple values into shared memory. So why not add on load too?

  • Doesn’t increase our global memory usage
  • Also reduces the amount of shared memory we need
    • Or keep shared memory same and do more work per block
__global__ void reduce_v1(const float* d_idata, float* d_odata, int n)
{
    // step 1  
    // step 2
    if(idx < n)
    {
        smem[threadIdx.x] = d_idata[idx];
        //Copy and add block data into shared memory
        if(idx + blockDim.x < n)
        {
	        smem[threadIdx.x] += d_idata[idx + blockDim.x];
        }
	} 
    __syncthreads();
    
    // step 3
    // step 4
}

Reduction with interleaved addressing add on load

Problem

Single - the threads needed is halved only in the first pass

Improvement

Multiple adds per thread on load

  • Replace single add with a loop.
  • Use a counter TILE to define the number to adds per thread
    • defining TILE as global constant will allow loop unrolling
    • preferable set TILE as power of 2

GPU V4.2: Multiple adds on load

__global__ void reduce_v1(const float* d_idata, float* d_odata, int n)
{
    // step 1  
    // step 2
    if(idx < n)
    {
        smem[threadIdx.x] = 0; // Start with identity
        for(int c = 0; c < TILE; c++)
        {
            // Copy and add block data with block offset into shared memory
            if(idx + c * blockDim.x < n)
            {
            	smem[threadIdx.x] += d_idata[idx + c * blockDim.x];
            }
        }
	} 
    __syncthreads();
    
    // step 3
    // step 4
}
problem

Last warp is still divergent

solution

Unroll the last warp

GPU V5.1: Last warp unroll

Split the step 3 into 2 steps:

__global__ void reduce_v2(const float* d_idata, float* d_odata, int n)
{
    // step 1 and 2 ...
    // step 3
    // step 3A
	for(int c = blockDim.x / 2; c > 32; c >>= 1)
    {
        // No need for index or modulo. It is replaced by c – similar to stage 0
        if (threadIdx.x < c)
        {
        	smem[threadIdx.x] += smem[threadIdx.x + c];
        }
        __syncthreads();
    }
    
    // step 3B
    if(threadIdx.x < 32)
    {
	    warpReduce(smem, threadIdx.x);
    }
    // step 4 ...
}

with

__device__ void warpReduce(volatile float* smem, int tid)
{
    //Write code for warp reduce here
    smem[tid] += smem[tid + 32];
    smem[tid] += smem[tid + 16];
    smem[tid] += smem[tid + 8 ];
    smem[tid] += smem[tid + 4 ];
    smem[tid] += smem[tid + 2 ];
    smem[tid] += smem[tid + 1 ];
}

note that volatile is used to declare smem

The compiler doesn't reorder stores to it and induce incorrect behavior.

Basically – Tell compiler we know what we are doing

There is no need for if(threadIdx.x < c), Essentially, when we write to the Nth part of the warp/block shared memory, we don’t really care about that data anymore.

The cost is nothing – We are executing threads that were sitting idle before.

Improvement

Able to unroll compeletely?

GPU V5.2: Complete unroll

Taking inspiration from Stage 4, unroll the for loop entirely in step 3A

#pragma unroll requires sizes to be known at compile time

  • Use Templates

  • CUDA supports C++ template parameters on device and host functions

  • Specify block size as a function template parameter:

    reduce_v5<threads><<<dims.dimBlocks,
    					 dims.dimThreads,
    					 sizeof(float) * dims.dimThreads
                       >>>
                         (d_idata, d_odata, n_elements);

Block size in GPU limited to 512 or 1024 threads.

  • Only a limited number of if conditions we have to write

Also make block sizes power of 2 (preferably multiples of 32).

if(blockSize >= 1024)
{
	if(threadIdx.x < 512) { smem[tid] += smem[tid + 512]; } __syncthreads();
}
if(blockSize >= 512)
{
	if(threadIdx.x < 256) { smem[tid] += smem[tid + 256]; } __syncthreads();
}
if(blockSize >= 256)
{
	if(threadIdx.x < 128) { smem[tid] += smem[tid + 128]; } __syncthreads();
}
if(blockSize >= 128)
{
	if(threadIdx.x < 64) { smem[tid] += smem[tid + 64]; } __syncthreads();
}

Note that the blockSize is known at compile time

  • All the if conditions related to blockSize are evaluated at compile time.
  • No problem with other if conditions either
  • All guarantee no warp divergence

Benchmark

Reduction benchmark

Further improvements

  1. RECURSIVE REDUCTION - Call recursion on result of block reduction – instead of CPU
  2. CONSTANT POINTER - use const float* x;, float* const x;, const float* const x;
  3. __restrict__ FLAG
    • assure the compiler that the pointer marked by it do not overlap
    • Basically, a and b are exclusive memory
    • And that you will not do indexing that will overflow into the other
    • Compiler can make optimizations based on this
      • Removes checks

Bank conflict

Conclusion

  1. Understand CUDA performance characteristics
    • Memory coalescing
    • Divergent branching
    • Bank conflicts
    • Latency hiding
  2. Use peak performance metrics to guide optimization
    • Know peak GFLOPs and Memory Bandwidth of GPU
  3. Know how to identify type of bottleneck
    • e.g. memory, core computation, or instruction overhead
  4. Optimise your algorithm, then unroll loops
  5. Use template gracefully

Final guide

  1. parallelisation
  2. memory coalsecing
  3. share memory
  4. other memory
    • texture
    • constant
  5. reduce bank conflicts
Further reading

CUDA C++ Programming Guide - NVIDIA Developer

CUDA C++ Best Practices Guide - NVIDIA Developer


CUDA Performance
https://daydreamatnight.github.io/2022/09/09/CUDA-fundamental-7/
Author
Ryan LI
Posted on
September 9, 2022
Licensed under