CUDA Programming 1

Basic CUDA programming methods, with a simple matrix multiply example constructed from the scratch.

All the pics and contents are not original. The contents of the whole series are mainly collected from:

NVIDIA CUDA初级教程视频

CIS 565 2019

CMU 15-418/618 (2018)

Introduction

CPU Computing History: APIs

  • 2001/2002 - researchers see GPU as data-parallel coprocessor

    The GPGPU field is born, expand GPU to the general computing

  • 2007 - NVIDIA releases CUDA

    • CUDA Compute Uniform Device Architecture
    • GPGPU shifts to GPU Computing
  • 2008 - Khronos releases OpenCL specification

  • 2013 - Khronos releases OpenGL compute shaders

  • 2015 - Khronos releases Vulkan and SPIR-V

• Except for SoC

• CUDA Program

• Contains both host and device code

CUDA Terminology

  • Host - typically the CPU
    • Code written in ANSI C, or other languages
  • Device typically the GPU (data-parallel)
    • Code written in extended ANSI C, or fortran
  • Host and device have separate memories
  • CUDA Program
    • Contains both host and device codes

Kernel

A CUDA Kernel represents a data-parallel function

  • Invoking a kernel creates lightweight threads on the device
  • Threads are generated and scheduled with hardware
  • similar to a shader in OpenGL/WebGL/VuIkan.
Example

Execute \(C = A+B\) in N times in parallel by N different CUDA threads

// Kernel Defination
__global__ void vectorAdd(const float* A, const float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    // ...
    // Kernel invocation with N threads
    vectorAdd<<<1, N>>>(A, B, C)
}
__global__ Declaration specifier
threadIdx.x Thread Index
<<<1,N>>> Kernel execution configuration
CUDA Program Execution

When a CUDA program is executed, a serial code will run on the host until the Kerenl invocation. Then the parallel code runs on the devices. The resultant data is gathered back to the host to run on serial...

Cuda excution

Thread Hierarchies

Grid - one or more thread blocks

  • 1D, 2D or 3D
  • Example: Index into vector, matrix, volume, better for representing arrays in 2D and 3D

Block - array of threads

  • 1D, 2D, or 3D
  • Each block in a grid has the same number of threads
  • Each thread in a block can
    • Synchronize
    • Access to the shared memory hazard-free

Two threads from two different blocks cannot cooperate

Thread ID, one block

Thread ID: Scalar thread identifier

Thread Index: threadIdx

  • 1D: Thread ID == Thread Index
  • 2D with size (Dx, Dy)
    • Thread ID of \(index_{(x, y)} = x + Dim_x * y\)
  • 3D with size (Dx, Dy, Dz)
    • Thread ID of \(index_{(x, y,z)} = x + Dim_x * y + Dim_x * Dim_y * z\)
example
// Kernel Definition
__global__ void matrixAddition(const float* A, const float* B, float* C)
{
    // 2D Thread Index to 1D memory index
    int idx = threadIdx.y * blockDim.x + threadIdx.x;
    C[idx] = A[idx] + B[idx];
}
int main()
{
    // ....
    // Kernel invocation with one block of N * N * 1 threads
    int blocks = 1;
    dim3 threadsPerBlock(N, N); // N rows x N columns
    // 1 block, 2D block of threads
    matrixAddition<<<blocks, threadsPerBlock>>>(A, B, C); 
}

Here, blockDim is an array with 2 elements representing the dimension of thread matrix inside the block.

threadIndex in One Block

Group of threads
  • G80 and GT200: Up to 512 threads
  • Fermi, Kepler, Maxwell, Pascal: Up to 1024 threads

Locate on same processor core (SM)

Share memory of that core (SM)

Thread ID, multiple blocks

1D or 2D Grid

Block Index: blockIdx

example
// Kernel Definition
__global__ void matrixAddition(const float* A, const float* B, float* C)
{
    // 2D Thread Index to 1D memory index
    int blockId_2D = blockIdx.y * gridDim.x + blockIdx.x +;
    int threadId_2D = threadIdx.y * blockDim.x + threadIdx.x;
    int idx = blockId_2D * (blockDim.x * blockDim.y) + threadId_2D;
    C[idx] = A[idx] + B[idx];
}
int main()
{
    // ....
    // Kernel invocation with computed configuration
    dim3 threadsPerBlock(N, N); // N*N threads per block
    dim3 blocks(M, M);   		// M*M blocks per thread
    // 2D grid of blocks, 2D block of threads
    matrixAddition<<<blocks, threadsPerBlock>>>(A, B, C); 
}

blockIndex in One Grid

Group of blocks

Blocks execute independently

  • In any order: parallel or series
    • run whenever one block is ready
  • Scheduled in any order by any number of cores
    • Allows code to scale with core count
    • For example, with 8 blocks on a GPU with 2 cores(SM), 2 blocks are run in parallel each time. If with a GPU with 4 cores, 4 blocks are run in parallel each time.

CUDA Memory Transfers

CUDA Memory Transfers

Memory Accesibility

Device code can: Host code can:
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory R/W per-grid global memory
Read only per-grid constant memory R/W per-grid constant memory
Read only per-grid texture memory R/W per-grid texture memory

Memory transfers

Host can transfer to/from device, through PCIE I/O

  • Global memory
  • Constant memory
cudaMalloc()
  • Allocate global memory on device
cudaFree()
  • Frees memory
Example
float *deviceMemory = NULL; // Pointer to device memory
int size = width * height * sizeof(float); // size in bytes
cudaMalloc((void**)&deviceMemory, size); // Allocate memory
// Do work
cudaFree(deviceMemory); // Free memory

Not that (void**)&deviceMemory is on device, not on host. Cannot be used by host directly. (a little bit contradict with the concept of global memory, need check later)

cudaMemcpy()

Cuda-Memory-Copy, transfer memory between host and device

  • Host to host

    cudaMemcpy(destPtr, sourcePtr, size, cudaMemcpyHostToHost);

  • Host to device

    cudaMemcpy(devicePtr, hostPtr, size, cudaMemcpyHostToDevice);

    (destination(host), source(device), size in byte, copy direction)

  • Device to host

    cudaMemcpy(hostPtr, devicePtr, size cudaMemcpyDeviceToHost);

  • Device to device

    cudaMemcpy(destPtr, sourcePtr, size, cudaMemcpyDeviceToDevice);

Example: matrix multiply

\[ \mathbf{P} = \mathbf{M}\mathbf{N} \]

\(\mathbf{P,M,N}\) are all square matrices with the same width.

CPU implementation

void MatrixMultiplyOnHost(float* M, float* N, float* P, int width)
{
    for (int i = 0; i < width; ++i)
    {
        for (int j = 0; j < width; ++j)
        {
            float sum = 0;
            for (int k = 0; k < width; ++k)
            {
                float a = M[i * width + k];
                float b = N[k * width + j];
                sum += a * b;
            }
            P[i * width + j] = sum;
        }
    }
}

GPU implementation

CUDA skeleton

This is a general framwork that is suitable for all most all the CUDA programs

int main() {
// 1. Allocate and Initialize M, N, and result P matrices
// Copy M, N matrices to device
// 2. M * N on device
// 3. Copy P matrix to host and output
// Free device memory and clean up
return 0;
}

Fill up the skeleton

Step 1: Add CUDA memory transfers to the skeleton
void MatrixMultiplyOnDevice(float* hostP,
const float* hostM, const float* hostN, const int width)
{
    int sizeInBytes = width * width * sizeof(float);
    float *devM, *devN, *devP;
    
    // Allocate M and N on device
    cudaMalloc((void**)&devM, sizeInBytes);
    cudaMalloc((void**)&devN, sizeInBytes);
    
    // Allocate P
    cudaMalloc((void**)&devP, sizeInBytes);
    
    // Copy M and N from host to device
    cudaMemcpy(devM, hostM, sizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(devN, hostN, sizeInBytes, cudaMemcpyHostToDevice);
    
    // Call the kernel here - Look back at these lines in step 3 later
    // Setup thread/block execution configuration
    dim3 threads(width, width);
    dim3 blocks(1, 1);
    // Launch the kernel
	MatrixMultiplyKernel<<<blocks, threads>>>(devM, devN, devP, width)
    
    // Copy P matrix from device to host
    cudaMemcpy(hostP, devP, sizeInBytes, cudaMemcpyDeviceToHost);
    
    // Free allocated memory
    cudaFree(devM); cudaFree(devN); cudaFree(devP);
}
Step 2: Implement the kernel in CUDA C
__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
                                     float* devP, const int width)
{
    // Accessing a Matrix, use 2D threads
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    
    // Initialize accumulator to 0
    float pValue = 0;
    // Multiply and add
    for (int k = 0; k < width; k++)
    {
        float m = devM[ty * width + k];
        float n = devN[k * width + tx];
        pValue += m * n;
        // no need synchronization for this code
    }
    
    // Write value to device memory
    // each thread has unique index to write to
    devP[ty * width + tx] = pValue;
}
Step 3: Invoke the kernel

Look back at the code Call the kernel here in Step1:

// Call the kernel here - Look NOW
// Setup thread/block execution configuration
dim3 threads(width, width);
dim3 blocks(1, 1); // 1 block
// Launch the kernel
MatrixMultiplyKernel<<<blocks, threads>>>(devM, devN, devP, width)

Analysis of the example

  • One Block of threads compute matrix devP

    • Each thread computes one element of devP
  • Each thread

    • Loads a row of matrix devM
    • Loads a column of matrix devN
    • Perform one multiply and addition for each pair of devM and devN elements
    • Compute to off-chip memory access ratio close to 1:1 (not very high, load-store intense)
  • Size of matrix limited by the number of threads allowed in a thread block

  • The performance bottleneck is:

    • bandwidth - the heavy memory access

CUDA Programming 1
https://daydreamatnight.github.io/2022/09/01/CUDA-fundamental-5/
Author
Ryan LI
Posted on
September 1, 2022
Licensed under