CUDA Programming 1

Basic CUDA programming methods, with a simple matrix multiply example constructed from the scratch.

All the pics and contents are not original. The contents of the whole series are mainly collected from:

NVIDIA CUDA初级教程视频

CIS 565 2019

CMU 15-418/618 (2018)

Introduction

CPU Computing History: APIs

2001/2002 - researchers see GPU as data-parallel coprocessor

The GPGPU field is born, expand GPU to the general computing
2007 - NVIDIA releases CUDA
- CUDA Compute Uniform Device Architecture
- GPGPU shifts to GPU Computing
2008 - Khronos releases OpenCL specification
2013 - Khronos releases OpenGL compute shaders
2015 - Khronos releases Vulkan and SPIR-V

• Except for SoC

• CUDA Program

• Contains both host and device code

CUDA Terminology

Host - typically the CPU
- Code written in ANSI C, or other languages
Device typically the GPU (data-parallel)
- Code written in extended ANSI C, or fortran
Host and device have separate memories
- Except for SoC ( System On a Chip, Unified Memory Architecture )
CUDA Program
- Contains both host and device codes

Kernel

A CUDA Kernel represents a data-parallel function

Invoking a kernel creates lightweight threads on the device
Threads are generated and scheduled with hardware
similar to a shader in OpenGL/WebGL/VuIkan.

Example

Execute \(C = A+B\) in N times in parallel by N different CUDA threads

// Kernel Defination
__global__ void vectorAdd(const float* A, const float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    // ...
    // Kernel invocation with N threads
    vectorAdd<<<1, N>>>(A, B, C)
}

`__global__`	Declaration specifier
`threadIdx.x`	Thread Index
`<<<1,N>>>`	Kernel execution configuration

CUDA Program Execution

When a CUDA program is executed, a serial code will run on the host until the Kerenl invocation. Then the parallel code runs on the devices. The resultant data is gathered back to the host to run on serial...

Cuda excution

Thread Hierarchies

Grid - one or more thread blocks

1D, 2D or 3D
Example: Index into vector, matrix, volume, better for representing arrays in 2D and 3D

Block - array of threads

1D, 2D, or 3D
Each block in a grid has the same number of threads
Each thread in a block can
- Synchronize
- Access to the shared memory hazard-free

Two threads from two different blocks cannot cooperate

Thread ID, one block

Thread ID: Scalar thread identifier

Thread Index: threadIdx

1D: Thread ID == Thread Index
2D with size (Dx, Dy)
- Thread ID of \(index_{(x, y)} = x + Dim_x * y\)
3D with size (Dx, Dy, Dz)
- Thread ID of \(index_{(x, y,z)} = x + Dim_x * y + Dim_x * Dim_y * z\)

example

// Kernel Definition
__global__ void matrixAddition(const float* A, const float* B, float* C)
{
    // 2D Thread Index to 1D memory index
    int idx = threadIdx.y * blockDim.x + threadIdx.x;
    C[idx] = A[idx] + B[idx];
}
int main()
{
    // ....
    // Kernel invocation with one block of N * N * 1 threads
    int blocks = 1;
    dim3 threadsPerBlock(N, N); // N rows x N columns
    // 1 block, 2D block of threads
    matrixAddition<<<blocks, threadsPerBlock>>>(A, B, C); 
}

Here, blockDim is an array with 2 elements representing the dimension of thread matrix inside the block.

threadIndex in One Block

Group of threads

G80 and GT200: Up to 512 threads
Fermi, Kepler, Maxwell, Pascal: Up to 1024 threads

Locate on same processor core (SM)

Share memory of that core (SM)

Thread ID, multiple blocks

1D or 2D Grid

Block Index: blockIdx

example

// Kernel Definition
__global__ void matrixAddition(const float* A, const float* B, float* C)
{
    // 2D Thread Index to 1D memory index
    int blockId_2D = blockIdx.y * gridDim.x + blockIdx.x +;
    int threadId_2D = threadIdx.y * blockDim.x + threadIdx.x;
    int idx = blockId_2D * (blockDim.x * blockDim.y) + threadId_2D;
    C[idx] = A[idx] + B[idx];
}
int main()
{
    // ....
    // Kernel invocation with computed configuration
    dim3 threadsPerBlock(N, N); // N*N threads per block
    dim3 blocks(M, M);   		// M*M blocks per thread
    // 2D grid of blocks, 2D block of threads
    matrixAddition<<<blocks, threadsPerBlock>>>(A, B, C); 
}

blockIndex in One Grid

Group of blocks

Blocks execute independently

In any order: parallel or series
- run whenever one block is ready
Scheduled in any order by any number of cores
- Allows code to scale with core count
- For example, with 8 blocks on a GPU with 2 cores(SM), 2 blocks are run in parallel each time. If with a GPU with 4 cores, 4 blocks are run in parallel each time.

CUDA Memory Transfers

Memory Accesibility

Device code can:	Host code can:
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory	R/W per-grid global memory
Read only per-grid constant memory	R/W per-grid constant memory
Read only per-grid texture memory	R/W per-grid texture memory

Memory transfers

Host can transfer to/from device, through PCIE I/O

Global memory
Constant memory

`cudaMalloc()`

Allocate global memory on device

`cudaFree()`

Frees memory

Example

float *deviceMemory = NULL; // Pointer to device memory
int size = width * height * sizeof(float); // size in bytes
cudaMalloc((void**)&deviceMemory, size); // Allocate memory
// Do work
cudaFree(deviceMemory); // Free memory

Not that (void**)&deviceMemory is on device, not on host. Cannot be used by host directly. (a little bit contradict with the concept of global memory, need check later)

`cudaMemcpy()`

Cuda-Memory-Copy, transfer memory between host and device

Host to host

cudaMemcpy(destPtr, sourcePtr, size, cudaMemcpyHostToHost);
Host to device

cudaMemcpy(devicePtr, hostPtr, size, cudaMemcpyHostToDevice);

(destination(host), source(device), size in byte, copy direction)
Device to host

cudaMemcpy(hostPtr, devicePtr, size cudaMemcpyDeviceToHost);
Device to device

cudaMemcpy(destPtr, sourcePtr, size, cudaMemcpyDeviceToDevice);

Example: matrix multiply

\[ \mathbf{P} = \mathbf{M}\mathbf{N} \]

\(\mathbf{P,M,N}\) are all square matrices with the same width.

CPU implementation

void MatrixMultiplyOnHost(float* M, float* N, float* P, int width)
{
    for (int i = 0; i < width; ++i)
    {
        for (int j = 0; j < width; ++j)
        {
            float sum = 0;
            for (int k = 0; k < width; ++k)
            {
                float a = M[i * width + k];
                float b = N[k * width + j];
                sum += a * b;
            }
            P[i * width + j] = sum;
        }
    }
}

GPU implementation

CUDA skeleton

This is a general framwork that is suitable for all most all the CUDA programs

int main() {
// 1. Allocate and Initialize M, N, and result P matrices
// Copy M, N matrices to device
// 2. M * N on device
// 3. Copy P matrix to host and output
// Free device memory and clean up
return 0;
}

Fill up the skeleton

Step 1: Add CUDA memory transfers to the skeleton

void MatrixMultiplyOnDevice(float* hostP,
const float* hostM, const float* hostN, const int width)
{
    int sizeInBytes = width * width * sizeof(float);
    float *devM, *devN, *devP;
    
    // Allocate M and N on device
    cudaMalloc((void**)&devM, sizeInBytes);
    cudaMalloc((void**)&devN, sizeInBytes);
    
    // Allocate P
    cudaMalloc((void**)&devP, sizeInBytes);
    
    // Copy M and N from host to device
    cudaMemcpy(devM, hostM, sizeInBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(devN, hostN, sizeInBytes, cudaMemcpyHostToDevice);
    
    // Call the kernel here - Look back at these lines in step 3 later
    // Setup thread/block execution configuration
    dim3 threads(width, width);
    dim3 blocks(1, 1);
    // Launch the kernel
	MatrixMultiplyKernel<<<blocks, threads>>>(devM, devN, devP, width)
    
    // Copy P matrix from device to host
    cudaMemcpy(hostP, devP, sizeInBytes, cudaMemcpyDeviceToHost);
    
    // Free allocated memory
    cudaFree(devM); cudaFree(devN); cudaFree(devP);
}

Step 2: Implement the kernel in CUDA C

__global__ void MatrixMultiplyKernel(const float* devM, const float* devN,
                                     float* devP, const int width)
{
    // Accessing a Matrix, use 2D threads
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    
    // Initialize accumulator to 0
    float pValue = 0;
    // Multiply and add
    for (int k = 0; k < width; k++)
    {
        float m = devM[ty * width + k];
        float n = devN[k * width + tx];
        pValue += m * n;
        // no need synchronization for this code
    }
    
    // Write value to device memory
    // each thread has unique index to write to
    devP[ty * width + tx] = pValue;
}

Step 3: Invoke the kernel

Look back at the code Call the kernel here in Step1:

// Call the kernel here - Look NOW
// Setup thread/block execution configuration
dim3 threads(width, width);
dim3 blocks(1, 1); // 1 block
// Launch the kernel
MatrixMultiplyKernel<<<blocks, threads>>>(devM, devN, devP, width)

Analysis of the example

One Block of threads compute matrix devP
- Each thread computes one element of devP
Each thread
- Loads a row of matrix devM
- Loads a column of matrix devN
- Perform one multiply and addition for each pair of devM and devN elements
- Compute to off-chip memory access ratio close to 1:1 (not very high, load-store intense)
Size of matrix limited by the number of threads allowed in a thread block
The performance bottleneck is:
- bandwidth - the heavy memory access

#CUDA

CUDA Programming 1

https://daydreamatnight.github.io/2022/09/01/CUDA-fundamental-5/

Author

Ryan LI

Posted on

September 1, 2022

Licensed under

CUDA programming 2 Previous

GPU/CUDA programming models Next