test GPU servers single GPU

Perform a cross-benchmark on between 8 Tesla A800 GPU server and GeForce RTX 3090. Single GPU performances on the basic matrix operations (Python C++) and real life large-scale neural networks.

Benchmark configuration:

target:
- Floating point operations per second (FLOPS)
- Bandwidth
- State monitoring
  - pcie.link.gen.current
  - pcie.link.gen.max
  - pstate
  - clocks.current.graphics [MHz]
  - clocks.max.graphics [MHz]
- accuracy (future)
Platforms:
- 8 NVIDIA A800s
- GeForce RTX 3090
- Intel® Xeon® Gold 6234 Processor (cpu)
Implementations:
- Python(PyTorch backended with cuBLAS)
- C++(cuBLAS directly)
Precisions:
- Float64(double),
- Float32(single)
- Float16(half)
Tasks (single):
- Basic:
  - Matrix matrix multiplication
  - Vector multiply a number
- Neural network layer
  - Bert
  - GPT-2
  - T5
- Simulations
  - nbody (flowlling blogs)
  - SPH
  - FVM (suspended)

Single GPU cross benchmark			basic		NN layer			simulation			note
Single GPU cross benchmark			GEMM	SCAL	BERT	GPT2	T5	nbody	SPH	FVM	note
A800*8 (w NV)	torch	FP64	√
		FP32	√	√#	√	√	√
		FP16	√		√	√	√
	cuda	FP64	√					o	√
		FP32	√					o	√
		FP16	√
3090	torch	FP64	√
		FP32	√	√#	√	√	√
		FP16	√		√	√	√
	cuda	FP64	√					o	o
		FP32	√					o	o
		FP16	√
cpu	torch	FP64	√
		FP32	√	√#
		FP16
	c++	FP64	o					o	o
		FP32	o					o	o
		FP16

√ indicates that it FLOPS has been bechmarked, # indicates that it bandwidth has been bechmarked, o indicates that we plan to do benchmarks but need more support. ALL GPU benchmarks are performed under status monitoring.

In the specifications section below, the on-paper performance data is collected and analyzed. In the process section, only python code snippets and partial results are included for illustration and explanation. The full data is in the full report sections.

Specifications

Specifications and other useful information related to the scientific computing performance of the A800, A100, RTX3090, and Xeon® Gold 6234(CPU) are listed below:

	Architecture	Memory (GB)	ECC	Bandwidth (GB/s)	CUDA Cores	FP16 Tensor Cores	FP64 GFLOPS	FP32 TFLOPS	FP16 TFLOPS	Compute Capability	PCIE (Gen)	PCIE (GB/s )	NVLink (Gen)	NVLink (GB/s )
A100	Ampere	80	√	2039	6912	432	19500	19.5	312	8.0	4.0	2*32	3	50*12
A800	Ampere	80	√	2039	6912	432	19500	19.5	312	8.0	4.0	2*32	3	50*8
RTX 3090	Ampere	24	x	936	10496	328	556	35.6	142	8.6	4.0	2*16	3	504*
Xeon® Gold 6234	CPU	N/A	√	N/A	x	x	N/A	0.2158	N/A	x	3.0	x	x	x

Data marked as italic is from (unofficial) TechPowerUp GPU Database. See the references list. Data of A4000 marked as orange italic is inaccessible or likely not true. Xeon® Gold 6234 (CPU) is added to this list, only the Geekbench (not the theory) F32 performance of a similar CPU Xeon® Gold 6230 was found. Data of A100 80GB PCIE and A800 80GB PCIE are selected.

instruction throughput Compute Capability	5.0, 5.2	5.3	6.0	6.1	6.2	7.x	8.0	8.6	8.9	9.0
16-bit floating-point add, multiply, multiply-add	N/A	256	128	2	256	128	256³	256³	128	256
32-bit floating-point add, multiply, multiply-add	128	128	64	128	128	64	64	128	128	128
64-bit floating-point add, multiply, multiply-add	4	4	32	4	4	32⁵	32	2	2	64

[3] 128 for __nv_bfloat16 [5] 2 for compute capability 7.5 GPUs

The TFLOPS in each precision are dependent on the number of CUDA cores, related Tensor Cores, and the instruction throughput.

For A100, the Float64 performance is much better because of its extra Float64 tensor cores and its high instruction throughput. See also why cuBLASHgemm is slower more than cuBLASSgemm when I use? (Because GTX1070 (6.1) has very low throughput for FP16(6.0) The only Pascal GPU with “fast” FP16 currently is P100)

Starting with cuBLAS version 11.0.0, there is no longer any restriction on matrix dimensions and memory alignments to use Tensor Cores. However, the best performance can be obtained using Tensor Cores when the matrix size and pointers meet certain memory alignment requirements. Specifically, all of the following conditions must be met to get the maximum performance from Tensor Cores.

m % 8 == 0
k % 8 == 0
op_B == cuBLAS_OP_N || n%8 == 0
intptr_t(A) % 16 == 0
intptr_t(B) % 16 == 0
intptr_t(C) % 16 == 0
intptr_t(A+lda) % 16 == 0
intptr_t(B+ldb) % 16 == 0
intptr_t(C+ldc) % 16 == 0

Bandwidth is also a very important factor and is always a key bottleneck in real-life performance. We will see this repeatedly in the next sections. That's why a large bandwidth (fast) memory can be really expensive. In parallel computation, the bandwidth of interconnect interface is more important. Take A40 as an example, the GPU-GPU bandwidth through PCIE can only reach 64 GB/s in the absence of the NVLink. And this speed highly depends on the PCIE configuration of the motherboard. If the motherboard only supports 16 lane PCIE, the bandwidth can only be 32 GB/s. With NVLink, the GPU-to-GPU communication can be much faster, in A40, it can be 112.5GB/s.

537759_1_En_23_Fig1_HTML

Note that NVLink only connects 2 GPUs. In a 4-GPU machine, PCIE is still used for GPUs not connected by NVLink. Another configuration is using NVSwitch + NVLinks to connect multiple cards. For example in above right NVIDIA DGX A10 containing 8x NVIDIA A100 GPUs. 12 NVLinks/GPU provide 600 GB/s GPU-to-GPU Bi-directional Bandwidth. 6x NVIDIA NVSwitches give 4.8 TB/s Bi-directional Bandwidth across all GPUs.

PCIe 4 doubles the data transfer speed of the previous generation (PCIe 3.0) from 1GB/s per lane to 2GB/s per lane, providing users with a total of 32GB/s in a 16 lane configuration. Furthermore, PCIe provides up to 16GT/s per lane when compared to the previous generation's 8GT/s. Each new generation of PCIe doubles the data transfer rate and total bandwidth per lane of the prior generation, paving the way for new, faster PCIe devices.

NVLink is a direct GPU-to-GPU interconnect that scales multi-GPU input/output (IO) within the server. NVSwitch connects multiple NVLinks to provide all-to-all GPU communication at full NVLink speed within a single node and between nodes. NVLink and NVSwitch for NVIDIA Ampere architecture provide extra 600GB/s GPU-to-GPU bandwidth.

Benchmark Process of single GPU

Prerequisites

Create a new conda environment, install PyTorch, hugging face transformer packages from for testing, Jupyter notebook (not required) and pandas.

Preparations

The GPU server (A800), GeForce RTX 3090 on my own computer.

import torch

torch.cuda.set_device(1) # for gpu server

print('Pytorch version\t:', torch.__version__)
print('CUDA version\t:', torch.version.cuda)
print('GPU\t\t:',torch.cuda.get_device_name())

A800	3090
`Pytorch version : 1.13.1+cu117 CUDA version : 11.7 GPU : NVIDIA A800`	`Pytorch version : 1.8.0+cu111 CUDA version : 11.1 GPU : GeForce RTX 3090`

Define benchmark functions:

import inspect
from collections import defaultdict
import pandas as pd
from torch.utils import benchmark 

pd.options.display.precision = 3

# transfer inspect datatype into dict
def var_dict(*args):
    callers_local_vars = inspect.currentframe().f_back.f_locals.items()
    return dict([(name, val) for name, val in callers_local_vars if val is arg][0] 
                for arg in args)

# return the median of mutiple benchmark results
def walltime(stmt, arg_dict, duration=60):
    return benchmark.Timer(stmt=stmt, globals=arg_dict).blocked_autorange(
        min_run_time=duration).median

Matrix operations

Matrix matrix multiplication (GEMM)

The performance of matrix multiplications represents the upper bound of the hardware.

The matrix multiplication performance is a main topic in HPC. There are a large number of research papers and libraries. For example, for CPU, BLAS(FORTRAN)/cBLAS(C) is one mainstream API containing all Basic Linear Algebra Subprograms and the backend library includes open-sourced ATLAS, OpenBLAS, close-sourced Intel® MKL. For Nvidia GPU, the backend library, cuBLAS, is not open-sourced. You may check cutlass, which claimed similar performance as cuBLAS, for some implementation details. And there is a new coming AI accelerated library openai-gemm only for faster GEMM operations.

The example code of TFLOPS of matrix multiplication is shown below. 4 different sizes of square matrices in 3 different precisions are tested.

matmul_tflops = defaultdict(lambda: {})
for n in [128, 512, 2048, 8192]:
    for dtype in (torch.float64, torch.float32, torch.float16):
        a = torch.randn(n, n, dtype=dtype).cuda()
        b = torch.randn(n, n, dtype=dtype).cuda()   
        t = walltime('a @ b', var_dict(a, b))
        matmul_tflops[f'n={n}'][dtype] = 2*n**3 / t / 1e12 #*
        del a, b
        
pd.DataFrame(matmul_tflops)

*The number of floating point operations of

multiplication of two \(n \times n\) matrices: \(2n^3\)
multiplication of two \(n-dimension\) vectors: \(2n^2\)

For example, the inner product of two vectors \(a=[a_0, a_1, ... a_n]\) and \(b=[b_0, b_1, ... b_n]\) can be expanded by: \[<a,b> = a_0*b_0 + a_0*b_1 + a_0*b_2 ... a_n*b_n\] Although in hardware, one multiplication and one addition are done at once, referring as Multiply–accumulate operation (MAC). But normally, they are treated as two separate floating point operations when calculating FLOPs.

A selection of results on different platforms are shown as:

A800	n=128	n=512	n=2048	n=8192	Theory
Float64	0.45773 (+359.49%)	10.662 (+2430.5%)	17.188 (+3059.44%)	18.023 (+3209.95%)	19.5 (+3407.19%)
Float32	0.34396 (-12.28%)	9.7233 (-49.8%)	17.575 (-34.89%)	18.993 (-43.89%)	19.5 (-45.22%)
Float16	0.20737 (-49.62%)	18.949 (-30.22%)	201.77 (+242.0%)	257.4 (+242.28%)	312 (+119.72%)

3090	n=128	n=512	n=2048	n=8192	Theory
Float64	0.099616 (100.0%)	0.42134 (100.0%)	0.54402 (100.0%)	0.54451 (100.0%)	0.556 (100.0%)
Float32	0.3921 (100.0%)	19.37 (100.0%)	26.991 (100.0%)	33.852 (100.0%)	35.6 (100.0%)
Float16	0.41158 (100.0%)	27.157 (100.0%)	58.997 (100.0%)	75.201 (100.0%)	142 (100.0%)

cpu	n=128	n=512	n=2048	n=8192	Standard
Float64	0.0975 (-2.09%)	0.09918 (-76.46%)	0.09824 (-81.94%)	0.09729 (-82.1%)	N/A
Float32	0.1853 (-52.7%)	0.2042 (-98.9%)	0.2080 (-99.2%)	0.2066 (-99.4%)	0.216 (-99.4%)
Float16	NAN	NAN	NAN	NAN	N/A

The general rule can be concluded as:

the performance increases with the matrix size i.e. occupancy.
a big performance jump is seen when converting from 64-bit floating points to 32-bit floating points, and from 32-bit floating points to 16-bit floating points.
the best TFLOPS number ever got is still far away from the theoretical TFLOPS. Because the performance is bottlenecked by the memory bandwidth.

From the comparison, we can see that:

For A800:
- it has much higher FLoat64 and Float16 FLOPS than the others
- the Float32 performance is slightly lower, stick to the theory.

Vector multiply a number (SCAL)

The number of computations (FLOPs) is only part of the story. Memory bandwidth is the other part, and most of the time is even more important! On current computer architectures, a single memory access from main memory is much slower than a single computation — by a factor of about 100 or more. As a result, the amount of memory read/writes will have a big impact on the speed — bigger perhaps than the FLOPs.

In order to test the bandwidth, we designed a task that scales a vector by a constant. The actual scaling operation is as fast as a snap. The main bottleneck is the bandwidth.

Element-wised tasks like this are very common, such as the time integrating step in simulations, the activation function layer and model updating process in Neural Networks. Such types of tasks are always ignored when calculating the complexity (FLOPs). But they can be actually very slow.

It’s actually common to not count these operations, as they only take up a small fraction of the overall time. We’re mostly interested in the (big) matrix multiplies and dot products, and we’ll simply assume that the activation function is free.

We only test on torch (code below) to illustrate the problem. And we only choose precision FP32, because the calculation overhead of large dimension FP64 vector is still big. We are seeking an operation with ignorable calculation overhead compare to the memory access. Besides, FP16 scaling is not supported in cuBLAS, neither level-1 routine cuBLASscal nor BLAS-like routine cuBLASScalEx().

For a more detailed pure bandwidth benchmark, refer to the beginning section of next blog.

vector = defaultdict(lambda: {})
for n in [1024*64, 1024*256, 1024*1024, 1024*1024*4]:
    a = torch.randn(n).cuda()
    t = walltime('a * 1.2', var_dict(a))
    vector[n]['TFLOPS'] = n / t / 1e12
    vector[n]['GB/s'] = 8 * n / t / 1e9 # **
    
pd.DataFrame(vector)

**The size of \(n\) dimensional Float32 vector is \(4n\) bits. In one scaling operation, each element of the vectors need to be distributed to each thread, and moved back to the main memory after computing.

In a storing dense task like this, assuming the time spend on operating is much lower than storing, the bandwidth is therefore \(8n/t\) bits/s.

A800	n=65536	n=262144	n=1048576	n=4194304	Peak	Theory
F32 TFLOPS	0.0081949	0.033527	0.13034	0.21234	18.993	19.5
Bandwidth (GB/s)	65.559 (-19.37%)	268.22 (-20.5%)	1042.7 (+19.53%)	1698.7 (+114.33%)	N/A	2039 (+117.8%)

RTX3090	n=65536	n=262144	n=1048576	n=4194304	Peak	Theory
F32 TFLOPS	0.010163	0.042172	0.10904	0.09907	33.852	35.6
Bandwidth (GB/s)	81.306 (100.0%)	337.38 (100.0%)	872.31 (100.0%)	792.56 (100.0%)	N/A	936 (100.0%)

CPU	n=65536	n=262144	n=1048576	n=4194304	Peak	Standard
F32 TFLOPS	0.0068965	0.0024492	0.002599	0.0021311	0.2066	0.216
Bandwidth (GB/s)	55.172 (-32.14%)	19.594 (-94.19%)	20.792 (-97.62%)	17.048 (-97.85%)	N/A	N/A

The TFLOPS is way far from the peak performance even with large vector size.

From the comparison, the bandwidth of A800 has as much as +114.33% gain from the 3090.

GPU state monitoring

This task is firstly designed for a wired bug when running GEMM with CUDA. It is not surprising that running cuBLAS directly gives better peak results than the indirect approach i.e. torch calling cuBLAS. However, there is a large drop in performance for Float32, RTX3090, CUDA compared to its torch counterparts. And this result is repeatable.

To check the reason, run the same GEMM benchmark again with CUDA and torch on Float32, RTX3090 while recording the GPU state, the result is shown below:

================== CUDA ==================
INFO: Running test for all 1 GPU deivce(s) on host tehp1308

==================
INFO: testing GPU0
==================
timestamp, index, name, pcie.link.gen.current, pcie.link.gen.max, pstate, clocks.current.graphics [MHz], clocks.max.graphics [MHz]
2022/12/22 14:26:29.629, 0, GeForce RTX 3090, 1, 3, P8, 210 MHz, 2100 MHz
2022/12/22 14:26:34.631, 0, GeForce RTX 3090, 3, 3, P2, 1695 MHz, 2100 MHz
2022/12/22 14:26:39.633, 0, GeForce RTX 3090, 3, 3, P2, 1695 MHz, 2100 MHz
2022/12/22 14:26:44.635, 0, GeForce RTX 3090, 3, 3, P2, 1575 MHz, 2100 MHz
2022/12/22 14:26:49.636, 0, GeForce RTX 3090, 3, 3, P2, 1530 MHz, 2100 MHz
2022/12/22 14:26:54.638, 0, GeForce RTX 3090, 3, 3, P2, 1440 MHz, 2100 MHz
2022/12/22 14:26:59.640, 0, GeForce RTX 3090, 3, 3, P2, 1515 MHz, 2100 MHz
2022/12/22 14:27:04.643, 0, GeForce RTX 3090, 3, 3, P2, 1485 MHz, 2100 MHz

cuBLASSgemm test result:
8.5635e-07
1.6361e-05
0.00013088
0.0010519
0.0085015
0.081756
0.29769
2.6873
9.7986
12.518
18.074
16.447
20.891
20.691

================== TORCH ==================
INFO: Running test for all 1 GPU deivce(s) on host tehp1308

==================
INFO: testing GPU0
==================
timestamp, index, name, pcie.link.gen.current, pcie.link.gen.max, pstate, clocks.current.graphics [MHz], clocks.max.graphics [MHz]
2022/12/22 14:42:38.925, 0, GeForce RTX 3090, 1, 3, P8, 255 MHz, 2100 MHz
2022/12/22 14:42:43.927, 0, GeForce RTX 3090, 3, 3, P2, 1695 MHz, 2100 MHz
2022/12/22 14:42:48.929, 0, GeForce RTX 3090, 3, 3, P2, 1950 MHz, 2100 MHz
2022/12/22 14:42:53.930, 0, GeForce RTX 3090, 3, 3, P2, 1950 MHz, 2100 MHz
2022/12/22 14:42:58.931, 0, GeForce RTX 3090, 3, 3, P2, 1950 MHz, 2100 MHz
2022/12/22 14:43:03.933, 0, GeForce RTX 3090, 3, 3, P2, 1950 MHz, 2100 MHz
2022/12/22 14:43:08.934, 0, GeForce RTX 3090, 3, 3, P2, 1935 MHz, 2100 MHz
2022/12/22 14:43:13.950, 0, GeForce RTX 3090, 3, 3, P2, 1890 MHz, 2100 MHz
2022/12/22 14:43:18.952, 0, GeForce RTX 3090, 3, 3, P2, 1800 MHz, 2100 MHz
2022/12/22 14:43:23.953, 0, GeForce RTX 3090, 3, 3, P2, 1740 MHz, 2100 MHz
2022/12/22 14:43:28.954, 0, GeForce RTX 3090, 3, 3, P2, 1695 MHz, 2100 MHz
2022/12/22 14:43:33.956, 0, GeForce RTX 3090, 3, 3, P2, 1935 MHz, 2100 MHz
Pytorch version : 1.8.0+cu111
CUDA version : 11.1
GPU : GeForce RTX 3090
8.5315e-07
1.6369e-05
0.00013096
0.0010552
0.0084168
0.06718
0.39649
2.9674
20.268
27.827
29.288
35.074
35.671
31.058

We can see except for the beginning moment, the range of GPU frequencies are:

cuBLAS: [1485, 1695] MHz
torch: [1695, 1950] MHz

The peak frequency when running cuBLAS directly is the same as the lowest frequency when running torch!

Check the benchmark logs of A800,

================== CUDA ==================
Running test for all 8 GPU deivce(s) on host localhost.localdomain

==================
testing GPU0
==================
timestamp, index, name, pcie.link.gen.current, pcie.link.gen.max, pstate, clocks.current.graphics [MHz], clocks.max.graphics [MHz]
2023/02/13 00:46:32.488, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 210 MHz, 1410 MHz
2023/02/13 00:46:37.491, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:46:42.495, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:46:47.496, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:46:52.938, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:46:57.940, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:47:02.941, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:47:07.943, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:47:12.944, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:47:18.320, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1155 MHz, 1410 MHz
2023/02/13 00:47:23.321, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:28.425, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:33.427, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:38.428, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:43.879, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:48.880, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:53.953, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:47:58.955, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:48:03.957, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:48:08.959, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz
2023/02/13 00:48:13.962, 0, NVIDIA A800-SXM4-80GB, 4, 4, P0, 1410 MHz, 1410 MHz

No major problem is found.

Below are some more information extracted from the GPU state comparison when running GEMM:

	pcie state	pstate	freq	freq stability
A800	4/4	8-0	1410/1410 MHZ	high
RTX3090	3/3	8-2	1950/2100 MHz	low

It is worth mentioning that, although the max frequencies of A800 is not comparable with that of RTX3090, the frequency of them stick to the maximum during the entire period of calculation. On the contrary, the frequencies of RTX3090 reaches their peak performances, lower than the max frequency, for a short time then deteriorates because of their less heat dissipation capabilities.

Also, the pstate represents to the current performance state of GPU, ranging from P0 max to P12 min. A800 immediately reaches P0 but RTX3090 only reach P2. Maybe that's why the frequency of RTX3090 never reach the highest. There is a way of forcing the graphic card to be P0 state is a kind of overclock, which can cause a crash, I did not try it. See Issues with P-state "--pstate p0" on 3080ti & 0.24.8.

Last but not least, note that A800 has a higher PCIE state than RTX3090. In fact, RTX3090 supports PCIE 4.0, but the connected CPU and motherboard only supports PCIE 3.0. See Intel® Xeon® Gold 6234 Processor. This means that the A800 has twice the CPU-GPU bandwidth of the RTX3090 per PCIE lane. This is a benefit for multi-GPU computing, especially in the absence of NVLink.

Large scale neural networks

Now, test the performance of a series of large scale transformer neural networks^[1], then split it into pieces to analysis the performance of each kernel.

The original interpretation of transformer is a encode-decode structure containing several transformer blocks. Each transformer block contains one (or two in decoder block) multi-head attention layer and a feed forward layer (with residual connection and layer norm layers) as shown below.

the classic architecture of transformer nerual network

Define a function to benchmark both forward and forward with backward performance using different sequence lengths and batch sizes:

def layer_benchmark(layer, hidden_size, seq_lens, batch_sizes, cross_attention=False):
    h = hidden_size
    results = defaultdict(lambda: {})
    encoder_state = 'encoder_hidden_states=X' if cross_attention else ''
    for s in seq_lens:
        for b in batch_sizes:            
            ffn = 16*b*s*h*h / 1e12  # TFLOPS for the Feed-Forward Network
            atten = (4*b*h*s*s + 8*b*s*h*h) / 1e12  # TFLOPS for attention            
            forward = ffn + (2 if cross_attention else 1) * atten
            
            X = torch.randn(b, s, h).half().cuda()
            results[f'batch={b}'][f'fwd seq_len={s}'] = forward / walltime(
                f'layer(X, {encoder_state})', var_dict(layer, X))
            results[f'batch={b}'][f'fwd+bwd seq_len={s}'] = 3 * forward / walltime(
                f'layer(X, {encoder_state})[0].sum().backward()', var_dict(layer, X))            
    return pd.DataFrame(results)

Bert

The architecture of BERT^[2] is a multi-layer bidirectional transformer encoder (stage 1-3 in the image above). There are two types of BERT:

BERT Base, 12 (Nx) transformer (encoder) blocks, 12 attention heads, and 110 million parameters
BERT Large, 24 (Nx) transformer (encoder) blocks, 16 attention heads and 340 million parameters

BERT large is adopted in this test. The hidden size is 1024, sequence length is chosen from [128, 512], and selected batch sizes are [4, 16, 32, 64], the result is shown below:

print(layer_benchmark(layer, config.hidden_size, [128, 512], [4, 16, 32, 64]).to_markdown())

A800_0	batch=4	batch=16	batch=32	batch=64
fwd seq_len=128	19.2772 (+8.7%)	73.497 (+98.49%)	111.482 (+194.8%)	117.696 (+194.67%)
fwd+bwd seq_len=128	16.5694 (-30.17%)	88.1707 (+192.16%)	107.347 (+150.18%)	135.874 (+197.25%)
fwd seq_len=512	77.2078 (+139.87%)	94.9238 (+171.26%)	97.682 (+171.89%)	101.759 (+178.88%)
fwd+bwd seq_len=512	50.4813 (+57.09%)	111.566 (+179.85%)	117.27 (+183.52%)	122.327 (+188.88%)

3090	batch=4	batch=16	batch=32	batch=64
fwd seq_len=128	17.7345 (100.0%)	37.0276 (100.0%)	37.8156 (100.0%)	39.9415 (100.0%)
fwd+bwd seq_len=128	23.729 (100.0%)	30.1785 (100.0%)	42.9077 (100.0%)	45.7111 (100.0%)
fwd seq_len=512	32.1873 (100.0%)	34.9941 (100.0%)	35.9264 (100.0%)	36.4884 (100.0%)
fwd+bwd seq_len=512	32.1356 (100.0%)	39.8661 (100.0%)	41.3627 (100.0%)	42.3451 (100.0%)

No surprise that a large batch size helps. But the best TFLOPS number is below the matrix multiplication TFLOPS.

In order to see the reason, take A40_0 and inspect the performance on each part of a BERT layer. The hidden size is 1024, sequence length is 128, and batch size is 64.

h, b, s = config.hidden_size, 64, 128
X = torch.randn(b, s, h).half().cuda()

print('Dense layer TFLOPS: %.3f' % (8*b*s*h*h / 1e12 / walltime(    
    'layer.intermediate.dense(X)', var_dict(layer, X))))

print('Dense+Activation TFLOPS: %.3f' % (8*b*s*h*h / 1e12 / walltime(
    'layer.intermediate(X)', var_dict(layer, X))))

ffn = 16*b*s*h*h / 1e12
print('FFN TFLOPS: %.3f'%(ffn / walltime(
    'layer.output(layer.intermediate(X),X)', var_dict(layer, X))))

att = (4*b*h*s*s + 8*b*s*h*h) / 1e12
print('Attention TFLOPS: %.3f'%(
    att / walltime('layer.attention(X)', var_dict(layer, X))))

print(att / ffn)

Dense layer TFLOPS: 95.531
Dense+Activation TFLOPS: 73.538
FFN TFLOPS: 72.570
Attention TFLOPS: 37.845
0.53125

We first benchmark the first dense layer in the Feed-Forward Network (FFN) layer. The number of TFLOPS is quite good, 95.510, with a peak GEMM TFLOPS of 101.26.

Then run this dense layer with the GeLU activation. Even though the activation function has a negligible complexity (n), it reduces the TFLOPS. The reason is that the elemental-wise operation of the activation function is bounded by the memory bandwidth.

Then we test the entire FFN, including the residual block and normalization and the TFLOPS number gains a little bit.

The other part of the BERT layer is the multi-head self-attention. Even though the main computation part of the attention block is still matrix multiplication, it has more memory constrained operators compared to FFN. So we see a lower TFLOPS.

GPT-2

Instead of encoder, GPT-2^[2] takes the decoder part of the original transformer architecture. GPT-2 Medium is used for

GPT-2 small, 12 (Nx) transformer (decoder) blocks, 117 million parameters
GPT-2 medium, 24 (Nx) transformer (decoder) blocks, 345 million parameters
GPT-2 large, 36 (Nx) transformer (decoder) blocks, 762 million parameters
GPT-2 XL, 48 (Nx) transformer (decoder) blocks, 1542 million parameters

GPT-2 medium is adopted in this test, with a equivalent size to BERT large. Hidden size is 1024 as well, but GPT-2 is trained with a 1024 sequence length.

A800_0	batch=4	batch=16	batch=32	batch=64
fwd seq_len=512	58.4751 (+125.14%)	65.4225 (+136.48%)	68.0979 (+150.45%)	69.3883 (+154.31%)
fwd+bwd seq_len=512	49.0019 (+86.13%)	76.031 (+148.75%)	78.9392 (+157.0%)	80.8344 (+168.09%)
fwd seq_len=1024	49.5448 (+121.18%)	52.8486 (+125.33%)	53.5561 (+137.26%)	53.6964 (+142.32%)
fwd+bwd seq_len=1024	57.3693 (+131.83%)	63.1987 (+140.25%)	64.1421 (+144.76%)	64.4187 (+141.87%)

3090	batch=4	batch=16	batch=32	batch=64
fwd seq_len=512	25.973 (100.0%)	27.6646 (100.0%)	27.19 (100.0%)	27.2853 (100.0%)
fwd+bwd seq_len=512	26.3274 (100.0%)	30.5657 (100.0%)	30.716 (100.0%)	30.1517 (100.0%)
fwd seq_len=1024	22.4006 (100.0%)	23.4543 (100.0%)	22.573 (100.0%)	22.1597 (100.0%)
fwd+bwd seq_len=1024	24.7468 (100.0%)	26.305 (100.0%)	26.2063 (100.0%)	26.634 (100.0%)

As we can see, despite GPT-2 and BERT has the same complexity, GPT-2 has worse TFLOPS when using the same batch size and sequence length. And the performance of 3090 nearly reach A40. Also, using a larger sequence length 1024 further harms the performance.

T5

T5^[2] takes the full transformer architecture. The encoder and decoder performance are tested separately. The performance of encoder is similar to BERT. The decoder has an additional cross attention, which increases the time complexity and also hurts TFLOPS.

A800_0	batch=4	batch=16	batch=32	batch=64
fwd encoder	38.8697 (+74.31%)	57.0796 (+123.16%)	59.3724 (+126.56%)	61.1527 (+130.66%)
fwd+bwd encoder	40.6477 (+71.46%)	70.3433 (+139.4%)	75.9491 (+150.57%)	78.273 (+148.64%)
fwd decoder	31.965 (+64.45%)	46.5513 (+114.58%)	48.3616 (+112.88%)	49.7627 (+118.77%)
fwd+bwd decoder	33.5017 (+62.1%)	59.5139 (+128.58%)	62.4874 (+132.72%)	64.1493 (+134.82%)

3090	batch=4	batch=16	batch=32	batch=64
fwd encoder	22.2991 (100.0%)	25.5779 (100.0%)	26.206 (100.0%)	26.5121 (100.0%)
fwd+bwd encoder	23.7068 (100.0%)	29.3832 (100.0%)	30.3106 (100.0%)	31.4801 (100.0%)
fwd decoder	19.4381 (100.0%)	21.6942 (100.0%)	22.7182 (100.0%)	22.7463 (100.0%)
fwd+bwd decoder	20.6679 (100.0%)	26.0369 (100.0%)	26.8509 (100.0%)	27.3184 (100.0%)

In conclusion, to achieve the best performance for a Transformer layer, we need to use a fast data type and a large batch size. For further improvement, we may need to rewrite the code. For example, fusing multiple kernels into a single one.

Future

The accuracy is not checked in this benchmark.

Note the memory of A800 is externally equipped with ECC compared with 3090. Theoretically, the accuracy of A800 is better than 3090.

A800

GPU Memory 80 GB GDDR6 with error-correcting code (ECC)

3090

Memory Type GDDR6X
The FLOPS of particles method remain unknown, the benchmark of it is suspended.
Multiple GPU performance will be tested in next blog.

A800
GPU Memory	80 GB GDDR6 with error-correcting code (ECC)

3090
Memory Type	GDDR6X

References

Nvidia_Tesla, wikipedia

NVIDIA A40 datasheet

Quadro RTX 8000 datasheet

NVIDIA RTX A4000 datasheet

NVIDIA A40 brief

NVIDIA A100 | Tensor Core GPU

https://www.viperatech.com/product/nvidia-a800-customer-deck/

NVIDIA A40 PCIe Specs | TechPowerUp GPU Database

Quadro RTX 8000 Specs | TechPowerUp GPU Database

NVIDIA GeForce RTX 3090 Specs | TechPowerUp GPU Database

NVIDIA RTX A4000 Specs | TechPowerUp GPU Database

NVIDIA A100 PCIe 80 GB Specs | TechPowerUp GPU Database

PCIe 4.0 (Ultimate Guide to Understanding PCI Express Gen ...

NVLink & NVSwitch:Advanced Multi-GPU Systems | NVIDIA

NVIDIA NVLink

DGX A100 : Universal System for AI Infrastructure

BLAS (Basic Linear Algebra Subprograms)

Automatically Tuned Linear Algebra Software (ATLAS)

OpenBLAS : An optimized BLAS library

Get Started with Intel® oneAPI Math Kernel Library

GitHub - openai/openai-gemm: Open single and half …

cuBLAS documentation

cuBLAS Level-3 Function Reference

2.8.12. cuBLASGemmEx

Matrix Multiplication Background User's Guide - NVIDIA

Multiply–accumulate operation (MAC), wikipedia

fusing pointwise operations, PyTorch tutorials

How fast is my model? - Machine, Think

BERT Explained | Papers With Code

GPT-2 Explained | Papers With Code

T5 Explained | Papers With Code

your GPU Compute Capability

16. Compute Capabilities

why cuBLASHgemm is slower more than cuBLASSgemm when I use?

Issues with P-state "--pstate p0" on 3080ti & 0.24.8.

Intel® Xeon® Gold 6234 Processor

Intel® Xeon® Gold 6330 Processor

Intel Xeon Gold GFLOPS performance

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. ↩︎
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. ↩︎
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. ↩︎
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140), 1-67. ↩︎
Choi, Y. R., & Stegailov, V. (2022). Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIe. In International Conference on Mathematical Modeling and Supercomputer Technologies (pp. 281-292). Springer, Cham. ↩︎

#deep learning #CUDA

test GPU servers single GPU

https://daydreamatnight.github.io/2022/12/16/test-GPU-server-singleGpu/

Author

Ryan LI

Posted on

December 16, 2022

Licensed under

test GPU server multiGPU performance Previous

Autoencoder identical to POD Next