test GPU server multiGPU performance

Following last blog, a benchmark is performed on 8 Nvidia A800 GPU server. The P2P bandwidth and multi GPU performance of N-body problem are tested.

1. Peer-to-peer bandwidth test

First check the GPU connection configuration.

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV8     NV8     NV8     NV8     NV8     NV8     NV8     0-27,56-83      0
GPU1    NV8      X      NV8     NV8     NV8     NV8     NV8     NV8     0-27,56-83      0
GPU2    NV8     NV8      X      NV8     NV8     NV8     NV8     NV8     0-27,56-83      0
GPU3    NV8     NV8     NV8      X      NV8     NV8     NV8     NV8     0-27,56-83      0
GPU4    NV8     NV8     NV8     NV8      X      NV8     NV8     NV8     28-55,84-111    1
GPU5    NV8     NV8     NV8     NV8     NV8      X      NV8     NV8     28-55,84-111    1
GPU6    NV8     NV8     NV8     NV8     NV8     NV8      X      NV8     28-55,84-111    1
GPU7    NV8     NV8     NV8     NV8     NV8     NV8     NV8      X      28-55,84-111    1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Except X, The speed decreases from NV# to SYS.

Note that each pair of GPUs are connected by Nvlink.

Use p2pBandwidthLatencyTest tool to get an estimation of the bandwidth. P2P copying of 40000000 integer elements between all visible GPUs are performed. And the results are compared with Nvlink P2P connection enabled and disabled.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A800-SXM4-80GB, pciBusID: 10, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A800-SXM4-80GB, pciBusID: 16, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA A800-SXM4-80GB, pciBusID: 49, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA A800-SXM4-80GB, pciBusID: 4d, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA A800-SXM4-80GB, pciBusID: 8a, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA A800-SXM4-80GB, pciBusID: 8f, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA A800-SXM4-80GB, pciBusID: c6, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA A800-SXM4-80GB, pciBusID: ca, pciDeviceID: 0, pciDomainID:0

P2P Connectivity Matrix
     D\D     0     1     2     3     4     5     6     7
     0	     1     1     1     1     1     1     1     1
     1	     1     1     1     1     1     1     1     1
     2	     1     1     1     1     1     1     1     1
     3	     1     1     1     1     1     1     1     1
     4	     1     1     1     1     1     1     1     1
     5	     1     1     1     1     1     1     1     1
     6	     1     1     1     1     1     1     1     1
     7	     1     1     1     1     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1519.65  15.83  20.62  20.70  20.65  20.73  20.86  20.71 
     1  15.86 1564.06  20.69  20.71  20.57  20.64  20.87  20.67 
     2  20.33  20.74 1562.66  15.70  20.59  20.65  20.81  20.65 
     3  20.38  20.68  15.71 1562.66  20.72  20.62  20.71  20.61 
     4  20.35  20.45  20.47  20.51 1564.30  15.69  21.18  20.61 
     5  20.50  20.53  20.55  20.58  15.71 1566.10  21.11  20.64 
     6  20.53  20.59  20.60  20.65  20.85  20.69 1566.02  15.66 
     7  20.55  20.56  20.59  20.65  20.84  20.71  15.79 1564.46 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1520.09 186.35 186.23 186.32 186.48 186.60 186.53 186.43 
     1 186.31 1533.74 186.54 186.57 186.46 186.63 186.49 186.56 
     2 186.27 186.27 1524.61 186.26 186.56 186.56 186.54 186.53 
     3 186.25 186.32 186.53 1562.89 186.63 186.55 186.64 186.60 
     4 186.30 186.39 186.38 186.45 1564.06 186.55 186.62 186.64 
     5 186.33 186.25 186.40 186.41 186.52 1565.16 186.61 186.60 
     6 186.36 186.29 186.34 186.57 186.49 186.49 1566.81 186.62 
     7 186.44 186.33 186.39 186.62 186.55 186.49 186.51 1564.61 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1555.85  15.83  20.85  20.86  20.84  20.73  20.89  20.73 
     1  15.71 1596.51  20.91  20.98  20.92  20.82  20.80  20.81 
     2  20.84  20.97 1594.18  15.80  20.78  20.70  20.85  20.73 
     3  20.88  20.81  15.81 1596.55  20.91  20.82  20.95  20.83 
     4  20.82  20.90  20.84  20.88 1594.23  15.81  21.14  20.99 
     5  20.77  20.85  20.80  20.85  15.84 1596.95  21.11  20.86 
     6  20.90  20.85  20.92  20.95  21.19  21.06 1596.55  15.82 
     7  20.85  20.82  20.89  20.90  21.02  20.94  15.81 1596.02 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1555.58 363.84 363.87 363.86 363.83 363.84 364.42 364.45 
     1 364.49 1597.28 363.90 363.87 364.17 364.25 364.47 364.47 
     2 364.37 364.37 1579.72 364.23 364.22 364.42 364.44 364.44 
     3 364.42 364.47 364.46 1595.41 363.77 364.39 364.47 364.42 
     4 364.30 364.26 364.38 364.47 1595.45 364.39 364.44 364.52 
     5 364.43 364.45 364.46 364.45 364.42 1596.67 364.47 364.43 
     6 364.50 364.45 364.43 364.51 364.45 364.39 1597.44 364.43 
     7 364.46 364.47 364.47 364.45 364.45 364.47 364.44 1595.53 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7 
     0   2.42  21.58  22.35  22.45  22.31  20.91  20.38  19.78 
     1  22.15   2.24  23.31  23.16  23.41  22.57  22.45  22.27 
     2  22.68  23.28   2.45  22.83  23.11  22.34  21.96  22.31 
     3  23.16  23.31  23.03   2.45  23.38  22.44  21.96  22.37 
     4  22.63  23.54  23.28  23.77   2.88  22.65  22.49  22.65 
     5  21.04  22.69  22.25  23.32  23.54   2.86  22.45  20.89 
     6  21.55  21.86  22.23  22.70  22.66  23.07   2.76  20.37 
     7  21.88  22.75  21.92  22.51  21.80  22.27  21.56   2.30 

   CPU     0      1      2      3      4      5      6      7 
     0   3.19  20.69  21.46  21.56  21.42  20.01  19.49  18.89 
     1  21.32   2.54  22.47  22.32  22.53  21.70  21.60  21.43 
     2  21.77  22.38   2.56  22.00  22.19  21.43  21.06  21.41 
     3  22.26  22.40  22.33   2.48  22.47  21.53  21.05  21.48 
     4  21.72  22.61  22.38  24.26   2.76  23.30  23.09  23.23 
     5  21.58  23.15  22.81  23.93  23.88   3.29  22.95  21.20 
     6  22.18  22.22  22.56  23.04  22.96  23.38   3.29  21.03 
     7  20.86  21.53  20.90  21.33  20.59  21.05  20.68   3.28 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7 
     0   3.12   3.22   3.24   3.23   3.24   3.22   3.24   3.24 
     1   3.20   2.77   3.24   3.24   3.23   3.21   3.24   3.22 
     2   3.23   3.18   2.98   3.22   3.23   3.24   3.19   3.20 
     3   3.23   3.21   3.21   2.98   3.23   3.24   3.23   3.23 
     4   3.25   3.28   3.24   3.25   2.79   3.27   3.28   3.27 
     5   3.25   3.26   3.24   3.21   3.25   2.68   3.24   3.25 
     6   3.10   3.12   3.11   3.11   3.11   3.13   2.67   3.11 
     7   3.12   3.11   3.12   3.09   3.12   3.11   3.11   2.66 

   CPU     0      1      2      3      4      5      6      7 
     0   3.20   2.20   2.26   2.25   2.25   2.25   2.26   2.26 
     1   2.02   2.69   1.93   1.93   1.95   1.96   1.88   1.88 
     2   2.02   1.98   2.69   1.98   1.99   1.99   1.92   1.93 
     3   1.99   1.95   1.96   2.65   1.96   2.09   1.89   1.89 
     4   2.23   2.12   2.13   2.12   2.80   2.13   2.05   2.03 
     5   2.56   2.47   2.48   2.46   2.46   3.39   2.40   2.40 
     6   2.49   2.43   2.44   2.43   2.44   2.46   3.25   2.38 
     7   2.61   2.52   2.53   2.53   2.53   2.55   2.54   3.38

We can see that:

With NVLink(P2P) enabled, the bidirectional GPU-to-GPU writes bandwidth reach 22.7% of self bandwidth.
With NVLink(P2P) disabled, the bidirectional GPU-to-GPU writes bandwidth through PCIE is extremely decreased (roughly 1.3% of self bandwidth).

2. multi-GPU nbody Benchmark

An N-body simulation numerically approximates the system of bodies where each body continuously interacts with every other body.

The nbody simulation benchmark was done across,

number of particles: from 100000 to 15000000
number of iterations: 100
number of devices: from 1 to 8
precision: float32 (float)

The linecharts of performances with number of devices are shown below, followed by the table of the average efficiency of nbodies>50000. Other figures and full data result are available in Appendix.

Fig2.1: unlink-ndevices-runtime

Fig2.2: unlink-ndevices-gflops

Fig2.3: unlink-ndevices-efficiency

devices	1	2	3	4	5	6	7	8
average efficiency	1	0.982481	0.968182	0.96815	0.965076	0.779486	0.676789	0.675233

we can see that

the performance sticks to the linear line (dashed) well when ndevices<6. The efficiency can still reach more than 99% with large nbodies.
With ndevices>=6, the efficiency drops badly, the runtime lines tilt up on ndevices=6,7. For example, the runtime for nbodies=8000000, ndevices=6 is longer than that of nbodies=8000000, ndevices=5.