CPU architecture

This is a very quick review on the CPU's fundamentals, in order to compare with GPU's when learning CUDA.

All the pics and contents are not original. The contents of the whole series are mainly collected from:

NVIDIA CUDA初级教程视频

CIS 565 2019

CMU 15-418/618 (2018)

Outline:

中文 English term
流水线 Pipelining
分支预测 Branch Prediction / Speculation
超标量 Superscalar
乱序执行 Out-of-Order(OoO) Execution
存储器层次 Memory Hierarchy
矢量操作 Vector Operations
多核处理 Multi-Core

CPU Introduction

What is CPU: Machine processing instructions.

Example of types of CPU instructions?

  • arithmetic: add r3,r4 -> r4

  • load-store: load [r4] -> r7

  • sequencing/control: jz end

Aim to be optimized:

\[ \frac{cycles}{instructions} \times \frac{seconds}{cycle} \]

cycles per instructions(CPI) times time per cycle.

cycle: 时钟周期

These two parameters are not independent, need trade-off.

CPU programs:
  • light threaded
  • heavy memory access
  • heavy branches

very less arithmetic instructions.

CPU structure

Intel Core Die Map

  • Very large memory size (Shared L3 Cache)
  • Controller (Memory Controller) - also for memory
  • less calculators (Core)
  • Instructional level parallelism (ILP) (will talk about it very soon)
    • Superscalar
    • Out-of-Order
  • Data-level parallelism (DLP)
    • vectors
  • Thread-level parallelism (TLP)
    • simultaneous multithreading (SMT)
    • multicore

Pipelining

Introduction

Standard 5-stage pipeline:

Fetch-->Decode-->Execute-->Memory-->Writeback

取址-->译码-->执行-->访存-->写回

CPU classic pipeline

instruction-level parallelism (ILP)
  • √ reduce clock period

  • × more latency and area

  • ? dependent instructions -- Bypassing, Stall and OoO

  • ? branching -- Branch Prediction / Branch Predication

  • ? alleged pipeline lengths

    • Core - 14, Pentium 4 (Prescott) - >20, Sandy Bridge - medium

Increase ILP: Superscalar, scheduling, out-of-order

Bypassing
sub R2, R3 -> R7
add R1, R7 -> R2

execute line 2 immediately after execute line 1, not until finishing of Memory and writeback of line1.

Stall
load [R3] -> R7
add R1, R7 -> R2

line 2 can only execute until the finishing of Memory and writeback of line1. The program have to wait.

Branches
jeq loop

prediction of how the branch will go, or predication.

Branches

Branch Prediction

Guess what's next, based on branch history.

Modern predictors >90% accuracy

  • √ Raise performance and energy efficiency

  • × Area increase and potentially latency (need time for guessing)

Branch Predication

Calculate results for all the conditions

Replace branches with conditional instructions, Avoids branch predictor.

  • √ no misprediction
  • √ no area increase
  • × unnecessary nop.

GPUs also use predication

Increasing IPC

Superscalar

Normal IPC is limited by the 1 instruction per clock setting.

Superscalar is introduced to increase the pipeline width.

  • √ IPC *N
  • × Increase Area
    • resources *N
    • number of bypassing networks *N^2
    • more registers and memory bandwidth

Sandy bridge pipeline

Sandy-Bridge has 6 ports, indicating 6 instructions per clock, managed by Scheduler.

Scheduling

Register Renaming/Aliasing

Check out this piece of code:

xor r1, r2 -> r3
add r3, r4 -> r4
sub r5, r2 -> r3
addi r3, 1 -> r1

line 1 and line 2, dependent (Read-After-Write, RAW)

line 3 and line 4, dependent (RAW)

line 1 and line 3 independent (Write-After-Write, WAW), maybe they can execute in parallel, but register r3 is used for both of the lines, can't run in parallel, without any modifications.

So we introduce register renaming to rename r3 into different registers.

xor p1, p2 -> p6
add p6, p4 -> p7
sub p5, p2 -> p8
addi p8, 1 -> p9

enable to parallel.

Out-of-Order (OoO) Execution

"Shuffle" the instructions to avoid waiting.

A Journey Through the CPU Pipeline:

Micro-ops can be processed out of order as long as their data is ready, sometimes skipping over unready micro-ops for a long time while working on other micro-ops that are ready. This way a long operation does not block quick operations and the cost of pipeline stalls is greatly reduced.

OoO core illustration

Fetch -> decode -> Rename -> Dispatch -> issue -> register read -> execute -> memory -> writeback -> commit

Devices about OoO include:

  • Reorder Buffer (ROB), record all the instructions state on execution.
  • Issue Queue/Scheduler, choose next instruction

Advantages and disadvantages:

  • √ push IPC nearly to the ideal
  • × increase area
  • × increase consumption

In-order CPUs: Intel Atom, ARM Cortex-A8, Qualcomm Scorpion

OoO CPUs: Intel Pentium Pro and onwards, ARM Cortex-9 (Apple A5, NV Tegra 2/3), Qualcomm Krait

OoO is not available on some phones, which need area and low consumption

Memory Hierarchy

The bigger the memory, the slower it is, more delay it has.

Memory type latency bandwidth size
SRAM (L1, L2, L3) 1 - 2 ns 200 GB/s 1 - 20 MB
DRAM (memory) 70 ns 20 GB/s 1 - 20 GB
Flash (disk) 70 - 90 μs 200 MB/s 100 - 1000 GB
HDD (disk) 10 ms 1-150 MB/s 500 - 3000 GB

Cashing

save the most useful memory as near as possible.

Cash hierarchy

hardware management
  • L1 instruction / Data cashes
  • L2 unified cache
  • L3 unified cache
software management
  • Main memory
  • Disk

CPU memory hierarchy

Other consideration of memory

Banking

avoid multiple access

Coherency
Memory controller

Increasing Data-level parallelism (DLP)

vector

motivation

for(int i = 0; i < N; i++)
{
    A[i] = A[i] + B[i] + c[i];
}

when N is very big, need to be faster, because each instruction per loop are independent, change the code into:

for(int i = 0; i < N; i+=4)
{
    // working on parallel, 4 "threads"
    A[i] = A[i] + B[i] + c[i];
    A[i+1] = A[i+1] + B[i+1] + c[i+1];
    A[i+2] = A[i+2] + B[i+2] + c[i+2];
    A[i+3] = A[i+3] + B[i+3] + c[i+3];
}

Single Instruction Multiple Data (SIMD)

  • ALU is wide
  • register is big

x86 vector sets

SSE2
  • 4-width float or integer instruction
  • Intel Pentium 4 onwards
  • AMD Athlon 64 onwards
AVX
  • 8-width float or integer instruction

  • Intel Sandy Bridge

  • AMD Bulldozer

Increasing Thread-level parallelism (TLP)

simultaneous multithreading (SMT)

Including:

  • instruction streams

  • private PC ,registers, stack

  • Shared global, heap

Thread can be created or destroyed.

Both programmer and OS can schedule it.

multicore

Pure replicating the whole pipeline. (Sandy Bridge-E: 6 cores)

  • √ full core, don not share other resources except the L1 cache
  • √ keep the mole's theory
  • × problems of multicore thread of efficiency
Locks, Coherence and Consistency

multiple threads read-write same data -- add lock

who's data is correct -- coherence

which type of data is correct -- consistency


CPU architecture
https://daydreamatnight.github.io/2022/07/15/CUDA-fundamental-1/
Author
Ryan LI
Posted on
July 15, 2022
Licensed under