Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn - - PowerPoint PPT Presentation

multi processors and gpu
SMART_READER_LITE
LIVE PREVIEW

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn - - PowerPoint PPT Presentation

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018 Predicted CPU Clock Speed 1 Clock speed 1971: 740 kHz, 2018: 45 GHz Source: Kurzweil "The Singularity


slide-1
SLIDE 1

Multi-Processors and GPU

Philipp Koehn 2 May 2018

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-2
SLIDE 2

1

Predicted CPU Clock Speed

Clock speed 1971: 740 kHz, 2018: 45 GHz Source: Kurzweil "The Singularity is Near" (2005)

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-3
SLIDE 3

2

Actual CPU Clock Speed

Clock speed 2018: 3 GHz

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-4
SLIDE 4

3

Why?

Intel estimate, around 2000: 400 kW by 2018?

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-5
SLIDE 5

4

Moore’s Law

Number of transitors per chip still exponential

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-6
SLIDE 6

5

What to do with the Transitors?

  • More parallelism → faster execution of instructions
  • More processors on a chip

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-7
SLIDE 7

6

multi-processors

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-8
SLIDE 8

7

Intel Core i7: Quad-Core

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-9
SLIDE 9

8

Intel Xeon Phi: 72 cores (2017)

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-10
SLIDE 10

9

Handling Multiple Processes

  • Kernel can keep multiple processes running
  • Each process is assigned to a core

– each core has a local cache – all cores share a common cache, common memory

  • Synchronization between cores not trivial

e.g., cache coherence

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-11
SLIDE 11

10

More Parallelism

  • Multiple processes not always the best way to parallelize
  • Often, within a process parallel execution would be helpful
  • Example:

matrix multiplication – loops over different parts of the data – instructions highly independent → can be executed in parallel

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-12
SLIDE 12

11

Multi-Threading

  • Parallel execution within process
  • No switching of process context (e.g., virtual address space)
  • Supported by various libraries

– pthread in C++ – thread in C++11 – thread in Python

  • Programmer has to take care of conflicts

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-13
SLIDE 13

12

computer graphics

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-14
SLIDE 14

13

Computer Graphics

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-15
SLIDE 15

14

Computer Graphics

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-16
SLIDE 16

15

tl;dr

  • Given

– 3d models of objects – lighting, textures – ray tracing

  • Lots of vector and matrix operations
  • Color value for each pixel on the screen has to be computed

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-17
SLIDE 17

16

High Demand

  • Computer games on regular PCs
  • Game consoles

– Atari (1972-1996) – Nintendo/Wii (since 1977) – Playstation (since 1994) – X-Box (since 2001)

  • 100s of millions sold

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-18
SLIDE 18

17

history

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-19
SLIDE 19

18

VGA Controller

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-20
SLIDE 20

19

GPU

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-21
SLIDE 21

20

Co-Processor

  • CPU handles the bulk of the complexity
  • GPU focuses on specific problems

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-22
SLIDE 22

21

Graphics Pipeline

Initially: dedicated hardware for core steps

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-23
SLIDE 23

22

Unified GPU Architecture

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-24
SLIDE 24

23

gpu

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-25
SLIDE 25

24

Streaming Multiprocessor (SM)

  • Fetches instruction (I-Cache)
  • Has to apply it over a vector of data
  • Each vector element is processed in one thread

(MT Issue)

  • Thread is handled by scalar processor (SP)
  • Special function units (SFU)

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-26
SLIDE 26

25

Taxonomy

  • SISD (single instruction, single data)

– uni-processors (6502, Intel until 1990s)

  • MIMD (multi instruction, multiple data)

– Intel Core i7 – multiple cores on a chip – each core runs instructions that operate on their own data

  • SIMD (single instruction, multiple data)

– Streaming Multi-Processors – multiple cores on a chip – same instruction executed on different data

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-27
SLIDE 27

26

GPU Architecture

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-28
SLIDE 28

27

Graphics Programming

  • Libraries that support all steps of graphics pipeline
  • Open standard:

OpenGL

  • Microsoft:

Direct3D

  • Libraries handle mapping to GPU hardware

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-29
SLIDE 29

28

Direct3D Pipeline

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-30
SLIDE 30

29

more uses for gpus

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-31
SLIDE 31

30

Deep Learning

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-32
SLIDE 32

31

Deep Learning

  • The latest machine learning hype
  • Computationally

– lots of matrix multiplications – lots of vector operations – massive data sets

  • Just what GPUs are good at

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-33
SLIDE 33

32

CUDA

  • Extension of C++ to support general GPU programming
  • Fairly low-level

– identify parts of program to be handled by GPU – define function to be executed by a thread – define how many threads are used

  • Key concepts

– kernel = function to be executed by a thread – thread block = set of threads to be executed in parallel – thread grid = set of thread blocks

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-34
SLIDE 34

33

Example

  • Serial loop

void example(int n, float alpha, float *x, float *y) { for( int i=0; i<n; n++) y[i] = alpha * x[i] + y[i] } example(n, 2.0, x, y);

  • Parallel with CUDA

#define THREADS 256 void cuda_example(int n, float alpha, float *x, float *y) { int i = blockIdx.x * blockDim.x + threadIDx.x; if (i < n) y[i] = alpha*x[i] + y[i]; } int nblocks = (n + THREADS - 1) / THREADS; cuda_example<<< nblocks, THREADS >>>(n, 2.0, x, y);

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-35
SLIDE 35

34

Memory Levels

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-36
SLIDE 36

35

multiprocessor architecture

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-37
SLIDE 37

36

Nvidia Titan V

  • 80 streaming multiprocessors, 5120 cores, 640 tensor cores
  • Clock speed 1455 MHz
  • Memory size 12 GB, bandwidth 650 GB/sec
  • Retail price $2999

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-38
SLIDE 38

37

Multithreaded Multiprocessor

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-39
SLIDE 39

38

Single Instruction, Multiple Thread

  • Each scalar processors

– executes same instruction – on different data – has own register file

  • Branch synchronization

– if threads diverge on conditional branches → execute different paths separately

  • Shared memory

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-40
SLIDE 40

39

instructions

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-41
SLIDE 41

40

Basics

  • Design more similar to MIPS than x86
  • Various data types - each of different sizes

– untyped bit arrays (8, 16, 32, 64 bits) – unsigned integers (8, 16, 32, 64 bits) – signed integers (8, 16, 32, 64 bits) – floating points (16, 32, 64 bits)

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-42
SLIDE 42

41

Basic Instructions

  • Arithmetic instructions operate on registers

– add d, a, b → d = a+b – mul d, a, b → d = a*b – mad d, a, b, c → d = a*b+c – mov d, a → d = a

  • Special functions handled by SFU processors

– square root (sqrt) – sine (sin) – cosine (cos) – binary logarithm (lg2)

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-43
SLIDE 43

42

Memory Access

  • Different memory spaces (global, shared, local, const)
  • Different data sizes (8, 16, 32, 64 bits)
  • Load (ld) and store (st)
  • Atomic memory read, write, add, min, max, and, ...

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-44
SLIDE 44

43

Control Flow

  • Branch (conditional on register value = 0)
  • Subroutine call:

call, ret

  • Synchronization:

bar.sync forces all threads to synchronize

  • Terminate thread:

exit

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-45
SLIDE 45

44

memory

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-46
SLIDE 46

45

Overview

  • Memory has to be very fast
  • Graphic card has several DRAM outside GPU

(fast access, high bandwidth, lots of pins)

  • Cache on chip:

L2 cache associated with each DRAM chip

  • Virtual memory addresses handled by memory management unit (MMU)

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-47
SLIDE 47

46

Levels

  • Global:

external DRAM (not on chip)

  • Shared:

per streaming multiprocessor

  • Local:

in DRAM, but cached on chip

  • Constant:

read-only, in DRAM

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018

slide-48
SLIDE 48

47

Graphics-Related Optimizations

  • Texture memory for read-only texture maps
  • There are also special instructions to deal with textures

Philipp Koehn Computer Systems Fundamentals: Multi-Processors and GPU 2 May 2018