Outline Multiprocessors Flynn taxonomy SIMD architectures Vector - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector - - PowerPoint PPT Presentation

POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures Multiprocessors Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano Outline


slide-1
SLIDE 1

POLITECNICO DI MILANO

Advanced Topics on Heterogeneous System Architectures

Multiprocessors

Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano

slide-2
SLIDE 2

2

Outline

Multiprocessors

Flynn taxonomy SIMD architectures Vector architectures MIMD architectures

A real life example What’s next

slide-3
SLIDE 3

Supercomputers

Definition of a supercomputer:

Fastest machine in world at given task A device to turn a compute-bound problem into an I/O

bound problem

Any machine costing $30M+ Any machine designed by Seymour Cray

CDC6600 (Cray, 1964) regarded as first supercomputer

3

slide-4
SLIDE 4

The Cray XD1 example

The XD1 uses AMD Opteron 64-bit CPUs

and it incorporates Xilinx Virtex-II FPGAs

Performance gains from FPGA

RC5 Cipher Breaking

1000x faster than 2.4 GHz P4

Elliptic Curve Cryptography

895-1300x faster than 1 GHz P3

Vehicular Traffic Simulation

300x faster on XC2V6000 than 1.7 GHz Xeon 650xfaster on XC2VP100 than 1.7 GHz Xeon

Smith Waterman DNA matching

28x faster than 2.4 GHz Opteron

4

slide-5
SLIDE 5

Supercomputer Applications

Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer ≡ Vector Machine

5

slide-6
SLIDE 6

Parallel Architectures

Definition: “A parallel computer is a collection of

processing elements that cooperates and communicate to solve large problems fast”

Almasi and Gottlieb, Highly Parallel Computing, 1989 The aim is to replicate processors to add performance

vs design a faster processor.

Parallel architecture extends traditional computer

architecture with a communication architecture

abstractions (HW/SW interface) different structures to realize abstraction efficiently 6

slide-7
SLIDE 7

Beyond ILP

ILP architectures (superscalar, VLIW...):

Ø Support fine-grained, instruction-level parallelism; Ø Fail to support large-scale parallel systems;

Multiple-issue CPUs are very complex, and returns (as

far as extracting greater parallelism) are diminishing ð extracting parallelism at higher levels becomes more and more attractive.

A further step: process- and thread-level parallel

architectures.

To achieve ever greater performance: connect

multiple microprocessors in a complex system.

7

slide-8
SLIDE 8

Beyond ILP

Most recent microprocessor chips are multiprocessor

  • n-chip: Intel i5, i7, IBM Power 8, Sun Niagara

Major difficulty in exploiting parallelism in

multiprocessors: suitable software ð being (at least partially) overcome, in particular, for servers and for embedded applications which exhibit natural parallelism without the need of rewriting large software chunks

8

slide-9
SLIDE 9

Flynn Taxonomy (1966)

SISD - Single Instruction Single Data

Uniprocessor systems

MISD - Multiple Instruction Single Data

No practical configuration and no commercial

systems

SIMD - Single Instruction Multiple Data

Simple programming model, low overhead,

flexibility, custom integrated circuits

MIMD - Multiple Instruction Multiple Data

Scalable, fault tolerant, off-the-shelf micros 9

slide-10
SLIDE 10

Flynn

10

slide-11
SLIDE 11

SISD

A serial (non-parallel) computer Single instruction: only one instruction stream is being

acted on by the CPU during any one clock cycle

Single data: only one data stream is being used as

input during any one clock cycle

Deterministic execution This is the oldest and even today,

the most common type of computer

11

slide-12
SLIDE 12

SIMD

A type of parallel computer Single instruction: all processing units execute the

same instruction at any given clock cycle

Multiple data: each processing unit can operate on a

different data element

Best suited for specialized problems characterized by a

high degree of regularity, such as graphics/image processing

12

slide-13
SLIDE 13

MISD

A single data stream is fed into multiple processing

units.

Each processing unit operates on the data

independently via independent instruction streams.

13

slide-14
SLIDE 14

MIMD

Nowadays, the most common type of parallel computer Multiple Instruction: every processor may be executing

a different instruction stream

Multiple Data: every processor may be working with a

different data stream

Execution can be synchronous or asynchronous,

deterministic or non-deterministic

14

slide-15
SLIDE 15

Which kind of multiprocessors?

Many of the early multiprocessors were SIMD – SIMD

model received great attention in the ’80’s, today is applied only in very specific instances (vector processors, multimedia instructions);

MIMD has emerged as architecture of choice for

general-purpose multiprocessors

Lets see these architectures more in details..

15

slide-16
SLIDE 16

SIMD - Single Instruction Multiple Data

Same instruction executed by multiple processors using

different data streams.

Each processor has its own data memory. Single instruction memory and control processor to

fetch and dispatch instructions

Processors are typically special-purpose. Simple programming model.

16

slide-17
SLIDE 17

SIMD Architecture

Central controller broadcasts instructions to multiple

processing elements (PEs)

17 Array Controller Inter-PE Connection Network PE M e m PE M e m PE M e m PE M e m PE M e m PE M e m PE M e m PE M e m Control Data

ü Only requires one controller for whole array

ü Only requires storage for one copy of program ü All computations fully synchronized

slide-18
SLIDE 18

SIMD model

Synchronized units: single Program Counter Each unit has its own addressing registers

Can use different data addresses

Motivations for SIMD:

Cost of control unit shared by all execution units Only one copy of the code in execution is necessary

Real life:

SIMD have a mix of SISD instructions and SIMD A host computer executes sequential operations SIMD instructions sent to all the execution units, which

has its own memory and registers and exploit an interconnection network to exchange data

18

slide-19
SLIDE 19

SIMD Machines Today

Distributed-memory SIMD failed as large-scale general-purpose computer platform

required huge quantities of data parallelism (>10,000 elements) required programmer-controlled distributed data layout

Vector supercomputers (shared-memory SIMD) still successful in high-end supercomputing

reasonable efficiency on short vector lengths (10-100 elements) single memory space

Distributed-memory SIMD popular for special-purpose accelerators

image and graphics processing

Renewed interest for Processor-in-Memory (PIM)

memory bottlenecks => put some simple logic close to memory viewed as enhanced memory for conventional system technology push from new merged DRAM + logic processes commercial examples, e.g., graphics in Sony Playstation-2/3

19

slide-20
SLIDE 20

Reality: Sony Playstation 2000

20

slide-21
SLIDE 21

Playstation 2000

Sample Vector Unit 2-wide VLIW Includes Microcode Memory High-level instructions like matrix-multiply Emotion Engine: Superscalar MIPS core Vector Coprocessor Pipelines RAMBUS DRAM interface 21

slide-22
SLIDE 22

25

Alternative Model: Vector Processing

Vector processors have high-level operations that work

  • n linear arrays of numbers: "vectors"

22

+

r1 r2 r3

add r3, r1, r2

SCALAR (1 operation) v1 v2 v3

+

vector length

add.vv v3, v1, v2

VECTOR (N operations)

slide-23
SLIDE 23

Vector Supercomputers

Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions

Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory

23

slide-24
SLIDE 24

Properties of Vector Instructions

A single vector instruction specifies a great deal of

work

Equivalent to executing an entire loop Each instruction represents 10 or 100s operations

Fetch and decode unit bandwidth needed to keep multiple

deeply pipelined FUs busy dramatically reduced

Vector instructions indicate that computation of each

result in the vector is independent of the computation

  • f the results of the other elements of the vector

No need to check for data hazards in the vector

Hardware needs to check for data hazards only

between two vectors instructions once per vector

  • perand

24

slide-25
SLIDE 25

Properties of Vector Instructions

Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory to fetch the vector from a set of memory banks => amortize memory latency of over ­ 64 elements => no (data) caches required! (Do use instruction cache) Reduces branches and branch problems in pipelines

An entire loop is replaced by a vector instruction therefore

control hazards that would arise from the loop branch are avoided

25

slide-26
SLIDE 26

Styles of Vector Architectures

A vector processor consists of a pipelined scalar unit

(ma be out-of order or VLIW) + vector unit

memory-memory vector processors: all vector

  • perations are memory to memory (first ones as CDC)

vector-register processors: all vector operations

between vector registers (except load and store)

Vector equivalent of load-store architectures Includes all vector machines since late 1980s:

Cray, Convex, Fujitsu, Hitachi, NEC

26

slide-27
SLIDE 27

Components of Vector Processor

Vector Register: fixed length bank holding a single

vector

has at least 2 read and 1 write ports typically 8-32 vector registers, each holding 64-128 64-

bit elements

Vector Functional Units (FUs): fully pipelined, start

new operation every clock cycle

typically 4 to 8 FUs: FP add, FP mult, FP reciprocal

(1/X), integer add, logical, shift; may have multiple of same unit

Control unit to detect hazards (control for Fus and data

from register accesses)

Scalar operations may use either the vector functional

units or use a dedicated set.

27

slide-28
SLIDE 28

Components of Vector Processor

Vector Load-Store Units (LSUs): fully pipelined unit to

load or store a vector to and from memory;

Pipelining allows moving words between vector registers

and memory with a bandwidth of 1 word per clock cycle

Handles also scalar loads and stores may have multiple LSUs

Scalar registers: single element for FP scalar or

address

Cross-bar to connect FUs , LSUs, registers

28

slide-29
SLIDE 29

Vector programming model

29

Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1]

VLR

Vector Length Register

+ + + + + + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1

v1

Vector Load and Store Instructions LV v1, r1, r2

Base, r1 Stride, r2 Memory Vector Register

slide-30
SLIDE 30

Vector Code Example

# C code for (i=0;i<64; i++) C[i] = A[i]+B[i]; # Scalar Code LI R4, #64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, #64 LV V1, R1 LV V2, R2 ADDV.D V3,V1,V2 SV V3, R3

30

slide-31
SLIDE 31

Vector Instruction Set Advantages

Compact

  • ne short instruction encodes N operations

Expressive, tells hardware that these N operations:

are independent use the same functional unit access disjoint registers access registers in same pattern as previous instructions access a contiguous block of memory

(unit-stride load/store)

access memory in a known pattern

(strided load/store)

Scalable

can run same code on more parallel pipelines (lanes)

31

slide-32
SLIDE 32

Vector Arithmetic Execution

  • Use deep pipeline (=> fast

clock) to execute element

  • perations
  • Simplifies control of deep

pipeline because elements in vector are independent (=> no hazards!)

V 1 V 2 V 3 V3 <- v1 * v2

Six stage multiply pipeline

32

slide-33
SLIDE 33

Vector Instruction Execution

ADDV C,A,B

C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using

  • ne pipelined

functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units

33

slide-34
SLIDE 34

Vector Memory System

1 2 3 4 5 6 7 8 9 A B C D E F

+

Base Stride Vector Registers Memory Banks Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request To avoid conflicts stride and #banks relatively prime 34

slide-35
SLIDE 35

Vector Unit Structure

Lane Functional Unit Vector Registers Memory Subsystem

Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, …

35

slide-36
SLIDE 36

T0 Vector Microprocessor (UCB/ICSI, 1995)

Lane

Vector register

elements striped

  • ver lanes

[0] [8] [16] [24] [1] [9] [17] [25] [2] [10] [18] [26] [3] [11] [19] [27] [4] [12] [20] [28] [5] [13] [21] [29] [6] [14] [22] [30] [7] [15] [23] [31]

36

slide-37
SLIDE 37

Vector Applications

Limited to scientific computing?

Multimedia Processing (compress., graphics, audio synth, image proc.) Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/Networking (memcpy, memset, parity, checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95

37

slide-38
SLIDE 38

MIMD - Multiple Instruction Multiple Data

Each processor fetches its own instructions and

  • perates on its own data.

Processors are often off-the-shelf microprocessors. Scalable to a variable number of processor nodes. Flexible: single-user machines focusing on high-performance for one

specific application,

multi-programmed machines running many tasks

simultaneously,

some combination of these functions. Cost/performance advantages due to the use of off-

the-shelf microprocessors.

Fault tolerance issues.

38

slide-39
SLIDE 39

Why MIMD?

MIMDs are flexible – they can function as single-user machines for high performances on one application, as multiprogrammed multiprocessors running many tasks simultaneously, or as some combination of such functions; Can be built starting from standard CPUs (such is the present case nearly for all multiprocessors!).

39

slide-40
SLIDE 40

MIMD

To exploit a MIMD with n processors

at least n threads or processes to execute independent threads typically identified by the programmer or created by the compiler. Parallelism is contained in the threads

thread-level parallelism.

Thread: from a large, independent process to parallel iterations of a loop. Important: parallelism is identified by the software (not by hardware as in superscalar CPUs!)... keep this in mind, we'll use it!

40

slide-41
SLIDE 41

MIMD Machines

Existing MIMD machines fall into 2 classes, depending on the number

  • f processors involved, which in turn dictates a memory organization

and interconnection strategy. Centralized shared-memory architectures

at most few dozen processor chips (< 100 cores) Large caches, single memory multiple banks Often called symmetric multiprocessors (SMP) and the

style of architecture called Uniform Memory Access (UMA)

Distributed memory architectures

To support large processor counts Requires high-bandwidth interconnect Disadvantage: data communication

among processors

41

Inte Interconne rconnection ction Ne Netwo twork rk

CO P0 N0 … … … C3 P3 N3

MM0 MM0 MM1 MM1 MM2 MM2 MM2 MM2 P0 P0, , P1 P1, P2 P2, , P3 P3

Node Node Processor Processor Interconnection Network Cache Cache … … … Main Memory Main Memory Processor Cache Main Memory Node

slide-42
SLIDE 42

Key issues to design multiprocessors

How many processors? How powerful are processors? How do parallel processors share data? Where to place the physical memory? How do parallel processors cooperate and coordinate? What type of interconnection topology? How to program processors? How to maintain cache coherency? How to maintain memory consistency? How to evaluate system performance?

42

slide-43
SLIDE 43

The Tilera example

43

slide-44
SLIDE 44

Create the most amazing game console

44

slide-45
SLIDE 45

One core

A 64-bit Power Architecture core Two-issue superscalar execution Two-way multithreaded core In-order execution Cache

32 KB instruction and a 32 KB data Level 1 cache 512 KB Level 2 cache The size of a cache line is 128 bytes

One core to rule them all

45

slide-46
SLIDE 46

Cell: PS3

Cell is a heterogeneous chip multiprocessor

One 64-bit Power core 8 specialized co-processors

based on a novel single-instruction multiple-data (SIMD)

architecture called SPU (Synergistic Processor Unit) 46

slide-47
SLIDE 47

Duck Demo SPE Usage

47

slide-48
SLIDE 48

Xenon: XBOX360

Three symmetrical cores

each two way SMT-capable and clocked at 3.2 GHz

SIMD: VMX128 extension for each core 1 MB L2 cache (lockable by the GPU) running at half-

speed (1.6 GHz) with a 256-bit bus

48

slide-49
SLIDE 49

Microsoft vision

Microsoft envisions a procedurally rendered game as

having at least two primary components:

Host thread: a game's host thread will contain the main

thread of execution for the game

Data generation thread: where the actual procedural

synthesis of object geometry takes place

These two threads could run on the same PPE, or they

could run on two separate PPEs.

In addition to the these two threads, the game could

make use of separate threads for handling physics, artificial intelligence, player input, etc.

49

slide-50
SLIDE 50

The Xenon architecture

50

slide-51
SLIDE 51

From ILP to TLP: from the processor to the programmer

Keep it simple

Stripping out hardware that's intended to optimize instruction

scheduling at runtime.

Neither the Xenon nor the Cell have an instruction window

Instructions pass through the processor in the order in which they're

fetched

Two adjacent, non-dependent instructions are executed in parallel

where possible

Static execution

Is simple to implement Takes up much less die space than dynamic execution

since the processor doesn't need to spend a lot of transistors on the

instruction window and related hardware. Those transistors that the lack of an instruction window frees up can be used to put more actual execution units on the die.

Rethink how you organize the processor

You can't just eliminate the instruction window and replace it

with more execution

51

slide-52
SLIDE 52

Regrouping the execution units

No hardware spent on an instruction window that looks for ILP at run- time The programmer has to structure the code stream at compile time so that it contains a high level of thread-level parallelism (TLP) Three separate cores

Each of which individually contains a relatively small number of execution

units.

The many parallel threads out of which the programmer has woven the

code stream are then scheduled to run on those separate cores

This TLP strategy will work extremely well for tasks like procedural synthesis that can be parallelized at the thread level. However, it won't work as well as an old-fashioned wide execution core plus large instruction window for inherently single-threaded tasks.

In particular, three types of game-oriented tasks are likely to suffer from

the lack of a out-of-order processing and core width:

Game control Artificial intelligence (AI) Physics

52

slide-53
SLIDE 53

Procedural Synthesis in a nutshell

Procedural synthesis is about making optimal use of system bandwidth and main memory by dynamically generating lower- level geometry data from statically stored higher-level scene data For 3D games

Artists use a 3D rendering program to produce content for the

game

Each model is translated into a collection of polygons Each polygons is represented in the computer's memory as

collections of vertices

When the computer is rendering a scene in a game in real-time

Models that are being displayed on the screen start out in main

memory as stored vertex data

That vertex data is fed from main memory into the GPU

where it is then rendered into a 3D image and output to the monitor

as a sequence of frames.

53

slide-54
SLIDE 54

Limitations

There are two problems

The costs of creating art assets for a 3D game are going

through the roof along with the size and complexity of the games themselves

Console hardware's limited main memory sizes and

limited bus bandwidth

54

slide-55
SLIDE 55

The Xbox 360's solution

Store high-level descriptions of objects in main

memory

Gave the CPU procedurally generate the geometry

(i.e., the vertex data) of the objects on the fly

Main memory stores high-level information This information is passed into the Xbox 360's Xenon

CPU, where the vertex data are generated by one or more running threads

These threads then feed that vertex data directly into

the GPU

by way of a special set of write buffers in the L2 cache

The GPU then takes that vertex information and

renders the trees normally, just as if it had gotten that information from main memory

55

slide-56
SLIDE 56

The Xbox 360's solution

56

slide-57
SLIDE 57

Questions

RISK more than others think is safe, CARE more than others think is wise, DREAM more than other think is practical, EXPECT more than others think is possible cadel maxim

57