POLITECNICO DI MILANO
Advanced Topics on Heterogeneous System Architectures
Multiprocessors
Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano
Outline Multiprocessors Flynn taxonomy SIMD architectures Vector - - PowerPoint PPT Presentation
POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures Multiprocessors Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano Outline
Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano
2
Multiprocessors
Flynn taxonomy SIMD architectures Vector architectures MIMD architectures
A real life example What’s next
Fastest machine in world at given task A device to turn a compute-bound problem into an I/O
Any machine costing $30M+ Any machine designed by Seymour Cray
3
The XD1 uses AMD Opteron 64-bit CPUs
and it incorporates Xilinx Virtex-II FPGAs
Performance gains from FPGA
RC5 Cipher Breaking
1000x faster than 2.4 GHz P4
Elliptic Curve Cryptography
895-1300x faster than 1 GHz P3
Vehicular Traffic Simulation
300x faster on XC2V6000 than 1.7 GHz Xeon 650xfaster on XC2VP100 than 1.7 GHz Xeon
Smith Waterman DNA matching
28x faster than 2.4 GHz Opteron
4
5
Definition: “A parallel computer is a collection of
Almasi and Gottlieb, Highly Parallel Computing, 1989 The aim is to replicate processors to add performance
Parallel architecture extends traditional computer
abstractions (HW/SW interface) different structures to realize abstraction efficiently 6
ILP architectures (superscalar, VLIW...):
Ø Support fine-grained, instruction-level parallelism; Ø Fail to support large-scale parallel systems;
Multiple-issue CPUs are very complex, and returns (as
A further step: process- and thread-level parallel
To achieve ever greater performance: connect
7
Most recent microprocessor chips are multiprocessor
Major difficulty in exploiting parallelism in
8
SISD - Single Instruction Single Data
Uniprocessor systems
MISD - Multiple Instruction Single Data
No practical configuration and no commercial
systems
SIMD - Single Instruction Multiple Data
Simple programming model, low overhead,
flexibility, custom integrated circuits
MIMD - Multiple Instruction Multiple Data
Scalable, fault tolerant, off-the-shelf micros 9
10
A serial (non-parallel) computer Single instruction: only one instruction stream is being
Single data: only one data stream is being used as
Deterministic execution This is the oldest and even today,
11
A type of parallel computer Single instruction: all processing units execute the
Multiple data: each processing unit can operate on a
Best suited for specialized problems characterized by a
12
A single data stream is fed into multiple processing
Each processing unit operates on the data
13
Nowadays, the most common type of parallel computer Multiple Instruction: every processor may be executing
Multiple Data: every processor may be working with a
Execution can be synchronous or asynchronous,
14
Many of the early multiprocessors were SIMD – SIMD
MIMD has emerged as architecture of choice for
Lets see these architectures more in details..
15
Same instruction executed by multiple processors using
Each processor has its own data memory. Single instruction memory and control processor to
Processors are typically special-purpose. Simple programming model.
16
Central controller broadcasts instructions to multiple
17 Array Controller Inter-PE Connection Network PE M e m PE M e m PE M e m PE M e m PE M e m PE M e m PE M e m PE M e m Control Data
ü Only requires one controller for whole array
ü Only requires storage for one copy of program ü All computations fully synchronized
Synchronized units: single Program Counter Each unit has its own addressing registers
Can use different data addresses
Motivations for SIMD:
Cost of control unit shared by all execution units Only one copy of the code in execution is necessary
Real life:
SIMD have a mix of SISD instructions and SIMD A host computer executes sequential operations SIMD instructions sent to all the execution units, which
has its own memory and registers and exploit an interconnection network to exchange data
18
Distributed-memory SIMD failed as large-scale general-purpose computer platform
required huge quantities of data parallelism (>10,000 elements) required programmer-controlled distributed data layout
Vector supercomputers (shared-memory SIMD) still successful in high-end supercomputing
reasonable efficiency on short vector lengths (10-100 elements) single memory space
Distributed-memory SIMD popular for special-purpose accelerators
image and graphics processing
Renewed interest for Processor-in-Memory (PIM)
memory bottlenecks => put some simple logic close to memory viewed as enhanced memory for conventional system technology push from new merged DRAM + logic processes commercial examples, e.g., graphics in Sony Playstation-2/3
19
20
Sample Vector Unit 2-wide VLIW Includes Microcode Memory High-level instructions like matrix-multiply Emotion Engine: Superscalar MIPS core Vector Coprocessor Pipelines RAMBUS DRAM interface 21
25
Vector processors have high-level operations that work
22
+
add r3, r1, r2
+
vector length
add.vv v3, v1, v2
Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory
23
A single vector instruction specifies a great deal of
Equivalent to executing an entire loop Each instruction represents 10 or 100s operations
Fetch and decode unit bandwidth needed to keep multiple
deeply pipelined FUs busy dramatically reduced
Vector instructions indicate that computation of each
No need to check for data hazards in the vector
Hardware needs to check for data hazards only
24
Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory to fetch the vector from a set of memory banks => amortize memory latency of over 64 elements => no (data) caches required! (Do use instruction cache) Reduces branches and branch problems in pipelines
An entire loop is replaced by a vector instruction therefore
control hazards that would arise from the loop branch are avoided
25
A vector processor consists of a pipelined scalar unit
memory-memory vector processors: all vector
vector-register processors: all vector operations
Vector equivalent of load-store architectures Includes all vector machines since late 1980s:
Cray, Convex, Fujitsu, Hitachi, NEC
26
Vector Register: fixed length bank holding a single
has at least 2 read and 1 write ports typically 8-32 vector registers, each holding 64-128 64-
bit elements
Vector Functional Units (FUs): fully pipelined, start
typically 4 to 8 FUs: FP add, FP mult, FP reciprocal
(1/X), integer add, logical, shift; may have multiple of same unit
Control unit to detect hazards (control for Fus and data
from register accesses)
Scalar operations may use either the vector functional
units or use a dedicated set.
27
Vector Load-Store Units (LSUs): fully pipelined unit to
Pipelining allows moving words between vector registers
and memory with a bandwidth of 1 word per clock cycle
Handles also scalar loads and stores may have multiple LSUs
Scalar registers: single element for FP scalar or
Cross-bar to connect FUs , LSUs, registers
28
29
Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
+ + + + + + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1
v1
Vector Load and Store Instructions LV v1, r1, r2
Base, r1 Stride, r2 Memory Vector Register
# C code for (i=0;i<64; i++) C[i] = A[i]+B[i]; # Scalar Code LI R4, #64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, #64 LV V1, R1 LV V2, R2 ADDV.D V3,V1,V2 SV V3, R3
30
Compact
Expressive, tells hardware that these N operations:
are independent use the same functional unit access disjoint registers access registers in same pattern as previous instructions access a contiguous block of memory
(unit-stride load/store)
access memory in a known pattern
(strided load/store)
Scalable
can run same code on more parallel pipelines (lanes)
31
V 1 V 2 V 3 V3 <- v1 * v2
Six stage multiply pipeline
32
ADDV C,A,B
C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using
functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units
33
1 2 3 4 5 6 7 8 9 A B C D E F
Base Stride Vector Registers Memory Banks Address Generator
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request To avoid conflicts stride and #banks relatively prime 34
Lane Functional Unit Vector Registers Memory Subsystem
Elements 0, 4, 8, … Elements 1, 5, 9, … Elements 2, 6, 10, … Elements 3, 7, 11, …
35
Lane
Vector register
elements striped
[0] [8] [16] [24] [1] [9] [17] [25] [2] [10] [18] [26] [3] [11] [19] [27] [4] [12] [20] [28] [5] [13] [21] [29] [6] [14] [22] [30] [7] [15] [23] [31]
36
Multimedia Processing (compress., graphics, audio synth, image proc.) Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/Networking (memcpy, memset, parity, checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95
37
Each processor fetches its own instructions and
Processors are often off-the-shelf microprocessors. Scalable to a variable number of processor nodes. Flexible: single-user machines focusing on high-performance for one
specific application,
multi-programmed machines running many tasks
simultaneously,
some combination of these functions. Cost/performance advantages due to the use of off-
Fault tolerance issues.
38
39
at least n threads or processes to execute independent threads typically identified by the programmer or created by the compiler. Parallelism is contained in the threads
thread-level parallelism.
40
Existing MIMD machines fall into 2 classes, depending on the number
and interconnection strategy. Centralized shared-memory architectures
at most few dozen processor chips (< 100 cores) Large caches, single memory multiple banks Often called symmetric multiprocessors (SMP) and the
style of architecture called Uniform Memory Access (UMA)
Distributed memory architectures
To support large processor counts Requires high-bandwidth interconnect Disadvantage: data communication
among processors
41
Inte Interconne rconnection ction Ne Netwo twork rk
CO P0 N0 … … … C3 P3 N3
MM0 MM0 MM1 MM1 MM2 MM2 MM2 MM2 P0 P0, , P1 P1, P2 P2, , P3 P3
Node Node Processor Processor Interconnection Network Cache Cache … … … Main Memory Main Memory Processor Cache Main Memory Node
How many processors? How powerful are processors? How do parallel processors share data? Where to place the physical memory? How do parallel processors cooperate and coordinate? What type of interconnection topology? How to program processors? How to maintain cache coherency? How to maintain memory consistency? How to evaluate system performance?
42
43
44
A 64-bit Power Architecture core Two-issue superscalar execution Two-way multithreaded core In-order execution Cache
32 KB instruction and a 32 KB data Level 1 cache 512 KB Level 2 cache The size of a cache line is 128 bytes
One core to rule them all
45
Cell is a heterogeneous chip multiprocessor
One 64-bit Power core 8 specialized co-processors
based on a novel single-instruction multiple-data (SIMD)
architecture called SPU (Synergistic Processor Unit) 46
47
Three symmetrical cores
each two way SMT-capable and clocked at 3.2 GHz
SIMD: VMX128 extension for each core 1 MB L2 cache (lockable by the GPU) running at half-
48
Microsoft envisions a procedurally rendered game as
Host thread: a game's host thread will contain the main
thread of execution for the game
Data generation thread: where the actual procedural
synthesis of object geometry takes place
These two threads could run on the same PPE, or they
In addition to the these two threads, the game could
49
50
Keep it simple
Stripping out hardware that's intended to optimize instruction
scheduling at runtime.
Neither the Xenon nor the Cell have an instruction window
Instructions pass through the processor in the order in which they're
fetched
Two adjacent, non-dependent instructions are executed in parallel
where possible
Static execution
Is simple to implement Takes up much less die space than dynamic execution
since the processor doesn't need to spend a lot of transistors on the
instruction window and related hardware. Those transistors that the lack of an instruction window frees up can be used to put more actual execution units on the die.
Rethink how you organize the processor
You can't just eliminate the instruction window and replace it
with more execution
51
No hardware spent on an instruction window that looks for ILP at run- time The programmer has to structure the code stream at compile time so that it contains a high level of thread-level parallelism (TLP) Three separate cores
Each of which individually contains a relatively small number of execution
units.
The many parallel threads out of which the programmer has woven the
code stream are then scheduled to run on those separate cores
This TLP strategy will work extremely well for tasks like procedural synthesis that can be parallelized at the thread level. However, it won't work as well as an old-fashioned wide execution core plus large instruction window for inherently single-threaded tasks.
In particular, three types of game-oriented tasks are likely to suffer from
the lack of a out-of-order processing and core width:
Game control Artificial intelligence (AI) Physics
52
Procedural synthesis is about making optimal use of system bandwidth and main memory by dynamically generating lower- level geometry data from statically stored higher-level scene data For 3D games
Artists use a 3D rendering program to produce content for the
game
Each model is translated into a collection of polygons Each polygons is represented in the computer's memory as
collections of vertices
When the computer is rendering a scene in a game in real-time
Models that are being displayed on the screen start out in main
memory as stored vertex data
That vertex data is fed from main memory into the GPU
where it is then rendered into a 3D image and output to the monitor
as a sequence of frames.
53
There are two problems
The costs of creating art assets for a 3D game are going
through the roof along with the size and complexity of the games themselves
Console hardware's limited main memory sizes and
limited bus bandwidth
54
Store high-level descriptions of objects in main
Gave the CPU procedurally generate the geometry
Main memory stores high-level information This information is passed into the Xbox 360's Xenon
These threads then feed that vertex data directly into
by way of a special set of write buffers in the L2 cache
The GPU then takes that vertex information and
55
56
57