Outline Multiprocessors Flynn taxonomy SIMD architectures Vector - PowerPoint PPT Presentation

POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures � Multiprocessors Politecnico di Milano � SeminarRoom, Bld 20 � 30 November, 2017 � Antonio Miele � Marco Santambrogio � Politecnico di Milano �

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector architectures MIMD architectures A real life example What’s next 2

Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 3

The Cray XD1 example The XD1 uses AMD Opteron 64-bit CPUs and it incorporates Xilinx Virtex-II FPGAs Performance gains from FPGA RC5 Cipher Breaking 1000x faster than 2.4 GHz P4 Elliptic Curve Cryptography 895-1300x faster than 1 GHz P3 Vehicular Traffic Simulation 300x faster on XC2V6000 than 1.7 GHz Xeon 650xfaster on XC2VP100 than 1.7 GHz Xeon Smith Waterman DNA matching 28x faster than 2.4 GHz Opteron 4

Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer ≡ Vector Machine 5

Parallel Architectures Definition: “ A parallel computer is a collection of processing elements that cooperates and communicate to solve large problems fast ” Almasi and Gottlieb, Highly Parallel Computing, 1989 The aim is to replicate processors to add performance vs design a faster processor. Parallel architecture extends traditional computer architecture with a communication architecture abstractions (HW/SW interface) different structures to realize abstraction efficiently 6

Beyond ILP ILP architectures (superscalar, VLIW...): Ø Support fine-grained, instruction-level parallelism; Ø Fail to support large-scale parallel systems; Multiple-issue CPUs are very complex, and returns (as far as extracting greater parallelism) are diminishing ð extracting parallelism at higher levels becomes more and more attractive. A further step: process- and thread-level parallel architectures. To achieve ever greater performance: connect multiple microprocessors in a complex system . 7

Beyond ILP Most recent microprocessor chips are multiprocessor on-chip: Intel i5, i7, IBM Power 8, Sun Niagara Major difficulty in exploiting parallelism in multiprocessors: suitable software ð being (at least partially) overcome, in particular, for servers and for embedded applications which exhibit natural parallelism without the need of rewriting large software chunks 8

Flynn Taxonomy (1966) SISD - Single Instruction Single Data Uniprocessor systems MISD - Multiple Instruction Single Data No practical configuration and no commercial systems SIMD - Single Instruction Multiple Data Simple programming model, low overhead, flexibility, custom integrated circuits MIMD - Multiple Instruction Multiple Data Scalable, fault tolerant, off-the-shelf micros 9

Flynn 10

SISD A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer 11

SIMD A type of parallel computer Single instruction: all processing units execute the same instruction at any given clock cycle Multiple data: each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing 12

MISD A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. 13

MIMD Nowadays, the most common type of parallel computer Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic 14

Which kind of multiprocessors? Many of the early multiprocessors were SIMD – SIMD model received great attention in the ’ 80 ’ s, today is applied only in very specific instances (vector processors, multimedia instructions); MIMD has emerged as architecture of choice for general-purpose multiprocessors Lets see these architectures more in details.. 15

SIMD - Single Instruction Multiple Data Same instruction executed by multiple processors using different data streams. Each processor has its own data memory. Single instruction memory and control processor to fetch and dispatch instructions Processors are typically special-purpose. Simple programming model. 16

SIMD Architecture Central controller broadcasts instructions to multiple processing elements (PEs) Inter-PE Connection Network Array Controller PE PE PE PE PE PE PE PE Control M M M M M M M M Data e e e e e e e e m m m m m m m m ü Only requires one controller for whole array ü Only requires storage for one copy of program ü All computations fully synchronized 17

SIMD model Synchronized units: single Program Counter Each unit has its own addressing registers Can use different data addresses Motivations for SIMD: Cost of control unit shared by all execution units Only one copy of the code in execution is necessary Real life: SIMD have a mix of SISD instructions and SIMD A host computer executes sequential operations SIMD instructions sent to all the execution units, which has its own memory and registers and exploit an interconnection network to exchange data 18

SIMD Machines Today Distributed-memory SIMD failed as large-scale general-purpose computer platform required huge quantities of data parallelism (>10,000 elements) required programmer-controlled distributed data layout Vector supercomputers (shared-memory SIMD) still successful in high-end supercomputing reasonable efficiency on short vector lengths (10-100 elements) single memory space Distributed-memory SIMD popular for special-purpose accelerators image and graphics processing Renewed interest for Processor-in-Memory (PIM) memory bottlenecks => put some simple logic close to memory viewed as enhanced memory for conventional system technology push from new merged DRAM + logic processes commercial examples, e.g., graphics in Sony Playstation-2/3 19

Reality: Sony Playstation 2000 20

Playstation 2000 Emotion Engine : Sample Vector Unit Superscalar MIPS core 2-wide VLIW Vector Coprocessor Pipelines Includes Microcode Memory RAMBUS DRAM interface High-level instructions like matrix-multiply 21

Alternative Model: Vector Processing Vector processors have high-level operations that work on linear arrays of numbers: "vectors" SCALAR VECTOR (1 operation) (N operations) v1 v2 r1 r2 + + v3 r3 vector length add.vv v3, v1, v2 add r3, r1, r2 22 25

Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory 23

Properties of Vector Instructions A single vector instruction specifies a great deal of work Equivalent to executing an entire loop Each instruction represents 10 or 100s operations Fetch and decode unit bandwidth needed to keep multiple deeply pipelined FUs busy dramatically reduced Vector instructions indicate that computation of each result in the vector is independent of the computation of the results of the other elements of the vector No need to check for data hazards in the vector Hardware needs to check for data hazards only between two vectors instructions once per vector operand 24

Properties of Vector Instructions Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory to fetch the vector from a set of memory banks => amortize memory latency of over 64 elements => no (data) caches required! (Do use instruction cache) Reduces branches and branch problems in pipelines An entire loop is replaced by a vector instruction therefore control hazards that would arise from the loop branch are avoided 25

Styles of Vector Architectures A vector processor consists of a pipelined scalar unit (ma be out-of order or VLIW) + vector unit memory-memory vector processors : all vector operations are memory to memory (first ones as CDC) vector-register processors : all vector operations between vector registers (except load and store) Vector equivalent of load-store architectures Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC 26

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector - PowerPoint PPT Presentation

POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures Multiprocessors Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano Outline

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Ubuntu Kernel Factory How we have Ubuntu kernels Ike Panhc <ike.pan@canonical.com>

"Systemized" Static Analysis Harry Xu University of California, Los Angeles Overview

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

ISC09, Pisa, Italy Outline Outline Contributions 2 Attacks Target Ciphers: RC6, ERC6 and

Web Application Security Payloads Andrs Riancho Director of Web Security BlackHat 2011 -

Advanced Block Cipher Design My crazy boss asked me to design a new block cipher. Whats next?

Project: Toy Cipher Competition May 1, 2019 1 / 9 Overall structure Two phases: Design

RDF* and SPARQL* An Alternatjve Approach to Statement-Level Metadata in RDF Olaf Hartjg

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector - PowerPoint PPT Presentation

POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures Multiprocessors Politecnico di Milano SeminarRoom, Bld 20 30 November, 2017 Antonio Miele Marco Santambrogio Politecnico di Milano Outline

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Ubuntu Kernel Factory How we have Ubuntu kernels Ike Panhc &lt;ike.pan@canonical.com&gt;

&quot;Systemized&quot; Static Analysis Harry Xu University of California, Los Angeles Overview

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

ISC09, Pisa, Italy Outline Outline Contributions 2 Attacks Target Ciphers: RC6, ERC6 and

Web Application Security Payloads Andrs Riancho Director of Web Security BlackHat 2011 -

Advanced Block Cipher Design My crazy boss asked me to design a new block cipher. Whats next?

Project: Toy Cipher Competition May 1, 2019 1 / 9 Overall structure Two phases: Design

RDF* and SPARQL* An Alternatjve Approach to Statement-Level Metadata in RDF Olaf Hartjg

Ubuntu Kernel Factory How we have Ubuntu kernels Ike Panhc <ike.pan@canonical.com>

"Systemized" Static Analysis Harry Xu University of California, Los Angeles Overview