SIMD+ Overview Illiac IV History Early machines First massively

SIMD+ Overview Illiac IV History � Early machines � First massively parallel (SIMD) computer � Illiac IV (first SIMD) � Sponsored by DARPA, built by various � Cray-1 (vector processor, not a SIMD) companies, assembled by Burroughs, under the direction of Daniel Slotnick at � SIMDs in the 1980s and 1990s the University of Illinois � Thinking Machines CM-2 (1980s) � Plan was for 256 PEs, in 4 quadrants of 64 PEs, but only one quadrant was built � General characteristics � Used at NASA Ames Research Center in � Host computer to interact with user and mid-1970s execute scalar instructions, control unit to send parallel instructions to PE array � 100s or 1000s of simple custom PEs, each with its own private memory � PEs connected by 2D torus, maybe also by row/column bus(es) or hypercube � Broadcast / reduction network 1 Fall 2005, MIMD 2 Fall 2005, MIMD Illiac IV Architectural Overview Illiac IV Routing and I/O � CU (control unit) + � Data routing 64 PUs (processing units) � CU bus —instructions or data can be fetched from a PEM and sent to the CU � PU = 64-bit PE (processing element) + PEM (PE memory) � CDB (Common Data Bus) — broadcasts information from CU to all PEs � CU operates on scalars, � PE Routing network — 2D torus PEs operate on vector-aligned arrays (A[1] on PE 1, A[2] on PE2, etc.) � Laser memory � All PEs execute the instruction broadcast � 1 Tb write-once read-only laser memory by the CU, if they are in active mode � Thin film of metal on a polyester sheet, on � Each PE can perform various arithmetic a rotating drum and logical instructions on data in 64-bit, 32-bit, and 8-bit formats � DFS (Disk File System) � Each PEM contains 2048 64-bit words � 1 Gb, 128 heads (one per track) � Data routed between PEs various ways � ARPA network link (50 Kbps) � I/O is handled by a separate Burroughs � Illiac IV was a network resource available B6500 computer (stack architecture) to other members of the ARPA network 3 Fall 2005, MIMD 4 Fall 2005, MIMD

Cray-1 History Cray-1 Vector Operations � First famous vector (not SIMD) processor � Vector arithmetic � 8 vector registers, each holding a 64- � In January 1978 there were only 12 non- element vector (64 64-bit words) Cray-1 vector processors worldwide: � Arithmetic and logical instructions operate � Illiac IV, TI ASC (7 installations), CDC on 3 vector registers STAR 100 (4 installations) � Vector C = vector A + vector B � Decode the instruction once, then pipeline the load, add, store operations � Vector chaining � Multiple functional units � 12 pipelined functional units in 4 groups: address, scalar, vector, and floating point � Scalar add = 3 cycles, vector add = 3 cycles, floating-point add = 6 cycles, floating-point multiply = 7 cycles, reciprocal approximation = 14 cycles � Use pipelining with data forwarding to bypass vector registers and send result of one functional unit to input of another 5 Fall 2005, MIMD 6 Fall 2005, MIMD Thinking Machines Corporation � s Cray-1 Physical Architecture Connection Machine CM-2 � Custom implementation � Distributed-memory SIMD (bit-serial) � Register chips, memory chips, low-speed � Thinking Machines Corp. founded 1983 and high-speed gates � CM-1, 1986 (1000 MIPS, 4K processors) � Physical architecture � CM-2, 1987 (2500 MFLOPS, 64K…) � “Cylindrical tower (6.5 � tall, 4.5 � diameter) with 8.5 � diameter seat � Programs run on one of 4 Front-End � Composed of 12 wedge-like columns in Processors, which issue instructions to 270° arc, so a “reasonably trim individual” the Parallel Processing Unit (PE array) can get inside to work � Control flow and scalar operations run on � World � s most expensive love-seat” Front-End Processors, while parallel � “Love seat” hides power supplies and operations run on the PPU plumbing for Freon cooling system � A 4x4 crossbar switch (Nexus) connects the 4 Front-Ends to 4 sections of the PPU � Freon cooling system � Each PPU section is controlled by a � Vertical cooling bars line each wall, Sequencer (control unit), which receives modules have a copper heat transfer assembly language instructions and plate that attaches to the cooling bars broadcasts micro-instructions to each � Freon is pumped through a stainless steel processor in that PPU section tube inside an aluminum casing 7 Fall 2005, MIMD 10 Fall 2005, MIMD

CM-2 Nodes / Processors CM-2 Interconnect � CM-2 constructed of “nodes”, each with: � Broadcast and reduction network � 32 processors (implemented by 2 custom � Broadcast, Spread (scatter) processor chips), 2 floating-point � Reduction (e.g., bitwise OR, maximum, accelerator chips, and memory chips sum), Scan (e.g., collect cumulative results over sequence of processors such � 2 processor chips (each 16 processors) as parallel prefix) � Contains ALU, flag registers, etc. � Sort elements � Contains NEWS interface, router � NEWS grid can be used for nearest- interface, and I/O interface neighbor communication � 16 processors are connected in a 4x4 mesh to their N, E, W, and S neighbors � Communication in multiple dimensions: 256x256, 1024x64, 8x8192, 64x32x32, � 2 floating-point accelerator chips 16x16x16x16, 8x8x4x8x8x4 � First chip is interface, second is FP execution unit � The 16-processor chips are also linked by a 12-dimensional hypercube � RAM memory � Good for long-distance point-to-point � 64Kbits, bit addressable communication 11 Fall 2005, MIMD 12 Fall 2005, MIMD MIMD Overview Thinking Machines CM-5 Overview � MIMDs in the 1980s and 1990s � Distributed-memory MIMD multicomputer � Distributed-memory multicomputers � SIMD or MIMD operation � Thinking Machines CM-5 � Configurable with up to 16,384 � IBM SP2 processing nodes and 512 GB of memory � Distributed-memory multicomputers with hardware to look like shared-memory � Divided into partitions, each managed by a control processor � nCUBE 3 � NUMA shared-memory multiprocessors � Processing nodes use SPARC CPUs � Cray T3D � Silicon Graphics POWER & Origin � General characteristics � 100s of powerful commercial RISC PEs � Wide variation in PE interconnect network � Broadcast / reduction / synch network 16 Fall 2005, MIMD 20 Fall 2005, MIMD

CM-5 Partitions / Control Processors CM-5 Nodes and Interconnection � Processing nodes may be divided into � Processing nodes (communicating) partitions, and are � SPARC CPU (running at 22 MIPS) supervised by a control processor � 8-32 MB of memory � Control processor broadcasts blocks of � (Optional) 4 vector processing units instructions to the processing nodes � SIMD operation: control processor � Each control processor and processing broadcasts instructions and nodes are closely synchronized node connects to two networks � MIMD operation: nodes fetch instructions � Control Network — for operations that independently and synchronize only as involve all nodes at once required by the algorithm � Broadcast, reduction (including parallel prefix), barrier synchronization � Control processors in general � Optimized for fast response & low latency � Schedule user tasks, allocate resources, � Data Network — for bulk data transfers service I/O requests, accounting, etc. between specific source and destination � In a small system, one control processor � 4-ary hypertree may play a number of roles � Provides point-to-point communication for tens of thousands of items simultaneously � In a large system, control processors are � Special cases for nearest neighbor often dedicated to particular tasks (partition manager, I/O cont. proc., etc.) � Optimized for high bandwidth 21 Fall 2005, MIMD 22 Fall 2005, MIMD IBM SP2 Overview SP2 System Architecture � Distributed-memory MIMD multicomputer � RS/6000 as system console � Scalable POWERparallel 1 (SP1) � SP2 runs various combinations of serial, parallel, interactive, and batch jobs � Scalable POWERparallel 2 (SP2) � Partition between types can be changed � RS/6000 � High nodes — interactive nodes for code workstation development and job submission plus 4–128 � Thin nodes — compute nodes POWER2 processors � Wide nodes — configured as servers, with extra memory, storage devices, etc. � POWER2 processors used IBM � s � A system “frame” contains 16 thin in RS 6000 processor or 8 wide processor nodes workstations, � Includes redundant power supplies, compatible nodes are hot swappable within frame with existing software � Includes a high-performance switch for low-latency, high-bandwidth communication 24 Fall 2005, MIMD 25 Fall 2005, MIMD

SIMD+ Overview Illiac IV History Early machines First massively - PDF document

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer Illiac IV (first SIMD) Sponsored by DARPA, built by various Cray-1 (vector processor, not a SIMD) companies, assembled by Burroughs,

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Finite State Machines (FSM) Chapter 8 State Machines Introduction State Machines Mealy and

SIMD Single Instruction Multiple Data Parallelism through simultaneous operations on different

SIMD Is a Message Digest Gatan Leurent, Pierre-Alain Fouque, Charles Bouillaguet cole Normale

Finite State Machines (FSM) AKA Finite State Automat on State Machines Introduction State

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40.

Feature extraction from deep models Olgert Denas Synopsis Intro to deep models Applications

$$$ $$$ Cache Memory $$$ 2 Schedule Today + Friday

The CMS Track Trigger and the Processing of its Data Christian Amstutz Institute for Data

Chapter 13. Neurodynamics Neural Networks and Learning Machines (Haykin) Lecture Notes on

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence g (associated

Computer Architecture 101 SDBS How does a computer look

In sensible gramma rs, these strings share some common c haracteristic or roll.

Sambuz

Useful Links

Newsletter

Mail Us