Challenges of Parallel Processor Design Martti Forsell (VTT Oulu) - PowerPoint PPT Presentation

Challenges of Parallel Processor Design Martti Forsell (VTT Oulu) Ville Lepp¨ anen (University of Turku) Martti Penttonen (University of Kuopio) May 18, 2009 Forsell-Lepp¨ anen-Penttonen

Contents • Moore’s law • Latency • Slackness • PRAMs on Chip – Paraleap – Eclipse – Moving threads Forsell-Lepp¨ anen-Penttonen 1

Moore’s law • 1 component on IC in 1959 • 50 component on IC in 1965 Moore: maybe 65000 components on IC in 1975 16 years — 2 16 -fold • 2 32 (not 2 48 ) components on IC in 2007 “Packing density doubles every 18 months” • “Laws” for clock cycles, bandwidth, ... • Not until eternity! size, heat, quantum effects • What to do with all those components? Multiple cores? Forsell-Lepp¨ anen-Penttonen 2

Latency • moving data needs time • overhead of components • latency of about 100 clock cycles • want to process but must wait for data • caches - clever enough? • multiple cores - what to do with them? • threads become important Forsell-Lepp¨ anen-Penttonen 3

Slackness Does latency imply inefficiency? • What to do instead of waiting? Some other thread • Are there parallel threads? Yes, PRAM algorithmics Multiple threads per processor core: slackness • Is it technically possible to run multiple threads? Bandwidth requirements for internal network • Any number of processors • Different structure of computer • New software (at least libraries) Forsell-Lepp¨ anen-Penttonen 4

PRAM P P P memory Multiple processors running synchronously, shared memory. proc compact(A) for i=0..n-1 pardo if A[i]=0 then C[i]=0 else C[i]=1 E=prefix-sum(C) for i=0..n-1 pardo if A[i]<>0 then B[E[i]]=A[i] return B Forsell-Lepp¨ anen-Penttonen 5

PRAM continued O (1) time assuming prefix-sum in O (1) time prefix-sum(C) = (C[1],C[1]+C[2],C[1]+C[2]+C[3],...) A lot of progress in 80’ies and 90’ies. k ParTime (log k n ) Hypothesis: NC = P , where NC = � Hence, for most problems there are highly parallel algorithms. Culler et al. 1993. PRAM is not realistic. Synchronous immediate ′ ! Try DMM! access to memory is not possible. PRAM is passe ′ ? Try PRAM! Now: DMM is passe Forsell-Lepp¨ anen-Penttonen 6

Slackness • Assume program uses sp virtual processors, while computer has p real processors. We have slackness s in computation. • Assume each data fetch requires φ hops in network. In time unit pφ bandwidth need is created. • φ is not constant, therefore network must be sparse, for example sparse torus Forsell-Lepp¨ anen-Penttonen 7

PRAM on Chip What changed in fifteen years? • DMM never became very popular • Dead end in commodity processor speedup • Space on chip ⇒ PRAM on chip becomes possible PRAM on chip • Paraleap (Vishkin et al.) • our Eclipse (Forsell et al.) • our Moving threads (Lepp¨ anen et al.) Forsell-Lepp¨ anen-Penttonen 8

PRAM on Chip design challenges 1. Enough parallelism to cover latency? Yes by PRAM theory 2. Enough communication bandwidth? Use sparse network 3. Efficient management of slackness on hardware? 4. Programming not too difficult? Forsell-Lepp¨ anen-Penttonen 9

Paraleap Vishkin’s XMT (Eplicit MultiThreading) model. Not as tightly synchronous as PRAM. Forsell-Lepp¨ anen-Penttonen 10

PRAM and XMT are similar Forsell-Lepp¨ anen-Penttonen 11

PRAM and XMT are different Forsell-Lepp¨ anen-Penttonen 12

Structure of Paraleap PSU Regs P P P P MTCU �� network �� C C C M M M Forsell-Lepp¨ anen-Penttonen 13

How does Paraleap work? • At spawn TCU gets the number of parallel threads and TPU’s get the code for running the thread • At the beginning and whenever a thread is completed, a TPU asks the TCU for a new thread • TCU uses the prefix-sum for pointing to the next thread if any remain • When all threads have been completed, control returns to the MPU Forsell-Lepp¨ anen-Penttonen 14

Implementation issues • Prefix-sum is actually implemented sequentially. It is claimed to be fast enough. Really? How scalable? • Internal network is a mesh of trees • Implemented on FPGA (Field Programmable Gate Array) at 75 MHz • Current version has 64 TPU’s in 4 clusters of 16 TPU’s sharing some functional units and network access Forsell-Lepp¨ anen-Penttonen 15

Paraleap exists Forsell-Lepp¨ anen-Penttonen 16

Paleap goes ASIC Forsell-Lepp¨ anen-Penttonen 17

Eclipse • strong PRAM models on chip • interleaved multithreading exploits slackess of algorithms • chained sequential functional units • supports instruction level parallelism of sequential code • sparse mesh • local memories and “scratchpads” (used for multioperations) • compiler, simulated running, • FPGA implementation planned Forsell-Lepp¨ anen-Penttonen 18

Structure of Eclipse S S S S S S S S S t c t c t c M P M P M P I I I a a a Fast memory bank S S S S S S Scratchpad S S S mux t c t c t c M P M P M P ALU I I I a a a Pending Pending S S S Reply Data Op Address Address Address S S S Thread Thread S S S t c t c t c Data Data M P M P M P I I I a a a Forsell-Lepp¨ anen-Penttonen 19

Moving threads • Processors have local memory • For data access, process with environment registers moves to the processor that has the data • No two-way traffic for a read. Fewer but bigger data packets • Tentative design exists, simulations by software Forsell-Lepp¨ anen-Penttonen 20

CUDA project • use NVIDIA graphic processor as shared memory parallel computer • cheap processing power • special libraries written Forsell-Lepp¨ anen-Penttonen 21

Conclusions • PRAM on chip seems feasible • Breakthrough? • A lot of work remains to be done • For popular introduction in Karelian, see http://opastajat.net “luvekkua karjalakse” (The same appeared in Finnish in Tietojenksittelytiede) Forsell-Lepp¨ anen-Penttonen 22

Challenges of Parallel Processor Design Martti Forsell (VTT Oulu) - PowerPoint PPT Presentation

Challenges of Parallel Processor Design Martti Forsell (VTT Oulu) Ville Lepp anen (University of Turku) Martti Penttonen (University of Kuopio) May 18, 2009 Forsell-Lepp anen-Penttonen Contents Moores law Latency Slackness

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Challenges of parallel processor design M. Forsell and V. Lepp anen and M. Penttonen May 12,

Processor Datapath Levels in Processor Design We can talk about design at a variety of levels

Processor Design Single Cycle Processor Hung-Wei Tseng Recap: the stored-program computer

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Using Processor Partitioning to Using Processor Partitioning to Evaluate the Performance of MPI,

2 3 Intel 48-core SCC processor Tilera 100-core processor Introduction Parallel program

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Cortex-A15 Processor ARMs next generation mobile applications processor Travis Lanier Senior

Ch. 5: Processor + Memory December 12, 2008 Ch. 5: Processor + Memory Overview of Implementation

Parallel Algorithms Parallel Prefix Sums Algorithm Theory WS 2012/13 Fabian Kuhn PRAM Parallel

Outline Introduction to CMOS VLSI Design Partitioning Design MIPS Processor Example

Simon McIntosh-Smith Computer Science University of Bristol 1 IWOCL 2014 ~110 delegates

Hardware-Accelerated Flexible Flow Measurement Pavel eleda celeda@liberouter.org Martin

Field Programmable Gate Array - What is it? 2D array of logic blocks surrounded by a

Introduction to Field Programmable Gate Arrays Lecture 2/3 CERN Accelerator School on Digital

28 Frederick Weyerhaeuser Germany Weyerhaeuser Corporation 1834 33 Jay Gould US Union

Washington by: Tanaisha.A Introduction Do you want to talk about Washington? What is the state

BOARD OF DIRECTORS of Industrial and Financial Systems, IFS AB (publ) ANDERS BS CHAIRMAN OF

Transboundary Water Resources Management Programs and