HiDISC: A De coup l ed A r c h i t e c t u r e f o - PowerPoint PPT Presentation

HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of Southern California http:/ / www-pdpc.usc.edu 19 May 2000

USC USC HiDISC: Hierarchical Decoupled Instruction Set Computer UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN Senso r CALIFORNIA CALIFORNIA I npu t s New Ideas App l i c a t i on App l i c a t i on (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) • A dedicated processor for each level of the Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data Dynamic P roce s so r P roce s so r P roce s so r Databas e access predictability to data access locality Reg i s t e r s • Exploit instruction-level parallelism without Cache HiDISC P roc e s so r extensive scheduling hardware Memory • Zero overhead prefetches for maximal computation throughput S i t ua t i ona l Awarene s s Schedule Impact • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • De f i n ed b enchma rk s • Con t i nu e s imu l a t i on s • Deve l op a nd t e s t a • Comp l e t e d s imu l a t o r o f mo r e b enchma rk s f u l l d e coup l i ng c omp i l e r • 7.4x speedup for matrix multiply over an in-order • P e r f o rmed i n s t r u c t i on - l e v e l ( SAR) • Gene r a t e p e r f o rmance s imu l a t i on s on h and - comp i l e d • Def i n e H iD ISC s t a t i s t i c s a nd ev a l u a t e issue superscalar processor b e n chma rk s a r c h i t e c t u r e d e s i gn • Upda t e S imu l a t o r • 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Apr i l 9 8 Apr i l 9 9 Apr i l 0 0 May 01 S t a r t End • Allows the compiler to solve indexing functions for irregular applications • Reduced system cost for high-throughput scientific codes University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC HiDISC: Hierarchical Decoupled Instruction Set Computer UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Senso r I npu t s App l i c a t i on App l i c a t i on (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r Dynamic P roce s so r P roce s so r P roce s so r Databas e Reg i s t e r s Cache HiDISC P roc e s so r Memory S i t ua t i ona l Awarene s s Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC Present Solutions UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Solution Limitations Larger Caches — Slow — W orks well only if working set fits cache and there is temporal locality. — Cannot be tailored for each application Hardware Prefetching — Behavior based on past and present execution- time behavior — Ensure overheads of prefetching do not outweigh Software Prefetching the benefits > conservative prefetching — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns — Solves the throughput problem, not the memory latency problem M ultithreading University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC The HiDISC Approach UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Obse rv a t i on : • So f twa r e p r e f e t ch i ngimpac t s c o mpu t e p e r f o rmance • P IMs and RA M BUS o f f e r a h i g h -bandw id t h memory s y s t em -u s e f u l f o r s p e cu l a t i v e p r e f e t c h i n g App roach : • Add a p r o c e s so r t o manage p r e f e t c h i ng - > h i d e ove r h e ad • Comp i l e r e xp l i c i t l y manage s t h e memory h i e r a r c hy • P r e f e t c h d i s t a n c e a d ap t s t o t h e p r og r am r un t ime behav i o r University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC W hat’s HiDISC UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Computation • A dedicated processor for each Instructions Compu t a t i o n Compu t a t i o n level of the memory hierarchy P roc e s s o r (CP ) P roc e s s o r (CP ) • Explicitly manage each level of Reg i s t e r s the memory hierarchy using instructions generated by the Access compiler Instructions Acce s s P r o c e s s o r P rog r am Comp i l e r Acce s s P r o c e s s o r (AP ) (AP ) • Hide memory latency by converting data access predictability to data access Cache locality (Just in Time Fetch) Cache Mgmt. Instructions • Exploit instruction-level Cache Mgmt . parallelism without extensive Cache Mgmt . P roce s so r P roce s so r scheduling hardware (CMP) (CMP) • Zero overhead prefetches for maximal computation throughput 2nd -Leve l Cache and Ma in Memory University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC Decoupled Architectures UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA 5-way 3-way 2-way 8-way Computation Computation Processor Processor Processor (CP) Processor (CP) Registers Registers Registers Registers Store Address Store Data Store Address Store Data Load Load Queue Queue Queue Queue Queue Queue 3-way Access Access Processor (AP) Processor (AP) 5-way Slip Control Slip Control Cache Cache Queue Cache Queue Cache 3-way 3-way Cache Management Cache Management Processor (CMP) Processor (CMP) Second-Level Cache Second-Level Cache Second-Level Cache Second-Level Cache and Main Memory and Main Memory and Main Memory and Main Memory MIPS DEAP CAPP HiDISC ( n ew) (Conven t i ona l ) (Decoup l ed ) (Decoup l ed ) (Decoup l ed ) University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC Slip Control Queue UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA • The Slip Control Queue (SCQ) adapts dynamically if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2* late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ; – Late prefetches = prefetched data arrived after load had been issued – Useful prefetches = prefetched data arrived before load had been issued University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Decoupling Programs for HiDISC-3 USC USC (Discrete Convolution - Inner Loop) UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA while (not end of loop) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); for (j = 0; j < i; ++j) GET_SCQ; y[i]=y[i]+(x[j]* h[i-j-1]); } send (EOD token) Inner Loop Convolution send address of y[i] to SAQ A ccess Processor Code for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

USC USC Benchmarks UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Benchmark Source of Lines of Description Data Benchmark Source Set Size Code LLL1 Livermore Loops 20 1024-element 24 KB [45] arrays, 100 iterations LLL2 Livermore Loops 24 1024-element 16 KB arrays, 100 iterations LLL3 Livermore Loops 18 1024-element 16 KB arrays, 100 iterations LLL4 Livermore Loops 25 1024-element 16 KB arrays, 100 iterations LLL5 Livermore Loops 17 1024-element 24 KB arrays, 100 iterations Tomcatv SPECfp95 [68] 190 33x33-element <64 KB matrices, 5 iterations MXM NAS kernels [5] 113 Unrolled matrix 448 KB multiply, 2 iterations CHOLSKY NAS kernels 156 Cholesky matrix 724 KB decomposition VPENTA NAS kernels 199 Invert three 128 KB pentadiagonals simultaneously Qsort Quicksort sorting 58 Quicksort 128 KB algorithm [14] University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

HiDISC: A De coup l ed A r c h i t e c t u r e f o - PowerPoint PPT Presentation

HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of

App App App App App App App App App App App App App App App App App App App App App App

H-COUP 1. H-COUP ? 2. H-COUP ? 3. H-COUP

H-COUP toward version 2 1. Introduction: top-down vs. bottom-up Kentarou Mawatari

Horizontal Vertically integrated Open interfaces Closed, proprietary Rapid innovation Slow

Hindsight to Foresight Where have we come from, where are we heading? | 14/06/2015 Andy

Sefos A self-aware factored operating system A Traditional OS App 1 App 2 App 3 System call

WI-FI SMART APPLICATION Download the app Use this QR code tho download Wifi Smart app for OS

MOBILE APP PDF Client Presentation FREE APP FOR ALL The ultimate Mobile App for Fans, Drivers

MOBILE APP TUTORIAL TPL Trakker Mobile App How to Download the Mobile App? Step 1: Tap on

Todays Agenda: Discuss College Application Process: Common App, Coalition App, School App

Communis Smart App is an app that is both revolutionary and convenient to use. It is an app that

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

The Four Steps 1 Solve the problem. 2 Write the app. 3 Compile the app. 4 Run the app. CSE 1020

Membership # 11BCAM831609 TURKEY: ZIKA SOUTH SUDAN CRISIS ATTEMPTED COUP SICKNESS EUROPE

La coqueluche mondiale des espaces collaboratifs ( coworking) a choisi la Place Ville Marie pour

Structural Power of Thai Internet System before 2006-Coup d' Etat Chanchai Chaisukkosol

MATHEMATICAL THINKING A guest lecture by Mr. Chase Is mathematics invented or discovered?

Specifying and Analysing Networks of Processes in CSP T (or In Search of Associativity) Paul

performant category-theory library in Coq Jason Gross, Adam Chlipala, David I. Spivak

Solution of Algebraic Equations by Using Autonomous Computational Methods Andrew Pownuk 1 , Jose

A Brief Exploration of Normed Division Algebras From R to O (and beyond?) Riley Potts May 2020

Enumerating quasitrivial semigroups MALOTEC Jimmy Devillet in collaboration with Miguel Couceiro

Algebra in Flatland Connie Dennis, Cassie Stamper Department of Mathematics Kansas State

Associativity on Numerical Computations on Massively Multithreaded Systems Daniel Chavarra,

HiDISC: A De coup l ed A r c h i t e c t u r e f o - PowerPoint PPT Presentation

HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of

App App App App App App App App App App App App App App App App App App App App App App

H-COUP 1. H-COUP ? 2. H-COUP ? 3. H-COUP

H-COUP toward version 2 1. Introduction: top-down vs. bottom-up Kentarou Mawatari

Horizontal Vertically integrated Open interfaces Closed, proprietary Rapid innovation Slow

Hindsight to Foresight Where have we come from, where are we heading? | 14/06/2015 Andy

Sefos A self-aware factored operating system A Traditional OS App 1 App 2 App 3 System call

WI-FI SMART APPLICATION Download the app Use this QR code tho download Wifi Smart app for OS

MOBILE APP PDF Client Presentation FREE APP FOR ALL The ultimate Mobile App for Fans, Drivers

MOBILE APP TUTORIAL TPL Trakker Mobile App How to Download the Mobile App? Step 1: Tap on

Todays Agenda: Discuss College Application Process: Common App, Coalition App, School App

Communis Smart App is an app that is both revolutionary and convenient to use. It is an app that

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

The Four Steps 1 Solve the problem. 2 Write the app. 3 Compile the app. 4 Run the app. CSE 1020

Membership # 11BCAM831609 TURKEY: ZIKA SOUTH SUDAN CRISIS ATTEMPTED COUP SICKNESS EUROPE

La coqueluche mondiale des espaces collaboratifs ( coworking) a choisi la Place Ville Marie pour

Structural Power of Thai Internet System before 2006-Coup d' Etat Chanchai Chaisukkosol

MATHEMATICAL THINKING A guest lecture by Mr. Chase Is mathematics invented or discovered?

Specifying and Analysing Networks of Processes in CSP T (or In Search of Associativity) Paul

performant category-theory library in Coq Jason Gross, Adam Chlipala, David I. Spivak

Solution of Algebraic Equations by Using Autonomous Computational Methods Andrew Pownuk 1 , Jose

A Brief Exploration of Normed Division Algebras From R to O (and beyond?) Riley Potts May 2020

Enumerating quasitrivial semigroups MALOTEC Jimmy Devillet in collaboration with Miguel Couceiro

Algebra in Flatland Connie Dennis*, Cassie Stamper* Department of Mathematics Kansas State

Associativity on Numerical Computations on Massively Multithreaded Systems Daniel Chavarra,

Algebra in Flatland Connie Dennis, Cassie Stamper Department of Mathematics Kansas State