COSC 5351 Advanced Computer Architecture Slides modified from - PowerPoint PPT Presentation

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

 MP Motivation  SISD v. SIMD v. MIMD  Centralized vs. Distributed Memory  Challenges to Parallel Programming  Consistency, Coherency, Write Serialization  Write Invalidate Protocol  Example  Conclusion COSC5351 Advanced Computer 2 Architecture 3/19/2012

3X 10000 From Hennessy and Patterson, Computer Architecture: A Quantitative ??%/year Approach , 4th edition, 2006 1000 Performance (vs. VAX-11/780) 52%/year 100 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 COSC5351 Advanced Computer • RISC + x86: ??%/year 2002 to present 3 Architecture 3/19/2012

 Growth in data-intensive applications ◦ Data bases, file servers, …  Growing interest in servers, server perf.  Increasing desktop perf. less important ◦ Outside of graphics  Improved understanding in how to use multiprocessors effectively ◦ Especially server where significant natural TLP  Advantage of leveraging design investment by replication ◦ Rather than unique design  Power consumption concerns ◦ Increase ILP => less efficient COSC5351 Advanced Computer 4 Architecture 3/19/2012

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE , V 54, 1900-1909, Dec. 1966.  Flynn classified by data & control streams - 1966 Single Instruction Single Single Instruction Multiple Data (SISD) Data SIMD (Uniprocessor) (single PC: Vector, CM-2) Multiple Instruction Single Multiple Instruction Multiple Data (MISD) Data MIMD (????) (Clusters, SMP servers)  SIMD  Data Level Parallelism  MIMD  Thread Level Parallelism  MIMD popular because ◦ Flexible: N pgms and 1 multithreaded pgm ◦ Cost-effective: same MPU in desktop & MIMD COSC5351 Advanced Computer 5 Architecture 3/19/2012

“A parallel computer is a collection of  processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture  + Communication Architecture 2 classes of multiprocessors WRT memory:  1. Centralized Memory Multiprocessor • < few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory 2. Physically Distributed-Memory multiprocessor • Larger number chips and cores > than 1. • BW demands  Memory distributed among processors COSC5351 Advanced Computer 6 Architecture 3/19/2012

Scale P P 1 n P P n 1 $ $ $ $ Mem Mem Interconnection network Interconnection network Mem Mem Centralized Memory Distributed Memory COSC5351 Advanced Computer 7 Architecture 3/19/2012

 Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors  Large caches  single memory can satisfy memory demands of small number of processors  Can scale to a few dozen processors by using a switch and by using many memory banks  Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases COSC5351 Advanced Computer 8 Architecture 3/19/2012

 Pro: Cost-effective way to scale memory bandwidth ◦ If most accesses are to local memory  Pro: Reduces latency of local memory accesses  Con: Communicating data between processors more complex  Con: Must change software to take advantage of increased memory BW COSC5351 Advanced Computer 9 Architecture 3/19/2012

1. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors 2. Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either • UMA (Uniform Memory Access time) for shared address, centralized memory MP • NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP COSC5351 Advanced Computer 10 Architecture 3/19/2012

First challenge is % of program inherently  sequential Suppose 80X speedup from 100  processors. What fraction of original program can be sequential? a.10% b.5% c.1% d.<1% COSC5351 Advanced Computer 11 Architecture 3/19/2012

80x with th 100 cpus us 1  Speedup overall   Fraction   enhanced 1 Fraction enhanced Speedup Ass ssume ume para rallel el enhanced opera erati tion ons use se all ll 1 proces ocessors ors and  8 0 others hers use e one   Fraction proces ocessor or so  parallel  parallel spee eedup up woul uld be 1 Fraction number er of 100 proces ocessors ors   Fraction     parallel 80 ( 1 Fraction ) 1 parallel 100     79 80 Fraction 0 . 8 Fraction parallel parallel   Fraction parallel 79 / 79 . 2 99 . 75 % 3/19/2012 12

Second challenge is long latency to remote  memory Suppose 32 CPU MP, 2GHz, 200 ns remote  memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2%  instructions involve remote access? a. 1.5X b. 2.0X c. 2.5X COSC5351 Advanced Computer 13 Architecture 3/19/2012

32 CPU MP, 2GHz, 200ns remote memory, all  local accesses hit memory hierarchy and base CPI is 0.5. Remote access = 400 cycles  (200ns*2Ghz = 200ns*2G/s=200ns*2/ns) ◦ What is performance impact if 0.2%  instructions involve remote access?  CPI = Base CPI + Remote request rate x Remote request cost  CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3  No communication is 1.3/0.5 or 2.6x faster than when 0.2% instructions involve local access 3/19/2012 14

1. Application parallelism  primarily via new algorithms that have better parallel performance 2. Long remote latency impact  both by architect and by the programmer For example, reduce frequency of remote ◦ accesses either by Caching shared data (HW)  Restructuring the data layout to make more  accesses local (SW) We will learn about how to use HW to help ◦ latency via caches COSC5351 Advanced Computer 15 Architecture 3/19/2012

 From multiple boards on a shared bus to multiple processors inside a single chip  Caches both ◦ Private data are used by a single processor ◦ Shared data are used by multiple processors  Caching shared data  reduces latency to shared data, memory bandwidth for shared data, and interconnect bandwidth  cache coherence problem COSC5351 Advanced Computer 16 Architecture 3/19/2012

P P P 2 1 3 u = ? 3 u = ? 4 5 $ $ $ u = 7 u :5 u :5 I/O devices 1 2 u :5 Memory ◦ Processors see different values for u after event 3 ◦ With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when  Processes accessing main memory may see very stale value ◦ Unacceptable for programming, and its frequent! COSC5351 Advanced Computer 17 Architecture 3/19/2012

This process should see value written immediately P • Coherent if: Reading L1 an address should 100:67 return the last value L2 100:35 written to that address – Memory Easy in uniprocessors, except for I/O 100:34 Disk Too vague and simplistic; 2 issues  Coherence defines values returned by a read 1. Consistency determines when a written value will 2. be returned by a read Coherence defines behavior to same location,  Consistency defines behavior to other locations COSC5351 Advanced Computer 18 Architecture 3/19/2012

P P 1 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; COSC5351 Advanced Computer Architecture

P P 1 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A;  Burak is meeting Lina at a restaurant and he arrives first ◦ He goes by specials board and it says Tuna  The tuna is sold out so they change the sign to Salmon  Lina shows up and sees the Salmon  Burak waits for Lina to decide, she say’s she’ll have the special.  What does Burak think she is ordering? COSC5351 Advanced Computer 20 Architecture 3/19/2012

P P 1 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A;  Intuition not guaranteed by coherence  Expect memory to respect order between accesses to different locations issued by a given process ◦ to preserve order among accesses to same location by different processes  Coherence is not enough! P ◦ pertains only to single location P n 1 Conceptual Mem Picture COSC5351 Advanced Computer 21 Architecture 3/19/2012

COSC 5351 Advanced Computer Architecture Slides modified from - PowerPoint PPT Presentation

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Consistency, Coherency, Write

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides 11

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides Q. How

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

Trees CoSc 450: Programming Paradigms 08 The definition of a tree CoSc 450: Programming

Lists CoSc 450: Programming Paradigms 07 The definition of a list CoSc 450: Programming

COSC 340: Software Engineering Using the Debugger Michael Jantz COSC 340: Software Engineering

Decision Trees I Dr. Alex Williams August 24, 2020 COSC 425: Introduction to Machine Learning

Orders of Growth and Tree Recursion CoSc 450: Programming Paradigms 04 Graphics primitive

Higher-Order Procedures CoSc 450: Programming Paradigms 05 In the functional paradigm,

COSC 340: Software Engineering Course Project: Introduction Michael Jantz COSC 340: Software

COSC as Parent Stakeholder Recent decision to have the Council of School Councils (COSC)

COSC 340: Software Engineering Design and Architecture Michael Jantz (adapted from slides by

NOW Handout Page 1 CS258 S99 1 Physi sical al Mem is 2 41 41 or Page size is 2 13 13 or 8Kb

CS252 S05 1 Bad locality behavior Memory Address (one dot per access) The Principle of

COSC 340: Software Engineering Design Patterns Michael Jantz Recommended text: Design Patterns:

ECE 650 Systems Programming & Engineering Spring 2018 Database Transaction Processing Tyler

Work Queue + Python A Framework For Scalable Scientific Ensemble Applications Peter Bui , Dinesh

Distributed Transactions and Concurrency CS425/ECE 428 Nikita Borisov Topics for Today

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Extreme Computing NoSQL www.inf.ed.ac.uk PREVIOUSLY: BATCH Query most/all data Results

Serializability with Snapshot Isolation under the Hood Mihaela Bornea 1 , S. Elnikety 2 , O.

Assessing Medication Adherence Dr. Lauren Hanna and Dr. Delbert Robinson Northwell Health

Underground Injection Control (UIC) Permitting Rob Castillo July 2020 1 Railroad Commission of

COSC 5351 Advanced Computer Architecture Slides modified from - PowerPoint PPT Presentation

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Consistency, Coherency, Write

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides 11

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides Q. How

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides

Trees CoSc 450: Programming Paradigms 08 The definition of a tree CoSc 450: Programming

Lists CoSc 450: Programming Paradigms 07 The definition of a list CoSc 450: Programming

COSC 340: Software Engineering Using the Debugger Michael Jantz COSC 340: Software Engineering

Decision Trees I Dr. Alex Williams August 24, 2020 COSC 425: Introduction to Machine Learning

Orders of Growth and Tree Recursion CoSc 450: Programming Paradigms 04 Graphics primitive

Higher-Order Procedures CoSc 450: Programming Paradigms 05 In the functional paradigm,

COSC 340: Software Engineering Course Project: Introduction Michael Jantz COSC 340: Software

COSC as Parent Stakeholder Recent decision to have the Council of School Councils (COSC)

COSC 340: Software Engineering Design and Architecture Michael Jantz (adapted from slides by

NOW Handout Page 1 CS258 S99 1 Physi sical al Mem is 2 41 41 or Page size is 2 13 13 or 8Kb

CS252 S05 1 Bad locality behavior Memory Address (one dot per access) The Principle of

COSC 340: Software Engineering Design Patterns Michael Jantz Recommended text: Design Patterns:

ECE 650 Systems Programming &amp; Engineering Spring 2018 Database Transaction Processing Tyler

Work Queue + Python A Framework For Scalable Scientific Ensemble Applications Peter Bui , Dinesh

Distributed Transactions and Concurrency CS425/ECE 428 Nikita Borisov Topics for Today

Scalability and Replication Marco Serafini COMPSCI 532 Lecture 13 Scalability 2 Scalability

Extreme Computing NoSQL www.inf.ed.ac.uk PREVIOUSLY: BATCH Query most/all data Results

Serializability with Snapshot Isolation under the Hood Mihaela Bornea 1 , S. Elnikety 2 , O.

Assessing Medication Adherence Dr. Lauren Hanna and Dr. Delbert Robinson Northwell Health

Underground Injection Control (UIC) Permitting Rob Castillo July 2020 1 Railroad Commission of

ECE 650 Systems Programming & Engineering Spring 2018 Database Transaction Processing Tyler