Parallel Computing: Opportunities and Challenges Victor Lee Parallel - PowerPoint PPT Presentation

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel Parallel Computing Lab (PCL), Intel

Who We Are: Parallel Computing Lab Parallel Computing ‐‐ Research to Realization • Worldwide leadership in throughput/parallel computing, industry role ‐ model for application ‐ driven – architecture research, ensuring Intel leadership for this application segment Dual Charter: – Application ‐ driven architecture research and multicore/manycore product ‐ intercept opportunities A li ti d i hit t h d lti / d t i t t t iti • Workload focus: • Multimodal real ‐ time physical simulation, Behavioral simulation, Interventional medical – imaging, Large ‐ scale optimization (FSI), Massive data computing, non ‐ numeric computing Industry and academic co ‐ travelers • Mayo, HPI, CERN, Stanford (Prof. Fedkiw), UNC (Prof. Manocha), Columbia (Prof. Broadie) – Architectural focus: • “Feeding the beast” (memory) challenge unstructured accesses domain specific support Feeding the beast (memory) challenge, unstructured accesses, domain ‐ specific support, – massively threaded machines Recent accomplishments: • First TFlop SGEMM and highest performing SparseMVM on KNF silicon demo’ed at SC’09 • Fastest LU/Linpack demo on KNF at ISC’10 Fastest LU/Linpack demo on KNF at ISC’10 • • Fastest search, sort, and relational join – Best Paper Award for Tree Search at SIGMOD 2010 • Victor.W.Lee@intel.com 2

Motivations Motivations • Exponential growth of digital devices Exponential growth of digital devices – Explosion of the amount of digital data Victor.W.Lee@intel.com 3

Motivations Motivations • Exponential growth of digital devices Exponential growth of digital devices – Explosion of the amount of digital data • Popularity of World ‐ Wide ‐ Web – Changing the demographics of computer users Victor.W.Lee@intel.com 4

Motivations Motivations • Exponential growth of digital devices Exponential growth of digital devices – Explosion of the amount of digital data • Popularity of World ‐ Wide ‐ Web – Changing the demographics of computer users • Limited frequency scaling for single core – Performance improvement via increasing core count Victor.W.Lee@intel.com 5

What these lead to What these lead to Massive data needs massive computing to process Birth of multi ‐ /many ‐ core architecture Birth of multi ‐ /many ‐ core architecture Parallel computing Victor.W.Lee@intel.com 6

The Opportunities The Opportunities What parallel computing can do for us? can do for us?

Semantic Barrier Semantic Barrier Evaluation Gap Norman’s Gulf Computer’s Simulated Computer s Simulated H Human’s Conceptual Model ’ C t l M d l Model Execution Gap • Lower semantic barrier => Make computers solve problems the human way => Makes it easier for human to bl h h k i i f h use computers Victor.W.Lee@intel.com 8

Model Driven Analytics Model Driven Analytics • Data ‐ driven models are now tractable and usable – We are not limited to analytical models any more We are not limited to analytical models any more – No need to rely on heuristics alone for unknown models – Massive data offers new algorithmic opportunities g pp • Many traditional compute problems worth revisiting • Web connectivity significantly speeds up model ‐ training training • Real ‐ time connectivity enables continuous model refinement – Poor model is an acceptable starting point – Classification accuracy improves over time Victor.W.Lee@intel.com 9

Interactive RMS Loop Interactive RMS Loop Recognition Mining Synthesis Is it …? What is …? What if …? Create a new Find an existing Model model instance model instance ode s a ce M Most RMS apps are about enabling interactive (real-time) RMS Loop ( Most RMS apps are about enabling interactive (real Most RMS apps are about enabling interactive (real-time) RMS Loop ( Most RMS apps are about enabling interactive (real M M M t RMS t RMS t RMS t RMS b b b b t t t t bli bli bli bli i t i t i t i t ti ti ti ti ( ( ( ( l l ti l l ti ti time) RMS Loop (iRMS time) RMS Loop (iRMS ti ) RMS L ) RMS L ) RMS L ) RMS L ( (iRMS (iRMS ( iRMS) iRMS) iRMS) iRMS) Feb 7 , 2 0 0 7 1 0 Pradeep K. Dubey pradeep.dubey@intel.com 10 Victor.W.Lee@intel.com 10

RMS Example: Future Medicine Recognition Mining Synthesis What is a tumor? Is there a tumor here? What if the tumor progresses? Images courtesy: http://splweb.bwh.harvard.edu:8000/pages/images_movies.html I h // l b b h h d d 8000/ /i i h l It is all about dealing efficiently with complex multimodal datasets It is all about dealing efficiently with complex multimodal datasets

RMS Example: Future Entertainment Recognition Mining Synthesis Who are Shrek, Fiona, What if Shrek were to reach When does Shrek first meet When does Shrek first meet late? What if Fiona didn’t and Prince Charming? Fiona’s parents? believe Prince Charming? What is the story ‐ net? Tomorrow’s interactions and collaborations: Interactive story ‐ nets, multi ‐ party real ‐ time Tomorrow’s interactions and collaborations: Interactive story ‐ nets, multi ‐ party real ‐ time Tomorrow s interactions and collaborations: Interactive story nets, multi party real time Tomorrow s interactions and collaborations: Interactive story nets, multi party real time collaboration in movies, games and strategy simulations collaboration in movies, games and strategy simulations

Opportunities (Summary) Opportunities (Summary) • More data More data – Model ‐ driven analytics • More computing – Interactive RMS loops • Lower computing barrier – Computer easier to use for the mass Computer easier to use for the mass Victor.W.Lee@intel.com 13

The Challenges The Challenges Why Parallel Computing is hard?

Multi ‐ Core / Many ‐ Core Era Multi Core / Many Core Era Multi ‐ Core Single Core Many ‐ Core Multi ‐ core / many ‐ core provides more compute capability with the same area / power p p y / p 4/21/2011 Intel Confidential 15

Architecture Trends Architecture Trends • Rapidly Increasing Compute – Core Scaling (Nhm (4 ‐ cores)  Wsm (6 ‐ cores)  …  Intel Knights Ferry (32 ‐ cores) …) – Data ‐ Level Parallelism (SIMD) Scaling • SSE (128 ‐ bits)  AVX (256 ‐ bits)  …  LRBNI(512 ‐ bits)  … )  )   )  ( b ( b ( b • Increasing Memory Bandwidth, But… – Not keeping pace with compute increase. p g p p – Used to be 1 ‐ byte/flop – Current: Wsm ( 0.21 bytes/flop ); AMD Magny Cours: (0.20 bytes/flop ); NVIDIA GTX 480 ( 0.13 bytes/flop ) – Future: 0.05 bytes/flop (GPUs, 2017) (ref: Bill Dally, SC’09) O One clear trend: More cores in processors l t d M i Victor.W.Lee@intel.com 16

Architecture Trend Architecture Trend Intel Core i7 990X Intel KNF (a.k.a. Westmere) Sockets 2 1 Cores/socket 6 32 Core Frequency (GHz) 3.3 1.2 SIMD Width 4 16 Peak Compute Peak Compute 316 GFLOPS 316 GFLOPS 1 228 GFLOPS 1,228 GFLOPS Increase in compute comes from more cores and wider SIMD Implication: Need to start programming for Parallel Architecture Victor.W.Lee@intel.com 17

Parallel Programming Parallel Programming • What’s hard about it? What s hard about it? We don’t think in parallel Parallel algorithms are Parallel algorithms are after ‐ thoughts Victor.W.Lee@intel.com 18

Parallel Programming Parallel Programming • Best serial code doesn’t always scale well for large # of processors Victor.W.Lee@intel.com 19

Scalability for Multi ‐ Core y • Amdahl’s law for multi ‐ core architecture: Serial component Parallel component component 4/21/2011 Intel Confidential 20

Scalability of Many ‐ Core y y • Amdahl’s law for many ‐ core architecture: Serial component Parallel component p Perf. ratio between 1 core in single ‐ core processor and many ‐ core processor Significant portion of applications must be Significant portion of applications must be parallelized to achieve good scaling 4/21/2011 Intel Confidential 21

Challenges (Summary) Challenges (Summary) • Architecture changes for many ‐ core – Compute density vs. compute efficiency – Data management: Feeding the Beast • Algorithms – Is the best scalar algorithm suitable for parallel computing • Programming model – Human tends to think in sequential steps Parallel – Human tends to think in sequential steps. Parallel computing is not natural – Non ‐ ninja parallel programming Victor.W.Lee@intel.com 22

Our approach Our approach Application Specific HW/SW Co ‐ design HW/SW Co design

Our Approach: App ‐ Arch Co ‐ Design Architecture ‐ aware analysis of computational needs of parallel applications Workloads Programming environments Focus on specific co ‐ travelers Focus on specific co travelers Execution environments and domains: Platform firmware/Ucode W orkload W orkloads requirem ents used to I/O, network, Memory HPC/Imaging/Finance/Physical drive design validate storage decisions designs Simulations/Medical/… Si l i /M di l/ On-die fabric Cache Cores Multi ‐ /Many ‐ core features that accelerate applications in a power ‐ efficient manner (bonus point: simplify programming) Victor.W.Lee@intel.com 24

Parallel Computing: Opportunities and Challenges Victor Lee Parallel - PowerPoint PPT Presentation

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel Parallel Computing Lab (PCL), Intel Who We Are: Parallel Computing Lab Parallel Computing Research to Realization Worldwide leadership in

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview on Parallel Programming Paradigms Ivan Giro3o

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM RAM model (Random Access

Parallel- 0 : A fully parallel algorithm for combinatorial compressed sensing Jared Tanner

P A R A L L E L A L G O R I T H M S F O R M I N I N G L A R G E - S C A L E T I M E - V A R Y

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

Principle of Parallel Algorithm Design Alexandre David B2-206 Today Preliminaries (3.1).

Parallel Computing: Opportunities and Challenges Victor Lee Parallel - PowerPoint PPT Presentation

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel Parallel Computing Lab (PCL), Intel Who We Are: Parallel Computing Lab Parallel Computing Research to Realization Worldwide leadership in

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Parallel computing platforms Approaches to building parallel computers

Introduction to Parallel Computing George Karypis Analytical Modeling of Parallel Algorithms

Introduction to Parallel Computing George Karypis Parallel Programming Platforms Elements of a

c p e c Writing Message-Passing Parallel Programs with MPI Edinburgh Parallel Computing Centre

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Elmer Parallel Computing ElmerTeam CSC IT Center for Science Ltd. CSC, April 2013 Parallel

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview on Parallel Programming Paradigms Ivan Giro3o

PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI

Parallel Programs 1 Why Bother with Programs? Theyre what runs on the machines we design

1 Analysis of sequential algorithms: The PRAM Model a Parallel RAM RAM model (Random Access

Parallel- 0 : A fully parallel algorithm for combinatorial compressed sensing Jared Tanner

P A R A L L E L A L G O R I T H M S F O R M I N I N G L A R G E - S C A L E T I M E - V A R Y

A Tile-based Parallel Viterbi Algorithm for Biological Sequence Alignment on GPU with CUDA Zhihui

Principle of Parallel Algorithm Design Alexandre David B2-206 Today Preliminaries (3.1).

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &