SLIDE 1 D t A l ti & Data Analytics & High Performance Computing: g p g When Worlds Collide
Bruce Hendrickson
Senior Manager for Math & Computer Science Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
SLIDE 2
What’s Left to Say!?
SLIDE 3
Worlds Apart
High Performance Data Analytics High Performance Computing Data Analytics Programming MPI SQL / MapReduce g g Model Q p Performance Single Application Throughput Metric Runtime Performance Limiter Processor Memory System Limiter Execution Model Batch Interactive Architecture Performance Resilience Architecture Driver Performance Resilience Data Volumes Small in, Large out Large in, Small out , g g , …
SLIDE 4 Outline
- Today’s HPC landscape
- HPC Applications are changing
– Evolution – Evolution – Revolution
- Architectures are changing
– Evolution – Revolution
– Organic forces will make HPC more data friendly – External forces will make HPC more data-centric
SLIDE 5 Enablers for Mainstream HPC
– “Killer micros” enable commodity-based parallel computing Killer micros enable commodity-based parallel computing – Attractive price and price/performance – Stable model for algorithms & software
– Portable and stable programming model and language All d f h i i f – Allowed for huge investment in software
- Bulk-Synchronous Parallel Programming (BSP)
– Basic approach to almost all successful MPI programs – Basic approach to almost all successful MPI programs – Compute locally; communicate; repeat – Excellent match for clusters+MPI – Good fit for many scientific applications
– Stability of the above allows for sustained algorithmic research
SLIDE 6
A Virtuous Circle…
Commodity Clusters Architectures Programming Explicit Programming Models Software p Message Passing MPI Algorithms g Bulk Synchronous P ll l Parallel
…but also a suffocating embrace
SLIDE 7 Applications Are Evolving
- Leading edge scientific applications increasingly include:
Adaptive unstructured data structures – Adaptive, unstructured data structures – Complex, multiphysics simulations – Multiscale computations in space and time – Multiscale computations in space and time – Complex synchronizations (e.g. discrete events)
- These raise significant parallelization challenges
– Limited by memory, not processor performance y y, p p – Unsolved micro-load balancing problems – Finite degree of coarse-grained parallelism – Bulk synchronous parallel not always appropriate
- These changes will stress existing approaches to parallelism
SLIDE 8 Revolutionary Applications
- What is “Computational Science”?
What is Computational Science ?
- We often equate it with modeling and simulation.
– But this is unnecessarily limited. F Di ti
– science – (noun) A branch of knowledge or study dealing – science – (noun) A branch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws. – com·pu·ta·tion·al (adjective) Of or involving computation
SLIDE 9 Emerging Uses of Computing in Science
- Science is increasingly data-centric
– Biology astrophysics particle physics earth science Biology, astrophysics, particle physics, earth science – Social sciences – Experimental, computational and literature data
- Sophisticated computing often required to extract
knowledge from this data knowledge from this data
- Computing challenges are different from mod/sim
p g g
– Data sets can be huge (I/O is a priority) – Response time may be short (throughput is key metric) – Computational kernels have different character
- What abstractions paradigms and algorithms are needed?
- What abstractions, paradigms and algorithms are needed?
SLIDE 10 Example: Network Science
- Graphs are ideal for representing entities and relationships
- Rapidly growing use in biological, social, environmental,
and other sciences and other sciences
The way it was … The way it is now …
Zachary’s karate club (|V|=34) Twitter social network (|V|≈200M) (| | )
SLIDE 11 Computational Challenges for Network Science
- Unlike meshes, complex networks aren’t partitionable
- Minimal computation to hide access time
- Runtime is dominated by latency
y y – Random accesses to global address space – Parallelism is very fine grained and dynamic
- Access pattern is data dependent
– Prefetching unlikely to help g y p – Usually only want small part of cache line P t ti ll b l l lit t ll l l f hi h
- Potentially abysmal locality at all levels of memory hierarchy
- Many algorithms are not bulk synchronous
- Approaches based on virtuous circle don’t work!
SLIDE 12 Locality Challenges
What we traditionally care about y What industry cares about Emerging Codes
From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007
SLIDE 13 Outline
- Today’s HPC landscape
- HPC Applications are changing
– Evolution – Evolution – Revolution
- Architectures are changing
– Evolution – Revolution
– Organic forces will make HPC more data friendly – External forces will make HPC more data-centric
SLIDE 14
Example: AMD Opteron p p
SLIDE 15
Example: AMD Opteron
Memory
p p
y (Latency Avoidance)
L1 D-Cache L2 Cache L1 I-Cache
SLIDE 16
Example: AMD Opteron
Memory
p p
y (Lat. Avoidance) Out-of-Order
L1 D-Cache Load/Store Unit
Exec Load/Store Mem/Coherency
L2 Cache I-Fetch
Mem/Coherency (Latency Tolerance)
L1 I-Cache I-Fetch Scan Align Memory Controller
SLIDE 17 Example: AMD Opteron
M
p p
Memory (Latency Avoidance)
L1 D-Cache
Avoidance)
Load/Store Unit
Out-of-Order E
L2 Cache I-Fetch
Exec Load/Store Mem/Coherency
Bus
DDR HT
L1 I-Cache I-Fetch Scan Align
y (Lat. Toleration)
Memory Controller
Memory and I/O Interfaces
SLIDE 18 Example: AMD Opteron
Memory
p p
y (Latency Avoidance)
FPU Execution L1 D-Cache Load/Store Unit
Out-of-Order Exec Load/Store
L2 Cache I-Fetch
Load/Store Mem/Coherency (Lat. Tolerance)
Int Execution Bus
DDR HT
L1 I-Cache I-Fetch Scan Align
Memory and I/O Interfaces
Memory Controller
COMPUTER
Thanks to Thomas Sterling
SLIDE 19 A Renaissance in Architecture Research
– Moore’s Law marches on – Real estate on a chip is essentially free
huge opportunity for innovation
- Major paradigm change – huge opportunity for innovation
- Bad news
– Power considerations limit the improvement in clock speed – Parallelism is only viable route to improve performance
- Current response, multicore processors
– Computation/Communication ratio will get worse p g
- Makes life harder for applications
L t l
- Long-term consequences unclear
SLIDE 20 Architectural Wish List for Graphs
- Low latency / high bandwidth
- Low latency / high bandwidth
– For small messages!
y
- Light-weight synchronization mechanisms for fine-grained
parallelism
– No graph partitioning required Avoid memory consuming profusion of ghost nodes – Avoid memory-consuming profusion of ghost-nodes – No local/global numbering conversions
- One machine with these properties is the Cray XMT
– Descendent of the Tera MTA
SLIDE 21 How Does the XMT Work?
- Latency tolerance via massive multi threading
- Latency tolerance via massive multi-threading
– Context switch in a single tick – Global address space, hashed to reduce hot-spots – No cache or local memory. – Multiple outstanding loads
- Remote memory request doesn’t stall processor
Remote memory request doesn t stall processor
– Other streams work while your request gets fulfilled
- Light-weight, word-level synchronization
Minimizes conflicts enables parallelism – Minimizes conflicts, enables parallelism
- Flexible dynamic load balancing
Sl l k 400 MH
SLIDE 22 Case Study: Single Source Shortest Path
- Parallel Boost Graph Library (PBGL)
Lumsdaine et al on Opteron cluster – Lumsdaine, et al., on Opteron cluster – Some graph algorithms can scale on some inputs
PBGL SSSP
e (s)
some inputs
- PBGL – MTA2 Comparison on SSSP
Time
MTA SSSP
– Erdös-Renyi random graph (|V|=228) – PBGL SSSP can scale on non-power
# Processors
law graphs – Order of magnitude speed difference 2 d f i d ffi i – 2 orders of magnitude efficiency difference
- Big difference in power consumption
Big difference in power consumption
– [Lumsdaine, Gregor, H., Berry, 2007]
SLIDE 23 Longer Term Architectural Opportunities
– Multithreading for latency tolerance on commodity processors Growing heterogeneity in our compute nodes – Growing heterogeneity in our compute nodes – Specialized machines targeting market segments
– Application-specific circuitry?
- E.g. common scientific kernels
– Reconfigurable hardware?
- Adapt circuits to the application at run time
- Adapt circuits to the application at run time
- Disruptive changes to our virtuous circle!
p g
SLIDE 24 Role for HPC in Data Analysis Pipeline
Compute
- Graph created from raw data, explored and
p Cloud
p , p studied under analyst direction.
Data Appliance
- Analysis done at every stage in pipeline
– Data size decreasing downwards
Appliance
– Algorithmic complexity increasing downwards
- Data flow not fully unidirectional
HPC Platform
– Analysis results get updated in database
I t ti t b t h ti
Platform
– Interactive, not batch computing – Orchestration is immensely complicated – Usability and human factors are big challenges
Workstation
Usability and human factors are big challenges
SLIDE 25
HPC Analysis System Architecture
HPC Data Server Cray XMT MPP
SLIDE 26 Exemplar: Sandia’s Networks Grand Challenge
- Large internally funded R&D project
- Goals:
– Enable exploratory analytics at scale Enable exploratory analytics at scale – Support rich combinations of analytical methods
- Graph analytics, algebraic methods, statistics, info-viz, …
– Focus on usability and usefulness of capabilities
- Significant human-factors investment
Cl ll b ti b t h d t t t
- Close collaboration between researchers and target customers
– Create open-source foundation for further R&D
SLIDE 27 Use Case for First Prototype: What ‘payloads’ were contained in the network transfers?
tkG Vi Linked Selection
Domain Mappers
Database vtkSQLQuery vtkGeoView vtkPARAFAC + More vtkTermView*
Domain Mappers vtkTableToSparseArray
vtkPARAFAC + More vtkConceptTermView* * Just * Just vtkGraphViews vtkGraphViews
SLIDE 28 Another Use Case: Analyst needs suspicious (out of the ‘norm’) behaviors Analyst needs suspicious (out of the norm ) behaviors “flagged” for further exploration
Database vtkSQLQuery vtkTableView kT bl T G h kG hVi vtkContingencyStats vtkTableToGraph vtkGraphView g y
SLIDE 29
Second Prototype
SLIDE 30 Conclusions
- Organic forces will make HPC more data friendly
BSP/MPI hegemony is breaking down – BSP/MPI hegemony is breaking down
- Multicore, more complex applications (traditional & emerging)
– Memory performance is already the key to HPC y p y y
- External forces will make HPC more data-centric
– Unprecedented opportunities for architectural innovation – Data rich applications of growing importance to science – Opportunities to impact applications currently not on the HPC radar
- Enormous challenges ahead
– Some will need to be addressed by HPC community anyway Some will need to be addressed by HPC community anyway – Cultural challenges may prove the most daunting
SLIDE 31 Thanks
- Cevdet Aykanat, Jon Berry, Rob Bisseling, Erik Boman,
Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine Iain Duff Danny Dunlavy Alan Edelman Jean Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean- Loup Faulon, John Feo, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Simon Kahan, P t K T K ld G K f t F d ik M Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Michael Mahoney, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan , p , g g, Toledo, Keith Underwood, etc.