[PPT] - D t A Data Analytics & l ti & High Performance Computing: PowerPoint Presentation

SLIDE 1

D t A l ti & Data Analytics & High Performance Computing: g p g When Worlds Collide

Bruce Hendrickson

Senior Manager for Math & Computer Science Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

SLIDE 2

What’s Left to Say!?

SLIDE 3

Worlds Apart

High Performance Data Analytics High Performance Computing Data Analytics Programming MPI SQL / MapReduce g g Model Q p Performance Single Application Throughput Metric Runtime Performance Limiter Processor Memory System Limiter Execution Model Batch Interactive Architecture Performance Resilience Architecture Driver Performance Resilience Data Volumes Small in, Large out Large in, Small out , g g , …

SLIDE 4

Outline

Today’s HPC landscape
HPC Applications are changing

– Evolution – Evolution – Revolution

Architectures are changing

– Evolution – Revolution

Conclusions:

– Organic forces will make HPC more data friendly – External forces will make HPC more data-centric

SLIDE 5

Enablers for Mainstream HPC

Clusters

– “Killer micros” enable commodity-based parallel computing Killer micros enable commodity-based parallel computing – Attractive price and price/performance – Stable model for algorithms & software

MPI

– Portable and stable programming model and language All d f h i i f – Allowed for huge investment in software

Bulk-Synchronous Parallel Programming (BSP)

– Basic approach to almost all successful MPI programs – Basic approach to almost all successful MPI programs – Compute locally; communicate; repeat – Excellent match for clusters+MPI – Good fit for many scientific applications

Algorithms

– Stability of the above allows for sustained algorithmic research

SLIDE 6

A Virtuous Circle…

Commodity Clusters Architectures Programming Explicit Programming Models Software p Message Passing MPI Algorithms g Bulk Synchronous P ll l Parallel

…but also a suffocating embrace

SLIDE 7

Applications Are Evolving

Leading edge scientific applications increasingly include:

Adaptive unstructured data structures – Adaptive, unstructured data structures – Complex, multiphysics simulations – Multiscale computations in space and time – Multiscale computations in space and time – Complex synchronizations (e.g. discrete events)

These raise significant parallelization challenges

– Limited by memory, not processor performance y y, p p – Unsolved micro-load balancing problems – Finite degree of coarse-grained parallelism – Bulk synchronous parallel not always appropriate

These changes will stress existing approaches to parallelism

SLIDE 8

Revolutionary Applications

What is “Computational Science”?

What is Computational Science ?

We often equate it with modeling and simulation.

– But this is unnecessarily limited. F Di ti

From Dictionary.com:

– science – (noun) A branch of knowledge or study dealing – science – (noun) A branch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws. – com·pu·ta·tion·al (adjective) Of or involving computation

r computers.
co pute s.

SLIDE 9

Emerging Uses of Computing in Science

Science is increasingly data-centric

– Biology astrophysics particle physics earth science Biology, astrophysics, particle physics, earth science – Social sciences – Experimental, computational and literature data

Sophisticated computing often required to extract

knowledge from this data knowledge from this data

Computing challenges are different from mod/sim

p g g

– Data sets can be huge (I/O is a priority) – Response time may be short (throughput is key metric) – Computational kernels have different character

What abstractions paradigms and algorithms are needed?
What abstractions, paradigms and algorithms are needed?

SLIDE 10

Example: Network Science

Graphs are ideal for representing entities and relationships
Rapidly growing use in biological, social, environmental,

and other sciences and other sciences

The way it was … The way it is now …

Zachary’s karate club (|V|=34) Twitter social network (|V|≈200M) (| | )

SLIDE 11

Computational Challenges for Network Science

Unlike meshes, complex networks aren’t partitionable
Minimal computation to hide access time
Runtime is dominated by latency

y y – Random accesses to global address space – Parallelism is very fine grained and dynamic

Access pattern is data dependent

– Prefetching unlikely to help g y p – Usually only want small part of cache line P t ti ll b l l lit t ll l l f hi h

Potentially abysmal locality at all levels of memory hierarchy
Many algorithms are not bulk synchronous
Approaches based on virtuous circle don’t work!

SLIDE 12

Locality Challenges

What we traditionally care about y What industry cares about Emerging Codes

From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007

SLIDE 13

Outline

Today’s HPC landscape
HPC Applications are changing

– Evolution – Evolution – Revolution

Architectures are changing

– Evolution – Revolution

Conclusions:

– Organic forces will make HPC more data friendly – External forces will make HPC more data-centric

SLIDE 14

Example: AMD Opteron p p

SLIDE 15

Example: AMD Opteron

Memory

p p

y (Latency Avoidance)

L1 D-Cache L2 Cache L1 I-Cache

SLIDE 16

Example: AMD Opteron

Memory

p p

y (Lat. Avoidance) Out-of-Order

L1 D-Cache Load/Store Unit

Exec Load/Store Mem/Coherency

L2 Cache I-Fetch

Mem/Coherency (Latency Tolerance)

L1 I-Cache I-Fetch Scan Align Memory Controller

SLIDE 17

Example: AMD Opteron

M

p p

Memory (Latency Avoidance)

L1 D-Cache

Avoidance)

Load/Store Unit

Out-of-Order E

L2 Cache I-Fetch

Exec Load/Store Mem/Coherency

Bus

DDR HT

L1 I-Cache I-Fetch Scan Align

y (Lat. Toleration)

Memory Controller

Memory and I/O Interfaces

SLIDE 18

Example: AMD Opteron

Memory

p p

y (Latency Avoidance)

FPU Execution L1 D-Cache Load/Store Unit

Out-of-Order Exec Load/Store

L2 Cache I-Fetch

Load/Store Mem/Coherency (Lat. Tolerance)

Int Execution Bus

DDR HT

L1 I-Cache I-Fetch Scan Align

Memory and I/O Interfaces

Memory Controller

COMPUTER

Thanks to Thomas Sterling

SLIDE 19

A Renaissance in Architecture Research

Good news

– Moore’s Law marches on – Real estate on a chip is essentially free

Major paradigm change

huge opportunity for innovation

Major paradigm change – huge opportunity for innovation
Bad news

– Power considerations limit the improvement in clock speed – Parallelism is only viable route to improve performance

Current response, multicore processors

– Computation/Communication ratio will get worse p g

Makes life harder for applications

L t l

Long-term consequences unclear

SLIDE 20

Architectural Wish List for Graphs

Low latency / high bandwidth
Low latency / high bandwidth

– For small messages!

Latency tolerant

y

Light-weight synchronization mechanisms for fine-grained

parallelism

Global address space

– No graph partitioning required Avoid memory consuming profusion of ghost nodes – Avoid memory-consuming profusion of ghost-nodes – No local/global numbering conversions

One machine with these properties is the Cray XMT

– Descendent of the Tera MTA

SLIDE 21

How Does the XMT Work?

Latency tolerance via massive multi threading
Latency tolerance via massive multi-threading

– Context switch in a single tick – Global address space, hashed to reduce hot-spots – No cache or local memory. – Multiple outstanding loads

Remote memory request doesn’t stall processor

Remote memory request doesn t stall processor

– Other streams work while your request gets fulfilled

Light-weight, word-level synchronization

Minimizes conflicts enables parallelism – Minimizes conflicts, enables parallelism

Flexible dynamic load balancing

Sl l k 400 MH

Slow clock, 400 MHz

SLIDE 22

Case Study: Single Source Shortest Path

Parallel Boost Graph Library (PBGL)

Lumsdaine et al on Opteron cluster – Lumsdaine, et al., on Opteron cluster – Some graph algorithms can scale on some inputs

PBGL SSSP

e (s)

some inputs

PBGL – MTA2 Comparison on SSSP

Time

MTA SSSP

– Erdös-Renyi random graph (|V|=228) – PBGL SSSP can scale on non-power

# Processors

law graphs – Order of magnitude speed difference 2 d f i d ffi i – 2 orders of magnitude efficiency difference

Big difference in power consumption

Big difference in power consumption

– [Lumsdaine, Gregor, H., Berry, 2007]

SLIDE 23

Longer Term Architectural Opportunities

Near future trends

– Multithreading for latency tolerance on commodity processors Growing heterogeneity in our compute nodes – Growing heterogeneity in our compute nodes – Specialized machines targeting market segments

Further out

– Application-specific circuitry?

E.g. common scientific kernels

– Reconfigurable hardware?

Adapt circuits to the application at run time
Adapt circuits to the application at run time
Disruptive changes to our virtuous circle!

p g

SLIDE 24

Role for HPC in Data Analysis Pipeline

Compute

Graph created from raw data, explored and

p Cloud

p , p studied under analyst direction.

Data Appliance

Analysis done at every stage in pipeline

– Data size decreasing downwards

Appliance

– Algorithmic complexity increasing downwards

Data flow not fully unidirectional

HPC Platform

– Analysis results get updated in database

Human-centric HPC

I t ti t b t h ti

Platform

– Interactive, not batch computing – Orchestration is immensely complicated – Usability and human factors are big challenges

Workstation

Usability and human factors are big challenges

SLIDE 25

HPC Analysis System Architecture

HPC Data Server Cray XMT MPP

SLIDE 26

Exemplar: Sandia’s Networks Grand Challenge

Large internally funded R&D project
Goals:

– Enable exploratory analytics at scale Enable exploratory analytics at scale – Support rich combinations of analytical methods

Graph analytics, algebraic methods, statistics, info-viz, …

– Focus on usability and usefulness of capabilities

Significant human-factors investment

Cl ll b ti b t h d t t t

Close collaboration between researchers and target customers

– Create open-source foundation for further R&D

Builds on ParaView

SLIDE 27

Use Case for First Prototype: What ‘payloads’ were contained in the network transfers?

tkG Vi Linked Selection

Domain Mappers

Database vtkSQLQuery vtkGeoView vtkPARAFAC + More vtkTermView*

Domain Mappers vtkTableToSparseArray

vtkPARAFAC + More vtkConceptTermView* * Just * Just vtkGraphViews vtkGraphViews

SLIDE 28

Another Use Case: Analyst needs suspicious (out of the ‘norm’) behaviors Analyst needs suspicious (out of the norm ) behaviors “flagged” for further exploration

Database vtkSQLQuery vtkTableView kT bl T G h kG hVi vtkContingencyStats vtkTableToGraph vtkGraphView g y

SLIDE 29

Second Prototype

SLIDE 30

Conclusions

Organic forces will make HPC more data friendly

BSP/MPI hegemony is breaking down – BSP/MPI hegemony is breaking down

Multicore, more complex applications (traditional & emerging)

– Memory performance is already the key to HPC y p y y

External forces will make HPC more data-centric

– Unprecedented opportunities for architectural innovation – Data rich applications of growing importance to science – Opportunities to impact applications currently not on the HPC radar

Enormous challenges ahead

– Some will need to be addressed by HPC community anyway Some will need to be addressed by HPC community anyway – Cultural challenges may prove the most daunting

SLIDE 31

Thanks

Cevdet Aykanat, Jon Berry, Rob Bisseling, Erik Boman,

Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine Iain Duff Danny Dunlavy Alan Edelman Jean Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean- Loup Faulon, John Feo, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Simon Kahan, P t K T K ld G K f t F d ik M Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Michael Mahoney, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan , p , g g, Toledo, Keith Underwood, etc.