D t A Data Analytics & l ti & High Performance Computing: - - PowerPoint PPT Presentation

d t a data analytics l ti high performance computing g p
SMART_READER_LITE
LIVE PREVIEW

D t A Data Analytics & l ti & High Performance Computing: - - PowerPoint PPT Presentation

D t A Data Analytics & l ti & High Performance Computing: g p g When Worlds Collide Bruce Hendrickson Senior Manager for Math & Computer Science Senior Manager for Math & Computer Science Sandia National Laboratories,


slide-1
SLIDE 1

D t A l ti & Data Analytics & High Performance Computing: g p g When Worlds Collide

Bruce Hendrickson

Senior Manager for Math & Computer Science Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

slide-2
SLIDE 2

What’s Left to Say!?

slide-3
SLIDE 3

Worlds Apart

High Performance Data Analytics High Performance Computing Data Analytics Programming MPI SQL / MapReduce g g Model Q p Performance Single Application Throughput Metric Runtime Performance Limiter Processor Memory System Limiter Execution Model Batch Interactive Architecture Performance Resilience Architecture Driver Performance Resilience Data Volumes Small in, Large out Large in, Small out , g g , …

slide-4
SLIDE 4

Outline

  • Today’s HPC landscape
  • HPC Applications are changing

– Evolution – Evolution – Revolution

  • Architectures are changing

– Evolution – Revolution

  • Conclusions:

– Organic forces will make HPC more data friendly – External forces will make HPC more data-centric

slide-5
SLIDE 5

Enablers for Mainstream HPC

  • Clusters

– “Killer micros” enable commodity-based parallel computing Killer micros enable commodity-based parallel computing – Attractive price and price/performance – Stable model for algorithms & software

  • MPI

– Portable and stable programming model and language All d f h i i f – Allowed for huge investment in software

  • Bulk-Synchronous Parallel Programming (BSP)

– Basic approach to almost all successful MPI programs – Basic approach to almost all successful MPI programs – Compute locally; communicate; repeat – Excellent match for clusters+MPI – Good fit for many scientific applications

  • Algorithms

– Stability of the above allows for sustained algorithmic research

slide-6
SLIDE 6

A Virtuous Circle…

Commodity Clusters Architectures Programming Explicit Programming Models Software p Message Passing MPI Algorithms g Bulk Synchronous P ll l Parallel

…but also a suffocating embrace

slide-7
SLIDE 7

Applications Are Evolving

  • Leading edge scientific applications increasingly include:

Adaptive unstructured data structures – Adaptive, unstructured data structures – Complex, multiphysics simulations – Multiscale computations in space and time – Multiscale computations in space and time – Complex synchronizations (e.g. discrete events)

  • These raise significant parallelization challenges

– Limited by memory, not processor performance y y, p p – Unsolved micro-load balancing problems – Finite degree of coarse-grained parallelism – Bulk synchronous parallel not always appropriate

  • These changes will stress existing approaches to parallelism
slide-8
SLIDE 8

Revolutionary Applications

  • What is “Computational Science”?

What is Computational Science ?

  • We often equate it with modeling and simulation.

– But this is unnecessarily limited. F Di ti

  • From Dictionary.com:

– science – (noun) A branch of knowledge or study dealing – science – (noun) A branch of knowledge or study dealing with a body of facts or truths systematically arranged and showing the operation of general laws. – com·pu·ta·tion·al (adjective) Of or involving computation

  • r computers.
  • co pute s.
slide-9
SLIDE 9

Emerging Uses of Computing in Science

  • Science is increasingly data-centric

– Biology astrophysics particle physics earth science Biology, astrophysics, particle physics, earth science – Social sciences – Experimental, computational and literature data

  • Sophisticated computing often required to extract

knowledge from this data knowledge from this data

  • Computing challenges are different from mod/sim

p g g

– Data sets can be huge (I/O is a priority) – Response time may be short (throughput is key metric) – Computational kernels have different character

  • What abstractions paradigms and algorithms are needed?
  • What abstractions, paradigms and algorithms are needed?
slide-10
SLIDE 10

Example: Network Science

  • Graphs are ideal for representing entities and relationships
  • Rapidly growing use in biological, social, environmental,

and other sciences and other sciences

The way it was … The way it is now …

Zachary’s karate club (|V|=34) Twitter social network (|V|≈200M) (| | )

slide-11
SLIDE 11

Computational Challenges for Network Science

  • Unlike meshes, complex networks aren’t partitionable
  • Minimal computation to hide access time
  • Runtime is dominated by latency

y y – Random accesses to global address space – Parallelism is very fine grained and dynamic

  • Access pattern is data dependent

– Prefetching unlikely to help g y p – Usually only want small part of cache line P t ti ll b l l lit t ll l l f hi h

  • Potentially abysmal locality at all levels of memory hierarchy
  • Many algorithms are not bulk synchronous
  • Approaches based on virtuous circle don’t work!
slide-12
SLIDE 12

Locality Challenges

What we traditionally care about y What industry cares about Emerging Codes

From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007

slide-13
SLIDE 13

Outline

  • Today’s HPC landscape
  • HPC Applications are changing

– Evolution – Evolution – Revolution

  • Architectures are changing

– Evolution – Revolution

  • Conclusions:

– Organic forces will make HPC more data friendly – External forces will make HPC more data-centric

slide-14
SLIDE 14

Example: AMD Opteron p p

slide-15
SLIDE 15

Example: AMD Opteron

Memory

p p

y (Latency Avoidance)

L1 D-Cache L2 Cache L1 I-Cache

slide-16
SLIDE 16

Example: AMD Opteron

Memory

p p

y (Lat. Avoidance) Out-of-Order

L1 D-Cache Load/Store Unit

Exec Load/Store Mem/Coherency

L2 Cache I-Fetch

Mem/Coherency (Latency Tolerance)

L1 I-Cache I-Fetch Scan Align Memory Controller

slide-17
SLIDE 17

Example: AMD Opteron

M

p p

Memory (Latency Avoidance)

L1 D-Cache

Avoidance)

Load/Store Unit

Out-of-Order E

L2 Cache I-Fetch

Exec Load/Store Mem/Coherency

Bus

DDR HT

L1 I-Cache I-Fetch Scan Align

y (Lat. Toleration)

Memory Controller

Memory and I/O Interfaces

slide-18
SLIDE 18

Example: AMD Opteron

Memory

p p

y (Latency Avoidance)

FPU Execution L1 D-Cache Load/Store Unit

Out-of-Order Exec Load/Store

L2 Cache I-Fetch

Load/Store Mem/Coherency (Lat. Tolerance)

Int Execution Bus

DDR HT

L1 I-Cache I-Fetch Scan Align

Memory and I/O Interfaces

Memory Controller

COMPUTER

Thanks to Thomas Sterling

slide-19
SLIDE 19

A Renaissance in Architecture Research

  • Good news

– Moore’s Law marches on – Real estate on a chip is essentially free

  • Major paradigm change

huge opportunity for innovation

  • Major paradigm change – huge opportunity for innovation
  • Bad news

– Power considerations limit the improvement in clock speed – Parallelism is only viable route to improve performance

  • Current response, multicore processors

– Computation/Communication ratio will get worse p g

  • Makes life harder for applications

L t l

  • Long-term consequences unclear
slide-20
SLIDE 20

Architectural Wish List for Graphs

  • Low latency / high bandwidth
  • Low latency / high bandwidth

– For small messages!

  • Latency tolerant

y

  • Light-weight synchronization mechanisms for fine-grained

parallelism

  • Global address space

– No graph partitioning required Avoid memory consuming profusion of ghost nodes – Avoid memory-consuming profusion of ghost-nodes – No local/global numbering conversions

  • One machine with these properties is the Cray XMT

– Descendent of the Tera MTA

slide-21
SLIDE 21

How Does the XMT Work?

  • Latency tolerance via massive multi threading
  • Latency tolerance via massive multi-threading

– Context switch in a single tick – Global address space, hashed to reduce hot-spots – No cache or local memory. – Multiple outstanding loads

  • Remote memory request doesn’t stall processor

Remote memory request doesn t stall processor

– Other streams work while your request gets fulfilled

  • Light-weight, word-level synchronization

Minimizes conflicts enables parallelism – Minimizes conflicts, enables parallelism

  • Flexible dynamic load balancing

Sl l k 400 MH

  • Slow clock, 400 MHz
slide-22
SLIDE 22

Case Study: Single Source Shortest Path

  • Parallel Boost Graph Library (PBGL)

Lumsdaine et al on Opteron cluster – Lumsdaine, et al., on Opteron cluster – Some graph algorithms can scale on some inputs

PBGL SSSP

e (s)

some inputs

  • PBGL – MTA2 Comparison on SSSP

Time

MTA SSSP

– Erdös-Renyi random graph (|V|=228) – PBGL SSSP can scale on non-power

# Processors

law graphs – Order of magnitude speed difference 2 d f i d ffi i – 2 orders of magnitude efficiency difference

  • Big difference in power consumption

Big difference in power consumption

– [Lumsdaine, Gregor, H., Berry, 2007]

slide-23
SLIDE 23

Longer Term Architectural Opportunities

  • Near future trends

– Multithreading for latency tolerance on commodity processors Growing heterogeneity in our compute nodes – Growing heterogeneity in our compute nodes – Specialized machines targeting market segments

  • Further out

– Application-specific circuitry?

  • E.g. common scientific kernels

– Reconfigurable hardware?

  • Adapt circuits to the application at run time
  • Adapt circuits to the application at run time
  • Disruptive changes to our virtuous circle!

p g

slide-24
SLIDE 24

Role for HPC in Data Analysis Pipeline

Compute

  • Graph created from raw data, explored and

p Cloud

p , p studied under analyst direction.

Data Appliance

  • Analysis done at every stage in pipeline

– Data size decreasing downwards

Appliance

– Algorithmic complexity increasing downwards

  • Data flow not fully unidirectional

HPC Platform

– Analysis results get updated in database

  • Human-centric HPC

I t ti t b t h ti

Platform

– Interactive, not batch computing – Orchestration is immensely complicated – Usability and human factors are big challenges

Workstation

Usability and human factors are big challenges

slide-25
SLIDE 25

HPC Analysis System Architecture

HPC Data Server Cray XMT MPP

slide-26
SLIDE 26

Exemplar: Sandia’s Networks Grand Challenge

  • Large internally funded R&D project
  • Goals:

– Enable exploratory analytics at scale Enable exploratory analytics at scale – Support rich combinations of analytical methods

  • Graph analytics, algebraic methods, statistics, info-viz, …

– Focus on usability and usefulness of capabilities

  • Significant human-factors investment

Cl ll b ti b t h d t t t

  • Close collaboration between researchers and target customers

– Create open-source foundation for further R&D

  • Builds on ParaView
slide-27
SLIDE 27

Use Case for First Prototype: What ‘payloads’ were contained in the network transfers?

tkG Vi Linked Selection

Domain Mappers

Database vtkSQLQuery vtkGeoView vtkPARAFAC + More vtkTermView*

Domain Mappers vtkTableToSparseArray

vtkPARAFAC + More vtkConceptTermView* * Just * Just vtkGraphViews vtkGraphViews

slide-28
SLIDE 28

Another Use Case: Analyst needs suspicious (out of the ‘norm’) behaviors Analyst needs suspicious (out of the norm ) behaviors “flagged” for further exploration

Database vtkSQLQuery vtkTableView kT bl T G h kG hVi vtkContingencyStats vtkTableToGraph vtkGraphView g y

slide-29
SLIDE 29

Second Prototype

slide-30
SLIDE 30

Conclusions

  • Organic forces will make HPC more data friendly

BSP/MPI hegemony is breaking down – BSP/MPI hegemony is breaking down

  • Multicore, more complex applications (traditional & emerging)

– Memory performance is already the key to HPC y p y y

  • External forces will make HPC more data-centric

– Unprecedented opportunities for architectural innovation – Data rich applications of growing importance to science – Opportunities to impact applications currently not on the HPC radar

  • Enormous challenges ahead

– Some will need to be addressed by HPC community anyway Some will need to be addressed by HPC community anyway – Cultural challenges may prove the most daunting

slide-31
SLIDE 31

Thanks

  • Cevdet Aykanat, Jon Berry, Rob Bisseling, Erik Boman,

Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine Iain Duff Danny Dunlavy Alan Edelman Jean Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean- Loup Faulon, John Feo, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Simon Kahan, P t K T K ld G K f t F d ik M Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Michael Mahoney, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan , p , g g, Toledo, Keith Underwood, etc.