[PPT] - Modeling Resource-Coupled Computations MarkHereld PowerPoint Presentation

SLIDE 1 Mark Hereld  Computa0on Ins0tute  Mathema0cs and Computer Science  Argonne Leadership Compu0ng Facility  Argonne Na0onal Laboratory  University of Chicago 

Modeling Resource-Coupled Computations

SLIDE 2

Roadmap

issues and ideas 
models and measurements 
implica0ons and work in progress

SLIDE 3

Issue

Given increasingly massive (and complex) datasets… 
how to connect them to computa0onal and display

resources that support visualiza0on and analysis?   

holis0c approaches to alloca0ng simula0on, analysis,

visualiza0on, display, storage, and network resources 

create and exploit ways to op0mally couple these

resources in real 0me 

SLIDE 4

Common sense

Analysis engines must be co‐located with simula0on

engines 

…or even, analysis code must be co‐located with

simula0on code, i.e in situ 

Display resources must be integrated locally with HPC

resources 

In general, wide‐area applica0ons will become

impossible… 

But, maybe the situa0on isn’t so dire.

SLIDE 5

ideas

Ideas 
Models 
Measurements 
Consequences 
Future

ideas 

SLIDE 6

Mitigation

More efficient I/O prac0ces

– Many (most) inefficiencies in R/W rates amenable to beWer  prac0ces by applica0on developer  – In addi0on to improvements in performance of I/O libraries 

BeWer data management

– BeWer data layout 

BeWer brute force compression methods

– Uncertainty aware; domain aware 

Leveraging limita0ons at the des0na0on

– Pixel real estate  – Perceptual limita0ons (and features) 

SLIDE 7

Coupled Resources

remote visualiza0on: couple data and large

computa0onal resources to remote display hardware 

in situ analysis and visualiza0on: merge simula0on

and analysis code on single machine 

co‐analysis: couple simula0on on supercomputer to

live analysis on visualiza0on and analysis plaZorm 

SLIDE 8

models

Ideas 
Models 
Measurements 
Consequences 
Future

models 

SLIDE 9 128 File‐Server   Nodes  Eureka  100 Nodes  10GE  640 x 10G   = 6.4 Tbps  Myrinet Switch   Complex  5‐Stage CLOS    10GE<‐>MX conversion  MX<‐>MX  640 BGP I/O Nodes  40K BGP Compute Nodes  10G MX  100 x 10G   = 1 Tbps  128 x 10G   = 1.28 Tbps  10G MX  Tree  4.3Tbps  Theore0cal Max Bandwidth from I/O Nodes to Eureka (Memory to Memory)    = 1 Tbps                        Bi‐direc0onal    = 2 Tbps  Theore0cal Max Bandwidth from I/O Nodes to FileServer (Memory to Memory) = 1.28 Tbps                       Bi‐direc0onal     = 2.56 Tbps  Theore0cal Max Bandwidth from Eureka to FileServer (Memory to Memory)     = 1 Tbps                       Bi‐direc0onal     = 2 Tbps  ALCF Network Architecture  Tbps – Terabits/sec 

SLIDE 10

Data Analytics Resource: Eureka

Data analy0cs and visualiza0on cluster at ALCF 
(2) head nodes, (100) compute nodes

– (2) Nvidia Quadro FX5600 graphics cards  – (2) XEON E5405 2.00 GHz quad core processors  – 32 GB RAM: (8) 4 rank, 4GB DIMMS  – (1) Myricom 10G CX4 NIC  – (2) 250GB local disks; (1) system, (1) minimal scratch  – 32 GFlops per server 

SLIDE 11

Application

FLASH

– Mul0‐physics code: Gravita0on, nuclear chemistry, MHD  – Laboratory to Universe 

Mul0ple (~20) simula0ons

– 8km resolu0on, 10K to 100K blocks each (16 * 16 * 16) voxel  – 2 Racks (8K cores) of the ANL’s Intrepid (BGP)  – typical simula0on is 10 runs each 12 hours 

O(hour) per checkpoint cycle

– 66% 0me spent simula0ng  – 33% 0me spent non‐overlapping I/O 

SLIDE 12

measurements

Ideas 
Models 
Measurements 
Consequences 
Future

measurements 

SLIDE 13

Flash IO for 1 run (12 hours)

Total Run 0me  =  41557 secs

– IO 0me during run = 14325 sec  (34% of the 0me)  – Circa March 2009 

Par0cle Data:

– 417 Files (0.1GB each) =  41.7 GB  – Time spent wri0ng =  9047 secs ( 22% of the run 0me) 

Plot files:

– 104 files (2.5GB each) ;Total = 260GB   – Time spent in wri0ng =  3897 secs ( 9% of the run 0me) 

Checkpoint  files:

– 10 files (8 GB each) ;Total = 80GB   – Time spent in wri0ng = 1144 secs ( 3% of the run 0me) 

SLIDE 14

FLASH Supernova Explosion Project

mul0ple (~20) simula0ons

– 8km resolu0on  – 10K to 100K blocks each (16 * 16 * 16) voxel  – 2 Racks (8K cores) of the ANL’s Intrepid (BGP)  – typical simula0on is 10 runs each 12 hours  – Circa November 2009 

=======================================================
File Type File Size #files #files Data Size
/ Run / Sim
=======================================================
Particle

~ 131 MB ~ 500 5000 500 GB

Plot ~ 13 GB 40-90 800 10 TB
Checkpoint ~ 42 GB 5-10 100 4.2 TB
=======================================================

SLIDE 15

Internal Network Experiments

Tree Network  Switch  BGP Compute Nodes  BGP I/O Node  Analysis Node 

SLIDE 16

Toward middleware to facilitate co-analysis

BGP Compute Nodes 

SLIDE 17

consequences

Ideas 
Models 
Measurements 
Consequences 
Future

consequences 

SLIDE 18

Map Intrepid I/O to Eureka

Speed up the applica0on

– Offload data organiza0on and disk writes 

Free co‐analysis

– Produce several high resolu0on movies  – Data compression  – Mul0‐0me step caching for window analysis 

Eureka is an accelerator and co‐analysis engine at only

1‐2% cost of Intrepid 

SLIDE 19

future

Ideas 
Models 
Measurements 
Consequences 
Future

future 

SLIDE 20

Works in Progress

Footprints

– System level use paWern data collec0on  – Boo0ng up a mini‐consor0um of resource monitoring enthusiasts 

in situ

– Papka parallel sorware rendering  – Tom Peterka and Rob Ross scaling sorware rendering algorithms  – HW‐SW rendering comparison experiments 

Co‐analysis

– StarGate experiments  – Intrepid <> Eureka communica0on experiments  – FLASH test 

Remote Visualiza0on

– Pixel shipping experiments and frameworks 

SLIDE 21 0.00001 0.0001 0.001 0.01 0.1 1 1 10 100 1000 Time (secs) Num Procs Eureka Rendering Times 256x256x256 Full Frame Time Render Time Composite Network Time Composite Render Time Sync State Time 0.00001 0.0001 0.001 0.01 0.1 1 1 10 100 1000 Time (secs) Num Procs Eureka Rendering Times 512x512x512 Full Frame Time Render Time Composite Network Time Composite Render Time Sync State Time 0.00001 0.0001 0.001 0.01 0.1 1 10 1 10 100 1000 Time (secs) Num Procs Eureka Rendering Times 1024x1024x1024 Full Frame Time Render Time Composite Network Time Composite Render Time Sync State Time 0.0001 0.001 0.01 0.1 1 1 10 100 Time (secs) Num Procs Eureka Rendering Times 2048x2048x2048 Full Frame Time Render Time Composite Network Time Composite Render Time Sync State Time 0.0001 0.001 0.01 0.1 1 10 100 1 10 100 1000 Time (secs) Num Procs Surveyor Rendering Times 256x256x256 Full Frame Time Render Time Composite Network Time Composite Render Time Sync State Time 0.0001 0.001 0.01 0.1 1 10 100 1 10 100 1000 Times (secs) Num Procs Surveyor Rendering Times 512x512x512 Full Frame Time Render Time Composite Network Time Composite Render Time Sync State Time

SLIDE 22

Wide Area Experiments

Simula0on 

4K uniform grid cube 
Single variable, float 
257 GB per 0me step 
577 0me steps 
150 TB total

Visualiza0on 

Volume rendering 
4K x 4K pixel

Interac0ve Display 

Large 0led display 
Naviga0on 
Manipula0on

RAW  DATA  RESULTS  CONTROL  DETAILS AND DEMO IN SDSU BOOTH 

SLIDE 23

Summary

Discussion of the issues with illumina0ng example

– Presumed impending Doom outlined 

Discussion of the ideas with examples

– Resource‐coupled computa0ons 

In situ couples simula0on and analysis in real 0me on shared compute 
Remote vis couples compute and data resources to remote display clients 
Co‐analysis couples two compute resources in real 0me 
Discussion of the work in progress with status

– Suite of experiments underway to characterize system components  – Strawman use cases in place provide challenging and exci0ng goals  – Stunning results and paradigm shirs forthcoming 

SLIDE 24

Acknowledgements

Venkat Vishwanath 
Michael Papka 
Eric Olson 
Joe Insley 
Tom Uram 
Tom Peterka 
Rob Ross 
Rick Stevens 
Rick Wagner, UCSD
Michael Norman, UCSD
Robert Harkess (UCSD)
Narayan Desai
David Ressman
William Scullin
Loren Wilson
Linda Winkler
ESNET2

SLIDE 25

end

Ques0ons?