Germn Llort gllort@bsc.es >10k processes + long runs = large - PowerPoint PPT Presentation

Germán Llort gllort@bsc.es

 >10k processes + long runs = large traces  Blind tracing is not an option  Profilers also start presenting issues  Can you even store the data?  How patient are you? 2 IPDPS - Atlanta, April 2010

 Past methodology: Filters driven by the expert • Get the whole trace • Summarize for a global view • Focus on a representative region  Goal: Transfer the expertise to the run-time 3 IPDPS - Atlanta, April 2010

 Traces of “100 Mb” • Best describe the application behavior • Trade-off: Maximize information / data ratio  The challenge? • Intelligent selection of the information  How? • On-line analysis framework – Decide at run-time what is most relevant 4 IPDPS - Atlanta, April 2010

Application tasks M PItrace attaches  Data acquisition T 0 T 1 T n • MPItrace (BSC) – PMPI wrappers  Data transmission • MRNet (U. of Wisconsin) Reduction Network – Scalable master / worker – Tree topology M RNet Front-end  Data analysis • Clustering (BSC) Clustering Analysis – Find structure of computing regions 5 IPDPS - Atlanta, April 2010

Back-end threads  Local trace buffers T 0 T 1 T n …  BE threads blocked  FE periodically collects data • Automatic / fixed interval Aggregate Broadcast • Reduction on tree data results  Global analysis M RNet Front-end  Propagate results Clustering  Locally emit trace events Analysis 6 IPDPS - Atlanta, April 2010

 Density-based clustering algorithm • J. Gonzalez, J. Gimenez, J. Labarta – IPDPS'09 “Automatic detection of parallel applications computation phases”  Characterize structure of computing regions  Using hardware counters data • Instructions + IPC – Complexity & Performance • Any other metric – i.e. L1, L2 cache misses 7 IPDPS - Atlanta, April 2010

Scatter Plot of Clustering Metrics Clusters Distribution Over Time Clusters Performance Code Linking 8 IPDPS - Atlanta, April 2010

 Trigger clustering analysis periodically • Sequence of structure snapshots  Compare subsequent clusterings • See changes in the application behavior  Find a representative region • Most applications are highly iterative 9 IPDPS - Atlanta, April 2010

 Compare 2 clusterings, cluster per cluster • Inscribe clusters into a rectangle • Match those that overlap with a 5% variance • Sum of the matched clusters cover the 85% of total computing time OK KO  Stability = N equivalent clusterings “in-a-row” • Keep on looking for differences  Gradually lower requisites if can not be met • Best possible region based on “seen” results 10 IPDPS - Atlanta, April 2010

 60 Mb, 6 iterations 11 IPDPS - Atlanta, April 2010

 Clustering time grows with the number of points • 5k pts  10 sec, 50k pts  10 min  Sample a subset of data to cluster (SDBScan) • Space: Select a few processes. Full time sequence. • Time: Random sampling. Wide covering.  Classify remaining data Nearest neighbor algorithm • – Reusing clustering structures 12 IPDPS - Atlanta, April 2010

All processes 25% random records 32 representatives 15% random records 16 representatives 10% random records 8 representatives + 15% random 75% less data Good quality 6s down from 2m Fast analysis 13 IPDPS - Atlanta, April 2010

 Important trace size reductions  Results before the application finishes  Final trace is representative 14 IPDPS - Atlanta, April 2010

 Compared vs. Profiles for the whole run • TAU Performance System (U. of Oregon)  Same overall structure • Same relevant functions, Avg. HWC’s & Time % • Most measurement differences under 1% Full run profile (TAU) Trace segment (M PItrace) GROM ACS user functions % Time Kinstr Kcycles % Time Kinstr Kcycles do_nonbonded 23.72% 24,709 22,349 23.94% 24,700 22,533 solve_pme 10.47% 6,795 9,913 10.52% 6,776 9,898 gather_f_bsplines 5.69% 5,286 5,387 5.64% 5,248 5,302 15 IPDPS - Atlanta, April 2010

matched clusters ∑ % time 16 IPDPS - Atlanta, April 2010

Instructions imbalance IPC imbalance  Study load balancing 17 IPDPS - Atlanta, April 2010

 Initial development • All data centralized • Sampling, clustering & classification at front-end • Bad scaling at large processor counts  >10k tasks • Sampling at leaves • Only put together the clustering set • Broadcast clustering results, classify at leaves 18 IPDPS - Atlanta, April 2010

 On-line automatic analysis framework  Identify structure and see how evolves  Determine a representative region  Detailed small trace + Periodic reports  Reductions in the time dimension  Scalable infrastructure supports other analyses  Current work • Spectral analysis (M. Casas): Better delineate the traced region • Parallel clustering in the tree • Finer stability heuristic 19 IPDPS - Atlanta, April 2010

Germn Llort gllort@bsc.es >10k processes + long runs = large - PowerPoint PPT Presentation

Germn Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? 2 IPDPS - Atlanta, April 2010

Pos Total Pos Cat. Name Surname Cat. Club 10k 10k 6m 10k 10k 2 2 1 3 1 Thomas Webb

- chisq.test(x, y) runs <- 1000 rbeta(runs, shape1, shape2) runs <- 1000 experiment_1

Studying : BSc Accounting & Finance, BSc Business & Economics, and BSc Business &

PRODUCT NUMBER: C10K CANNON 10K Part Number: C10K FASTER Two pullers in one 5K / 10K with a

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Welcome to BSC Javier Bartolom BSC System Head 1st WISE Workshop, 20-22 October 2015 Agenda

Folding Carton Point of Purchase Display Purchasing Small runs Large runs Combo runs

Questions to ask while conducting the Wheat Germ DNA Glop Wheat Germ Please read the

Fractional Factorial Designs Each replicate of a 2 k design requires 2 k runs. E.g. 64 runs for k =

Katrin Hartmann, k.hartmann@gcu.ac.uk New BSc (Hon.) IT Management for Business for BSc (Hon.)

Thanks to our Sponsors A brief history of Protg 1987 PROTG runs on LISP machines

Performance Assessment in Optimization Anne Auger, CMAP & Inria Visualization and

EVS 8K and EVS 10K Solder Recovery System EVS Solder Recovery Systems Turns Dross Into

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Parallel programming using OpenMP Computer Architecture J. Daniel Garca Snchez (coordinator)

Facilities Master Planning Hickman Mills School District Public Meeting February 12, 2019

What Well Cover Immigration Law Basics B-1 / B-2 - Visitors F-1 Students R-1

12) Validation of Graph-Based Datalog and EARS Transitive Closure Models

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark

Health Reform Implementation Thursday, July 21, 2011 This webcast will begin at 2:00 P.M. EDT

Multiparty Asynchronous Session Types http://mrg.doc.ic.ac.uk/ Nobuko Yoshida Imperial College

SDN in CloudStack Tuesday, October 15, 13 About me Hugo Trippaers Email:

Germn Llort gllort@bsc.es >10k processes + long runs = large - PowerPoint PPT Presentation

Germn Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? 2 IPDPS - Atlanta, April 2010

Pos Total Pos Cat. Name Surname Cat. Club 10k 10k 6m 10k 10k 2 2 1 3 1 Thomas Webb

- chisq.test(x, y) runs &lt;- 1000 rbeta(runs, shape1, shape2) runs &lt;- 1000 experiment_1

Studying : BSc Accounting &amp; Finance, BSc Business &amp; Economics, and BSc Business &amp;

PRODUCT NUMBER: C10K CANNON 10K Part Number: C10K FASTER Two pullers in one 5K / 10K with a

Observing Facts Andreas Zeller 1 Reasoning about Runs Experimentation n controlled runs

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Welcome to BSC Javier Bartolom BSC System Head 1st WISE Workshop, 20-22 October 2015 Agenda

Folding Carton Point of Purchase Display Purchasing Small runs Large runs Combo runs

Questions to ask while conducting the Wheat Germ DNA Glop Wheat Germ Please read the

Fractional Factorial Designs Each replicate of a 2 k design requires 2 k runs. E.g. 64 runs for k =

Katrin Hartmann, k.hartmann@gcu.ac.uk New BSc (Hon.) IT Management for Business for BSc (Hon.)

Thanks to our Sponsors A brief history of Protg 1987 PROTG runs on LISP machines

Performance Assessment in Optimization Anne Auger, CMAP &amp; Inria Visualization and

EVS 8K and EVS 10K Solder Recovery System EVS Solder Recovery Systems Turns Dross Into

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Parallel programming using OpenMP Computer Architecture J. Daniel Garca Snchez (coordinator)

Facilities Master Planning Hickman Mills School District Public Meeting February 12, 2019

What Well Cover Immigration Law Basics B-1 / B-2 - Visitors F-1 Students R-1

12) Validation of Graph-Based Datalog and EARS Transitive Closure Models

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark

Health Reform Implementation Thursday, July 21, 2011 This webcast will begin at 2:00 P.M. EDT

Multiparty Asynchronous Session Types http://mrg.doc.ic.ac.uk/ Nobuko Yoshida Imperial College

SDN in CloudStack Tuesday, October 15, 13 About me Hugo Trippaers Email:

- chisq.test(x, y) runs <- 1000 rbeta(runs, shape1, shape2) runs <- 1000 experiment_1

Studying : BSc Accounting & Finance, BSc Business & Economics, and BSc Business &

Performance Assessment in Optimization Anne Auger, CMAP & Inria Visualization and