Performance Analysis and Prediction for distributed homogeneous - PowerPoint PPT Presentation

Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-G¨ unther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University of Heidelberg, Germany Future Technology Group, LBNL, Berkeley, USA ISC’12, Hamburg, 18. June 2012 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 1 / 26

Outline Background and Motivation 1 D-Grid and bwGRiD bwGRiD Mannheim/Heidelberg Next generation bwGRiD Performance Modeling 2 The Roofline Model Analysis of a single Region Analysis of two identical interconnected Regions Application to bwGRiD Conclusions 3 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 2 / 26

Background and Motivation D-Grid and bwGRiD D-Grid and bwGRiD bwGRiD Virtual Organization (VO) Community project of the German Grid Initiative D-Grid Project partners are the Universities in Baden-W¨ urttemberg bwGRiD Resources Compute clusters at 8 locations Central storage unit in Karlsruhe bwGRiD Objectives Verifying the functionality and the benefit of Grid concepts for the HPC community in Baden-W¨ urttemberg Managing organizational, security, and license issues Development of new cluster and Grid applications Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 3 / 26

Background and Motivation D-Grid and bwGRiD bwGRiD – Resources Compute Cluster Site Nodes Frankfurt Mannheim 140 Heidelberg 140 Karlsruhe 140 (interconnected Stuttgart 420 to a single cluster) Mannheim Heidelberg T¨ ubingen 140 Ulm/Konstanz 280 Karlsruhe Freiburg 140 Esslingen 180 Stuttgart Total 1580 Esslingen Tübingen Ulm Central Storage (joint cluster with Konstanz) München with backup 128 TB Freiburg without backup 256 TB Total 384 TB Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 4 / 26

Background and Motivation bwGRiD Mannheim/Heidelberg bwGRiD MA/HD – Hardware Hardware Mannheim Heidelberg total Blade Center 10 10 20 Blades (Nodes) 140 140 280 CPUs (Cores) 1120 1120 2240 Login Server 2 2 4 Admin Server 1 – 1 Infiniband Switches 1 1 2 HP Storage System 32 TB 32 TB 64 TB Blade Configuration 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) 16 GB Memory 140 GB hard drive (since January 2009) Gigabit-Ethernet (1 Gbit) Infiniband Netzwork (20 Gbit) Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 5 / 26

Background and Motivation bwGRiD Mannheim/Heidelberg bwGRiD MA/HD – Overview VORM VORM Grid Grid User MA User HD PBS LDAP AD Admin passwd Cluster Cluster Mannheim Heidelberg Obsidian Obsidian InfiniBand 28 km InfiniBand Lustre Lustre bwFS bwFS MA HD 140 nodes cluster InfiniBand Network Login/Admin Server Directory service Storage Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 6 / 26

Background and Motivation bwGRiD Mannheim/Heidelberg bwGRiD MA/HD – Interconnection Network Technology InfiniBand over Ethernet over fibre optics (28 km) 2 Obsidian Longbow (150 TEUR) MPI Performance Latency is high: 145 µ sec = 143 µ sec light transit time + 2 µ sec Bandwidth is as expected: 930 MB/sec (local 1200-1400 MB/sec) Operating Considerations Operating the two clusters as single system image Fast InfiniBand interconnection to the storage systems MPI performance not sufficient for all kinds of parallel jobs → Keep all nodes of a job on one side Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 7 / 26

Background and Motivation Next generation bwGRiD Next generation bwGRiD Questions What bandwidth is required to allow all parallel jobs running accross two cluster regions? Is the expected bandwidth for the new system sufficient? Is there an optimal size for a cluster region? Performance Charateristics bwGRiD 1 bwGRiD 2 Bandwidth between two nodes 1.5 GByte/sec 6 GByte/sec Bandwidth between two regions 1.0 GByte/sec 15 – 45 GByte/sec Performance of a single core 8.5 GFlop/sec 10 – 16 GFlop/sec Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 8 / 26

Performance Modeling Outline Background and Motivation 1 D-Grid and bwGRiD bwGRiD Mannheim/Heidelberg Next generation bwGRiD Performance Modeling 2 The Roofline Model Analysis of a single Region Analysis of two identical interconnected Regions Application to bwGRiD Conclusions 3 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 9 / 26

Performance Modeling The Roofline Model EECS Basic Roofline Electrical Engineering and Computer Sciences B ERKELEY P AR L AB  Performance is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio Gflop/s = min Peak Gflop/s Stream BW * actual flop:byte ratio 1 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 10 / 26

Performance Modeling The Roofline Model EECS Roofline model for Opteron Electrical Engineering and (adding ceilings) Computer Sciences B ERKELEY P AR L AB AMD Opteron 2356 Peak roofline performance  peak SP (Barcelona) based on manual for  128 single precision peak attainable Gflop/s and a hand tuned stream  64 read for bandwidth 32 16 8 4 2 1 1 / 8 1 / 4 1 / 2 1 2 4 8 16 flop:DRAM byte ratio 2 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 10 / 26

Performance Modeling The Roofline Model A Performance Model based on the Roofline Model Roofline Principles: Bottleneck Analysis Bound by Peak Flop and Measured Bandwidth The following steps will be used to develop a performance model for single and multiple regions: Transform basics scales to dimensionless quantitates to arrive at universal scaling law Assume optimal floating-point operations and scaling with system size Introduce effective bandwidth scaling with system size Formulate result with dimensionless code-to-system balance factors Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 10 / 26

Performance Modeling The Roofline Model Performance Model – Overall System Abstraction Hardware Region I 1 Region I 2 ✛ ✲ aggregate bandwidth B E number of cores n number of cores n core performance l th core performance l th bandwidth b I bandwidth b I Application (Load) # op number of arithmetic operations performed on # b number of bytes (data) Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 11 / 26

Performance Modeling Analysis of a single Region Analysis of a single Region Total time = Computation time + Communication time Total time with ideal floating-point operations: � # op  � � d s + # b # op 1 + # b d th additive  b I d th b I # op t V ∼ ≥ � � # op d s , # b � � d th max # op 1 , # b d th max overlapping b I  # op b I Identify a code-to-system balance factor x based on: a : Arithmetic intensity (roofline model, Williams et al. 2009) a ∗ : Operational balance (’architectural intensity’): b I b I a ∗ = # op d th = # op x = a # b d th # b Throughput: � d th x d = # op additive x +1 ≤ d th min(1 , x ) t V overlapping Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 12 / 26

Performance Modeling Analysis of a single Region Single Region – Throughput Throughput d for additive (green) and overlapping (red) concepts. Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 13 / 26

Performance Modeling Analysis of a single Region Single Region – Speed-up Ideal floating-point d th = n · l l and Effective bandwidth scaling z = b l 0 with a reference bandwidth b I 0 gives: b l = x ′ · z # b · b I # b · b I · b I x = # op d th = 1 n · # op 0 b I l th n 0 where x ′ is the balance factor of the core (or node, unit, ...) Parallel Speed-up is then: Sp = d ( n ) d (1) = 1 + x ′ z � � → 1 + x ′ z � 1 + x ′ z � n →∞ n Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 14 / 26

Performance Modeling Analysis of a single Region Single Region – Speed-up Speed-up Sp for different values x ′ and z . Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 15 / 26

Performance Modeling Analysis of two identical interconnected Regions Analysis of two interconnected Regions Total time = Time (1 region, 1/2 comp. load) + Communication time between regions Total time for # x bytes and channel bandwidth B E : t V ∼ t (1) + # x / B E V t V ≥ (# op / 2) � 1 + a ∗ � + # x d th B E a Throughput: 1 d ≤ 2 d th # x 1 + a ∗ a + 2 d th B E # op Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 16 / 26

Performance Analysis and Prediction for distributed homogeneous - PowerPoint PPT Presentation

Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-G unther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University of Heidelberg, Germany Future

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Verification Verification, Performance Performance Analysis Performance Performance Analysis

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Prediction-Guided Performance-Energy Trade-off for Interactive Applications Daniel Lo Taejoon

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Finding Opportunities in Opportunity Zones MEDA Spring 2019 Conference Glendive, MT

LCCMR ID: 056-B Project Title: City of Rosemount Groundwater Observation Project Category: B.

LNFH Groundwater Supply IWG Presentation May 16, 2014 Tim Flynn, LHG & Dan Haller, PE

SEEDSMAN GEOTECHNICS PTY LTD ACN 082 109 082 Telephone 0417279556 Facsimile 0248722535 Monday,

Sc Science Y e ar 3 Scie ntists and Inve ntors Y ear One Scie cience ce | Y e ar 3 | Scie

Data Science in US and Canadian Higher Education OR Enabling Educational Infrastructure for

Digital 395 Opportunities Barstow Broadband Roundtable Michael Ort CEO, Praxis Associates, Inc.

Going Global What you need you need to know 3 Core messages: SWOT , Plan, Network SWOT Plan

Performance Analysis and Prediction for distributed homogeneous - PowerPoint PPT Presentation

Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-G unther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University of Heidelberg, Germany Future

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

DeepLoc Data set statistics &amp; performance Protein prediction II Gregor Sturm, Johannes Rest,

Verification Verification, Performance Performance Analysis Performance Performance Analysis

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Prediction-Guided Performance-Energy Trade-off for Interactive Applications Daniel Lo Taejoon

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Finding Opportunities in Opportunity Zones MEDA Spring 2019 Conference Glendive, MT

LCCMR ID: 056-B Project Title: City of Rosemount Groundwater Observation Project Category: B.

LNFH Groundwater Supply IWG Presentation May 16, 2014 Tim Flynn, LHG &amp; Dan Haller, PE

SEEDSMAN GEOTECHNICS PTY LTD ACN 082 109 082 Telephone 0417279556 Facsimile 0248722535 Monday,

Sc Science Y e ar 3 Scie ntists and Inve ntors Y ear One Scie cience ce | Y e ar 3 | Scie

Data Science in US and Canadian Higher Education OR Enabling Educational Infrastructure for

Digital 395 Opportunities Barstow Broadband Roundtable Michael Ort CEO, Praxis Associates, Inc.

Going Global What you need you need to know 3 Core messages: SWOT , Plan, Network SWOT Plan

DeepLoc Data set statistics & performance Protein prediction II Gregor Sturm, Johannes Rest,

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

LNFH Groundwater Supply IWG Presentation May 16, 2014 Tim Flynn, LHG & Dan Haller, PE