performance analysis and prediction for distributed
play

Performance Analysis and Prediction for distributed homogeneous - PowerPoint PPT Presentation

Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-G unther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University of Heidelberg, Germany Future


  1. Performance Analysis and Prediction for distributed homogeneous Clusters Heinz Kredel, Hans-G¨ unther Kruse, Sabine Richling, Erich Strohmaier IT-Center, University of Mannheim, Germany IT-Center, University of Heidelberg, Germany Future Technology Group, LBNL, Berkeley, USA ISC’12, Hamburg, 18. June 2012 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 1 / 26

  2. Outline Background and Motivation 1 D-Grid and bwGRiD bwGRiD Mannheim/Heidelberg Next generation bwGRiD Performance Modeling 2 The Roofline Model Analysis of a single Region Analysis of two identical interconnected Regions Application to bwGRiD Conclusions 3 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 2 / 26

  3. Background and Motivation D-Grid and bwGRiD D-Grid and bwGRiD bwGRiD Virtual Organization (VO) Community project of the German Grid Initiative D-Grid Project partners are the Universities in Baden-W¨ urttemberg bwGRiD Resources Compute clusters at 8 locations Central storage unit in Karlsruhe bwGRiD Objectives Verifying the functionality and the benefit of Grid concepts for the HPC community in Baden-W¨ urttemberg Managing organizational, security, and license issues Development of new cluster and Grid applications Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 3 / 26

  4. Background and Motivation D-Grid and bwGRiD bwGRiD – Resources Compute Cluster Site Nodes Frankfurt Mannheim 140 Heidelberg 140 Karlsruhe 140 (interconnected Stuttgart 420 to a single cluster) Mannheim Heidelberg T¨ ubingen 140 Ulm/Konstanz 280 Karlsruhe Freiburg 140 Esslingen 180 Stuttgart Total 1580 Esslingen Tübingen Ulm Central Storage (joint cluster with Konstanz) München with backup 128 TB Freiburg without backup 256 TB Total 384 TB Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 4 / 26

  5. Background and Motivation bwGRiD Mannheim/Heidelberg bwGRiD MA/HD – Hardware Hardware Mannheim Heidelberg total Blade Center 10 10 20 Blades (Nodes) 140 140 280 CPUs (Cores) 1120 1120 2240 Login Server 2 2 4 Admin Server 1 – 1 Infiniband Switches 1 1 2 HP Storage System 32 TB 32 TB 64 TB Blade Configuration 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) 16 GB Memory 140 GB hard drive (since January 2009) Gigabit-Ethernet (1 Gbit) Infiniband Netzwork (20 Gbit) Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 5 / 26

  6. Background and Motivation bwGRiD Mannheim/Heidelberg bwGRiD MA/HD – Overview VORM VORM Grid Grid User MA User HD PBS LDAP AD Admin passwd Cluster Cluster Mannheim Heidelberg Obsidian Obsidian InfiniBand 28 km InfiniBand Lustre Lustre bwFS bwFS MA HD 140 nodes cluster InfiniBand Network Login/Admin Server Directory service Storage Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 6 / 26

  7. Background and Motivation bwGRiD Mannheim/Heidelberg bwGRiD MA/HD – Interconnection Network Technology InfiniBand over Ethernet over fibre optics (28 km) 2 Obsidian Longbow (150 TEUR) MPI Performance Latency is high: 145 µ sec = 143 µ sec light transit time + 2 µ sec Bandwidth is as expected: 930 MB/sec (local 1200-1400 MB/sec) Operating Considerations Operating the two clusters as single system image Fast InfiniBand interconnection to the storage systems MPI performance not sufficient for all kinds of parallel jobs → Keep all nodes of a job on one side Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 7 / 26

  8. Background and Motivation Next generation bwGRiD Next generation bwGRiD Questions What bandwidth is required to allow all parallel jobs running accross two cluster regions? Is the expected bandwidth for the new system sufficient? Is there an optimal size for a cluster region? Performance Charateristics bwGRiD 1 bwGRiD 2 Bandwidth between two nodes 1.5 GByte/sec 6 GByte/sec Bandwidth between two regions 1.0 GByte/sec 15 – 45 GByte/sec Performance of a single core 8.5 GFlop/sec 10 – 16 GFlop/sec Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 8 / 26

  9. Performance Modeling Outline Background and Motivation 1 D-Grid and bwGRiD bwGRiD Mannheim/Heidelberg Next generation bwGRiD Performance Modeling 2 The Roofline Model Analysis of a single Region Analysis of two identical interconnected Regions Application to bwGRiD Conclusions 3 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 9 / 26

  10. Performance Modeling The Roofline Model EECS Basic Roofline Electrical Engineering and Computer Sciences B ERKELEY P AR L AB  Performance is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio Gflop/s = min Peak Gflop/s Stream BW * actual flop:byte ratio 1 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 10 / 26

  11. Performance Modeling The Roofline Model EECS Roofline model for Opteron Electrical Engineering and (adding ceilings) Computer Sciences B ERKELEY P AR L AB AMD Opteron 2356 Peak roofline performance  peak SP (Barcelona) based on manual for  128 single precision peak attainable Gflop/s and a hand tuned stream  64 read for bandwidth 32 16 8 4 2 1 1 / 8 1 / 4 1 / 2 1 2 4 8 16 flop:DRAM byte ratio 2 Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 10 / 26

  12. Performance Modeling The Roofline Model A Performance Model based on the Roofline Model Roofline Principles: Bottleneck Analysis Bound by Peak Flop and Measured Bandwidth The following steps will be used to develop a performance model for single and multiple regions: Transform basics scales to dimensionless quantitates to arrive at universal scaling law Assume optimal floating-point operations and scaling with system size Introduce effective bandwidth scaling with system size Formulate result with dimensionless code-to-system balance factors Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 10 / 26

  13. Performance Modeling The Roofline Model Performance Model – Overall System Abstraction Hardware Region I 1 Region I 2 ✛ ✲ aggregate bandwidth B E number of cores n number of cores n core performance l th core performance l th bandwidth b I bandwidth b I Application (Load) # op number of arithmetic operations performed on # b number of bytes (data) Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 11 / 26

  14. Performance Modeling Analysis of a single Region Analysis of a single Region Total time = Computation time + Communication time Total time with ideal floating-point operations: � # op  � � d s + # b # op 1 + # b d th additive  b I d th b I # op t V ∼ ≥ � � # op d s , # b � � d th max # op 1 , # b d th max overlapping b I  # op b I Identify a code-to-system balance factor x based on: a : Arithmetic intensity (roofline model, Williams et al. 2009) a ∗ : Operational balance (’architectural intensity’): b I b I a ∗ = # op d th = # op x = a # b d th # b Throughput: � d th x d = # op additive x +1 ≤ d th min(1 , x ) t V overlapping Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 12 / 26

  15. Performance Modeling Analysis of a single Region Single Region – Throughput Throughput d for additive (green) and overlapping (red) concepts. Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 13 / 26

  16. Performance Modeling Analysis of a single Region Single Region – Speed-up Ideal floating-point d th = n · l l and Effective bandwidth scaling z = b l 0 with a reference bandwidth b I 0 gives: b l = x ′ · z # b · b I # b · b I · b I x = # op d th = 1 n · # op 0 b I l th n 0 where x ′ is the balance factor of the core (or node, unit, ...) Parallel Speed-up is then: Sp = d ( n ) d (1) = 1 + x ′ z � � → 1 + x ′ z � 1 + x ′ z � n →∞ n Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 14 / 26

  17. Performance Modeling Analysis of a single Region Single Region – Speed-up Speed-up Sp for different values x ′ and z . Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 15 / 26

  18. Performance Modeling Analysis of two identical interconnected Regions Analysis of two interconnected Regions Total time = Time (1 region, 1/2 comp. load) + Communication time between regions Total time for # x bytes and channel bandwidth B E : t V ∼ t (1) + # x / B E V t V ≥ (# op / 2) � 1 + a ∗ � + # x d th B E a Throughput: 1 d ≤ 2 d th # x 1 + a ∗ a + 2 d th B E # op Kredel, Kruse, Richling, Strohmaier (ISC’12) Performance Analysis Hamburg, June 2012 16 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend