a simple concept for the performance analysis of cluster
play

A simple Concept for the Performance Analysis of Cluster-Computing - PowerPoint PPT Presentation

A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1 , S. Richling 2 , J.P . Kruse 3 , E. Strohmaier 4 , H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute


  1. A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1 , S. Richling 2 , J.P . Kruse 3 , E. Strohmaier 4 , H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute of Geosciences, Goethe University Frankfurt, Germany 4 Future Technology Group, LBNL, Berkeley, USA ISC’13, Leipzig, 18. June 2013

  2. Outline Introduction Performance Model Applications Scalar-Product of Vectors Matrix Multiplication Linpack TOP500 Conclusions

  3. Introduction Motivation ◮ Sophisticated mathematical models for performance analysis cannot keep up with rapid hardware development. ◮ There is a lack of reliable rules of thumb to estimate the size and performance of clusters. Goals ◮ Development of a simple and transparent model. ◮ Restriction to few parameters describing hardware and software. ◮ Using speed-up as a dimensionless metric. ◮ Finding the optimal size of a cluster for a given application. ◮ Validation of the results by modeling of standard kernels.

  4. Related Work ◮ Roofline model for multi-cores (Williams et al. 2009) ◮ Performance models by Hockney: ◮ Model with few hardware and software parameters, focus on benchmark runtimes and performance (Hockney 1987, Hockney & Jesshope 1988) ◮ Model based on similarities to fluid dynamics (Hockney 1995) ◮ Performance models by Numrich: ◮ Based on Newtons classical mechanics (Numrich 2007) ◮ Based on dimension analysis (Numrich 2008) ◮ Based on the Pi theorem (Numrich 2010) ◮ Linpack performance model (Luszczek & Dongarra 2011) ◮ Performance model based on a stochastic approach (Kruse 2009, Kredel et al. 2010) ◮ Performance model for interconnected clusters (Kredel et al. 2012)

  5. Model Parameters Hardware Parameters l peak l peak l peak l peak l peak · · · · · · p 1 2 3 4 p number of processing units (PUs) l peak theoretical peak performance of each PU k = 1 , p b c bandwidth of the network Software Parameters # op total number of arithmetic operations # b total number of bytes involved # x total number of bytes communicated between the PUs

  6. Distribution of the work load ( # op , # b ) Homogeneous case • Distribution of operations # op o 1 o 2 o 3 o 4 o p · · · · · · o k = # op / p ( or ω k = 1 / p ) • Distribution of data # b d p d 1 d 2 d 3 d 4 · · · · · · d k = # b / p ( or δ k = 1 / p )

  7. Distribution of the work load ( # op , # b ) Heterogeneous case → additional parameters ( ω k , δ k ) • Distribution of operations # op o 1 o 2 o 3 o 4 o p · · · · · · p � o k = ω k · # op with ω k = 1 k = 1 • Distribution of data # b d p d 1 d 2 d 3 d 4 · · · · · · p � d k = δ k · # b with δ k = 1 k = 1

  8. Performance Indicators Primary performance measure t Total time to process the work load (# op , # b ) Derived performance measures l ( p ) = # op Performance t S = l ( p ) Speed-up (dimensionless) l ( 1 ) Goal: Speed-up as a function of ◮ total work load (# op , # b ) [ Flop , Byte ] ◮ work distribution ( ω k , δ k ) ◮ communication requirements # x [Byte] ◮ hardware parameters ( p , l peak , b c ) [-,Flop/s, Byte] k

  9. Total execution time Computation time t r = max { t 1 ( o 1 , d 1 ) , . . . , t n ( o p , d p ) } ≃ o k ≥ o k l k l peak k Communication time t c ≃ # x b c Total execution time t ≃ t r + t c t ≥ o k + # x l peak b c k

  10. Total execution time � � l peak t ≥ ω k · # op + # x b c = ω k · # op ω k # op · # x # b · 1 + k b c · l peak l peak # b k k � � t ≥ ω k · # op 1 + 1 · l peak x k k One dimensionless parameter for “hardware + software” x k = ω k · a · r a ∗ k a = # op computational intensity of the software [Float/Byte] # b k = l peak k a ∗ ”computational intensity” of the hardware [Float/Byte] b c r = # b ”inverse communication intensity” [-] # x

  11. Performance and Speed-up Performance ≤ l peak l = # op x k k · t ω k 1 + x k Speed-up S = l ( p ) l ( 1 ) = l k ( ω k < 1 ) 1 + x k ( ω k = 1 ) l k ( ω k = 1 ) = 1 + ω k · x k ( ω k = 1 ) · r = a · b 0 x k ( ω k = 1 ) = a · r = a · b c · b c c · r = ˆ x k · z · r b 0 a ∗ l peak l peak k c k k 1 + ˆ x k · r · z S = general case with ω k = ω ( k , p ) / p 1 + ω ( k , p ) · ˆ x k · r · z p S = 1 + ˆ x · r · z homogeneous case with ω ( k , p ) = 1 1 + ˆ x · r · z p

  12. Application-oriented Analysis Application characterized by problem size n . Software Parameters # op → # op ( n ) # b → # b ( n ) # x → # x ( n , p ) Analysis of the performance of a homogeneous cluster x r ( n , p ) l ≤ p l peak x + 1 = l peak y · 1 + y r ( n , p ) p x · z · r ( n , p ) / p = y · r ( n , p ) / p ≃ y · c ( n ) 1 With x = ˆ d ( p ) p ◮ Number of PUs p 1 / 2 necessary to reach half of the maximum performance of all p PUs. 2 pl peak → y · r ( n , p 1 / 2 ) = p 1 / 2 l ( p 1 / 2 ) = 1 ◮ Number of PUs p to obtain the maximum of the performance dl dp = 0 → p 2 max · d ′ ( p max ) = y = ˆ x · z · c ( n )

  13. Compute resources for the simulations bwGRiD Cluster Site Nodes Frankfurt Mannheim 140 Heidelberg 140 Karlsruhe 140 (interconnected to a single cluster) Stuttgart 420 Mannheim Heidelberg T¨ ubingen 140 Ulm/Konstanz 280 Karlsruhe Freiburg 140 Stuttgart Esslingen 180 Total 1580 Esslingen Tübingen Ulm (joint cluster with Konstanz) München Freiburg

  14. bwGRiD – Hardware Node Configuration ◮ 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) ◮ 16 GB Memory ◮ 140 GB hard drive (since January 2009) ◮ InfiniBand Network (20 Gbit/sec) Hardware parameters for our model l peak = 8 GFlop/sec (for one core) b c = 1 . 5 GByte/sec (node-to-node) b 0 = 1 . 0 GByte/sec (reference bandwidth) c

  15. Scalar-Product of two Vectors � ( u , v ) = u k · v k k Software Parameters # op = 2 n − 1 ≃ 2 n if n ≫ 1 # b = 2 n w # x = p w = 8 p Speed-up 1 + x 64 · n 3 S = with x = p 1 + x / p Simulations ◮ Vector sizes up to n = 10 7 ◮ 20 runs for each configuration ( p , n ) ◮ Speed-up calculated from mean run-times

  16. Speed-up for Scalar Product scalarproduct with size n 450 n = 10 5 , experimental 400 n = 10 5 , theoretical n = 5 × 10 5 , experimental 350 n = 5 × 10 5 , theoretical n = 10 6 , experimental 300 n = 10 6 , theoretical n = 10 7 , experimental 250 n = 10 7 , theoretical S(p) 200 150 100 50 0 -50 50 100 150 200 250 300 350 400 450 500 p

  17. Matrix Multiplication A n × n · B n × n = C n × n on a √ p · √ p processor-grid Software Parameters # op = 2 n 3 − n 2 ≃ 2 n 3 # b = 2 n 2 w # x = 2 n 2 √ p ( 1 − √ p ) w ≃ 2 n 2 w √ p 1 Speed-up 1 + x 2048 n √ p 3 S = with x = 1 + x / p Simulations ◮ Matrix sizes up to n = 40000 ◮ Cannon’s algorithm ◮ Runs with 8 and 4 cores per node

  18. Speed-up for Matrix Multiplication

  19. Linpack Solution of Ax = b Software Parameters # op = 2 3 n 3 # b = 2 n 2 · w � � 1 + log 2 p n 2 · w # x = 3 α 12 Speed-up 1 + x n S ∼ with x = 128 and α = 1 / 3 1 + x / p Simulations ◮ Matrix sizes up to 40000. ◮ Smaller α would lead to better fits for small p .

  20. Speed-up for Linpack

  21. Linpack on bwGRiD Half of Peak performance at: p 1 / 2 = y n 3 α = 128 Maximum performance at: p max = ( 24 · ln 2 / 128 ) · n = 24 ln ( 2 ) p 1 / 2 Region with ’good’ performance for n = 10000 p = [ p 1 / 2 , p max ] = [ 80 , 1300 ] Maximum performance l max = ∼ l peak y 9 3 α 10 l max = 560 GFlop/sec for n = 10000

  22. TOP500 Maximum performance l max = n · b c 9 3 w 10 In TOP500 list: l max → R max and n → N max Bandwidth b c not in the list. Derive Effective Bandwidth c = R max · 3 w · 10 b eff N max 9 Analyze which parameter predicts ranking best ◮ first 100 systems ◮ excluding systems with accelerators and missing N max ◮ comparison with single core performance l peak = R max / p max

  23. TOP500 – November 2011 Blue: Linpack-Performance per core Red: Derived effective Bandwidth 35 30 b_c^eff [GByte/sec] l^th [GFlop/sec] 25 20 15 10 5 0 1 3 7 8 9 11 12 14 15 17 22 24 26 27 28 29 38 39 41 42 43 45 46 47 48 51 52 54 55 56 57 60 61 64 65 66 68 72 73 77 78 81 83 84 85 86 90 93 95 98 Rank in TOP500 list (Nov. 2011)

  24. TOP500 – November 2012 Blue: Linpack-Performance per core Red: Derived effective Bandwidth 40 b_c^eff [GByte/sec] l^th [GFlop/sec] 35 30 25 20 15 10 5 0 2 3 5 6 11 14 15 19 20 21 24 25 27 28 29 39 45 49 54 55 56 61 63 64 69 70 71 74 77 80 82 83 85 88 92 93 94 95 96 97100 Rank in TOP500 List (November 2012)

  25. Conclusions ◮ Developed a performance model which integrates the characteristics of hardware and software with a few parameters. ◮ Model provides simple formulae for performance and speed-up. ◮ Results compare reasonably well with simulations of standard applications. ◮ Model allows estimation of the optimal size of a cluster for a given class of applications. ◮ Model allows estimation of the maximum performance for a given class of applications. ◮ Identified effective bandwidth as a key performance indicator for Linpack (TOP500) on compute clusters. ◮ Future work: ◮ Analysis of inhomogeneous clusters with asymmetric load distribution ◮ Further applications: Sparse matrix-vector operations and FFT

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend