hierarchical parallel matrix multiplication on large
play

Hierarchical Parallel Matrix Multiplication on Large-Scale - PowerPoint PPT Presentation

Problem Outline Experiments Summary Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms Jean-Nol Quintin, Khalid Hasanov, Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science


  1. Problem Outline Experiments Summary Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms Jean-Noël Quintin, Khalid Hasanov, Alexey Lastovetsky Heterogeneous Computing Laboratory School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland http://hcl.ucd.ie 2013 1 / 38

  2. Problem Outline Experiments Summary Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA 2 / 38

  3. Problem Outline Experiments Summary Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 2 / 38

  4. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 3 / 38

  5. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Motivation ◮ Majority of HPC algorithms for scientific applications were introduced between 1970s and 1990s ◮ They were designed for and tested on up to hundreds (few thousands at most) of processors. ◮ In June 1995, the number of cores in the top 10 supercomputers ranged from 42 to 3680 (see http://www.top500.org/) 4 / 38

  6. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Motivation ◮ Majority of HPC algorithms for scientific applications were introduced between 1970s and 1990s ◮ They were designed for and tested on up to hundreds (few thousands at most) of processors. ◮ In June 1995, the number of cores in the top 10 supercomputers ranged from 42 to 3680 (see http://www.top500.org/) ◮ Nowadays, in June 2013, this number ranges from 147,456 to 3,120,000 4 / 38

  7. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Motivation The increasing scale of the HPC platforms creates new research questions which needs to be solved: ◮ Scalability ◮ Communication cost ◮ Energy efficiency ◮ etc. 5 / 38

  8. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Introduction We focus on the communication cost of scientific applications on large-scale distributed memory platforms. ◮ Example application: Parallel Matrix Multiplication. ◮ Why Matrix Multiplication? ◮ Matrix multiplication is important in its own rights as a computational kernel of many scientific applications. ◮ It is a popular representative for other scientific applications ◮ If an optimization method works well for matrix multiplication, it will also work well for many other relative scientific applications 6 / 38

  9. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Introduction ◮ Example algorithm: ◮ SUMMA - Scalable Universal Matrix Multiplication Algorithm. ◮ Introduced by Robert A. van de Geijn and Jerrell Watts. University of Texas at Austin, 1995. ◮ Implemented in ScaLAPACK. 7 / 38

  10. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 8 / 38

  11. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA SUMMA A b • k P 00 P 01 P 02 P 03 P 04 P 05 P 00 P 01 P 02 P 03 P 04 P 05 P 10 P 11 P 12 P 13 P 14 P 15 P 10 P 11 P 12 P 13 P 14 P 15 B b P 20 P 21 P 22 P 23 P 24 P 25 P 20 P 21 P 22 P 23 P 24 P 25 k • P 30 P 31 P 32 P 33 P 34 P 35 P 30 P 31 P 32 P 33 P 34 P 35 P 40 P 41 P 42 P 43 P 44 P 45 P 40 P 41 P 42 P 43 P 44 P 45 P 50 P 51 P 52 P 53 P 54 P 55 P 50 P 51 P 52 P 53 P 54 P 55 √ √ ◮ Number of steps: n b ( n × n - matrices, b - block size, P × P - processors grid, P = 36) ◮ The pivot column A b n • k of P × b blocks of matrix A is broadcast horizontally. √ k • of b × n ◮ The pivot row B b P blocks of matrix B is broadcast vertically. √ ◮ Then, each n P × n P block c ij of matrix C is updated, c ij = c ij + a ik × b kj . √ √ ◮ Size of data broadcast vertically and horizontally in each step: 2 n P × b √ 9 / 38

  12. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Outline Problem Outline Motivation and Introduction Previous Work: SUMMA Our Work: HSUMMA Experiments Experiments on Grid5000 Experiments on BlueGene 10 / 38

  13. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Our Contribution ◮ We introduce application level hierarchical optimization of SUMMA ◮ Hierarchical SUMMA (HSUMMA) is platform independent optimization of SUMMA ◮ We theoretically and experimentally show that HSUMMA reduces the communication cost of SUMMA 11 / 38

  14. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA SUMMA vs HSUMMA. Arrangement of Processors P 00 P 01 P 02 P 03 P 04 P 05 P 10 P 11 P 12 P 13 P 14 P 15 P 20 P 21 P 22 P 23 P 24 P 25 P 30 P 31 P 32 P 33 P 34 P 35 P 40 P 41 P 42 P 43 P 44 P 45 P 50 P 51 P 52 P 53 P 54 P 55 SUMMA 12 / 38

  15. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA SUMMA vs HSUMMA. Arrangement of Processors P 00 P 01 P 02 P 03 P 04 P 05 P 00 P 01 P 02 P 03 P 04 P 05 P 10 P 11 P 12 P 13 P 14 P 15 P 10 P 11 P 12 P 13 P 14 P 15 P 20 P 21 P 22 P 23 P 24 P 25 P 20 P 21 P 22 P 23 P 24 P 25 P 30 P 31 P 32 P 33 P 34 P 35 P 30 P 31 P 32 P 33 P 34 P 35 P 40 P 43 P 45 P 41 P 42 P 44 P 40 P 41 P 42 P 43 P 44 P 45 P 50 P 51 P 52 P 53 P 54 P 55 P 50 P 51 P 52 P 53 P 54 P 55 SUMMA HSUMMA 12 / 38

  16. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Horizontal Communications Between Groups in HSUMMA A M • k P 00 P 01 P 02 P 03 P 04 P 05 ◮ P - number of processors ( P = 36) ◮ G - number of groups ( G = 9) P 10 P 11 P 12 P 13 P 14 P 15 ◮ √ √ P × P - processors grid ◮ √ √ P 20 P 21 P 22 P 23 P 24 P 25 G × G - grid of processor groups ◮ M - block size between groups P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M - number of steps P 40 P 41 P 42 P 43 P 44 P 45 ◮ Size of data broadcast horizontally in each step: n × M √ P 50 P 51 P 52 P 53 P 54 P 55 P The pivot column A M n • k of P × M blocks of matrix A is broadcast horizontally √ between groups 13 / 38

  17. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Horizontal Communications Inside Groups in HSUMMA A b A b A b • k • k • k P 00 P 01 P 02 P 03 P 04 P 05 √ √ P P G − grid of processors inside groups P 10 P 11 P 12 P 13 P 14 P 15 √ G × √ ◮ ◮ b − block size inside one group P 20 P 21 P 22 P 23 P 24 P 25 ◮ M / b − steps inside one group P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M − steps between groups ◮ Size of data broadcast horizontally in each step: n × b P 40 P 41 P 42 P 43 P 44 P 45 √ P P 50 P 51 P 52 P 53 P 54 P 55 Upon receipt of the pivot column data from the other groups, the local pivot column A b n • k , ( b ≤ M ) of P × b blocks of matrix A is broadcast horizontally √ inside each group 14 / 38

  18. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Vertical Communications Between Groups in HSUMMA P 00 P 01 P 02 P 03 P 04 P 05 ◮ P - number of processors ( P = 36) ◮ G - number of groups ( G = 9) P 10 P 11 P 12 P 13 P 14 P 15 ◮ √ √ P × P - processors grid ◮ √ √ B M P 20 P 21 P 22 P 23 P 24 P 25 G × G - grid of processor groups k • ◮ M - block size between groups P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M - number of steps P 40 P 41 P 42 P 43 P 44 P 45 ◮ Size of data broadcast vertically in each step: n × M √ P 50 P 51 P 52 P 53 P 54 P 55 P The pivot row B M k • of M × n P blocks of matrix B is broadcast vertically √ between groups 15 / 38

  19. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Vertical Communications Inside Groups in HSUMMA B b P 00 P 01 P 02 P 03 P 04 P 05 k • √ √ P P G − grid of processors P 10 P 11 P 12 P 13 P 14 P 15 ◮ √ G × √ ◮ b − block size inside one group B b P 20 P 21 P 22 P 23 P 24 P 25 ◮ M / b − steps inside one group k • P 30 P 31 P 32 P 33 P 34 P 35 ◮ n / M − steps between groups ◮ Size of data broadcast B b vertically in each step: n × b P 40 P 41 P 42 P 43 P 44 P 45 √ k • P P 50 P 51 P 52 P 53 P 54 P 55 Upon receipt of the pivot row data from the other groups, the local pivot row B b • k of b × n P , ( b ≤ M ) blocks of matrix B is broadcast vertically inside each √ group 16 / 38

  20. Problem Outline Motivation and Introduction Experiments Previous Work: SUMMA Summary Our Work: HSUMMA Communication Model to Analyse SUMMA and HSUMMA Time of sending of a message of size m between two processors: α + m β (1) Here, ◮ α -latency ◮ β -reciprocal bandwith ◮ m -message size 17 / 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend