Communication Lower Bounds for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou

M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

M OTIVATIONS 3 Getting up to speed: The Future of Supercomputing , Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004. Annual improvement Time per flop Bandwidth Latency Network 26% 15% 59% DRAM 23% 5%

M OTIVATIONS 3 Getting up to speed: The Future of Supercomputing , Eds. Susan L. Graham, Marc Snir, and Cynthia A. Patterson, National Research Council, 227 pages, 2004. Annual improvement Time per flop Bandwidth Latency Network 26% 15% 59% DRAM 23% 5% 100 10000 Memory BW (Mword/sec) Mflops DRAM Chip BW (Mword/sec) DRAM Row Access Time Expon. (DRAM Row Access Time) 1000 Time (nsec) 100 10 1 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 10 Jan 88 Jan 90 Jan 92 Jan 94 Jan 96 Jan 98 Jan 00 Jan 02 FIGURE 5.3 Arithmetic performance (Mflops), memory bandwidth, and DRAM chip bandwidth per calendar year. FIGURE 5.4 Decrease in memory latency (in nanoseconds) per calendar year.

M OTIVATIONS 4 http://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

M OTIVATIONS 5 Data Movement Cost: Energy Trends FLOPs almost free; 10000 ¡ data movement cost is dominant 1000 ¡ Minimizing amount of data movement Picojoules increasingly critical 100 ¡ No Change 45mm ¡ 45 nm 10 ¡ 11 nm 11nm ¡(2018) ¡ 1 ¡ Source: Jim Demmel, John Shalf

M OTIVATIONS 6

M OTIVATIONS 7

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 8 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel Distributed

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 9 One core Intel Xeon Processor E5520 (nehalem) β − 1 = 580 · 10 6 words/sec γ − 1 = 10 . 12 · 10 9 flops/sec M = 10 6 words DGEMM on one core Intel Xeon Processor E5520 (nehalem) 10 9 8 7 6 GFlops/sec 5 4 3 2 1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Matrix Order

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model.

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication C" =" A" B"

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication input: A an n -by- n matrix, B an n -by- n matrix output: C an n -by- n matrix % starting from C = 0 for i = 1 : n , for j = 1 : n , for k = 1 : n , c ij = c ij + a ik b kj

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication input: A an n -by- n matrix, B an n -by- n matrix output: C an n -by- n matrix % starting from C = 0 for i = 1 : n , for j = 1 : n , for k = 1 : n , c ijk = a ik b kj c ij = c ij + c ijk 2 n 3 operations any order of creation of the c ijk results in the correct answer

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. Intel%Xeon%Processor%E5520%(Nehalem)% � dense matrix-matrix multiplication � sequential: two levels of memory 10.12%GFLOP/sec/core% CPU% » sequential = not parallel! Cache% (8%MB)% » fast memory of size M » slow memory 25.6%GB/s %ec% » computation happens in fast memory Main%memory% (16%GB)%

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication � sequential: two levels of memory � communication cost: time, energy, etc.

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 10 Mission Statement We study communication costs for the ordinary dense (OD) matrix-matrix multiplication in the sequential model. � dense matrix-matrix multiplication � sequential: two levels of memory � communication cost: time, energy, etc. � ordinary: we compute all ( n 3 of them) the c ijk = a ik · b kj (consequence: Strassen-like matrix-matrix multiplication algorithms are not allowed.)

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 11 Sequential Lower Bounds for Matrix-Matrix Multiplication Consider any ordinary dense matrix-matrix multiplication algorithm for multiplying an n –by– n matrix with an n –by– n matrix, consider a computer with fast memory of size M , then Upper bound :: square tile matrix-matrix multiplication The number of words transferred between slow and fast memory is at most � n 3 � √ 3 . 46 . M Lower Bound :: Irony, Toledo, and Tiskin, 2004 The number of words transferred between slow and fast memory is at least � n 3 � √ 0 . 35 − M . M √ Note: 3 . 46 ≈ 2 3 √ 2 ) − 1 Note: 0 . 35 ≈ ( 2

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 12 What is an algorithm ? � A sequence of the following instructions define an algorithm: » Read an element from slow memory to fast memory. » Create an element in fast memory. » Write an element from fast memory to slow memory. » Delete an element from fast memory. » Perform a floating-point operation operation in fast memory.  Read a 11    Read b 11     Create c 111 = a 11 b 11      Read a 12      Read b 21 Create c 112 = a 12 b 21     Write c 11      Delete c 11 , a 11 , b 11     .   .   . Split the instructions into segments so exactly M reads and writes occur in each segment.

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 13 What do we want to compute? We want to compute the n 3 cijk = aik bkj . The computation of cijk requires aik and bkj to be in cache. mul(plica(on"(i,j,k)" c ij" ="c ij" +"a ik "b kj " c ij " j" b kj " i" a ik " k"

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 14 What do we want to compute? We want to compute the n 3 c ijk = a ik b kj . The computation of c ijk requires a ik , b kj , and a c ij ∗ to be in cache. In order to compute c ijk , � we either have to have a ik in cache at the start of the segment ( M a ) we have to read a ik ( R a ) from slow memory to cache during the segment. � we either have to have b kj in cache at the start of the segment a ik ( M b ) we have to read a kj ( R b ) from slow memory to cache during the segment. � we have to have a c ij ∗ in cache at the end of the segment ( N c ) or we have to write back ( W c ) during the segment. mul(plica(on"(i,j,k)" c ij" ="c ij" +"a ik "b kj " c ij " j" b kj " i" a ik " k"

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 15 � Split the instructions into segments so exactly M reads and writes occur in each segment. � M reads and writes.  Read a 11     Read b 11     Create c 111 = a 11 b 11      Read a 12      Read b 21 Segment Create c 112 = a 12 b 21      Write c 11      Delete c 11 , a 11 , b 11     .  .   . 

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 15 � Split the instructions into segments so exactly M reads and writes occur in each segment. � M reads and writes.  Read a 11  » R a = number of reads for A .    Read b 11     Create c 111 = a 11 b 11      Read a 12      Read b 21 Segment Create c 112 = a 12 b 21      Write c 11      Delete c 11 , a 11 , b 11     .  .   . 

C OMMUNICATION L OWER B OUND FOR S EQUENTIAL M ATRIX -M ATRIX MULTIPLICATION 15 � Split the instructions into segments so exactly M reads and writes occur in each segment. � M reads and writes.  Read a 11  » R a = number of reads for A .    Read b 11  » W a = number of writes for A .    Create c 111 = a 11 b 11      Read a 12      Read b 21 Segment Create c 112 = a 12 b 21      Write c 11      Delete c 11 , a 11 , b 11     .  .   . 

Communication Lower Bounds for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Amit Chakrabarti Dartmouth College WAPMDS, IIT Kanpur, Dec 2009 Amit Chakrabarti 1 Multi-Pass

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Monotone Circuit Depth Lower Bounds Prashant Vasudevan April 10, 2012 Prashant Vasudevan

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

On lower bounds for C 0 -semigroups Yuri Tomilov IM PAN, Warsaw Chemnitz, August, 2017 Yuri

Lecture 3: Lower Bounds for Sorting, Linear Time Sorting Algorithms Instructor: Saravanan

9. Sorting III Lower bounds for the comparison based sorting, radix- and bucket-sort 248 9.1

Data Streams & Communication Complexity Lecture 3: Communication Complexity and Lower Bounds

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Sequence Covering Arrays Lower Bounds Upper Bounds Existence Results Charles J. Colbourn 1

Robust Lower Bounds for Communication and Stream Computation Amit Chakrabarti Dartmouth

Applied Machine Learning Syllabus and logistics Siamak Ravanbakhsh COMP 551 (fall 2020) Admin

Chatbots for Language Learning Anja Reusch Technische Universit at Dresden Analyse eines

CS 378: Autonomous Intelligent Robotics (FRI) Dr. Todd Hester Are there any questions?

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush HarvardNLP Code:

arXiv:1610.04211v2 [cs.CL] 17 Nov 2016 cult to train and recurrency tends to complex- soning

Bayesian Reinforcement Learning: A Survey Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau,

Video Object Mining : Issues and Perspectives Jonathan Weber, S ebastien Lef` evre, Pierre

Improving Background Based Conversation with Context-aware Knowledge Pre-selection Pengjie Ren

Communication Lower Bounds for Matrix-Matrix Multiplication - PowerPoint PPT Presentation

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9, 2015 Julien Langou M OTIVATIONS 2 Motivations Communication Lower Bound for Sequential Matrix-Matrix multiplication Application to Parallel

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lower Bounds on Matrix Rigidity via a Quantum Argument Ronald de Wolf CWI Amsterdam Lower

Lecture 2. Upper and lower bounds for subgaussian matrices The -net method refined 1 Random

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Amit Chakrabarti Dartmouth College WAPMDS, IIT Kanpur, Dec 2009 Amit Chakrabarti 1 Multi-Pass

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Kernel-Size Lower Bounds: The Evidence from Complexity Theory Andrew Drucker IAS Worker 2013,

Monotone Circuit Depth Lower Bounds Prashant Vasudevan April 10, 2012 Prashant Vasudevan

I/O Lower Bounds and Algorithms for Matrix-Matrix Multiplication Tyler M. Smith July 5, 2017 1

On lower bounds for C 0 -semigroups Yuri Tomilov IM PAN, Warsaw Chemnitz, August, 2017 Yuri

Lecture 3: Lower Bounds for Sorting, Linear Time Sorting Algorithms Instructor: Saravanan

9. Sorting III Lower bounds for the comparison based sorting, radix- and bucket-sort 248 9.1

Data Streams &amp; Communication Complexity Lecture 3: Communication Complexity and Lower Bounds

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Sequence Covering Arrays Lower Bounds Upper Bounds Existence Results Charles J. Colbourn 1

Robust Lower Bounds for Communication and Stream Computation Amit Chakrabarti Dartmouth

Applied Machine Learning Syllabus and logistics Siamak Ravanbakhsh COMP 551 (fall 2020) Admin

Chatbots for Language Learning Anja Reusch Technische Universit at Dresden Analyse eines

CS 378: Autonomous Intelligent Robotics (FRI) Dr. Todd Hester Are there any questions?

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush HarvardNLP Code:

arXiv:1610.04211v2 [cs.CL] 17 Nov 2016 cult to train and recurrency tends to complex- soning

Bayesian Reinforcement Learning: A Survey Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau,

Video Object Mining : Issues and Perspectives Jonathan Weber, S ebastien Lef` evre, Pierre

Improving Background Based Conversation with Context-aware Knowledge Pre-selection Pengjie Ren

Data Streams & Communication Complexity Lecture 3: Communication Complexity and Lower Bounds