Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 2 - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS260 – Lecture 2 Yan Gu Case Study: Matrix Multiplication Many slides in this lecture are borrowed from the first lecture in 6.172 Performance Engineering of Software Systems at MIT. The credit is to Prof. Charles E. Leiserson, and the instructor appreciates the permission to use them in this course. The numbers of runtime and more details of the experiment can be found in Tao Schardl’s dissertation Performance engineering of multicore software: Developing a science of fast code for the post-Moore era .

Technology Scaling 1,000,000 100,000 Normalized transistor count 10,000 1,000 Clock speed (MHz) 100 Processor cores 10 1 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year Stanford’s CPU DB [DKM12]

Performance Is No Longer Free ∙ Moore’s Law continues to 2011 Intel increase computer performance Skylake processor ∙ But now that performance looks like big multicore processors with complex cache hierarchies, wide vector units, GPUs, FPGAs, etc. 2008 ∙ Generally, algorithms must be NVIDIA GT200 adapted to utilize this hardware GPU efficiently!

Square-Matrix Multiplication c 11 c 12 ⋯ c 1n a 11 a 12 ⋯ a 1n b 11 b 12 ⋯ b 1n c 21 c 22 ⋯ c 2n a 21 a 22 ⋯ a 2n b 21 b 22 ⋯ b 2n = ∙ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ ⋱ ⋮ a n1 a n2 ⋯ a nn b n1 b n2 ⋯ b nn c n1 c n2 ⋯ c nn C A B n c ij =  a ik b kj k = 1 Assume for simplicity that n = 2 k .

AWS c4.8xlarge Machine Specs Feature Specification Microarchitecture Haswell ( Intel Xeon E5-2666 v3 ) Clock frequency 2.9 GHz Processor chips 2 Processing cores 9 per processor chip Hyperthreading 2 way 8 double-precision operations, including Floating-point unit fused-multiply-add, per core per cycle Cache-line size 64B L1-icache 32KB private 8-way set associative L1-dcache 32KB private 8-way set associative L2-cache 256 KB private 8-way set associative L3-cache (LLC) 25MB shared 20-way set associative DRAM 60GB Peak = (2.9 × 10 9 ) × 2 × 9 × 16 = 836 GFLOPS

Version 1: Nested Loops in Python import sys, random Running time = 21042 seconds ≈ 6 hours from time import * Is this fast? n = 4096 A = [[random.random() Should we expect more? for row in xrange(n)] for col in xrange(n)] B = [[random.random() for row in xrange(n)] for col in xrange(n)] C = [[0 for row in xrange(n)] for col in xrange(n)] start = time () for i in xrange ( n ): for j in xrange ( n ): for k in xrange ( n ): C[i ][ j ] += A [ i ][ k ] * B [ k ][ j ] end = time() print ' %0.6f ' % ( end - start )

Version 1: Nested Loops in Python import sys, random Running time = 21042 seconds ≈ 6 hours from time import * Is this fast? n = 4096 A = [[random.random() Should we expect more? for row in xrange(n)] for col in xrange(n)] B = [[random.random() for row in xrange(n)] for col in xrange(n)] Back-of-the-envelope calculation C = [[0 for row in xrange(n)] for col in xrange(n)] 2n 3 = 2( 2 12 ) 3 = 2 37 floating-point operations start = time () for i in xrange ( n ): Running time = 21042 seconds for j in xrange ( n ): ∴ Python gets 2 37 /21042 ≈ 6.25 MFLOPS for k in xrange ( n ): C[i ][ j ] += A [ i ][ k ] * B [ k ][ j ] Peak ≈ 836 GFLOPS end = time() Python gets ≈ 0.00075% of peak print ' %0.6f ' % ( end - start )

Version 2: Java import java.util.Random; Running time = 2,738 seconds ≈ 46 minutes public class mm_java { static int n = 4096; … about 8.8× faster than Python. static double[][] A = new double[n][n]; static double[][] B = new double[n][n]; static double[][] C = new double[n][n]; public static void main(String[] args) { Random r = new Random(); for ( int i =0; i < n ; i ++) { for ( int j =0; j < n ; j ++) { A [ i ][ j ] = r . nextDouble (); B [ i ][ j ] = r . nextDouble (); C [ i ][ j ] = 0; } } for ( int i =0; i < n ; i ++) { long start = System.nanoTime(); for ( int j =0; j < n ; j ++) { for ( int i =0; i < n ; i ++) { for ( int k =0; k < n ; k ++) { for ( int j =0; j < n ; j ++) { C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; for ( int k =0; k < n ; k ++) { C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; } } } } } } long stop = System.nanoTime(); double tdiff = ( stop - start ) * 1 e-9 ; System.out.println(tdiff); } }

Version 3: C #include <stdlib.h> #include <stdio.h> Using the Clang/LLVM 5.0 compiler #include <sys/time.h> #define n 4096 Running time = 1,156 seconds ≈ 19 minutes double A[n][n]; double B[n][n]; double C[n][n]; float tdiff(struct timeval *start, About 2× faster than Java and struct timeval *end) { return (end->tv_sec-start->tv_sec) + 1e-6*(end->tv_usec-start->tv_usec); about 18× faster than Python } int main(int argc, const char * argv []) { for ( int i = 0; i < n ; ++ i ) { for ( int j = 0; j < n ; ++ j ) { A[i][j] = (double)rand() / (double)RAND_MAX; B[i][j] = (double)rand() / (double)RAND_MAX; C [ i ][ j ] = 0; } } for ( int i = 0; i < n ; ++ i ) { struct timeval start, end; for ( int j = 0; j < n ; ++ j ) { gettimeofday(&start, NULL); for ( int k = 0; k < n ; ++ k ) { for ( int i = 0; i < n ; ++ i ) { for ( int j = 0; j < n ; ++ j ) { C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; for ( int k = 0; k < n ; ++ k ) { C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; } } } } } } gettimeofday(&end, NULL); printf ( "%0.6f \ n ", tdiff (& start , & end )); return 0; }

Where We Stand So Far Running Relative Absolute Percent of Version Implementation GFLOPS time (s) speedup Speedup peak 1 Python 21041.67 1.00 1 0.007 0.001 2 Java 2387.32 8.81 9 0.058 0.007 3 C 1155.77 2.07 18 0.119 0.014 Why is Python so slow and C so fast? ∙ Python is interpreted ∙ C is compiled directly to machine code ∙ Java is compiled to byte-code, which is then interpreted and just-in-time (JIT) compiled to machine code

Interpreters are versatile, but slow • The interpreter reads, interprets, and performs each program statement and updates the machine state • Interpreters can easily support high-level programming features — such as dynamic code alteration — at the cost of performance Read next Interpret statement statement Interpreter loop Update Perform state statement

JIT Compilation ∙ JIT compilers can recover some of the performance lost by interpretation ∙ When code is first executed, it is interpreted ∙ The runtime system keeps track of how often the various pieces of code are executed ∙ Whenever some piece of code executes sufficiently frequently, it gets compiled to machine code in real time ∙ Future executions of that code use the more-efficient compiled version

Loop Order We can change the order of the loops in this program without affecting its correctness for ( int i = 0; i < n ; ++ i ) { for ( int j = 0; j < n ; ++ j ) { for ( int k = 0; k < n ; ++ k ) { C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; } } }

Loop Order We can change the order of the loops in this program without affecting its correctness for ( int i = 0; i < n ; ++ i ) { for ( int k = 0; k < n ; ++ k ) { for ( int j = 0; j < n ; ++ j ) { C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; } } } Does the order of loops matter for performance?

Performance of Different Orders Loop order (outer Running • Loop order affects to inner) time (s) i, j, k 1155.77 running time by a i, k, j 177.68 factor of 18! j, i, k 1080.61 j, k, i 3056.63 • What’s going on?! k, i, j 179.21 k, j, i 3032.82

Hardware Caches Each processor reads and writes main memory in contiguous blocks, called cache lines ∙ Previously accessed cache lines are stored in a smaller memory, called a cache , that sits near the processor ∙ Cache hits — accesses to data in cache — are fast ∙ Cache misses — accesses to data not in cache — are slow memory processor cache P B M / B cache lines

Memory Layout of Matrices In this matrix-multiplication code, matrices are laid out in memory in row-major order Matrix Row 1 Row 2 What does this layout imply Row 3 about the performance of Row 4 Row 5 different loop orders? Row 6 Row 7 Row 8 Memory Row 1 Row 2 Row 3

Access Pattern for Order i, j, k for ( int i = 0; i < n ; ++ i ) Running time: for ( int j = 0; j < n ; ++ j ) for ( int k = 0; k < n ; ++ k ) 1155.77s C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; Excellent spatial locality In-memory layout C = Good spatial locality A x Poor spatial locality B 4096 elements apart

Access Pattern for Order i, k, j for ( int i = 0; i < n ; ++ i ) Running time: for ( int k = 0; k < n ; ++ k ) for ( int j = 0; j < n ; ++ j ) 177.68s C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; In-memory layout C = A x B

Access Pattern for Order j, k, i for ( int j = 0; j < n ; ++ j ) Running time: for ( int k = 0; k < n ; ++ k ) for ( int i = 0; i < n ; ++ i ) 3056.63s C [ i ][ j ] += A [ i ][ k ] * B [ k ][ j ]; In-memory layout C = A x B

Performance of Different Orders We can measure the effect of different access patterns using the Cachegrind cache simulator: $ valgrind --tool=cachegrind ./mm Loop order (outer to Last-level-cache Running time (s) inner) miss rate i, j, k 1155.77 7.7% i, k, j 177.68 1.0% j, i, k 1080.61 8.6% j, k, i 3056.63 15.4% k, i, j 179.21 1.0% k, j, i 3032.82 15.4%

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 2 - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 2 Yan Gu Case Study: Matrix Multiplication Many slides in this lecture are borrowed from the first lecture in 6.172 Performance Engineering of Software Systems at MIT. The

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Earley algorithm Earley: introduction Example of Earley algorithm Scott Farrar CLMA,

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Trip Report FINAL MEETING AND SUMMER SCHOOL OF DFG PRIORITY PROGRAM ALGORITHM ENGINEERING DFG

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu

Dijkstras Algorithm Austin Saporito and Charlie Rizzo Test Questions 1. What is the run time

Pollards Rho Algorithm for Elliptic Curves Aaron Blumenfeld November 30, 2015 Aaron

Duplication of Benefits under the Robert T. Stafford Disaster Relief and Emergency Assistance Act

Assertions, pre/post- conditions Assertions: Section 4.2 in Savitch (p. 239) Programming as a

Post IPv4 completion Making IPv6 deployable incrementally by making it backward compatible

Isomorphic Kotlin Troy Miles @therockncoder Troy Miles @therockncoder Troy Miles, aka the

Preventing PostSurgical Harm Wednesday, June 5, 2019 Engaging Patients and Families in Safety

VFS, Continued Don Porter 1 COMP 790: OS Implementation Logical Diagram Binary Memory

Reinforcement Learning Emma Brunskill Stanford University Winter 2019 Midterm Review

Housekeeping Welcome to today s ACM Webinar. The presentation starts at the top of the

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 2 - PowerPoint PPT Presentation

Algorithm Engineering (aka. How to Write Fast Code) CS260 Lecture 2 Yan Gu Case Study: Matrix Multiplication Many slides in this lecture are borrowed from the first lecture in 6.172 Performance Engineering of Software Systems at MIT. The

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm &amp; Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Earley algorithm Earley: introduction Example of Earley algorithm Scott Farrar CLMA,

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Trip Report FINAL MEETING AND SUMMER SCHOOL OF DFG PRIORITY PROGRAM ALGORITHM ENGINEERING DFG

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 6 Yan n Gu

Dijkstras Algorithm Austin Saporito and Charlie Rizzo Test Questions 1. What is the run time

Pollards Rho Algorithm for Elliptic Curves Aaron Blumenfeld November 30, 2015 Aaron

Duplication of Benefits under the Robert T. Stafford Disaster Relief and Emergency Assistance Act

Assertions, pre/post- conditions Assertions: Section 4.2 in Savitch (p. 239) Programming as a

Post IPv4 completion Making IPv6 deployable incrementally by making it backward compatible

Isomorphic Kotlin Troy Miles @therockncoder Troy Miles @therockncoder Troy Miles, aka the

Preventing PostSurgical Harm Wednesday, June 5, 2019 Engaging Patients and Families in Safety

VFS, Continued Don Porter 1 COMP 790: OS Implementation Logical Diagram Binary Memory

Reinforcement Learning Emma Brunskill Stanford University Winter 2019 Midterm Review

Housekeeping Welcome to today s ACM Webinar. The presentation starts at the top of the

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM