Lean programs, branch mispredictions, and sorting Amr Elmasry - PowerPoint PPT Presentation

Lean programs, branch mispredictions, and sorting Amr Elmasry & Jyrki Katajainen Department of Computer Science University of Copenhagen These slides are available at http://www.cphstl.dk � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (1)

Problem: Pipelining Code Pipelined execution ↓ ( x < y ) ↓ λ if ( x < y ) goto λ ; I 1 ; if ( x < y ) goto λ ; I 2 ; I 1 or J 1 ? . . . λ : Here instructions are carried out in five steps: J 1 ; J 2 ; • Instruction fetch . . . • Register read • Execution • Data access • Register write History table → prediction → speculation if wrong cycles wasted → � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (2)

Early work Call for research: “the frequency of conditional jump instructions might also be a factor” [Knuth 1993; The Stanford GraphBase, p. 497] Mergesort: O ( n lg n ) work, n lg n + O ( n ) element comparisons, and O ( n ) branch mispredictions, where n is the number of elements being sorted; the stronger claims made in the thesis are wrong. [Mortensen 2001; Master’s Thesis] � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (3)

Main idea Decouple element comparisons from conditional branches! C ++ code Assembly-language code 1 movl (%eax), %edx 1 i f ( less ( ∗ q , ∗ p )) { 2 leal 4(%eax), %edi 2 ∗ r = ∗ q ; 3 ++ q ; 3 movl (%ebx), %ecx } 4 4 leal 4(%ebx), %ebp else { 5 5 cmpl %ecx, %edx ∗ r = ∗ p ; 6 6 cmovge %ecx, %edx 7 ++ p ; 7 cmovge %ebp, %ebx 8 } 9 ++ r ; 8 cmovl %edi, %eax 9 movl %edx, (%esi) 10 addl $4, %esi Aha! Conditional move if (c) x = y � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (4)

Later work—confuses me Samplesort: O ( n lg n ) work, n lg n + O ( n ) element comparisons, and O ( n ) branch mispredictions on an average [Sanders & Winkel 2004] Lower bound: Branch mispredictions are unavoidable in sorting [Brodal & Moruz 2005] Quicksort: A skewed pivot-selection strategy can lead to a better performance than the exact-median pivot-selection strategy [Kaligosi & Sanders 2006] Search trees: Skewed binary search trees can perform better than perfectly balanced search trees [Brodal & Moruz 2006] � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (5)

Appetizer: Heap construction 1 template < typename position , typename index , typename comparator > 2 void siftdown ( position a , index i , index n , comparator less ) { 3 typedef typename std : : iterator_traits < position > :: value_type element ; 4 element copy = a [ i ] ; 5 loop : 1 index j = 2 ∗ i ; 6 7 i f ( j < = n ) { 80 8 i f ( j < n ) 2 3 9 i f ( less ( a [ j ] , a [ j + 1]) ) 49 75 10 j = j + 1; 4 5 6 7 11 i f ( less ( copy , a [ j ]) ) { 53 46 27 47 12 a [ i ] = a [ j ] ; 13 i = j ; 8 10 9 14 goto loop ; 12 10 24 15 } 16 } 17 a [ i ] = copy ; 18 } 19 20 template < typename position , typename comparator > comparator less ) { 21 void make_heap ( position first , position beyond , 22 typedef typename std : : iterator_traits < position > :: difference_type index ; 23 position const a = first − 1; 24 index const n = beyond − first ; 25 for ( index i = n / 2; i > 0; −− i ) [Floyd 1964] 26 siftdown ( a , i , n , less ) ; 27 } � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (6)

Optimization 1 opt 1 : Make sure that siftdown is always called with an odd n . > template < typename position , typename index , typename comparator > > void siftup ( position a , index j , comparator less ) { . . . > > } Execution time/ n [ns] index const m = ( n & 1) ? n : n − 1; > Program for ( index i = m / 2; i > 0; −− i ) > F F 1 siftdown ( a , i , m , less ) ; > n siftup ( a , n , less ) ; > 2 10 11.4 10.3 2 15 8 i f ( j < n ) 11.4 10.5 2 20 16.2 16.1 25 for ( index i = n / 2; i > 0; −− i ) 2 25 26 siftdown ( a , i , n , less ) ; 16.4 15.6 Aha! An unnecessary if Aha! Cache effects � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (7)

Optimization 2 opt 2 : Interpret the result of a comparison as an integer and use this value in normal index arithmetic. j = j + less ( a [ j ] , a [ j + 1]) ; > Execution time/ n [ns] 9 i f ( less ( a [ j ] , a [ j + 1]) ) 10 j = j + 1; Program F 1 F 12 n 2 10 10.3 7.1 2 15 10.5 7.6 2 20 16.1 11.0 2 25 15.6 14.0 Aha! A mixture of int s and bool s � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (8)

Optimization 3 opt 3 : Do not make any element moves when the element at the root stays in its original location. Execution time/ n [ns] element copy ; > index k = 2 ∗ i ; Program > k = k + less ( a [ k ] , a [ k + 1]) ; F 12 F 123 > n i f ( less ( a [ i ] , a [ k ]) ) { > copy = a [ i ] ; 2 10 > 7.1 6.4 a [ i ] = a [ k ] ; > 2 15 7.6 6.8 } > else { 2 20 > 11.0 10.0 return ; > 2 25 } 14.0 12.9 > i = k ; > Element moves/ n 4 element copy = a [ i ] ; Program F F 123 n 2 10 1.73 1.52 Aha! Loop unrolling 2 15 1.74 1.53 2 20 1.74 1.53 2 25 1.74 1.52 � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (9)

Ultimate goal: Lean programs Referee comment: Conditional-branch-lean would be a better term! • A program that has a constant number of unnested loops. • Each loop is branch-free , except the final conditional branch at the end. • A branch predictor is static assuming that forward branches are not taken and backward branches are taken. • Each such program induces O (1) branch mispredictions in this model. � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (10)

Our main result: Program transformation Theorem. Let P be a program of length κ , measured in the number of assembly- language instructions. Assume that the running time of P is t ( n ) for an input of size n . There exists a program Q of length O ( κ ) that is equivalent to P , runs in O ( κt ( n )) time for the same input as P , and induces O (1) branch mispredictions. [this paper] � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (11)

An improvement Referee comment: It seems that the bound on the running time could be improved. Example: The control- flow graph of siftdown Yes. Instead of program length κ , one 1-4 could express the running time as a function of the number of basic blocks. A 5-7 17 basic block is a piece of code with at most one branch or branch tar- 8 get; branch targets start a block and 9 branches end a block. 10 11 12-14 � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (12)

Other results: Hand-tailoring Heapsort: O ( n lg n ) work, 2 n lg n + O ( n ) element comparisons, O (1) extra space, and O (1) branch mispredictions [this paper] Mergesort: O ( n lg n ) work, n lg n + O ( n ) element comparisons, O ( n ) extra space, and O (1) branch mispredictions [this paper] In-situ mergesort: O ( n lg n ) work, n lg n + O ( n ) element comparisons, O (lg n ) extra space, and O ( n ) branch mispredictions [Elmasry, Katajainen & Stenmark 2012] � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (13)

Criticism Theory Practice 1) We used C ++ to describe the 1) Assembly code written by us programs. was much slower than that generated by the compiler. 2) We relied on conditional- 2) We could not force the com- move instructions. piler to translate them as we wanted. 3) We assumed that the branch 3) Real branch-prediction hard- predictor was static. ware is more complicated. 4) On paper everything worked 4) We got test results that we smoothly. could not explain. � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (14)

Advice for practitioners • Write programs as before if speed is not primary concern. • Keep easy-to-predict branches since they have small overhead on modern processors. • Eliminate hard-to-predict branches if the elimination will not cause too much overhead. � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (15)

Concluding remarks • Welcome to the world of paranoid programming! Referee comment: How architecture-dependent are the results? Referee comment: The fun factor is pretty much non-existent. • It was fun to tailor the programs until we saw the pattern how to write them. • Still, we do not know what is the most efficient way of avoiding if statements. Aha! Creativity still needed � Performance Engineering Laboratory c 6th Conference on Fun with Algorithms, Venice, June 2012 (16)

Lean programs, branch mispredictions, and sorting Amr Elmasry - PowerPoint PPT Presentation

Lean programs, branch mispredictions, and sorting Amr Elmasry & Jyrki Katajainen Department of Computer Science University of Copenhagen These slides are available at http://www.cphstl.dk Performance Engineering Laboratory c 6th

UK Lean Summit 2016 UK Lean Summit 2016 Learning Lean, Lean Learning Learning Lean, Lean

NE Indiana Lean Network Dec 8, 2016 Lean Culture Traditional Way Of Thinking Cost + Profit =

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

Overview/Questions What is sorting? Why does sorting matter? How is sorting

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

How to Run a Lean Coffee Session Lean Coffee(tm) is a trademark of Modus Cooperandi WHAT IS LEAN

YELLOW BELT TRAINING (LEAN DAILY) SIMPLER. FASTER. BETTER. LESS COSTLY. lean.ohio.gov

Lean Software Development Lean Software Development is an Agile practice that is based on the

Balfour Beatty Competing in a Lean Environment AGENDA ABOUT BALFOUR BEATTY DEFINING LEAN IN

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

DESIGN OF SMART COMPOSITE FOR VIBRATION SUPPRESSION USING LAMINATION PARAMETERS S. Honda 1 *, K.

Regional Concerns Meeting for Calais VT 14, Bridge 77 over Kingsbury Branch This Presentation is

CHINATOWN HIM MARK LAI BRANCH LIBRARY RENOVATION COMMUNITY MEETING | 06.15.2019 1 SAN FRANCISCO

Air Operations Branch THE FLORIDA DIVISION OF EMERGENCY MANAGEMENT AOB Overview Function/role

Fixed parameter tractable algorithms for corridor guarding problems R. Subashini Joint work with

Deciphering the z g distribution in heavy ion collisions P. Caucal, E. Iancu, A.H. Mueller and G.

Delft Cooperation on Intelligent Systems The Human in the loop: a crisis-continuation factor?

SMIPS Multimedia Extension Group 2 Myron King Asif Khan Motivation & Utility Motivation:

Lean programs, branch mispredictions, and sorting Amr Elmasry - PowerPoint PPT Presentation

Lean programs, branch mispredictions, and sorting Amr Elmasry & Jyrki Katajainen Department of Computer Science University of Copenhagen These slides are available at http://www.cphstl.dk Performance Engineering Laboratory c 6th

UK Lean Summit 2016 UK Lean Summit 2016 Learning Lean, Lean Learning Learning Lean, Lean

NE Indiana Lean Network Dec 8, 2016 Lean Culture Traditional Way Of Thinking Cost + Profit =

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Sorting Insertion sort Bubble sort Divide and conquer sorting Sorting Last time: introduction

Overview/Questions What is sorting? Why does sorting matter? How is sorting

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Sorting Lower Bound Sorting Lower Bound 1 Comparison-Based Sorting (10.4) Many sorting

How to Run a Lean Coffee Session Lean Coffee(tm) is a trademark of Modus Cooperandi WHAT IS LEAN

YELLOW BELT TRAINING (LEAN DAILY) SIMPLER. FASTER. BETTER. LESS COSTLY. lean.ohio.gov

Lean Software Development Lean Software Development is an Agile practice that is based on the

Balfour Beatty Competing in a Lean Environment AGENDA ABOUT BALFOUR BEATTY DEFINING LEAN IN

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Sorting Sorting used as a step in many algorithms Savitch Chapter 7.4 Sorting algorithms

Sorting Sorting as a tool Sorting problem: Given a list a with n elements possessing a There are

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in

DESIGN OF SMART COMPOSITE FOR VIBRATION SUPPRESSION USING LAMINATION PARAMETERS S. Honda 1 *, K.

Regional Concerns Meeting for Calais VT 14, Bridge 77 over Kingsbury Branch This Presentation is

CHINATOWN HIM MARK LAI BRANCH LIBRARY RENOVATION COMMUNITY MEETING | 06.15.2019 1 SAN FRANCISCO

Air Operations Branch THE FLORIDA DIVISION OF EMERGENCY MANAGEMENT AOB Overview Function/role

Fixed parameter tractable algorithms for corridor guarding problems R. Subashini Joint work with

Deciphering the z g distribution in heavy ion collisions P. Caucal, E. Iancu, A.H. Mueller and G.

Delft Cooperation on Intelligent Systems The Human in the loop: a crisis-continuation factor?

SMIPS Multimedia Extension Group 2 Myron King Asif Khan Motivation &amp; Utility Motivation:

SMIPS Multimedia Extension Group 2 Myron King Asif Khan Motivation & Utility Motivation: