SLIDE 1
Helsinki, 8 December 2003 Title: The current truth about heaps - - PDF document
Helsinki, 8 December 2003 Title: The current truth about heaps - - PDF document
Helsinki, 8 December 2003 Title: The current truth about heaps Speaker: Jyrki Katajainen Co-workers: Claus Jensen and Fabio Vitale This talk is about the heaps we all love. I will explain how the heap functions are im- plemented in the CPH
SLIDE 2
SLIDE 3
c
Performance Engineering Laboratory
3
SLIDE 4
Heap functions in the STL
void push heap(position A, position Z, ordering f);
Effect:
✁ ✁ ✁ ✁ ✁ ✁❆ ❆ ❆ ❆ ❆ ✈ ✲ ✁ ✁ ✁ ✁ ✁ ✁❆ ❆ ❆ ❆ ❆
at most log2 n comparisons
void pop heap(position A, position Z, ordering f);
Effect:
✈ ✁ ✁ ✁ ✁ ✁ ✁❆ ❆ ❆ ❆ ❆ ✲ ✁ ✁ ✁ ✁ ✁ ✁❆ ❆ ❆ ❆ ❆ ✈
at most 2 log2 n comparisons
void make heap(position A, position Z, ordering f);
Effect:
✲ ✁ ✁ ✁ ✁ ✁ ✁❆ ❆ ❆ ❆ ❆
at most 3n comparisons
void sort heap(position A, position Z, ordering f);
Effect:
✁ ✁ ✁ ✁ ✁ ✁❆ ❆ ❆ ❆ ❆ ✲ ✑✑✑✑✑ ✑ at most n log2 n
comparisons
c
Performance Engineering Laboratory
4
SLIDE 5
How would you do it?
c
Performance Engineering Laboratory
5
SLIDE 6
Jones 1986
Operation sequence (hold model):
push()N[pop()push()]K
e ← pop() increase the priority of e by − ln(drand())
push(e)
Input data: element size: 4 B; #elements: 1–213.5 Environment: computer: VAX 11/780 running UNIX (BSD 4.2); cache: 8 kB: TLB: 64 entries; compiler: Berkeley Pascal with optimization enabled
c
Performance Engineering Laboratory
6
SLIDE 7
LaMarca & Ladner 1996
Operation sequence: Hold model? #define NOTSORANDNUM(x) (x + RANDNUM()) Input data: element size: 8 B; #elements: 210–223 Environment: computer: DEC Alphastation 250; processor: Al- pha 21064A 266 MHz; L1 cache: 8 kB; L2 cache: direct-mapped, 2 MB, 32 B per line; compiler?: cc
c
Performance Engineering Laboratory
7
SLIDE 8
Sanders 1999
Operation sequence: [push()pop()push()]N[pop()push()pop()]N Input data: element size: 4 B, drawn randomly; satellite data: 4 B; #elements: 28–223 Environment: computer: Pentium II 300 MHz; compiler g++ -O6
c
Performance Engineering Laboratory
8
SLIDE 9
Brengel et al. 1999
Operation sequence:
push()N/pop()N
Input data: element size: 4 B, drawn randomly from [0 . . 107]; #elements: 1 · 106–200 · 106 Environment: computer: Sparc Ultra 1/143; main memory: 256 MB, 8 kB per page; local disk: 9 GB fastwide SCSI; logical block size: 64 kB; buffer size: 16 MB
c
Performance Engineering Laboratory
9
SLIDE 10
Edelkamp & Stiegeler 2002
Operation sequence:
make(N)[pop()]N
Input data: element size: 4 B, floating point numbers drawn randomly; #elements: 106; ordering: f 0(x) = x and f i(x) = ln(f i−1(x+1)) for i > 0 Environment: computer: Pentium III 450 MHz; compiler g++ -O2
c
Performance Engineering Laboratory
10
SLIDE 11
How would you do it now?
c
Performance Engineering Laboratory
11
SLIDE 12
Sanders’ programs on Pentium II
500 1000 1500 2000 2500 3000 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Sanders’ programs: [push()]N [pop()]N 2−ary heap 4−ary heap
SLIDE 13
Sanders’ programs on Pentium III
500 1000 1500 2000 2500 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Sanders’ programs on Pentium III: [push()]N [pop()]N 2−ary heap 4−ary heap
SLIDE 14
Sanders’ programs on Pentium IV
200 400 600 800 1000 1200 1400 1600 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Sanders’ programs on Pentium IV: [push()]N [pop()]N 2−ary heap 4−ary heap
SLIDE 15
Cost of unsigned int operations
initializations instruction unsigned int p ← 1 a[i] ← 0 x ← 220 a[i] ← x n = 210 . . 224 4.1–4.7 ns p ← 617 a[i] ← 0 x ← 220 a[i] ← x n = 210 . . 214 7.3–8.9 ns n = 215 12 ns n = 216 29 ns n = 216 . . 222 62–63 ns p ← 1 a[i] ← 0 x ← 220 x ← a[i] n = 210 . . 224 3.3–3.8 ns p ← 617 a[i] ← 0 x ← 220 x ← a[i] n = 210 . . 215 3.3–4.1 ns n = 216 23 ns n = 217 . . 222 45–55 ns p ← 1 a[i] ← 0 x ← 220 r ← (a[i] < x) n = 210 . . 224 5.3–5.8 ns p ← 1 a[i] ← 0 x ← 220 r ← (ln(a[i]) < ln(x)) n = 210 . . 224 580–610ns c
Performance Engineering Laboratory
15
SLIDE 16
Cost of bigint operations
initializations instruction bigint p ← 1 a[i] ← 0 x ← 220 a[i] ← x n = 210 . . 221 60–66 ns n = 222 290 ns p ← 617 a[i] ← 0 x ← 220 a[i] ← x n = 210 . . 212 75–78 ns n = 213 117 ns n = 214 229 ns n = 215 . . 220 297–318 ns n = 221 . . 222 748–752 ns p ← 1 a[i] ← 0 x ← 220 x ← a[i] n = 210 . . 222 18–21 ns p ← 617 a[i] ← 0 x ← 220 x ← a[i] n = 210 . . 212 24 ns n = 213 83 ns n = 214 180 ns n = 215 . . 222 230–260 ns p ← 1 a[i] ← 0 x ← 220 r ← (a[i] < x) n = 210 . . 222 13–16 ns c
Performance Engineering Laboratory
16
SLIDE 17
Other current research
Pointer-based methods: hopelessly slow → theoretical computer science Methods with good amortized bounds: terrible worst case → not relevant for us Methods with few element moves: bad cache behaviour → not good for us External-memory methods: high constants → relevant only for very large data sets Cache-oblivious methods: huge constants → theoretical computer science
c
Performance Engineering Laboratory
17
SLIDE 18
Our policy-based framework
template <arity d, typename position, typename ordering> class heap_policy { public: typedef typename std::iterator_traits<position>::difference_type index; typedef typename std::iterator_traits<position>::difference_type level; typedef typename std::iterator_traits<position>::value_type element; template <typename integer> heap_policy(integer n = 0); bool is_root(index) const; bool is_first_child(index) const; index size() const; level depth(index) const; index root() const; index leftmost_leaf() const; index last_leaf() const; index first_child(index) const; index parent(index) const; index ancestor(index, level) const; index top_some_absent(position, index, const ordering&) const; index top_all_present(position, index, const ordering&) const; void update(position, index, const element&); void erase_last_leaf(position, const ordering&); void insert_new_leaf(position, const ordering&); private: index n; };
c
Performance Engineering Laboratory
18
SLIDE 19
Input data
cheap move expensive move cheap comparison unsigned int bigint expensive comparison unsigned int ln comparison (int, bigint) ln comparison
c
Performance Engineering Laboratory
19
SLIDE 20
One new old idea: local heaps
c
Performance Engineering Laboratory
20
SLIDE 21
Our solution for sort heap()
In-place mergesort by Katajainen, Pasanen, and Teuhola [1996] Fine-tuning not yet implemented Almost as fast as quicksort, see CPH STL Report 2003-2
c
Performance Engineering Laboratory
21
SLIDE 22
Our solution for make heap()
Depth-first heap construction by Bojesen, Kata- jainen, and Spork [2000] Almost optimal in all respects Other work: less element comparisons → theoretical computer science
c
Performance Engineering Laboratory
22
SLIDE 23
Various approaches for pop heap()
– top-down → many element comparisons – bottom-up → typical case good – move-saving bottom-up → theoretical com- puter science – binary-search top-down – two-levels-at-a-time top-down
c
Performance Engineering Laboratory
23
SLIDE 24
Various approaches for push heap()
– move-saving top-down → slow – bottom-up → typical case good – bottom-up with buffering → complicated – binary-search bottom-up
c
Performance Engineering Laboratory
24
SLIDE 25
Efficiency of 2-, 3-, 4-ary heaps
200 400 600 800 1000 1200 1400 1600 1800 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Efficiency of various sorting functions for random integers SGI::partial_sort() Bottom−up approach: 3−ary heap Bottom−up approach: 2−ary heap Bottom−up approach: 4−ary heap SGI::sort()
SLIDE 26
Efficiency of 2-, 3-, 4-ary heaps
2000 4000 6000 8000 10000 12000 14000 16000 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Efficiency of various sorting functions for random integers using ln comparison Bottom−up approach: 4−ary heap Bottom−up approach: 3−ary heap SGI::sort() Bottom−up approach: 2−ary heap SGI::partial_sort()
SLIDE 27
Efficiency of local heaps
200 400 600 800 1000 1200 1400 1600 1800 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Efficiency of various sorting functions for random integers SGI::partial_sort() Two−by−two top−down approach: 1−local heap Two−by−two top−down approach: 5−local heap Two−by−two top−down approach: 4−local heap Two−by−two top−down approach: 3−local heap Two−by−two top−down approach: 2−local heap SGI::sort()
SLIDE 28
Efficiency of local heaps
2000 4000 6000 8000 10000 12000 14000 1000 10000 100000 1e+06 1e+07 Execution time per element [in nanoseconds] n Efficiency of various sorting functions for random integers using ln comparison Two−by−two top−down approach: 1−local heap Two−by−two top−down approach: 5−local heap Two−by−two top−down approach: 4−local heap Two−by−two top−down approach: 2−local heap Two−by−two top−down approach: 3−local heap SGI::sort() function: partial_sort source: SGI
SLIDE 29
Conclusions
– In 40 years — not much progress – At the moment it is not clear how big the overhead of local heaps is for small problem sizes. – Some combinations of various approaches have still to be tested. – Code-tuning of the best approaches is still to be done. – It takes time to develop fast library rou- tines. – How does technology influence on the ef- ficiency of the library routines?
c
Performance Engineering Laboratory
29
SLIDE 30