Cache misses for lookup, existing of random ints Cache misses for - PDF document

The hash tables • Google’s dense and sparse hash tables – Use open addressing – Quadratic probing • SGI – Is a chained hash table – Referred to as gnu in graphs • One-table and Two-table – Doubly linked: Rather large compartments – Buckets: Two pointers delimit a section of a circular list. – One-table uses alternative vector implementation

• Chained – Singly linked: small compartments – lookup uses a tight inner loop by copying into sentinel. • The same hash functions were used for all hash tables. A string table lookup method for string data and a division method for integer data.

The machines • 32 bit app intel machines at DIKU without PAPI • 64 bit amd with PAPI extensions

Kinds of benchmarks • Memory allocated – Not Google • Timing – CPU time + Total CPU cycles – Variability measured as std. deviation – CPU time and CPU cycle graphs are very alike, CPU cycles are used. • Cache behaviour – Number of L1 Cache misses. – L1 Cache miss ratio: Percentage of cache ac- cesses that are misses.

Benchmarks skipped integer string reference value value genome random random random words words paths paths CPU time Mem. alloc. insert Resp. time Cache CPU time Mem. alloc. lookup, existing Resp. time Cache CPU time Mem. alloc. lookup, non-e. Resp. time Cache CPU time Mem. alloc. erase Resp. time Cache iterate fwd. Resp. time iterate bwd. Resp. time • The meaurements using “by value” string data are not included here because they were problemetic.

Data used for the benchmarks • Random strings (10 bytes) and integers - reflect the behavior of the data structure with a perfect distribution of in-data. • Ordered data: Gene sequences, words and output of the locate command. • Realism of data: range 100000 - 800000 elements • Data was loaded into memory from a file and then scanned.

• The maximum load factor – Google: 0.8 – SGI (gnu on graphs): 1.0 – Us: initially 5.0, then 1.0 - we focus on max load factor of 1.0 • Timing – Linear hash tables all save the hash value which saves time on string data. – Iteration on the linear hash tables should be effi- cient because of the circular chain of elements. – The chained hash table uses a tight inner loop containing only one test. First compartment in chain is stored in vector. – Google has a cache advantage in using open addressing.

• Allocation – Saving the hash value takes more memory. – Doubly linked lists vs. singly linked lists. – Two tables vs. one table. – Alternative allocation scheme used by one table

The Lookup operation • lookup non-existing is called for each insert . • lookup existing is called for each delete • Saving the hash value saves time on lookup of string data. • Lookup non-existing: each odd entry was inserted, each even was looked up.

Total CPU cycles for lookup, existing of random ints Total CPU cycles for lookup, existing of genome-based ints 6e+08 google_dense google_dense google_sparse google_sparse gnu gnu onetable onetable 6e+09 twotable twotable 5e+08 chained chained 5e+09 4e+08 4e+09 3e+08 3e+09 2e+08 2e+09 1e+08 1e+09 0 0 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000 • One- and two-table are very similar. • A factor 10 difference between the two graphs y axis • a non uniform distribution particularly effects the sparse hash table

Cache • Few cache misses with open addressing • lookup of non-existing elements causes more cache misses because more elements are traversed on av- erage. • Linear hash tables have a quite high cache miss ratio. • The good cache miss ratio of the chained hash table implementation is likely due to the storing of the first compartment within the vector.

Cache misses for lookup, existing of random ints Cache misses for lookup, non-existing of random ints 2e+06 google_dense google_dense 1.8e+06 google_sparse google_sparse gnu gnu onetable onetable twotable twotable 1.6e+06 chained chained 1.5e+06 1.4e+06 1.2e+06 1e+06 1e+06 800000 600000 500000 400000 200000 0 0 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for lookup, existing of random ints Cache miss ratio for lookup, non-existing of random ints google_dense google_dense 14 14 google_sparse google_sparse gnu gnu onetable onetable twotable twotable 12 12 chained chained 10 10 8 8 6 6 4 4 2 2 0 0 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000

• When using data that is not uniformly distributed the cache miss ratio is higher for all hash tables, and sig- nificantly higher for the linear hash tables. Cache miss ratio for lookup, existing of genome-based ints Cache miss ratio for lookup, non-existing of genome-based ints 30 google_dense google_dense google_sparse google_sparse gnu gnu onetable onetable 20 twotable twotable chained 25 chained 20 15 15 10 10 5 5 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000

Variability • The chained hash table exhibits the least amount of variability. Again the storing of the first element within the vector may be the reason. • Google sparse fluctuates a lot. Standard deviation of CPU cycle count per operation for lookup, existing of pointers to random strings google_dense 500 google_sparse gnu onetable twotable chained 400 300 200 100 0 100000 200000 300000 400000 500000 600000 700000 800000

Standard deviation of CPU cycle count per operation for lookup, existing of pointers to filenames 900 google_dense google_sparse gnu onetable 800 twotable chained 700 600 500 400 300 200 100 0 100000 200000 300000 400000 500000 600000 700000 800000

Memory allocation • Graphs are similar for different data types. • Graphs from the amd64 are similar, but more memory is allocated. • The onetable hash table uses a constant amount of memory per element. • Small decline in memory allocated per element for the onetable implementation is due to duplicates in the genome data. Max load factor 1

Allocated bytes per element for insertion of random ints Allocated bytes per element for insertion of genome-based ints gnu gnu 40 onetable onetable onetable (linear fit) onetable (linear fit) 40 twotable twotable chained chained chained (linear fit) chained (linear fit) 35 35 30 30 25 25 20 20 15 15 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000 Max load factor 5 • All our implementations use less memory at a load factor of 5 because more buckets need to be allocated.

Allocated bytes per element for insertion of random ints Allocated bytes per element for insertion of genome-based ints gnu gnu 24 onetable onetable onetable (linear fit) onetable (linear fit) twotable twotable 24 chained chained chained (linear fit) chained (linear fit) 22 22 20 20 18 18 16 16 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000

The insert operation • The sparse hash table uses a lot of time, especially on the genome based integers. • The chained hash table does well on string data. • The google hash tables loose cache benefits when inserting strings. Total CPU cycles for insertion of random ints Total CPU cycles for insertion of genome-based ints 1.2e+10 google_dense google_dense google_sparse google_sparse gnu gnu 1.2e+09 onetable onetable twotable twotable 1e+10 chained chained 1e+09 8e+09 8e+08 6e+09 6e+08 4e+09 4e+08 2e+09 2e+08 0 0 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000

Total CPU cycles for insertion of pointers to random strings Total CPU cycles for insertion of pointers to filenames 3e+09 google_dense google_dense 1.8e+09 google_sparse google_sparse gnu gnu onetable onetable 1.6e+09 twotable twotable 2.5e+09 chained chained 1.4e+09 2e+09 1.2e+09 1e+09 1.5e+09 8e+08 1e+09 6e+08 4e+08 5e+08 2e+08 0 0 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000

Cache Cache miss ratio for insertion of random ints Cache miss ratio for insertion of genome-based ints 9 google_dense google_dense 3 google_sparse google_sparse gnu gnu 8 onetable onetable twotable twotable chained chained 7 2.5 6 5 2 4 3 1.5 2 1 1 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for insertion of pointers to random strings Cache miss ratio for insertion of pointers to filenames 2.2 google_dense google_dense 3 google_sparse google_sparse gnu gnu 2.1 onetable onetable twotable twotable 2.8 chained chained 2 2.6 1.9 2.4 1.8 2.2 1.7 2 1.6 1.8 1.5 1.6 1.4 1.4 1.3 100000 200000 300000 400000 500000 600000 700000 800000 100000 200000 300000 400000 500000 600000 700000 800000

Cache misses for lookup, existing of random ints Cache misses for - PDF document

The hash tables Googles dense and sparse hash tables Use open addressing Quadratic probing SGI Is a chained hash table Referred to as gnu in graphs One-table and Two-table Doubly linked: Rather large compartments

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

CS1100: Computer Science and Its Applications Table Lookup and Error Processing Created By

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Harper Avenue Focus Area Existing Conditions Existing Development Existing Development

Harper Avenue Focus Area Open House Existing Conditions Existing Development Existing

Neutrino Mass with Three Yukawa Ints. Neutrino Mass with Five Yukawa Ints. Summary

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

1 Blocking Example Reducing Conflict Misses by Blocking /* After */ for (jj = 0; jj < N; jj

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Development of lookup tables of climate change impact on national water resources Naota Hanasaki

WE ARE A PEOPLES BUSINESS www.dalatahotelgroup.com Partner Hotels Clayton Hotels Maldron

PERSONAL REAL ESTATE CORPORATIONS A LOOK AT THE POTENTIAL TAX SAVINGS 778-847-6261 JEFF BORDEN,

Investor Presentation April 2011 Disclaimer Forward-looking statements This presentation

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

McM Monte-Carlo Management Service Jean-Roch Vlimant for PdmV ( * ) & Generator Groups

Romeo and Juliet - Power Presentation William Shakespeare PDF File: Romeo and Juliet - Power

Income/Estate Tax Update Presented by: Christina C. Ward, Assistant Director Sandra J. Lind, Tax

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Cache misses for lookup, existing of random ints Cache misses for - PDF document

The hash tables Googles dense and sparse hash tables Use open addressing Quadratic probing SGI Is a chained hash table Referred to as gnu in graphs One-table and Two-table Doubly linked: Rather large compartments

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

CS1100: Computer Science and Its Applications Table Lookup and Error Processing Created By

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Harper Avenue Focus Area Existing Conditions Existing Development Existing Development

Harper Avenue Focus Area Open House Existing Conditions Existing Development Existing

Neutrino Mass with Three Yukawa Ints. Neutrino Mass with Five Yukawa Ints. Summary

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

1 Blocking Example Reducing Conflict Misses by Blocking /* After */ for (jj = 0; jj &lt; N; jj

Cache Performance 1 C and cache misses (1) int array[1024]; // 4KB array int even_sum = 0,

Lecture 21: Memory Hierarchy Todays topics: Cache organization Cache hits/misses 1

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Development of lookup tables of climate change impact on national water resources Naota Hanasaki

WE ARE A PEOPLES BUSINESS www.dalatahotelgroup.com Partner Hotels Clayton Hotels Maldron

PERSONAL REAL ESTATE CORPORATIONS A LOOK AT THE POTENTIAL TAX SAVINGS 778-847-6261 JEFF BORDEN,

Investor Presentation April 2011 Disclaimer Forward-looking statements This presentation

Pipelines and Informed Planning Alliance (PIPA) Pipelines and Informed Planning Alliance (PIPA)

McM Monte-Carlo Management Service Jean-Roch Vlimant for PdmV ( * ) &amp; Generator Groups

Romeo and Juliet - Power Presentation William Shakespeare PDF File: Romeo and Juliet - Power

Income/Estate Tax Update Presented by: Christina C. Ward, Assistant Director Sandra J. Lind, Tax

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

1 Blocking Example Reducing Conflict Misses by Blocking /* After */ for (jj = 0; jj < N; jj

McM Monte-Carlo Management Service Jean-Roch Vlimant for PdmV ( * ) & Generator Groups