Cache misses for lookup, existing of random ints Cache misses for - - PDF document

cache misses for lookup existing of random ints cache
SMART_READER_LITE
LIVE PREVIEW

Cache misses for lookup, existing of random ints Cache misses for - - PDF document

The hash tables Googles dense and sparse hash tables Use open addressing Quadratic probing SGI Is a chained hash table Referred to as gnu in graphs One-table and Two-table Doubly linked: Rather large compartments


slide-1
SLIDE 1

The hash tables

  • Google’s dense and sparse hash tables

– Use open addressing – Quadratic probing

  • SGI

– Is a chained hash table – Referred to as gnu in graphs

  • One-table and Two-table

– Doubly linked: Rather large compartments – Buckets: Two pointers delimit a section of a cir- cular list. – One-table uses alternative vector implementation

slide-2
SLIDE 2
  • Chained

– Singly linked: small compartments – lookup uses a tight inner loop by copying into sentinel.

  • The same hash functions were used for all hash ta-
  • bles. A string table lookup method for string data and

a division method for integer data.

slide-3
SLIDE 3

The machines

  • 32 bit app intel machines at DIKU without PAPI
  • 64 bit amd with PAPI extensions
slide-4
SLIDE 4

Kinds of benchmarks

  • Memory allocated

– Not Google

  • Timing

– CPU time + Total CPU cycles – Variability measured as std. deviation – CPU time and CPU cycle graphs are very alike, CPU cycles are used.

  • Cache behaviour

– Number of L1 Cache misses. – L1 Cache miss ratio: Percentage of cache ac- cesses that are misses.

slide-5
SLIDE 5

Benchmarks skipped

CPU time

  • Mem. alloc.
  • Resp. time

Cache CPU time

  • Mem. alloc.
  • Resp. time

Cache CPU time

  • Mem. alloc.
  • Resp. time

Cache random genome random words paths random words paths reference value integer value string CPU time

  • Mem. alloc.
  • Resp. time

Cache

  • Resp. time
  • Resp. time

insert lookup, existing lookup, non-e. erase iterate fwd. iterate bwd.

  • The meaurements using “by value” string data are

not included here because they were problemetic.

slide-6
SLIDE 6

Data used for the benchmarks

  • Random strings (10 bytes) and integers - reflect the

behavior of the data structure with a perfect distribu- tion of in-data.

  • Ordered data: Gene sequences, words and output of

the locate command.

  • Realism of data: range 100000 - 800000 elements
  • Data was loaded into memory from a file and then

scanned.

slide-7
SLIDE 7
  • The maximum load factor

– Google: 0.8 – SGI (gnu on graphs): 1.0 – Us: initially 5.0, then 1.0 - we focus on max load factor of 1.0

  • Timing

– Linear hash tables all save the hash value which saves time on string data. – Iteration on the linear hash tables should be effi- cient because of the circular chain of elements. – The chained hash table uses a tight inner loop containing only one test. First compartment in chain is stored in vector. – Google has a cache advantage in using open ad- dressing.

slide-8
SLIDE 8
  • Allocation

– Saving the hash value takes more memory. – Doubly linked lists vs. singly linked lists. – Two tables vs. one table. – Alternative allocation scheme used by one table

slide-9
SLIDE 9

The Lookup operation

  • lookup non-existing is called for each insert.
  • lookup existing is called for each delete
  • Saving the hash value saves time on lookup of string

data.

  • Lookup non-existing: each odd entry was inserted,

each even was looked up.

slide-10
SLIDE 10

1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for lookup, existing of random ints google_dense google_sparse gnu

  • netable

twotable chained 1e+09 2e+09 3e+09 4e+09 5e+09 6e+09 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for lookup, existing of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained

  • One- and two-table are very similar.
  • A factor 10 difference between the two graphs y axis
  • a non uniform distribution particularly effects the sparse

hash table

slide-11
SLIDE 11

Cache

  • Few cache misses with open addressing
  • lookup of non-existing elements causes more cache

misses because more elements are traversed on av- erage.

  • Linear hash tables have a quite high cache miss ra-

tio.

  • The good cache miss ratio of the chained hash table

implementation is likely due to the storing of the first compartment within the vector.

slide-12
SLIDE 12

200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 100000 200000 300000 400000 500000 600000 700000 800000 Cache misses for lookup, existing of random ints google_dense google_sparse gnu

  • netable

twotable chained 500000 1e+06 1.5e+06 2e+06 100000 200000 300000 400000 500000 600000 700000 800000 Cache misses for lookup, non-existing of random ints google_dense google_sparse gnu

  • netable

twotable chained 2 4 6 8 10 12 14 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for lookup, existing of random ints google_dense google_sparse gnu

  • netable

twotable chained 2 4 6 8 10 12 14 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for lookup, non-existing of random ints google_dense google_sparse gnu

  • netable

twotable chained

slide-13
SLIDE 13
  • When using data that is not uniformly distributed the

cache miss ratio is higher for all hash tables, and sig- nificantly higher for the linear hash tables.

5 10 15 20 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for lookup, existing of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained 5 10 15 20 25 30 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for lookup, non-existing of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained

slide-14
SLIDE 14

Variability

  • The chained hash table exhibits the least amount of
  • variability. Again the storing of the first element within

the vector may be the reason.

  • Google sparse fluctuates a lot.

100 200 300 400 500 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for lookup, existing of pointers to random strings google_dense google_sparse gnu

  • netable

twotable chained

slide-15
SLIDE 15

100 200 300 400 500 600 700 800 900 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for lookup, existing of pointers to filenames google_dense google_sparse gnu

  • netable

twotable chained

slide-16
SLIDE 16

Memory allocation

  • Graphs are similar for different data types.
  • Graphs from the amd64 are similar, but more mem-
  • ry is allocated.
  • The onetable hash table uses a constant amount of

memory per element.

  • Small decline in memory allocated per element for

the onetable implementation is due to duplicates in the genome data. Max load factor 1

slide-17
SLIDE 17

15 20 25 30 35 40 100000 200000 300000 400000 500000 600000 700000 800000 Allocated bytes per element for insertion of random ints gnu

  • netable
  • netable (linear fit)

twotable chained chained (linear fit) 15 20 25 30 35 40 100000 200000 300000 400000 500000 600000 700000 800000 Allocated bytes per element for insertion of genome-based ints gnu

  • netable
  • netable (linear fit)

twotable chained chained (linear fit)

Max load factor 5

  • All our implementations use less memory at a load

factor of 5 because more buckets need to be allo- cated.

slide-18
SLIDE 18

16 18 20 22 24 100000 200000 300000 400000 500000 600000 700000 800000 Allocated bytes per element for insertion of random ints gnu

  • netable
  • netable (linear fit)

twotable chained chained (linear fit) 16 18 20 22 24 100000 200000 300000 400000 500000 600000 700000 800000 Allocated bytes per element for insertion of genome-based ints gnu

  • netable
  • netable (linear fit)

twotable chained chained (linear fit)

slide-19
SLIDE 19

The insert operation

  • The sparse hash table uses a lot of time, especially
  • n the genome based integers.
  • The chained hash table does well on string data.
  • The google hash tables loose cache benefits when

inserting strings.

2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for insertion of random ints google_dense google_sparse gnu

  • netable

twotable chained 2e+09 4e+09 6e+09 8e+09 1e+10 1.2e+10 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for insertion of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained

slide-20
SLIDE 20

2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 1.4e+09 1.6e+09 1.8e+09 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for insertion of pointers to random strings google_dense google_sparse gnu

  • netable

twotable chained 5e+08 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for insertion of pointers to filenames google_dense google_sparse gnu

  • netable

twotable chained

slide-21
SLIDE 21

Cache

1 1.5 2 2.5 3 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for insertion of random ints google_dense google_sparse gnu

  • netable

twotable chained 1 2 3 4 5 6 7 8 9 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for insertion of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for insertion of pointers to random strings google_dense google_sparse gnu

  • netable

twotable chained 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for insertion of pointers to filenames google_dense google_sparse gnu

  • netable

twotable chained

slide-22
SLIDE 22

2e+06 4e+06 6e+06 8e+06 1e+07 100000 200000 300000 400000 500000 600000 700000 800000 Cache misses for insertion of pointers to random strings google_dense google_sparse gnu

  • netable

twotable chained 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 100000 200000 300000 400000 500000 600000 700000 800000 Cache misses for insertion of pointers to filenames google_dense google_sparse gnu

  • netable

twotable chained

slide-23
SLIDE 23

Variability

  • Linear hash tables show very little variability.

50000 100000 150000 200000 250000 300000 350000 400000 450000 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for insertion of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained 50000 100000 150000 200000 250000 300000 350000 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for insertion of random ints google_dense google_sparse gnu

  • netable

twotable chained

slide-24
SLIDE 24

The delete operation

  • Deletion is surprisingly effective on the google dense

hash table

  • The chained hash table is also very effective

1e+08 2e+08 3e+08 4e+08 5e+08 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for deletion of random ints google_dense google_sparse gnu

  • netable

twotable chained 1e+09 2e+09 3e+09 4e+09 5e+09 6e+09 7e+09 8e+09 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for deletion of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained

slide-25
SLIDE 25

1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for deletion of pointers to random strings google_dense google_sparse gnu

  • netable

twotable chained 2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for deletion of pointers to filenames google_dense google_sparse gnu

  • netable

twotable chained

slide-26
SLIDE 26

Variability

  • Sparse does very bad on genome based data

100 150 200 250 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for deletion of random ints google_dense google_sparse gnu

  • netable

twotable chained 5000 10000 15000 20000 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for deletion of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained

slide-27
SLIDE 27

0.5 1 1.5 2 2.5 3 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for deletion of random ints google_dense google_sparse gnu

  • netable

twotable chained 1 2 3 4 5 6 7 8 100000 200000 300000 400000 500000 600000 700000 800000 Cache miss ratio for deletion of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained

slide-28
SLIDE 28

Forward Iteration

2e+07 4e+07 6e+07 8e+07 1e+08 1.2e+08 1.4e+08 1.6e+08 1.8e+08 2e+08 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for iteration of random ints google_dense google_sparse gnu

  • netable

twotable chained 5e+07 1e+08 1.5e+08 2e+08 100000 200000 300000 400000 500000 600000 700000 800000 Total CPU cycles for iteration of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained 40 50 60 70 80 90 100 110 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for iteration of random ints google_dense google_sparse gnu

  • netable

twotable chained

slide-29
SLIDE 29

500 1000 1500 2000 2500 3000 100000 200000 300000 400000 500000 600000 700000 800000 Standard deviation of CPU cycle count per operation for iteration of genome-based ints google_dense google_sparse gnu

  • netable

twotable chained