Exercises: Special Case Search Structures Design super-fast search - PowerPoint PPT Presentation

Exercises: Special Case Search Structures Design super-fast search structures for the following special cases: a) Data consists of ints in the range 0 … 100 b) Data consists of chars in the range A-Z, a-z, 0-9 c) Data consists of a few dozen Strings, each of which has a unique last character d) Data consists of a few dozen ints , most of which have a unique last two digits e) Data consists of a few hundred Points, such that if you add the two coordinates modulo 1000 you get a mostly unique number in the range 0 ... 999 1

Hash Tables & Dictionaries

Goal • Arrays are too static – Don’t grow dynamically (but can use doubling trick) – Hard to insert/remove in the middle – Index has to be a natural number • Recursive structures are too dynamic – Hard to enforce nice structure (e.g., balanced trees) – Complicated code • Compromise: – Use an array most of the time – Allow arbitrary key for indexing – Make special cases for inserting/removing, if needed 3

Arrays using arbitrary keys c) Data consists of a few dozen Strings, each of which has a unique last character remove(“Banana”) Array of String insert(“Bandana”) 0 search(“Bubba”) … … 97 take last char, 98 “Baobob” convert to ascii, 99 “Blanc” then to int 100 in range 0..255 101 “Bee” … hash: verb . … To chop into pieces, mince; 255 4 To make a mess of, mangle;

Hash Tables • Generalized: – need “hash function” for data objects – need linked lists (or something) for dealing with collisions do_something(data) Array of Lists 0 x 1 hash function: 2 mangle it all up z w 3 then get an int … (then do mod N) … … … r a u N-1 5

Hash Table Algorithms • Need a hash function that can convert from data objects (the “key”) to int (the “bucket number”) – Call it the hash code for the object • Insert(obj): – compute hash code for obj – Append obj to list at that bucket • Search(obj): – compute hash code for obj – Look for obj in list at that bucket • Remove(obj): – compute hash code for obj – Remove obj from list at that bucket 6

Performance • Wildly optimistic assumption: – Assume all hash codes turn out to be unique and in the range 0 … N-1 • Insert: O(1) • Remove: O(1) • Delete: O(1) Neat trick, huh? 7

Key Issues • What should the hash function be? • How big should the array be? • How should we organize the linked lists? 8

Hash Functions • Goal: – Every (different) object should get a different hash code • Realistic Goal: – Different objects should be likely to get different hash code – Collisions should be more or less random – Use the whole range 0 … N-1, evenly 9

Hash Functions: 1 st attempts • Data is student id numbers: – hash(ID) = 55 (for any ID) • bad � hash table reverts to linked list behavior – hash(ID) = two most significant digits • bad � all data ends up in just a few buckets – hash(ID) = two least significant digits • pretty good � collisions happen sort of randomly (373333 and 375933 collide) – hash(ID) = add pairs of digits, mod 100 • e.g., 375933 � (37 + 59 + 33) mod 100 • better � even less of a pattern to the collisions – hash(ID) = square the number, then take middle digits • popular in practice … what is the pattern of collisions? • Of course, we want it to scale to arbitrary N, though! 10

Example: Multiplicative Hash Function • Suppose N is a power of 2 (say N = 1024) • Suppose keys are ints in range 0 … 32767 • Use H(k) = (((32768*0.6125423371*k)%32768)%1024) – Where do these constants come from? • Try it out: int[] histogram = new int[1024]; for (int k = 0; k < 32768; k++) { int bucket = H(k); histogram[bucket]++; } for (int i = 0; i < 1024; i++) System.out.println(i + “ “ + histogram[i]); 11

0 34 1 33 2 32 450 3 32 400 4 32 5 32 350 6 31 300 7 32 8 32 250 Count 9 34 10 32 200 11 31 150 12 30 13 33 100 14 33 15 32 50 16 32 0 17 33 29 30 31 32 33 34 18 32 Collisions per bucket 19 30 20 31 21 33 22 34 23 32 most buckets 24 32 bucket 25 had 32 collisions 25 30 had 30 collisions 26 32 27 32 12 … …

What if we use 0.61254 instead? 450 0.6125423371 400 350 0.61254 300 250 Count 350 200 300 150 100 250 50 200 Count 0 29 30 31 32 33 34 150 Collisions per bucket 100 50 0 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 Collisions per bucket 13

Hash Functions in Practice • Need to handle arbitrary object types – Strings, Integers, doubles, ChessGames, etc. – Each type has a different way of getting an integer • Need to convert from int to bucket # • Two step process: for arbitrary object x – (1) Call x .hashCode() to get an integer k – (2) Use k % N as the bucket number 14

public int hashCode() • Instance method, defined in java.lang.Object • Computes the int hash code of this object • Specification: – Must return same value repeatedly for same object – Two objects that equals must have the same hash code – Otherwise, they should have different hash codes • Easy examples: – Integer, Short, Byte � just returns value as an int – Float, Double � return bit pattern of value as an int 15

hashCode() for non-primitives • Strings – Compute based on each character (arithmetic) as: s[0]*31 (n-1) + s[1]*31 (n-2) + ... + s[n-1] • Other user-defined types: – Usually based on String.hashCode(), Integer.hashCode(), etc. – You have to write your own hashCode for your own classes! Obey the contract… • Default: – Dynamic binding resolves to Object.hashCode() – Just uses address of object in memory 16

Perfect Hashing • If we knew what the keys were ahead of time, we could design a perfect hash function – Each key gets a different hash code – Hash codes start at 0, count upwards with no gaps • When is this used? – Compiler: keywords are known ahead of time: • public = 0, private = 1, protected = 2, void = 3, int = 4, … • Hash table lookup is much faster than String.equals • Compiler never does String.equals on keywords any more – if (keywordTable.search(someword)) { … } else { … } 17

Hash Table Worst-Case Performance • If hash codes are terribly chosen: – All buckets have 0 objects – Except one bucket that has all the objects – Revert to linked-list behavior – get: O(n) – put: O(1) • If hash codes are perfectly chosen: – Each bucket has exactly the same number of objects • N buckets, D objects � λ = D/N objects per bucket – get: O( λ ) – put: O(1) 18

Load Factor: λ • We usually assume that hash codes are randomly distributed – gives close to “perfectly chosen” performance – statistics & probability can tell you how many collisions to expect, what the graph would look like, etc. • won’t get exactly λ = D / N objects per bucket • but will get λ ± 33% or so with very high probability • So performance is: – get: O( λ ) where λ = D/N = # objects / # buckets – put: O(1) • Want no more than one object per bucket – Want λ = 1 so let # buckets = # objects – Or better yet, use λ = 0.75 so let # buckets = #objects * 1.33 19

Controlling Load Factor • Want to keep λ = 0.75, but don’t know # objects apriori • Use array doubling trick again – These tricks are used and reused and reused again • Gives in practice: – O(1) performance for all operations 20

Hash Table Flavors • So far we have hash tables with separate chaining • Could do hash table with open-addressing – Don’t chain, instead try multiple buckets – Start with bucket h , generate sequence of bucket #’s • Example: linear probing – Start with bucket h , then try h+1 , h+2 , … – Analysis gets tricky, but no extra memory allocation 21

Hash Table Summary • Supports SearchStructure interface – insert, delete, search all in O(1) “expected” worst case – “expected” = if we assume hash codes are random, then with very high probability • What else can we do with them? 22

Dictionaries • Dictionary structure is a set of < key , value > pairs • Want fast search for value given a key • Easy with existing search structure interface: – Define a class Pair to store key and value – Define equals that only compares key half of two pairs – Define compareTo that only compares key half of pairs • Used in the BST search structure – Define hashCode that only hashes key half of a pair • Used in the hash table search structure • Better yet, modify BST and hash table to maintain key and value – Will use inner classes BST.KeyValuePair and HashTable.KeyValuePair 23

HashSet and HashMap • Java provides HashSet as a sequence structure – Uses separate chained hashing, as we did above – Aims for load factor λ = 0.75, and doubles size when exceeded, as we did – Uses x.hashCode() instance method, as we did • Java provides HashMap as a dictionary structure – Similar, but maintains key and value pairs – Slightly different interface: • search by key (fast) • search by value (slow) • enumerate keys • enumerate pairs • enumerate values • etc. 24

Exercises: Special Case Search Structures Design super-fast search - PowerPoint PPT Presentation

Exercises: Special Case Search Structures Design super-fast search structures for the following special cases: a) Data consists of ints in the range 0 100 b) Data consists of chars in the range A-Z, a-z, 0-9 c) Data consists of a few dozen

Query DB structures Manipulation queries DB search Hits Memory search 2 Standardization of

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Self Adjusting Data Structures 1 Self Adjusting Data Structures t ve to front 2

Self-Adjusting Data Structures 1 Self-Adjusting Data Structures move-to-front 2 7 4 1 9 5

EXERCISES EXERCISES Important Perfectly safe for the vast majority of people Those with

Neck Exercises for Prevention, Neck Exercises for Prevention, Rehabilitation and Strength

Course setup 9 ec course examination based on computer exercises weekly exercises

Exercises, II part Forward Chaining: 12 Jul 2012 Exercises, II part Consider the following set

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Informed search algorithms Outline Best-first search Greedy best-first search A *

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

10.1 Blind Search 8.12. Basic Algorithms 8. Data Structures for Search Algorithms 9.

Contact manifolds and SU ( 2 ) -structures in 5-dimensions SU ( n ) -structures Sasaki-Einstein

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

A P A Place ace In In My My He Heart Nana Mouskouri 06/03/2015 1 06/03/2015 2 i got your

Full-Year 2018 Financial and Operating Results 27 March 2019 gtcapital.com.ph D E F I N E D B Y

DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOL DIFFERENTIAL AROMA VOLATILES DIFFERENTIAL AROMA

Interference of Simulated IEEE 802.11 Links with Directional Antennas Michael Rademacher, Karl

Q1 2019 Highlights Multi-Asset Mid-Tier TSX:TGZ / OTCQX:TGCDF Friday, May 3, 2019 West African

Building a 121 Mining Event - London May 20-21, 2019 Multi-Asset Mid-Tier TSX:TGZ / OTCQX:TGCDF

SPRING: a next-generation compressor for FASTQ data Shubham Chandak Stanford University

A model for the runout analysis of rapid flow slides, debris flows, and avalanches Article in