3134 Data Structures in Java Lecture 13 Mar 7 2007 Shlomo - - PowerPoint PPT Presentation

3134 data structures in java
SMART_READER_LITE
LIVE PREVIEW

3134 Data Structures in Java Lecture 13 Mar 7 2007 Shlomo - - PowerPoint PPT Presentation

3134 Data Structures in Java Lecture 13 Mar 7 2007 Shlomo Hershkop 1 Announcements Done grading midterms Reading: Chapter hashtables, sorting (basics) 2 Outline Hash DS Overview Collisions Ds applications


slide-1
SLIDE 1

1

3134 Data Structures in Java

Lecture 13 Mar 7 2007 Shlomo Hershkop

slide-2
SLIDE 2

2

Announcements

Done grading midterms Reading:

Chapter hashtables, sorting (basics)

slide-3
SLIDE 3

3

Outline

Hash DS

Overview Collisions Ds applications

sorting

Basics complicated

slide-4
SLIDE 4

4

Hash Table DS

This data structure is for organizing an

unordered set of items

Have the following runtimes: find insert delete

slide-5
SLIDE 5

5

Comparison of average runtime

Best Tree:

AVL

find insert delete

Hash Table

find insert delete

slide-6
SLIDE 6

6

Hash Function

mapping function between items and locations

in the hashtable DS

Examples

slide-7
SLIDE 7

7

Issues

What hash function to use ? What do you do about collisions??

slide-8
SLIDE 8

8

Example

Lets say you need a dictionary For each word insert in hash table

runtime ?

when I need to look up a word call find on

hash table

runtime ?

slide-9
SLIDE 9

9

hash functions

The truth is that hash functions should be

based on the data

lets step through some examples

slide-10
SLIDE 10

10

Option 1: integral keys

items are numbers can use them directly to compute hash Hash(key) = key % Tablesize Example Question : why not use randomness to make sure

to avoid collisions ?

slide-11
SLIDE 11

11

Option 2: String key

Hash(key) = sum of ascii values Hash(abc) = 97 + 98 + 99 any idea if this will work ?

slide-12
SLIDE 12

12

Counter example:

dictionary tablesize 40,000 what is the maximum word size what would be the max value returned by the

hash ??

slide-13
SLIDE 13

13

Option 3: power

lets add some spread to the summation Hash(ley) = key[ 1] * 260 + key[ 1] * 261 *

..key[ i] * 26i

slide-14
SLIDE 14

14

issues

non uniform distribution of characters in

the english language

  • nly 28% of your table will actually be

reached

collisions!

slide-15
SLIDE 15

15

Option 4: Adjusted power

Hash(ley) = (key[ 1] * 370 + key[ 1] * 371 *

..key[ i] * 37i) % tablesize

need to make sure it will be positive java uses 31i performs well on general strings

slide-16
SLIDE 16

16

  • k so now we know how to get things into

the table

what do you do when 2 things map to

same array location ??

slide-17
SLIDE 17

17

Option 1: Separate Chaining

At each array location have a linked list

how would the insert in the LL work ?

how do you perform a find on the hash

table ?

slide-18
SLIDE 18

18

Option 2: open addressing

if collision occurs, will try to find alternate

cell in the array to store item

lets see how this works

slide-19
SLIDE 19

19

strategy

first try hash(x) if full

try Hash(x) + f(i) % tablesize to locate f is used to move around the array to find a

location to use

different options, any ideas ?

slide-20
SLIDE 20

20

Linear probing

f(i) = i Example can you think of any issues ?

slide-21
SLIDE 21

21

clustering

linear probing suffers from a problem

called clustering

domino affect

slide-22
SLIDE 22

22

Quadratic probing

f(i) = i2 how will this affect clusters ?

slide-23
SLIDE 23

23

Theorem

if quadratic probing is used and table size

is prime, and table is at least half empty then we will always find a spot for a new element

slide-24
SLIDE 24

24

Option 3: Double Hashing

  • Apply a second hash function H2 and

probe at distance i * hash2(x)

  • f(i) = rehash(i)
  • hash(x) + i* fi(x)
  • Note:

1.

can’t return 0

2.

entire table must be addressable

slide-25
SLIDE 25

25

Load factor

number of element divided by table size

slide-26
SLIDE 26

26

growing

So how do you resize a hash ??

slide-27
SLIDE 27

27

deletion

how would deletion work any issues?

slide-28
SLIDE 28

28

Extendible Hashing

setup similar to B+ tree hashing routine which has growth built in use partial bits for keys when need to grow will use more bits

slide-29
SLIDE 29

29

question

from the data structures we have covered

which is the most space efficient ??

slide-30
SLIDE 30

30

Wrapping up

Say you want to add a new operation to

heaps

DecreasePriority (p,d)

want to subtract d from priority p any ideas on run time ??

slide-31
SLIDE 31

31

Switching gears

slide-32
SLIDE 32

32

When we come back from break, we will

be doing much more programming background etc

Inheritance Class relationships Viruses Virus checking program

slide-33
SLIDE 33

33

Application

anyone know how Google works from a

data structure point of view

runtime ??

slide-34
SLIDE 34

34

Search engine technology

generally search engines work in the

following way:

collect documents e.g. webpages index information wait for search

  • understand query
  • search and match
  • scoring system
slide-35
SLIDE 35

35

Any ideas how to design a search engine

so that you can quickly find results ?

slide-36
SLIDE 36

36

hash table of search words inverted index table

slide-37
SLIDE 37

37

Vector Model

Each document is a vector in an n

dimensional vector space of search terms

take query and find closets points sparse (very) if one word tokens, order will be ignored

slide-38
SLIDE 38

38

algorithm

First we generate a master word list can strip out stop words Stemming: can also calculate related

words i.e. runs and run worry and worrying

slide-39
SLIDE 39

39

master word list

  • cat
  • dog
  • fine
  • good
  • got
  • hat
  • make
  • pet

# A cat is a fine pet $vec = [ 1, 0, 1, 0, 0, 0, 1 ] ;

slide-40
SLIDE 40

40

many ways of calculating similarity

between search term and documents

cosine can generate relevance scoring

slide-41
SLIDE 41

41

General issues

  • Better parsing
  • Non-English Collections
  • stemming
  • stop words
  • Similarity Search
  • can combine a few docs to find similarity
  • Term Weighting
  • Incorporating Metadata
  • Exact Phrase Matching
slide-42
SLIDE 42

42

More DS

Searching

slide-43
SLIDE 43

43

Simple

So its straightforward to sort in O(N2) time Insertion sort Selection sort Bubble sort

slide-44
SLIDE 44

44

More complicated

Shell Sort

This is an O(N1.5) algorithm that is simple and

efficient in practice

  • riginally presented as an O(N2) algorithm

complicated to analyze took many years to get better bounds

slide-45
SLIDE 45

45

More Complex

O(N log N) algorithms

merge sort heapsort

slide-46
SLIDE 46

46

Quicksort

worst case O(n2) average case O(N log N)

will learn how to make the worst case occur

with such low probability that we will end up dealing with average case

slide-47
SLIDE 47

47

Selection sort

anyone remember how this one works ?? 2 arrays, sorted and unsorted keep choosing min from the unsorted list

and append to sorted

slide-48
SLIDE 48

48

Bubble Sort

Anyone ?? iterate and swap out of ordered elements

slide-49
SLIDE 49

49

Insertion sort

this is the quickest of the O(N2) algorithms

for small sets

slide-50
SLIDE 50

50

Insertion

sort 1st element sort first 2 sort first 3 etc