something very different
play

Something very different - We can use this graph to fi nd mutations - PowerPoint PPT Presentation

Covid-19 genome variation graph Something very different - We can use this graph to fi nd mutations in new cases - Perhaps mutations which increase virulence Covid-19 phylogeny De Brujn graph - Genome assembly - Inferred using post order


  1. Covid-19 genome variation graph Something very different - We can use this graph to fi nd mutations in new cases - Perhaps mutations which increase virulence Covid-19 phylogeny De Brujn graph - Genome assembly - Inferred using post order traversal and probability - We can use this to understand how the virus spreads - Represents overlapping sequences - Assemble genome by graph traversal - Build graph using a perfect hash https://nextstrain.org/narratives/ncov/sit-rep/2020-03-04 , http://data-science-sequencing.github.io/Win2018/lectures/lecture7/ , http://virological.org/t/ 1 response-to-on-the-origin-and-continuing-evolution-of-sars-cov-2/418

  2. Back to hashing Ignoring this for now but p should be bigger than m so we map to all parts of the hash table (( ax + b ) mod p ) mod m 8 0 Warmup: Find the largest set of keys that collide 7 1 2 6 hash ( x ) = ( 3 x + 2 ) mod9 3 5 4 Any value of the form 3i where i is an integer collides This is bad as we could only map to part of the hash table 10 0 1 9 2 hash ( x ) = ( 3 x + 2 ) mod11 8 3 7 Any value of the form 11i collides 6 4 5 This is better because we map to all values from 0 - 10 Which is a better hash function? The second function is better because 3 is relatively prime to 11 which results in fewer collisions because it avoids degeneracy, that is getting trapped in cycles that don't map to all the possible values. 2

  3. Hashing with chaining hash(AT)=hash(GA)=1 Store multiple key in each array slot Keys We can store lots of keys but performance degrades 0 How? 1 AT GA • We will consider linked lists • Any dictionary ADT could be 2 must consider if keys used provided ... 3 are comparable (ordered) Result (using linked list) 4 CT • We can hash more than m things 5 into an array of size m 6 AA TA • Worst case runtime depends on d["A"] = 1 length of largest chain d["A"] = 2 print d["A"] ? • Memory is allocated on each We could insert at head or tail of linked insert list since we need to check if key exists This could lead to bad memory/cache so we must traverse anyways 3 performance

  4. Acces time for chaining Why do we need this to think about runtime? Load factor: # items hashed # size of array = n = α m Assuming a uniform hash function i.e. probability of hashing to any slot is equal Hash to a location with an average of α items Search cost: • Unsuccessful search examines items α • Successful search examines 1 + n − 1 2 m = 1 + α 2 − α 2 n items Pay for fi rst item Average # remaining items For good performance we want a small load factor Our query key is on average 1/2 way through 4

  5. hash(AT)=hash(GA)=hash(GT)=1 Open adressing GT Clustering Each array element contains one item. The hash 0 TA function speci fi es a sequence of elements to try. Insert: If fi rst slot is occupied check next location in 1 AT hash function sequence. Find: If slot does not match keep trying the next slot in 2 GA sequence until either the item is found or an empty slot is visited (item not found). 3 GT Remove: Find and replace item with a tombstone . 4 CT remove(AA) Result: • Cannot hash more than m items by pigeonhole 5 principle • Hash table memory allocated once 6 AA • Performance will depend on how many times we check slots Wrap around using mod m 5

  6. Linear probing Try ( h ( k ) + i ) mod m for i = 0 , 1 , 2 ,... m − 1 insert(83) insert(76) insert(14) insert(42) 0 14 14 14 1 42 42 2 83 3 4 5 6 76 76 76 76 For this example h ( k ) = k mod7 and m = 7 6

  7. Double hashing Try ( h ( k ) + i · h 2 ( k )) mod m for i = 0 , 1 , 2 ,... m − 1 insert(76) insert(14) insert(42) insert(83) 0 14 14 14 1 83 2 3 42 42 4 5 6 76 76 76 76 For this example h ( k ) = k mod7, h 2 ( k ) = 5 − k mod 5 and m = 7 Does not hash to 0! 7

  8. Rehashing Sometimes we need to resize the hash table • For open addressing this will have to happen when we fi ll the table • For separate chaining we want to do this when the load factor gets big To resize we: • Resize the hash table • Θ ( 1 ) amortized time if doubling One reason we need a new value for mod m • Get a new hash function since the table size changes Result: Because we have a new hash function • Spread the keys out • Remove tombstones (open addressing) • Allows arbitrarily large tables 8

  9. Hashing summary What collision resolution strategy is best? High load factor: Chaining is better, open addressing su ff ers from clustering Open addressing can have better memory performance, fewer memory allocations What is the best implementation of a dictionary ADT? AVL Hash table Worst case: Θ (log 2 ( n )) Θ ( n ) Lots of collisions Θ (1) Average case: Θ (log 2 ( n )) Why did we talk about trees? AVL trees can make use of the fact keys are comparable for fast operations: Find the max - Keep going right Θ (log 2 ( n )) Range queries - Find all values less than a key More in depth info: http://jeffe.cs.illinois.edu/teaching/ algorithms/notes/05-hashing.pdf 9

  10. Is this a BST? Something new No. Does not have search property. What is interesting about this tree? 1. It is complete - All layers full except last where all nodes are as far left as 2 possible 2. Every node is greater than its children. Implies root is min. 5 6 9 8 7 14 29 21 42 15 33 Where should we insert next? 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend