Searching Complexity of Binary Search Hans-Joachim Bckenhauer and - - PowerPoint PPT Presentation

searching
SMART_READER_LITE
LIVE PREVIEW

Searching Complexity of Binary Search Hans-Joachim Bckenhauer and - - PowerPoint PPT Presentation

Searching Complexity of Binary Search Hans-Joachim Bckenhauer and Dennis Komm Digital Medicine I: Introduction to Programming Dynamic Programming and HMMs Autumn 2019 December 5, 2019 Binary Search Complexity of Binary Search At first,


slide-1
SLIDE 1

Hans-Joachim Böckenhauer and Dennis Komm

Digital Medicine I: Introduction to Programming

Dynamic Programming and HMMs

Autumn 2019 – December 5, 2019

Searching

Complexity of Binary Search Binary Search

def binsearch(data, searched): left = 0 right = len(data) - 1 while left <= right: current = (left + right) // 2 if data[current] == searched: return current elif data[current] > searched: right = current - 1 else: left = current + 1 return -1

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 1 / 42

Complexity of Binary Search

At first, there are n elements With every iteration, the search space is halved After the first iteration, there remain n/2 elements After the second iteration, there remain n/4 elements . . . After how many iterations x does there remain only one element?

n/2x = 1 ⇐ ⇒ n = 2x ⇐ ⇒ x = log2 n

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 2 / 42

slide-2
SLIDE 2

Complexity of Binary Search

10 20 30 40 50 60 70 80 90 100 20 40 60 80 100

Linear search Binary search Binary search

Input length n Comparisons

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 3 / 42

Complexity of Binary Search

We again use a variable counter to count the comparisons Algorithm is executed on sorted lists with values 1 to n The value of n grows by 1 with every iteration Initially, n is 1, at the end 1 000 000 The first element 1 is always sought Results are stored in a list and plotted using matplotlib

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 4 / 42

Complexity of Binary Search

Worst Case

values = [] data = [1] for i in range(1, 1000001): data.append(data[-1] + 1) values.append(binsearch(data, 1)) plt.plot(values, color="red") plt.show() Add element that is larger by 1 than the currently last

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 5 / 42

Complexity of Binary Search

What happens if data is unsorted? Linear search always works for unsorted lists and is in O(n) Sorting can pay off for multiple searches Sorting is in O(n log n) and is therefore slower than linear search Binary search is in O(log n) and is consequently much faster than linear search When does sorting pay off? If more than log n searches are made

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 6 / 42

slide-3
SLIDE 3

A Question

from Medical Research Observing a Bacterial Culture

How does a bacterial culture change over time? Nutrient solution in changing environment (Temperature increase, addition of chemicals, . . . ) We want to study: When are bacteria sick, when healthy? If healthy, some protein is produced Altered bacteria DNA (Fluorescence marker, GFP) Observe fluorescence level over time

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 7 / 42

Observing a Bacterial Culture

Time in minutes Observation

120 60 140 250 300 390 450 550 600

a b

Elowitz, Leibler, 2000, Nature 403

For the fluorescence level, we observe Low, Medium, Medium, High, Medium, High, Medium, High What follows for the health condition of the bacteria?

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 8 / 42

Hidden Markov Models

The Unfair Casino

slide-4
SLIDE 4

The Unfair Casino

A fair die (All results with probability 1

6)

An unfair (loaded) die (Result with probability 1

2)

At beginning one is chosen (Probability 1

2)

After each roll the die can be (secretly) switched (Probability 1

20)

Markov Model State: Die currently used Transition probability: Probability to switch the die Emission probability: Probability to observe concrete number; depending on the current state

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 9 / 42

Markov Model for the Unfair Casino

Start Fair

Prob ( ) = 0.16 Prob ( ) = 0.16 Prob ( ) = 0.16 Prob ( ) = 0.16 Prob ( ) = 0.16 Prob ( ) = 0.16

Unfair

Prob ( ) = 0.1 Prob ( ) = 0.1 Prob ( ) = 0.1 Prob ( ) = 0.1 Prob ( ) = 0.1 Prob ( ) = 0.5 1 2 1 2 1 20 1 20 19 20 19 20

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 10 / 42

Probability of a Concrete Output

Suppose we observe , , , Moreover, we know that first the unfair die was used twice, and then the fair

  • ne was used twice

This is a fixed walk through our Markov model How large is the probability?

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 11 / 42

Probability of a Concrete Output

Consider the first

  • 1. Event A = “The unfair die is chosen at the beginning”
  • 2. Event B = “The

is rolled” Prob (The unfair die is chosen at the beginning and is rolled)

= Prob (A and B) = Prob (A) · Prob (B under the condition A) = Prob (A) · Prob (B | A) = 1 2 · 1 2

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 12 / 42

slide-5
SLIDE 5

Probability of a Concrete Output

We can continue in this fashion for , , ,

1 2 · 1 2 · 19 20 · 1 2 · 1 20 · 1 6 · 19 20 · 1 6 361 2 304 000

Probability that the unfair die is chosen at the beginning Probability that then a is rolled Probability that the die is not switched Probability that then a is rolled again Probability that the die is switched Probability that then a is rolled again Probability that the die is not switched Probability that another is rolled Total probability

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 13 / 42

Probability of a Concrete Output

Hidden Markov Model (HMM) But what happens if we only observe the outputs? Suppose we observe , , , , What is the most probable walk? For three numbers, the unfair die is “better” For two numbers, the fair one is “better” Is it worth to switch the die?

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 14 / 42

Probability of a Concrete Output

Number of possible walks for n rolls is 2n For 300 rolls this is more than

1090

Do we really need to try them all? No

➯ Dynamic programming

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 15 / 42

Hidden Markov Models

Dynamic Programming

slide-6
SLIDE 6

Dynamic Programming

Idea Solution for complete input is put together from solutions for subproblems, starting with smallest subproblems Problem: Find suitable subproblems Viterbi Algorithm (Viterbi, 1967) All most probable walks for prefixes of the given sequence that end in particular state are partial solutions Compute most probable walks for larger prefixes from most probable walks for smaller ones

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 16 / 42

Viterbi Algorithm

“fair” “unfair”

X1 X2 X3 X4 X5 X6 X7

Suppose shaded cells are already filled out Most probable walks of length 3 are known (e.g., walk that ends with observation X3 is: “unfair, ” “fair, ” “fair”) Compute such a walk of length 4 ending in “fair” Extend known walks and take more probable one

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 17 / 42

Viterbi Algorithm

Observe sequence , , , , Create table Store which walk is the most probable so far “fair” “unfair”

1 2 · 1 6 1 12 1 2 · 1 2 1 4

Probability that is

  • bserved and the state is

“fair”

Probability of the most probable walk of length 2 ending in “fair”

·

Probability that is observed in “fair”

1 6 max

1

12 · 19 20, 1 4 · 1 20

  • 19

240 19 1440

19 1440 19 800 361 172800 361 16000 6859 20736000 6859 640000 6859 76800000 130321 25600000

Most probable walk: only unfair die

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 18 / 42

Hidden Markov Models

for Bacterial Culture

slide-7
SLIDE 7

Observing a Bacterial Culture

Bacterium can be in one of three states:

  • 1. Healthy
  • 2. OK
  • 3. Sick

DNA is changed so that bacteria fluoresce; the more, the more healthy they are. Fluorescence level:

  • 1. High
  • 2. Medium
  • 3. Low

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 19 / 42

Observing a Bacterial Culture

Time in minutes Observation

120 60 140 250 300 390 450 550 600

a b

Elowitz, Leibler, 2000, Nature 403

For the fluorescence level, we observe Low, Medium, Medium, High, Medium, High, Medium, High What follows for the health condition of the bacteria?

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 20 / 42

Observing a Bacterial Culture

Emission probabilities (empirical results)

  • 1. Healthy:

Prob (High) = 0.5 Prob (Medium) = 0.3 Prob (Low) = 0.2

  • 2. OK:

Prob (High) = 0.2 Prob (Medium) = 0.6 Prob (Low) = 0.2

  • 3. Sick:

Prob (High) = 0.2 Prob (Medium) = 0.3 Prob (Low) = 0.5

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 21 / 42

Observing a Bacterial Culture

Transition probabilities (empirical results)

Healthy Prob (High) = 0.5 Prob (Medium) = 0.3 Prob (Low) = 0.2 OK Prob (High) = 0.2 Prob (Medium) = 0.6 Prob (Low) = 0.2 Sick Prob (High) = 0.2 Prob (Medium) = 0.3 Prob (Low) = 0.5 0.4 0.3 0.3 0.2 0.2 0.6 0.4 0.6 Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 22 / 42

slide-8
SLIDE 8

Observing a Bacterial Culture

We can fill out the following table (values rounded) Healthy OK Sick

Low Medium Medium High Medium High Medium High

0.06 0.06 0.16 0.008 0.04 0.03 0.0024 0.0144 0.0054

0.00144 0.001728 0.000648 0.000172 0.000622 0.000129 0.000062 0.000074 0.000024 0.000007 0.000002 0.000005 0.000002 0.000003 0.000001

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 23 / 42

How Fast is the Viterbi Algorithm?

Count comparisons (as with sorting) Number of states: q Length of the observed sequence: n Table has size q × n For each cell, q values need to be compared We need roughly q · q · n comparisons The time complexity is in O(q2 · n)

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 24 / 42

How Fast is the Viterbi Algorithm?

Here, we have q = 18; let n = 300: less than 100 000 comparisons Naive approach: more than 10143 comparisons

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 25 / 42

How Fast is the Viterbi Algorithm?

1 2 3 4 5 6 7 8 9 10 50,000 100,000 naive Viterbi Length of sequence Comparisons

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 26 / 42

slide-9
SLIDE 9

Hidden Markov Models

The Viterbi Algorithm in Python Viterbi Algorithm – Initialization

Store all probabilities in matrices Rows of emission matrix: Healthy, OK, Sick Columns of emission matrix: High, Medium, Low Rows and columns of transition matrix: Healthy, OK, Sick Transition from row to column

emission_probs = np.array([[0.5, 0.3, 0.2],

Probability to observe “Medium”

[0.2, 0.6, 0.2], in state Healthy Probability to observe “High” [0.2, 0.3, 0.5]])

in state OK

transition_probs = np.array([[0.4, 0.3, 0.3],

Probability to transition from

[0.2, 0.6, 0.2], state Healthy to state Sick Probability to

transition from

[0.0, 0.4, 0.6]])

state OK to state Healthy

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 27 / 42

Viterbi Algorithm – Initialization

Convert letters to numbers

➯ Corresponding to matrix positions

seq = "LMMHMHMH" n = len(seq) seq2 = np.zeros(n) for i in range(0, n): if seq[i] == "H": seq2[i] = 0 elif seq[i] == "M": seq2[i] = 1 else: seq2[i] = 2

Create vector of size n that

  • nly contains zeros

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 28 / 42

Viterbi Algorithm – Probability Matrix

Create numpy matrix to store probabilities

probs = np.zeros((3, n))

Fill first column with emission probabilities First letter seq[0]

seq2[0] corresponds column in emission matrix

Assumption: Initially, every state is equally likely Probability for Healthy, OK, or Sick are all 1/3 Multiply respective emission probability with 1/3

for i in range(0, 3): probs[i][0] = 1/3 * emission_probs[i, int(seq2[0])]

Create matrix of size 3 × n that only contains zeros

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 29 / 42

slide-10
SLIDE 10

Viterbi Algorithm – Probability Matrix

Fill table entry by entry

for j in range(1, n):

For every column (except the first)

for i in range(0, 3):

For every row

eprob = emission_probs[i, int(seq2[j])]

Probability to observe

tprob1 = transition_probs[0,i] seq2[j] while the state is i

Different probabilities

tprob2 = transition_probs[1,i]

to transition to i

tprob3 = transition_probs[2,i] x = [probs[0, j-1] * tprob1 * eprob,

Store the three

probs[1, j-1] * tprob2 * eprob,

probabilities in variable x

probs[2, j-1] * tprob3 * eprob] smax = np.amax(x)

Compute largest probability

probs[i, j] = smax

Store it in matrix

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 30 / 42

Viterbi Algorithm – Tracing Matrix

Tracing “arrows” in table Additionally store which value was used for the maximum Create matrix to store walk through the table. . .

probs = np.zeros((3, n)) back = np.zeros((3, n)) Same size as probability matrix

Add in loop. . .

smax = np.amax(x) Compute largest probability probs[i, j] = smax Store value imax = np.argmax(x) Compute corresponding index back[i, j] = imax Store index

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 31 / 42

Viterbi Algorithm – Tracing Matrix

Now run backwards through matrix back Compute largest probability in last column Correspondingly output H, O, or S Then consider the value at the corresponding position at back If there is a “0, ” step from line 0 (upper-left) If there is a “1, ” step from line 1 (mid-left) If there is a “2, ” step from line 2 (lower-left) Continue in column to the left and line back[i, j]

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 32 / 42

Viterbi Algorithm – Tracing Matrix

Resulting state sequence as vector

result = ["X"] * n

Search maximum entry in last column

x = [probs[0, n-1], probs[1, n-1], probs[2, n-1]] imax = np.argmax(x)

The last entry of the result has to be correspondingly

if imax == 0: result[n-1] = "H" elif imax == 1: result[n-1] = "O" else: result[n-1] = "S"

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 33 / 42

slide-11
SLIDE 11

Viterbi Algorithm – Tracing Matrix

Now fill result from right to left

i = imax Start with line corresponding to back j = n-1 and last column while j >= 0: Repeat for every column if back[i, j] == 0: Step from upper-left; predecessor state was Healthy result[j-1] = ”H” elif back[i, j] == 1: Step from mid-left; predecessor state was OK result[j-1] = ”O” else: Step from lower-left; predecessor state was Sick result[j-1] = ”S” i = int(back[i,j]) Continue in row corresponding to matrix back j -= 1 and the row left to the current one

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 34 / 42

CG Islands

Encoding Regions in the DNA

DNA is interpreted as string over the letters A, C, G, T Goal: Find encoding regions (genes)

CG islands are regions in the strings in which the sequence CG appears

more often They frequently appear close to genes, and less frequent otherwise

➯ Find CG islands in given string

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 35 / 42

The Unfair Casino revisited

Variant: Coin toss instead of die Unfair coin shows “heads” with probability 1/3 and “tails” with probability 2/3

Fair Prob (heads) = 0.5 Prob (tails) = 0.5 Start Unfair Prob (heads) = 0.3 Prob (tails) = 0.6

9 10 1 10 1 2 1 2 1 10 9 10 Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 36 / 42

slide-12
SLIDE 12

The Unfair Casino revisited

Alternative modeling In every state, only one result “heads” or “tails” is output with probability 1 Single states for the fair and the unfair coin, respectively Transition probabilities model the respective emission probabilities Transitions in unfair “tails” state are more likely than in the unfair “heads” states

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 37 / 42

Alternative HMM for the Unfair Coin Toss

Start Fair “heads” Fair “tails” Unfair “heads” Unfair “tails”

1 4 1 4 1 6 1 3 9 20 9 20 1 30 1 15 9 20 9 20 1 30 1 15 1 20 1 20 9 30 9 15 1 20 1 20 9 30 9 15 Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 38 / 42

HMM to Identify CG islands

Idea Analogously to the second model of the coin toss Four respective states for the four letters A, C, G, and T within or outside the CG islands Every state emits exactly one letter with probability 1 Different transition probabilities within or outside the CG islands

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 39 / 42

HMM to Identify CG islands

A+ T + C+ G+ δ(C+, G+) δ(C+, G+) Within CG island Within CG island A− T − C− G− δ(C−, G−) δ(C−, G−) Outside CG island Outside CG island δ(X+, Y −) δ(Y −, X+)

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 40 / 42

slide-13
SLIDE 13

Determining the Transition Probabilities

Empirical results from known DNA sequences Within CG island

δ(C+, G+) ≈ 0.3 δ(C+, X+) ≈ 0.23 for X ∈ {A, C, T} δ(X+, Y +) ≈ 0.25 for X ∈ {A, G, T}, Y ∈ {A, C, G, T}

Outside CG island

δ(C−, G−) ≈ 0.04 δ(C−, X−) ≈ 0.32 for X ∈ {A, C, T} δ(X−, Y −) ≈ 0.25 for X ∈ {A, G, T}, Y ∈ {A, C, G, T}

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 41 / 42

Determining the Transition Probabilities

Switch between CG islands and their environment Transition probabilities depend on average length of CG islands and the distance between two CG islands Typical length of a CG island: 500 to 2000

➯ δ(X+, Y −) ≈ 1/4000 = 0.00025

Typical distance between two CG islands: 25 000

➯ δ(X−, Y +) ≈ 1/100 000 = 0.00001

Digital Medicine I: Introduction to Programming – HMMs Autumn 2019 Böckenhauer, Komm 42 / 42