Lecture 8 n Agenda: n String matching n How to evaluate a pattern - - PowerPoint PPT Presentation

lecture 8
SMART_READER_LITE
LIVE PREVIEW

Lecture 8 n Agenda: n String matching n How to evaluate a pattern - - PowerPoint PPT Presentation

Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system String Matching (note 1) n n Definitions: n x = movi. Text :zlatanibrahimovic n Shift: s = offset from start of text to start position of x n Valid


slide-1
SLIDE 1

Lecture 8

n Agenda:

n String matching n How to evaluate a pattern recognition system

slide-2
SLIDE 2

String Matching

n

(note 1)

n Definitions:

n x = ”movi”. Text:”zlatanibrahimovic” n Shift: s = offset from start of text to start position of x n Valid shift: s = offset to a complete match

n Applications: Find word in text, count words, etc.

slide-3
SLIDE 3

String Matching - Algorithm

n Naive string matching: brute force n Ok, but slow for large texts n Alternative: Boyer-Moore string matching

n Faster because s = s+k, where k>1 n k=1 for the naive algorithm

slide-4
SLIDE 4

Boyer-Moore: Definitions

n Algorithm (tavle) n Good suffix:

n The elements (from right) which match

n Bad character:

n The first (from right) wrong element

n Calculate the effect of both and apply max.

slide-5
SLIDE 5

Boyer-Moore: Definitions

n F(x): Last occurrence function (bad character)

n Look-up table containing each letter in the alphabet together

with their right-most location in x

n Example: x = ”bror”. F(x): o = 3, r = 2, b = 1, the rest = 0 n Example: x = ”estimates”

F(x): e = 8, t = 7, a = 6, m = 5, i = 4, s = 2, the rest = 0

n NB: note that the right-most element is ignored since this

corresponds to the current shift

slide-6
SLIDE 6

Boyer-Moore: Definitions

n G(x): Good-suffix function

n Look-up table containing the second right-most position of

each suffix, which can be (re)found in x

n Ex: x = ”bror”. G(x): r = 2, the rest = 0,

hence or = 0, ror = 0

n Ex: x = ”estimates”

G(x): s = 2, es = 1, the rest = 0

slide-7
SLIDE 7

Boyer-Moore String Matching

slide-8
SLIDE 8

Distance measure for Strings

n We know what to do for features… n x = ”hej” y = ”her” z = ”haj” n Dist(x,y) ?? Dist(x,y) > Dist(x,z) ?? n Applications: Spell-checking, speech recognition,

DNA analysis, copy-cat detection, …

n Hamming distance: |x| = |z|

n Measures the number of positions where a difference

  • ccurs

n Dist(x,y)=1, Dist(y,z)=2, Dist(x,y) = Dist(x,z)

slide-9
SLIDE 9

Distance measure for Strings

n Levenshtein distance n |x| = |z| is not required => better

measure

n Aka Edit distance, since the distance is

defined as the number of operations that need to be preformed on x in order to

  • btain y
slide-10
SLIDE 10

Edit Distance (change x to y)

n Cost matrix: C (1.row, 1.col., hereafter one col. at a time)

n

C[i,j] = min[ C[i-1,j] +1 , C[i,j-1] +1 , C[i-1,j-1] +1 – δ(x[i],y[j]) ]

deletion insertion No change / exchange δ(x[i],y[j])= 1 if x[i]=y[j]

  • therwise 0
slide-11
SLIDE 11

Recognition rate

n In some system specifications you need technical

success criteria for your project (product)

n HW, SW, Real-time, recognition rate,…

n Recognition rate =

(number of correct classified / number of tested samples)

n Multiply by 100% and you have it in percentages

n How do you test a system? n How do you present and interpret the results?

slide-12
SLIDE 12

Methods for test

n Cross-validation

n Train on α % of the samples (α > 50) and test on the rest n α is typically 90, depending on the number of samples and

the complexity of the system

n M-fold cross validation

n Divide (randomly) all samples in M equally sized groups n Use M-1 groups to train the system and test on the rest n Do this M times and average the results

slide-13
SLIDE 13

Interpretation of the results

n Recognition rate =

(number of correct classified / number of tested samples)

n

Multiply by 100% and you have it in percentages

n Error % = 100% - ( Recognition rate x 100% ) n Distribution of errors? n Confusion matrix

n 3 classes n 25 samples

per class

P1 P2 P3 P1 19 5 1 P2 24 1 P3 1 4 20

Input (the truth) Output (from the system)

slide-14
SLIDE 14

General Representation of errors

n Number of errors = Incorrect recognized + Not recognized n The total number of errors can be represented like this:

Yes No Yes No

Input (the truth) Output (from the system) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm) Not recognized (Type II error) (False negativ = FN) (False reject = FR) (False reject rate = FRR) (Miss)

slide-15
SLIDE 15

General Representation of errors

  • Example: SETI
  • Find intelligent signals in input data
  • FN versus FP – are they equally important?

Yes No Yes No

Input (the truth) Output (from the system) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm) Not recognized (Type II error) (False negativ = FN) (False reject = FR) (False reject rate = FRR) (Miss) Ok

No !!

slide-16
SLIDE 16

General Representation of errors

  • Example: Access control to nuclear weapons
  • Is the person trying to enter ok?
  • FN versus FP – are they equally important?

Yes No Yes No

Input (the truth) Output (from the system) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm) Not recognized (Type II error) (False negativ = FN) (False reject = FR) (False reject rate = FRR) (Miss) Ok

No !!

slide-17
SLIDE 17

Receiver Operating Characteristic Methodology

slide-18
SLIDE 18

Introduction to ROC curves

  • ROC = Receiver Operating Characteristic
  • Started in electronic signal detection

theory (1940s - 1950s)

  • Has become very popular in biomedical

applications, particularly radiology and imaging

  • Also used in machine learning applications to

assess classifiers

  • Can be used to compare tests/procedures
slide-19
SLIDE 19

ROC curve

slide-20
SLIDE 20

ROC curves: simplest case

  • Consider diagnostic test for a disease
  • Test has 2 possible outcomes:

– ‘postive’ = suggesting presence of disease – ‘negative’

  • An individual can test either positive or

negative for the disease

  • Prof. Mean...
slide-21
SLIDE 21

True disease state vs. Test result

not rejected rejected No disease (D = 0) J specificity

X

Type I error (False +) α Disease (D = 1)

X

Type II error (False -) β J Power 1 - β; sensitivity Disease Test

slide-22
SLIDE 22

Specific Example

Test Result Pts with disease Pts without the disease

slide-23
SLIDE 23

Test Result

Call these patients “negative” Call these patients “positive”

Threshold

slide-24
SLIDE 24

Test Result

Call these patients “negative” Call these patients “positive” without the disease with the disease

True Positives

Some definitions ...

slide-25
SLIDE 25

Test Result

Call these patients “negative” Call these patients “positive” without the disease with the disease

False Positives

slide-26
SLIDE 26

Test Result

Call these patients “negative” Call these patients “positive” without the disease with the disease

True negatives

slide-27
SLIDE 27

Test Result

Call these patients “negative” Call these patients “positive” without the disease with the disease

False negatives

slide-28
SLIDE 28

Test Result

without the disease with the disease

‘‘

  • ’’

‘‘ +’’

Moving the Threshold: right

slide-29
SLIDE 29

Test Result

without the disease with the disease

‘‘

  • ’’

‘‘ +’’

Moving the Threshold: left

slide-30
SLIDE 30

True Positive Rate (sensitivity)

0% 100%

False Positive Rate (1-specificity)

0% 100%

ROC curve

slide-31
SLIDE 31

True Positive Rate

%

100%

False Positive Rate

% 100%

True Positive Rate

%

100%

False Positive Rate

% 100%

A good test: A poor test:

ROC curve comparison

slide-32
SLIDE 32

Best Test: Worst test:

True Positive Rate

%

100%

False Positive Rate

% 100 %

True Positive Rate

%

100%

False Positive Rate

% 100 %

The distributions don’t overlap at all The distributions

  • verlap completely

ROC curve extremes

slide-33
SLIDE 33

Area under ROC curve (AUC)

  • Overall measure of test performance
  • Comparisons between two tests based on

differences between (estimated) AUC

  • For continuous data, AUC equivalent to Mann-

Whitney U-statistic (nonparametric test of difference in location between two populations)

slide-34
SLIDE 34

True Positive Rate

%

100%

False Positive Rate

% 100 %

True Positive Rate

%

100%

False Positive Rate

% 100 %

True Positive Rate

%

100%

False Positive Rate

% 100 %

AUC = 50% AUC = 90% AUC = 65% AUC = 100%

True Positive Rate

%

100%

False Positive Rate

% 100 %

AUC for ROC curves

slide-35
SLIDE 35

K-fold Cross-Validation

  • Randomly sort data
  • Divide into k folds (e.g. k=10)
  • Use one fold for validation and the remaining for

training

  • Average the accuracy

35