Lecture 8 n Agenda: n String matching n How to evaluate a pattern - PowerPoint PPT Presentation

Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system

String Matching (note 1) n n Definitions: n x = ”movi”. Text :”zlatanibrahimovic” n Shift: s = offset from start of text to start position of x n Valid shift: s = offset to a complete match n Applications: Find word in text, count words, etc.

String Matching - Algorithm n Naive string matching: brute force n Ok, but slow for large texts n Alternative: Boyer-Moore string matching n Faster because s = s+k, where k>1 n k=1 for the naive algorithm

Boyer-Moore: Definitions n Algorithm (tavle) n Good suffix: n The elements (from right) which match n Bad character: n The first (from right) wrong element n Calculate the effect of both and apply max.

Boyer-Moore: Definitions n F( x ): Last occurrence function (bad character) n Look-up table containing each letter in the alphabet together with their right-most location in x n Example: x = ”bror”. F( x ): o = 3, r = 2, b = 1, the rest = 0 n Example: x = ”estimates” F( x ): e = 8, t = 7, a = 6, m = 5, i = 4, s = 2, the rest = 0 n NB: note that the right-most element is ignored since this corresponds to the current shift

Boyer-Moore: Definitions n G( x ): Good-suffix function n Look-up table containing the second right-most position of each suffix, which can be (re)found in x n Ex: x = ”bror”. G( x ): r = 2, the rest = 0, hence or = 0, ror = 0 n Ex: x = ”estimates” G( x ): s = 2, es = 1, the rest = 0

Boyer-Moore String Matching

Distance measure for Strings n We know what to do for features… n x = ”hej” y = ”her” z = ”haj” n Dist( x,y ) ?? Dist( x,y ) > Dist( x,z ) ?? n Applications: Spell-checking, speech recognition, DNA analysis, copy-cat detection, … n Hamming distance: | x | = | z | n Measures the number of positions where a difference occurs n Dist( x,y )=1, Dist( y,z )=2, Dist( x,y ) = Dist( x,z )

Distance measure for Strings n Levenshtein distance n | x | = | z | is not required => better measure n Aka Edit distance, since the distance is defined as the number of operations that need to be preformed on x in order to obtain y

Edit Distance (change x to y ) n Cost matrix: C (1.row, 1.col., hereafter one col. at a time) C [i,j] = min[ C [i-1,j] +1 , C [i,j-1] +1 , C [i-1,j-1] +1 – δ (x[i],y[j]) ] n insertion deletion No change / exchange δ (x[i],y[j])= 1 if x[i]=y[j] otherwise 0

Recognition rate n In some system specifications you need technical success criteria for your project (product) n HW, SW, Real-time, recognition rate,… n Recognition rate = (number of correct classified / number of tested samples ) n Multiply by 100% and you have it in percentages n How do you test a system? n How do you present and interpret the results?

Methods for test n Cross-validation n Train on α % of the samples ( α > 50 ) and test on the rest n α is typically 90, depending on the number of samples and the complexity of the system n M-fold cross validation n Divide (randomly) all samples in M equally sized groups n Use M-1 groups to train the system and test on the rest n Do this M times and average the results

Interpretation of the results n Recognition rate = ( number of correct classified / number of tested samples ) Multiply by 100% and you have it in percentages n n Error % = 100% - ( Recognition rate x 100% ) n Distribution of errors? n Confusion matrix Output (from the system) n 3 classes P1 P2 P3 n 25 samples per class P1 19 5 1 Input (the truth) P2 0 24 1 P3 1 4 20

General Representation of errors n Number of errors = Incorrect recognized + Not recognized n The total number of errors can be represented like this: Output (from the system) Yes No Yes Not recognized Input (Type II error) (the truth) No (False negativ = FN) (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)

General Representation of errors • Example: SETI • Find intelligent signals in input data • FN versus FP – are they equally important? Output (from the system) Yes No Yes No !! Not recognized Input (Type II error) (the truth) No (False negativ = FN) Ok (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)

General Representation of errors • Example: Access control to nuclear weapons • Is the person trying to enter ok? • FN versus FP – are they equally important? Output (from the system) Yes No Yes Ok Not recognized Input (Type II error) (the truth) No (False negativ = FN) No !! (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)

Receiver Operating Characteristic Methodology

Introduction to ROC curves • ROC = Receiver Operating Characteristic • Started in electronic signal detection theory (1940s - 1950s) • Has become very popular in biomedical applications, particularly radiology and imaging • Also used in machine learning applications to assess classifiers • Can be used to compare tests/procedures

ROC curve

ROC curves: simplest case • Consider diagnostic test for a disease • Test has 2 possible outcomes: – ‘ postive ’ = suggesting presence of disease – ‘ negative ’ • An individual can test either positive or negative for the disease • Prof. Mean...

True disease state vs. Test result Test not rejected rejected Disease J X No disease (D = 0) specificity Type I error (False +) α X J Disease (D = 1) Type II error Power 1 - β ; (False -) β sensitivity

Specific Example Pts without Pts with the disease disease Test Result

Threshold Call these patients “ negative ” Call these patients “ positive ” Test Result

Some definitions ... Call these patients “ negative ” Call these patients “ positive ” True Positives Test Result without the disease with the disease

Call these patients “ negative ” Call these patients “ positive ” False Test Result Positives without the disease with the disease

Call these patients “ negative ” Call these patients “ positive ” True negatives Test Result without the disease with the disease

Call these patients “ negative ” Call these patients “ positive ” False negatives Test Result without the disease with the disease

Moving the Threshold: right ‘‘ ‘‘ - ’’ + ’’ Test Result without the disease with the disease

Moving the Threshold: left ‘‘ ‘‘ - ’’ + ’’ Test Result without the disease with the disease

ROC curve 100% True Positive Rate (sensitivity) 0% 100% 0% False Positive Rate (1-specificity)

ROC curve comparison A poor test: A good test: 100% 100% True Positive Rate True Positive Rate 0 0 % % 100% 100% 0 0 False Positive Rate False Positive Rate % %

ROC curve extremes Best Test: Worst test: 100% 100% True Positive Rate True Positive Rate 0 0 % 100 % 0 100 False Positive 0 % % False Positive % % Rate Rate The distributions The distributions don ’ t overlap at all overlap completely

Area under ROC curve (AUC) • Overall measure of test performance • Comparisons between two tests based on differences between (estimated) AUC • For continuous data, AUC equivalent to Mann- Whitney U-statistic (nonparametric test of difference in location between two populations)

AUC for ROC curves 100% 100% AUC = 100% True Positive Rate True Positive Rate AUC = 50% 0 0 % 100 % 0 100 False Positive 0 % % False Positive % % Rate Rate 100% 100% AUC = 90% True Positive True Positive AUC = 65% Rate Rate 0 0 % % 100 0 100 0 False Positive % False Positive % % % Rate Rate

K-fold Cross-Validation • Randomly sort data • Divide into k folds (e.g. k=10) • Use one fold for validation and the remaining for training • Average the accuracy 35

Lecture 8 n Agenda: n String matching n How to evaluate a pattern - PowerPoint PPT Presentation

Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system String Matching (note 1) n n Definitions: n x = movi. Text :zlatanibrahimovic n Shift: s = offset from start of text to start position of x n Valid

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Speaker Recognition Low-Dimensional Representation Sequence of features: GMM

Fun IP Prof. Roger Ford Monday, April 25, 2016 Trademarks: Dilution 15 U.S.C. 1125 False

Distributed Systems On-computer keychain file Need there be more? Smart Cards, Biometrics,

Smartphone based Access Control: Adventures in Usability Lujo Bauer Carnegie Mellon Device

NDN-NIC: Name-based Filtering on Network Interface Card Junxiao Shi, Teng Liang, Beichuan Zhang

Evaluating Resistance to False- Name Manipulations in Elections Vincent Conitzer Bo Waggoner

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ T

ILLEGAL SECTION 8 SIDE PAYMENTS & THE FEDERAL FALSE CLAIMS ACT Eileen D. Yacknin PLAN

Lecture 8 n Agenda: n String matching n How to evaluate a pattern - PowerPoint PPT Presentation

Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system String Matching (note 1) n n Definitions: n x = movi. Text :zlatanibrahimovic n Shift: s = offset from start of text to start position of x n Valid

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Speaker Recognition Low-Dimensional Representation Sequence of features: GMM

Fun IP Prof. Roger Ford Monday, April 25, 2016 Trademarks: Dilution 15 U.S.C. 1125 False

Distributed Systems On-computer keychain file Need there be more? Smart Cards, Biometrics,

Smartphone based Access Control: Adventures in Usability Lujo Bauer Carnegie Mellon Device

NDN-NIC: Name-based Filtering on Network Interface Card Junxiao Shi, Teng Liang, Beichuan Zhang

Evaluating Resistance to False- Name Manipulations in Elections Vincent Conitzer Bo Waggoner

CSE 105 THEORY OF COMPUTATION Fall 2016 http://cseweb.ucsd.edu/classes/fa16/cse105-abc/ T

ILLEGAL SECTION 8 SIDE PAYMENTS &amp; THE FEDERAL FALSE CLAIMS ACT Eileen D. Yacknin PLAN

ILLEGAL SECTION 8 SIDE PAYMENTS & THE FEDERAL FALSE CLAIMS ACT Eileen D. Yacknin PLAN