lecture 8
play

Lecture 8 n Agenda: n String matching n How to evaluate a pattern - PowerPoint PPT Presentation

Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system String Matching (note 1) n n Definitions: n x = movi. Text :zlatanibrahimovic n Shift: s = offset from start of text to start position of x n Valid


  1. Lecture 8 n Agenda: n String matching n How to evaluate a pattern recognition system

  2. String Matching (note 1) n n Definitions: n x = ”movi”. Text :”zlatanibrahimovic” n Shift: s = offset from start of text to start position of x n Valid shift: s = offset to a complete match n Applications: Find word in text, count words, etc.

  3. String Matching - Algorithm n Naive string matching: brute force n Ok, but slow for large texts n Alternative: Boyer-Moore string matching n Faster because s = s+k, where k>1 n k=1 for the naive algorithm

  4. Boyer-Moore: Definitions n Algorithm (tavle) n Good suffix: n The elements (from right) which match n Bad character: n The first (from right) wrong element n Calculate the effect of both and apply max.

  5. Boyer-Moore: Definitions n F( x ): Last occurrence function (bad character) n Look-up table containing each letter in the alphabet together with their right-most location in x n Example: x = ”bror”. F( x ): o = 3, r = 2, b = 1, the rest = 0 n Example: x = ”estimates” F( x ): e = 8, t = 7, a = 6, m = 5, i = 4, s = 2, the rest = 0 n NB: note that the right-most element is ignored since this corresponds to the current shift

  6. Boyer-Moore: Definitions n G( x ): Good-suffix function n Look-up table containing the second right-most position of each suffix, which can be (re)found in x n Ex: x = ”bror”. G( x ): r = 2, the rest = 0, hence or = 0, ror = 0 n Ex: x = ”estimates” G( x ): s = 2, es = 1, the rest = 0

  7. Boyer-Moore String Matching

  8. Distance measure for Strings n We know what to do for features… n x = ”hej” y = ”her” z = ”haj” n Dist( x,y ) ?? Dist( x,y ) > Dist( x,z ) ?? n Applications: Spell-checking, speech recognition, DNA analysis, copy-cat detection, … n Hamming distance: | x | = | z | n Measures the number of positions where a difference occurs n Dist( x,y )=1, Dist( y,z )=2, Dist( x,y ) = Dist( x,z )

  9. Distance measure for Strings n Levenshtein distance n | x | = | z | is not required => better measure n Aka Edit distance, since the distance is defined as the number of operations that need to be preformed on x in order to obtain y

  10. Edit Distance (change x to y ) n Cost matrix: C (1.row, 1.col., hereafter one col. at a time) C [i,j] = min[ C [i-1,j] +1 , C [i,j-1] +1 , C [i-1,j-1] +1 – δ (x[i],y[j]) ] n insertion deletion No change / exchange δ (x[i],y[j])= 1 if x[i]=y[j] otherwise 0

  11. Recognition rate n In some system specifications you need technical success criteria for your project (product) n HW, SW, Real-time, recognition rate,… n Recognition rate = (number of correct classified / number of tested samples ) n Multiply by 100% and you have it in percentages n How do you test a system? n How do you present and interpret the results?

  12. Methods for test n Cross-validation n Train on α % of the samples ( α > 50 ) and test on the rest n α is typically 90, depending on the number of samples and the complexity of the system n M-fold cross validation n Divide (randomly) all samples in M equally sized groups n Use M-1 groups to train the system and test on the rest n Do this M times and average the results

  13. Interpretation of the results n Recognition rate = ( number of correct classified / number of tested samples ) Multiply by 100% and you have it in percentages n n Error % = 100% - ( Recognition rate x 100% ) n Distribution of errors? n Confusion matrix Output (from the system) n 3 classes P1 P2 P3 n 25 samples per class P1 19 5 1 Input (the truth) P2 0 24 1 P3 1 4 20

  14. General Representation of errors n Number of errors = Incorrect recognized + Not recognized n The total number of errors can be represented like this: Output (from the system) Yes No Yes Not recognized Input (Type II error) (the truth) No (False negativ = FN) (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)

  15. General Representation of errors • Example: SETI • Find intelligent signals in input data • FN versus FP – are they equally important? Output (from the system) Yes No Yes No !! Not recognized Input (Type II error) (the truth) No (False negativ = FN) Ok (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)

  16. General Representation of errors • Example: Access control to nuclear weapons • Is the person trying to enter ok? • FN versus FP – are they equally important? Output (from the system) Yes No Yes Ok Not recognized Input (Type II error) (the truth) No (False negativ = FN) No !! (False reject = FR) (False reject rate = FRR) (Miss) Incorrect recognized (Type I error) (False positiv = FP) (False accept = FA) (False accept rate = FAR) (Ghost object) (False alarm)

  17. Receiver Operating Characteristic Methodology

  18. Introduction to ROC curves • ROC = Receiver Operating Characteristic • Started in electronic signal detection theory (1940s - 1950s) • Has become very popular in biomedical applications, particularly radiology and imaging • Also used in machine learning applications to assess classifiers • Can be used to compare tests/procedures

  19. ROC curve

  20. ROC curves: simplest case • Consider diagnostic test for a disease • Test has 2 possible outcomes: – ‘ postive ’ = suggesting presence of disease – ‘ negative ’ • An individual can test either positive or negative for the disease • Prof. Mean...

  21. True disease state vs. Test result Test not rejected rejected Disease J X No disease (D = 0) specificity Type I error (False +) α X J Disease (D = 1) Type II error Power 1 - β ; (False -) β sensitivity

  22. Specific Example Pts without Pts with the disease disease Test Result

  23. Threshold Call these patients “ negative ” Call these patients “ positive ” Test Result

  24. Some definitions ... Call these patients “ negative ” Call these patients “ positive ” True Positives Test Result without the disease with the disease

  25. Call these patients “ negative ” Call these patients “ positive ” False Test Result Positives without the disease with the disease

  26. Call these patients “ negative ” Call these patients “ positive ” True negatives Test Result without the disease with the disease

  27. Call these patients “ negative ” Call these patients “ positive ” False negatives Test Result without the disease with the disease

  28. Moving the Threshold: right ‘‘ ‘‘ - ’’ + ’’ Test Result without the disease with the disease

  29. Moving the Threshold: left ‘‘ ‘‘ - ’’ + ’’ Test Result without the disease with the disease

  30. ROC curve 100% True Positive Rate (sensitivity) 0% 100% 0% False Positive Rate (1-specificity)

  31. ROC curve comparison A poor test: A good test: 100% 100% True Positive Rate True Positive Rate 0 0 % % 100% 100% 0 0 False Positive Rate False Positive Rate % %

  32. ROC curve extremes Best Test: Worst test: 100% 100% True Positive Rate True Positive Rate 0 0 % 100 % 0 100 False Positive 0 % % False Positive % % Rate Rate The distributions The distributions don ’ t overlap at all overlap completely

  33. Area under ROC curve (AUC) • Overall measure of test performance • Comparisons between two tests based on differences between (estimated) AUC • For continuous data, AUC equivalent to Mann- Whitney U-statistic (nonparametric test of difference in location between two populations)

  34. AUC for ROC curves 100% 100% AUC = 100% True Positive Rate True Positive Rate AUC = 50% 0 0 % 100 % 0 100 False Positive 0 % % False Positive % % Rate Rate 100% 100% AUC = 90% True Positive True Positive AUC = 65% Rate Rate 0 0 % % 100 0 100 0 False Positive % False Positive % % % Rate Rate

  35. K-fold Cross-Validation • Randomly sort data • Divide into k folds (e.g. k=10) • Use one fold for validation and the remaining for training • Average the accuracy 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend