understanding string distances
play

Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION - PowerPoint PPT Presentation

Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R Real


  1. Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  2. What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R

  3. What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R

  4. Real world applications INTERMEDIATE REGULAR EXPRESSIONS IN R

  5. INTERMEDIATE REGULAR EXPRESSIONS IN R

  6. String distances in R library(stringdist) stringdist("saturday", "sunday", method = "lv") Returns: 3 Is identical: stringdist("sunday", "saturday", method = "lv") INTERMEDIATE REGULAR EXPRESSIONS IN R

  7. Finding a match amatch( x = "Sonday", table = c("Friday", "Saturday", "Sunday"), maxDist = 1, method = "lv" ) Returns: 3 INTERMEDIATE REGULAR EXPRESSIONS IN R

  8. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  9. Methods of string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  10. Damerau-Levenshtein INTERMEDIATE REGULAR EXPRESSIONS IN R

  11. Method abbreviations Regular Levenshtein distance: stringdist(a, b, method = "lv") Damerau-Levenshtein distance: stringdist(a, b, method = "dl") Optimal String Alignment distance: stringdist(a, b, method = "osa") INTERMEDIATE REGULAR EXPRESSIONS IN R

  12. Q-Grams (or n-grams) INTERMEDIATE REGULAR EXPRESSIONS IN R

  13. Q-Grams (or n-grams) INTERMEDIATE REGULAR EXPRESSIONS IN R

  14. Inspecting q-grams qgrams("Honolulu", "Hanolulu", q = 2) Returns: Ho on ul no ol lu la V1 1 1 1 1 1 2 0 V2 1 1 1 1 1 1 1 INTERMEDIATE REGULAR EXPRESSIONS IN R

  15. Method abbreviations Sum of qgrams that are not shared stringdist(a, b, method = "qgram") # equals 4 Not shared qgrams divided by total number of qgrams stringdist(a, b, method = "jaccard") # equals 0.5 Optimal String Alignment distance stringdist(a, b, method = "cosine") # equals 0.22 INTERMEDIATE REGULAR EXPRESSIONS IN R

  16. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  17. Fuzzy joins IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Instructor

  18. A regular join INTERMEDIATE REGULAR EXPRESSIONS IN R

  19. A fuzzy join INTERMEDIATE REGULAR EXPRESSIONS IN R

  20. The fuzzyjoin package library(fuzzyjoin) stringdist_join( user_input, database, by = c("user_input" = "name"), method = "lv", max_dist = 1, distance_col = "distance" ) INTERMEDIATE REGULAR EXPRESSIONS IN R

  21. stringdist_join: Result INTERMEDIATE REGULAR EXPRESSIONS IN R

  22. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  23. Custom Fuzzy Matching IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  24. Combining two fuzzy matches INTERMEDIATE REGULAR EXPRESSIONS IN R

  25. Combining two fuzzy matches INTERMEDIATE REGULAR EXPRESSIONS IN R

  26. Fuzzy matches: Helper functions For the string comparison: small_str_distance <- function(left, right) { stringdist(left, right) <= 5 } For the number comparison: close_to_each_other <- function(left, right) { abs(left - right) <= 3 } INTERMEDIATE REGULAR EXPRESSIONS IN R

  27. The fuzzy join fuzzy_left_join( a, b, by = c( "title" = "prod_title", "year" = "prod_year" ), match_fun = c( "title" = small_str_distance, "year" = close_to_each_other ) ) INTERMEDIATE REGULAR EXPRESSIONS IN R

  28. The fuzzy join: The result INTERMEDIATE REGULAR EXPRESSIONS IN R

  29. Let's practice! IN TERMEDIATE REGULAR EX P RES S ION S IN R

  30. Congratulations IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist

  31. A look back 1. Regular Expressions: Writing custom patterns str_view() , str_match() , str_detect() ... 2. Creating strings with data glue() , glue_collapse() , ... 3. Extracting structured data from text str_extract_all() , extract() , ... 4. Similarities between strings strindist() , amatch() , stringdist_join() INTERMEDIATE REGULAR EXPRESSIONS IN R

  32. Next courses INTERMEDIATE REGULAR EXPRESSIONS IN R

  33. Thank you! IN TERMEDIATE REGULAR EX P RES S ION S IN R

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend