Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION - - PowerPoint PPT Presentation

understanding string distances
SMART_READER_LITE
LIVE PREVIEW

Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION - - PowerPoint PPT Presentation

Understanding string distances IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R What is a string distance? INTERMEDIATE REGULAR EXPRESSIONS IN R Real


slide-1
SLIDE 1

Understanding string distances

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Data Journalist

slide-2
SLIDE 2

INTERMEDIATE REGULAR EXPRESSIONS IN R

What is a string distance?

slide-3
SLIDE 3

INTERMEDIATE REGULAR EXPRESSIONS IN R

What is a string distance?

slide-4
SLIDE 4

INTERMEDIATE REGULAR EXPRESSIONS IN R

Real world applications

slide-5
SLIDE 5

INTERMEDIATE REGULAR EXPRESSIONS IN R

slide-6
SLIDE 6

INTERMEDIATE REGULAR EXPRESSIONS IN R

String distances in R

library(stringdist) stringdist("saturday", "sunday", method = "lv")

Returns:

3

Is identical:

stringdist("sunday", "saturday", method = "lv")

slide-7
SLIDE 7

INTERMEDIATE REGULAR EXPRESSIONS IN R

Finding a match

amatch( x = "Sonday", table = c("Friday", "Saturday", "Sunday"), maxDist = 1, method = "lv" )

Returns:

3

slide-8
SLIDE 8

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R

slide-9
SLIDE 9

Methods of string distances

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Data Journalist

slide-10
SLIDE 10

INTERMEDIATE REGULAR EXPRESSIONS IN R

Damerau-Levenshtein

slide-11
SLIDE 11

INTERMEDIATE REGULAR EXPRESSIONS IN R

Method abbreviations

Regular Levenshtein distance:

stringdist(a, b, method = "lv")

Damerau-Levenshtein distance:

stringdist(a, b, method = "dl")

Optimal String Alignment distance:

stringdist(a, b, method = "osa")

slide-12
SLIDE 12

INTERMEDIATE REGULAR EXPRESSIONS IN R

Q-Grams (or n-grams)

slide-13
SLIDE 13

INTERMEDIATE REGULAR EXPRESSIONS IN R

Q-Grams (or n-grams)

slide-14
SLIDE 14

INTERMEDIATE REGULAR EXPRESSIONS IN R

Inspecting q-grams

qgrams("Honolulu", "Hanolulu", q = 2)

Returns:

Ho on ul no ol lu la V1 1 1 1 1 1 2 0 V2 1 1 1 1 1 1 1

slide-15
SLIDE 15

INTERMEDIATE REGULAR EXPRESSIONS IN R

Method abbreviations

Sum of qgrams that are not shared

stringdist(a, b, method = "qgram") # equals 4

Not shared qgrams divided by total number of qgrams

stringdist(a, b, method = "jaccard") # equals 0.5

Optimal String Alignment distance

stringdist(a, b, method = "cosine") # equals 0.22

slide-16
SLIDE 16

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R

slide-17
SLIDE 17

Fuzzy joins

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Instructor

slide-18
SLIDE 18

INTERMEDIATE REGULAR EXPRESSIONS IN R

A regular join

slide-19
SLIDE 19

INTERMEDIATE REGULAR EXPRESSIONS IN R

A fuzzy join

slide-20
SLIDE 20

INTERMEDIATE REGULAR EXPRESSIONS IN R

The fuzzyjoin package

library(fuzzyjoin) stringdist_join( user_input, database, by = c("user_input" = "name"), method = "lv", max_dist = 1, distance_col = "distance" )

slide-21
SLIDE 21

INTERMEDIATE REGULAR EXPRESSIONS IN R

stringdist_join: Result

slide-22
SLIDE 22

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R

slide-23
SLIDE 23

Custom Fuzzy Matching

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Data Journalist

slide-24
SLIDE 24

INTERMEDIATE REGULAR EXPRESSIONS IN R

Combining two fuzzy matches

slide-25
SLIDE 25

INTERMEDIATE REGULAR EXPRESSIONS IN R

Combining two fuzzy matches

slide-26
SLIDE 26

INTERMEDIATE REGULAR EXPRESSIONS IN R

Fuzzy matches: Helper functions

For the string comparison:

small_str_distance <- function(left, right) { stringdist(left, right) <= 5 }

For the number comparison:

close_to_each_other <- function(left, right) { abs(left - right) <= 3 }

slide-27
SLIDE 27

INTERMEDIATE REGULAR EXPRESSIONS IN R

The fuzzy join

fuzzy_left_join( a, b, by = c( "title" = "prod_title", "year" = "prod_year" ), match_fun = c( "title" = small_str_distance, "year" = close_to_each_other ) )

slide-28
SLIDE 28

INTERMEDIATE REGULAR EXPRESSIONS IN R

The fuzzy join: The result

slide-29
SLIDE 29

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R

slide-30
SLIDE 30

Congratulations

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Data Journalist

slide-31
SLIDE 31

INTERMEDIATE REGULAR EXPRESSIONS IN R

A look back

  • 1. Regular Expressions: Writing custom patterns

str_view() , str_match() , str_detect() ...

  • 2. Creating strings with data

glue() , glue_collapse() , ...

  • 3. Extracting structured data from text

str_extract_all() , extract() , ...

  • 4. Similarities between strings

strindist() , amatch() , stringdist_join()

slide-32
SLIDE 32

INTERMEDIATE REGULAR EXPRESSIONS IN R

Next courses

slide-33
SLIDE 33

Thank you!

IN TERMEDIATE REGULAR EX P RES S ION S IN R