Welcome IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr - - PowerPoint PPT Presentation

welcome
SMART_READER_LITE
LIVE PREVIEW

Welcome IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr - - PowerPoint PPT Presentation

Welcome IN TERMEDIATE REGULAR EX P RES S ION S IN R Angelo Zehr Data Journalist Where you might have left off INTERMEDIATE REGULAR EXPRESSIONS IN R From Rebus to writing custom expressions Does "cat" start with "c" ? The


slide-1
SLIDE 1

Welcome

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Data Journalist

slide-2
SLIDE 2

INTERMEDIATE REGULAR EXPRESSIONS IN R

Where you might have left off

slide-3
SLIDE 3

INTERMEDIATE REGULAR EXPRESSIONS IN R

From Rebus to writing custom expressions

Does "cat" start with "c" ? The rebus way:

str_detect("cat", pattern = START %R% "c")

Regular expression:

str_detect("cat", pattern = "^c")

slide-4
SLIDE 4

INTERMEDIATE REGULAR EXPRESSIONS IN R

Prerequisites: stringr

str_detect(string, pattern) str_match(string, pattern)

slide-5
SLIDE 5

INTERMEDIATE REGULAR EXPRESSIONS IN R

What regular expressions will help you achieve

slide-6
SLIDE 6

INTERMEDIATE REGULAR EXPRESSIONS IN R

What regular expressions will help you achieve

slide-7
SLIDE 7

INTERMEDIATE REGULAR EXPRESSIONS IN R

Our rst dataset

movie_titles <- c( "Karate Kid", "The Twilight Saga: Eclispe", "Knight & Day", "Shrek Forever After (3D)", "Marmaduke.", "Predators", "StreetDance (3D)", "Robin Hood", "Micmacs A Tire-Larigot", "Sex And the City 2", ... movie_titles[ str_detect( movie_titles, pattern = "^K" ) ] "Karate Kid", "Knight & Day", ...

slide-8
SLIDE 8

INTERMEDIATE REGULAR EXPRESSIONS IN R

Special characters in regular expressions

Special character Meaning

^

Caret: Marks the beginning of a line or string

$

Dollar Sign: Marks the end of a line or string

.

Period: Matches anything: letters, numbers or white spaces

\\.

Two backslashes: Escapes the period when we search an actual period

slide-9
SLIDE 9

INTERMEDIATE REGULAR EXPRESSIONS IN R

For example

Code Result

str_match("Book", "^.")

Will match "B"

str_match("Book", ".$")

Will match "k"

str_match("Book", "\\.")

No match

str_match("Book.", "\\.")

Will match "."

slide-10
SLIDE 10

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R

slide-11
SLIDE 11

Character classes and repetitions

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Data Journalist

slide-12
SLIDE 12

INTERMEDIATE REGULAR EXPRESSIONS IN R

Available character classes

Character Class Example

\\d or [:digit:] 0, 1, 2, 3,… \\w or [:word:] a, b, c…, 1, 2, 3…, _ [A-Za-z] or [:alpha:] A, B, C,…, a, b, c,… [aeiou]

either a , e , i , o or u

\\s or [:space:] " " , tabs or line breaks

slide-13
SLIDE 13

INTERMEDIATE REGULAR EXPRESSIONS IN R

A concrete example

str_match_all()

Result

"Hi John_35", "\\d" "3" , "5" "Hi John_35", "\\w" "H" , "i" , "J" , "o" , "h" , "n" , "_" , "3" , "5" "Hi John_35", "[A-Za-z]" "H" , "i" , "J" , "o" , "h" , "n" "Hi John_35", "[aeiou]" "i" , "o" "Hi John_35", "\\s" " "

slide-14
SLIDE 14

INTERMEDIATE REGULAR EXPRESSIONS IN R

Repetitions

Syntax Meaning

\\w{2}

exactly 2 times

\\w{2,3}

minimum 2 times, maximum 3 times

\\w{2,}

minimum 2 times, but no maximum

\\w+

1 or more repetitions

\\w*

0, 1 or more repetitions

slide-15
SLIDE 15

INTERMEDIATE REGULAR EXPRESSIONS IN R

Inversion of character classes

Original Negation

\\d match digits \\D match all but digits \\w match word characters \\W match all but word characters \\s match spaces \\S match all but spaces [a-zA-Z] match alphabet [^a-zA-Z] match all but alphabet

slide-16
SLIDE 16

INTERMEDIATE REGULAR EXPRESSIONS IN R

Custom pattern with classes

str_match_all("Toy Story 3", "[\\d\\s]")

Result:

[,1] [1,] " " [2,] " " [3,] "3"

slide-17
SLIDE 17

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R

slide-18
SLIDE 18

The pipe and the question mark

IN TERMEDIATE REGULAR EX P RES S ION S IN R

Angelo Zehr

Instructor

slide-19
SLIDE 19

INTERMEDIATE REGULAR EXPRESSIONS IN R

This or that

lines <- c( "Karate Kid 2, Distributor: Columbia, 58 Screens", "Finding Nemo, Distributors: Pixar and Disney, 10 Screens", "Finding Harmony, Distributor: Unknown, 1 Screen", "Finding Dory, Distributors: Pixar and Disney, 8 Screens" ) str_detect(lines, "Columbia|Pixar") TRUE TRUE FALSE TRUE

slide-20
SLIDE 20

INTERMEDIATE REGULAR EXPRESSIONS IN R

Making things optional

str_view(lines, pattern = "Distributor|Distributors") str_view(lines, pattern = "Distributors?")

slide-21
SLIDE 21

INTERMEDIATE REGULAR EXPRESSIONS IN R

Greedy vs. lazy

str_view("Toy Story 3 In Disney Digital 3D", ".*3") str_view("Toy Story 3 In Disney Digital 3D", ".*?3")

slide-22
SLIDE 22

Let's practice!

IN TERMEDIATE REGULAR EX P RES S ION S IN R