Strings, string patterns using regular expressions Steve Bagley - - PowerPoint PPT Presentation

strings string patterns using regular expressions
SMART_READER_LITE
LIVE PREVIEW

Strings, string patterns using regular expressions Steve Bagley - - PowerPoint PPT Presentation

Strings, string patterns using regular expressions Steve Bagley somgen223.stanford.edu 1 Strings somgen223.stanford.edu 2 s <- c ("a", "ab", "abcdef") length (s) [1] 3 str_length (s) [1] 1 2 6 Strings in R


slide-1
SLIDE 1

Strings, string patterns using regular expressions

Steve Bagley

somgen223.stanford.edu 1

slide-2
SLIDE 2

Strings

somgen223.stanford.edu 2

slide-3
SLIDE 3

Strings in R

s <- c("a", "ab", "abcdef") length(s) [1] 3 str_length(s) [1] 1 2 6

  • length is the length of the vector, not the length of the strings in the vector.
  • We will use the stringr functions (prefix str_) for string manipulation.

somgen223.stanford.edu 3

slide-4
SLIDE 4

Building strings: str_c

str_c("abc", "def") [1] "abcdef" str_c("abc", 123) [1] "abc123" str_c("abc", 1:3) [1] "abc1" "abc2" "abc3" str_c("abc", 1:3, sep = "_") [1] "abc_1" "abc_2" "abc_3"

somgen223.stanford.edu 4

slide-5
SLIDE 5

sep vs collapse

str_c("abc", 1:3, sep = "+") [1] "abc+1" "abc+2" "abc+3" str_c("abc", 1:3, sep = "+", collapse = " * ") [1] "abc+1 * abc+2 * abc+3"

  • sep separates elementwise.
  • collapse (if supplied) is inserted into the aggregated result.

somgen223.stanford.edu 5

slide-6
SLIDE 6

Converting case

str_to_upper("this is a test") [1] "THIS IS A TEST" str_to_lower("THIS IS A TEST") [1] "this is a test" str_to_title("THIS IS A TEST") [1] "This Is A Test" str_to_sentence("THIS IS A TEST") [1] "This is a test"

somgen223.stanford.edu 6

slide-7
SLIDE 7

Substrings

s <- "this is a test" str_sub(s, 11, 14) [1] "test" str_sub(s, 3) [1] "is is a test"

  • str_sub returns a substring
  • It returns from the first position (counting from 1) to the second position (or the

end of the string if not supplied).

somgen223.stanford.edu 7

slide-8
SLIDE 8

Regular expressions

somgen223.stanford.edu 8

slide-9
SLIDE 9

Regular expressions: patterns in strings

  • String often have some kind of internal pattern.
  • Example: phone number 650-723-2300 (area code, exchange/prefix, line

number)

  • Example: zip code 94305 or 94305-1234
  • We need some way to describing these patterns: regular expressions

somgen223.stanford.edu 9

slide-10
SLIDE 10

Regular expressions

  • Regular expressions (“regexs” or “regexps”) are a powerful (but confusing)

language for expressing character-level patterns.

  • The general idea is that some characters match themselves, but other characters

take on special meaning, and control how other characters are matched. There needs to be some way to specify that those special characters can literally appear, which leads to some odd quoting conventions.

somgen223.stanford.edu 10

slide-11
SLIDE 11

Most characters match only themselves

str_detect("ac", "ac") # entire pattern matches [1] TRUE str_detect("bc", "ac") # entire pattern does not match [1] FALSE str_detect("aacc", "ac") # entire pattern is contained in string [1] TRUE str_detect("ccaa", "ac") # pattern not contained in string [1] FALSE str_detect("abcde", "ac") # pattern not contained in string [1] FALSE

  • str_detect takes two arguments: a string, and a pattern.
  • It returns TRUE if it finds the complete pattern anywhere in the string.

somgen223.stanford.edu 11

slide-12
SLIDE 12

str_detect accepts vectors

str_detect(c("ac", "bc", "aacc", "ccaa", "abcde"), "ac") [1] TRUE FALSE TRUE FALSE FALSE

  • These are the same examples as on the previous slide, but all in one vector for the

string argument.

somgen223.stanford.edu 12

slide-13
SLIDE 13

Matching one character from a set of characters

str_detect("ac", "[ab]c") [1] TRUE str_detect("bc", "[ab]c") [1] TRUE str_detect("cc", "[ab]c") [1] FALSE str_detect("abc", "[ab]c") [1] TRUE str_detect("bac", "[ab]c") [1] TRUE str_detect("cabc", "[ab]c") [1] TRUE

  • The bracket expression [ ] matches when any one character contained in the

brackets occurs in the string at that position.

  • In this example, [ab] will match either a or b.
  • This raises the question of how we would match a bracket character.

somgen223.stanford.edu 13

slide-14
SLIDE 14

Anchors

str_detect("aacc", "ac") [1] TRUE str_detect("aacc", "^aa") [1] TRUE str_detect("aacc", "^ac") [1] FALSE str_detect("aacc", "^aacc") [1] TRUE str_detect("aacc", "cc") [1] TRUE str_detect("aacc", "cc$") [1] TRUE str_detect("aacc", "ac$") [1] FALSE str_detect("aacc", "^aacc$") [1] TRUE

  • ^ matches only at the beginning of the string
  • $ matches only at the end of the string

somgen223.stanford.edu 14

slide-15
SLIDE 15

Exercise

Explain this: str_detect("aacc", "^[ab][ac][bc][bb]$") [1] FALSE

somgen223.stanford.edu 15

slide-16
SLIDE 16

Answer

str_detect("aacc", "^[ab][ac][bc][ab]$") [1] FALSE

  • Starting at beginning of the string, we need:
  • a or b (true), and
  • a or c (true), and
  • b or c (true), and
  • a or b (false)

somgen223.stanford.edu 16

slide-17
SLIDE 17

Exercise:

List all the word strings that this pattern matches? str_detect(word, "^c[aeiou]t$")

somgen223.stanford.edu 17

slide-18
SLIDE 18

Answer

str_detect(c("cat", "cet", "cit", "cot", "cut"), "^c[aeiou]t$") [1] TRUE TRUE TRUE TRUE TRUE

somgen223.stanford.edu 18

slide-19
SLIDE 19

Exercise

  • How many different strings could this pattern match?

str_detect(some_string, "^[ab][bc][cd][da]$")

somgen223.stanford.edu 19

slide-20
SLIDE 20

Answer

  • Each [ab] can match two characters.
  • There are four occurrences of [ ]
  • 24 = 16

somgen223.stanford.edu 20

slide-21
SLIDE 21

Optional matches

str_detect("abc", "abc") [1] TRUE str_detect("abc", "abc?") [1] TRUE ## same as: str_detect("abc", "ab") | str_detect("abc", "abc") [1] TRUE str_detect("abc", "abcd?") [1] TRUE

  • ? matches the previous expression zero or one times
  • In abc?, the previous expression is c, not abc.

somgen223.stanford.edu 21

slide-22
SLIDE 22

Grouping

str_detect("abcd", "ab(ef)?cd") [1] TRUE str_detect("abcd", "abcd(ef)?") [1] TRUE str_detect("abcd", "ab(ef)?cd") [1] TRUE

  • () provides grouping. Any suffix operator applies to everything inside the

parentheses.

somgen223.stanford.edu 22

slide-23
SLIDE 23

Example: phone number

phone_number <- "[2-9][0-9]{2}-[0-9]{3}-[0-9]{4}" str_detect(c("650-723-2300", "123-45-6789"), phone_number) [1] TRUE FALSE str_count(c("650-723-2300", "123-45-6789"), phone_number) [1] 1 0

  • [2-9] any single digit in the range 2 through 9. Note that - has special meaning

here and does not match itself.

  • {2} the preceding expression must occur twice in a row
  • This pattern assumes that hyphen (-) is used as punctuation in the phone number.

somgen223.stanford.edu 23

slide-24
SLIDE 24

Example: zip code

zip_code <- "^[0-9]{5}(-[0-9]{4})?$" str_detect(c("94305", "94305-1234", "94305-123", "My zip code is 94305-1234"), zip_code) [1] TRUE TRUE FALSE FALSE

somgen223.stanford.edu 24

slide-25
SLIDE 25

Visualizing regex matching

zip_code2 <- "[0-9]{5}" str_view_all(c("94305", "1234567", "3.14159"), zip_code2)

  • This only works in RStudio.
  • See the result in the Viewer pane (lower right), with highlights for the parts of

each string that match the regex.

somgen223.stanford.edu 25

slide-26
SLIDE 26

Reading

  • See ?regex for more
  • Read: 14 Strings | R for Data Science
  • Read: Introduction to stringr
  • Optional: Short illustrated guide to regex
  • Optional: Online regex tutorial
  • Optional: Regular expression - Wikipedia

somgen223.stanford.edu 26