strings string patterns using regular expressions
play

Strings, string patterns using regular expressions Steve Bagley - PowerPoint PPT Presentation

Strings, string patterns using regular expressions Steve Bagley somgen223.stanford.edu 1 Strings somgen223.stanford.edu 2 s <- c ("a", "ab", "abcdef") length (s) [1] 3 str_length (s) [1] 1 2 6 Strings in R


  1. Strings, string patterns using regular expressions Steve Bagley somgen223.stanford.edu 1

  2. Strings somgen223.stanford.edu 2

  3. s <- c ("a", "ab", "abcdef") length (s) [1] 3 str_length (s) [1] 1 2 6 Strings in R • length is the length of the vector, not the length of the strings in the vector. • We will use the stringr functions (prefix str_ ) for string manipulation. somgen223.stanford.edu 3

  4. str_c ("abc", "def") [1] "abcdef" str_c ("abc", 123) [1] "abc123" str_c ("abc", 1 : 3) [1] "abc1" "abc2" "abc3" str_c ("abc", 1 : 3, sep = "_") [1] "abc_1" "abc_2" "abc_3" Building strings: str_c somgen223.stanford.edu 4

  5. str_c ("abc", 1 : 3, sep = "+") [1] "abc+1" "abc+2" "abc+3" str_c ("abc", 1 : 3, sep = "+", collapse = " * ") [1] "abc+1 * abc+2 * abc+3" sep vs collapse • sep separates elementwise. • collapse (if supplied) is inserted into the aggregated result. somgen223.stanford.edu 5

  6. str_to_upper ("this is a test") [1] "THIS IS A TEST" str_to_lower ("THIS IS A TEST") [1] "this is a test" str_to_title ("THIS IS A TEST") [1] "This Is A Test" str_to_sentence ("THIS IS A TEST") [1] "This is a test" Converting case somgen223.stanford.edu 6

  7. s <- "this is a test" str_sub (s, 11, 14) [1] "test" str_sub (s, 3) [1] "is is a test" Substrings • str_sub returns a substring • It returns from the first position (counting from 1) to the second position (or the end of the string if not supplied). somgen223.stanford.edu 7

  8. Regular expressions somgen223.stanford.edu 8

  9. Regular expressions: patterns in strings • String often have some kind of internal pattern. • Example: phone number 650-723-2300 (area code, exchange/prefix, line number) • Example: zip code 94305 or 94305-1234 • We need some way to describing these patterns: regular expressions somgen223.stanford.edu 9

  10. Regular expressions • Regular expressions (“regexs” or “regexps”) are a powerful (but confusing) language for expressing character-level patterns. • The general idea is that some characters match themselves, but other characters take on special meaning, and control how other characters are matched. There needs to be some way to specify that those special characters can literally appear, which leads to some odd quoting conventions. somgen223.stanford.edu 10

  11. [1] TRUE [1] FALSE # entire pattern matches [1] TRUE str_detect ("bc", "ac") # entire pattern does not match [1] FALSE str_detect ("aacc", "ac") # entire pattern is contained in string str_detect ("ac", "ac") str_detect ("ccaa", "ac") # pattern not contained in string [1] FALSE str_detect ("abcde", "ac") # pattern not contained in string Most characters match only themselves • str_detect takes two arguments: a string, and a pattern. • It returns TRUE if it finds the complete pattern anywhere in the string. somgen223.stanford.edu 11

  12. str_detect ( c ("ac", "bc", "aacc", "ccaa", "abcde"), "ac") [1] TRUE FALSE TRUE FALSE FALSE str_detect accepts vectors • These are the same examples as on the previous slide, but all in one vector for the string argument. somgen223.stanford.edu 12

  13. str_detect ("bac", "[ab]c") [1] TRUE [1] TRUE str_detect ("bc", "[ab]c") [1] TRUE str_detect ("cc", "[ab]c") [1] FALSE str_detect ("abc", "[ab]c") [1] TRUE str_detect ("ac", "[ab]c") [1] TRUE str_detect ("cabc", "[ab]c") Matching one character from a set of characters • The bracket expression [ ] matches when any one character contained in the brackets occurs in the string at that position. • In this example, [ab] will match either a or b . • This raises the question of how we would match a bracket character. somgen223.stanford.edu 13

  14. [1] TRUE [1] TRUE [1] TRUE str_detect ("aacc", "^aa") [1] TRUE str_detect ("aacc", "^ac") [1] FALSE str_detect ("aacc", "^aacc") [1] TRUE str_detect ("aacc", "cc") str_detect ("aacc", "ac") str_detect ("aacc", "cc$") [1] TRUE str_detect ("aacc", "ac$") [1] FALSE str_detect ("aacc", "^aacc$") Anchors • ^ matches only at the beginning of the string • $ matches only at the end of the string somgen223.stanford.edu 14

  15. str_detect ("aacc", "^[ab][ac][bc][bb]$") [1] FALSE Exercise Explain this: somgen223.stanford.edu 15

  16. str_detect ("aacc", "^[ab][ac][bc][ab]$") [1] FALSE Answer • Starting at beginning of the string, we need: • a or b (true), and • a or c (true), and • b or c (true), and • a or b (false) somgen223.stanford.edu 16

  17. str_detect (word, "^c[aeiou]t$") Exercise: List all the word strings that this pattern matches? somgen223.stanford.edu 17

  18. str_detect ( c ("cat", "cet", "cit", "cot", "cut"), "^c[aeiou]t$") [1] TRUE TRUE TRUE TRUE TRUE Answer somgen223.stanford.edu 18

  19. str_detect (some_string, "^[ab][bc][cd][da]$") Exercise • How many different strings could this pattern match? somgen223.stanford.edu 19

  20. Answer • Each [ab] can match two characters. • There are four occurrences of [ ] • 2 4 = 16 somgen223.stanford.edu 20

  21. str_detect ("abc", "abc") [1] TRUE str_detect ("abc", "abc?") [1] TRUE ## same as: str_detect ("abc", "ab") | str_detect ("abc", "abc") [1] TRUE str_detect ("abc", "abcd?") [1] TRUE Optional matches • ? matches the previous expression zero or one times • In abc? , the previous expression is c , not abc . somgen223.stanford.edu 21

  22. str_detect ("abcd", "ab(ef)?cd") [1] TRUE str_detect ("abcd", "abcd(ef)?") [1] TRUE str_detect ("abcd", "ab(ef)?cd") [1] TRUE Grouping • () provides grouping. Any suffix operator applies to everything inside the parentheses. somgen223.stanford.edu 22

  23. phone_number <- "[2-9][0-9]{2}-[0-9]{3}-[0-9]{4}" str_detect ( c ("650-723-2300", "123-45-6789"), phone_number) [1] TRUE FALSE str_count ( c ("650-723-2300", "123-45-6789"), phone_number) [1] 1 0 Example: phone number • [2-9] any single digit in the range 2 through 9. Note that - has special meaning here and does not match itself. • {2} the preceding expression must occur twice in a row • This pattern assumes that hyphen ( - ) is used as punctuation in the phone number. somgen223.stanford.edu 23

  24. zip_code <- "^[0-9]{5}(-[0-9]{4})?$" str_detect ( c ("94305", "94305-1234", "94305-123", "My zip code is 94305-1234"), zip_code) [1] TRUE TRUE FALSE FALSE Example: zip code somgen223.stanford.edu 24

  25. zip_code2 <- "[0-9]{5}" str_view_all ( c ("94305", "1234567", "3.14159"), zip_code2) Visualizing regex matching • This only works in RStudio. • See the result in the Viewer pane (lower right), with highlights for the parts of each string that match the regex. somgen223.stanford.edu 25

  26. Reading • See ?regex for more • Read: 14 Strings | R for Data Science • Read: Introduction to stringr • Optional: Short illustrated guide to regex • Optional: Online regex tutorial • Optional: Regular expression - Wikipedia somgen223.stanford.edu 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend