Strings, string patterns using regular expressions Steve Bagley - PowerPoint PPT Presentation

Strings, string patterns using regular expressions Steve Bagley somgen223.stanford.edu 1

Strings somgen223.stanford.edu 2

s <- c ("a", "ab", "abcdef") length (s) [1] 3 str_length (s) [1] 1 2 6 Strings in R • length is the length of the vector, not the length of the strings in the vector. • We will use the stringr functions (prefix str_ ) for string manipulation. somgen223.stanford.edu 3

str_c ("abc", "def") [1] "abcdef" str_c ("abc", 123) [1] "abc123" str_c ("abc", 1 : 3) [1] "abc1" "abc2" "abc3" str_c ("abc", 1 : 3, sep = "_") [1] "abc_1" "abc_2" "abc_3" Building strings: str_c somgen223.stanford.edu 4

str_c ("abc", 1 : 3, sep = "+") [1] "abc+1" "abc+2" "abc+3" str_c ("abc", 1 : 3, sep = "+", collapse = " * ") [1] "abc+1 * abc+2 * abc+3" sep vs collapse • sep separates elementwise. • collapse (if supplied) is inserted into the aggregated result. somgen223.stanford.edu 5

str_to_upper ("this is a test") [1] "THIS IS A TEST" str_to_lower ("THIS IS A TEST") [1] "this is a test" str_to_title ("THIS IS A TEST") [1] "This Is A Test" str_to_sentence ("THIS IS A TEST") [1] "This is a test" Converting case somgen223.stanford.edu 6

s <- "this is a test" str_sub (s, 11, 14) [1] "test" str_sub (s, 3) [1] "is is a test" Substrings • str_sub returns a substring • It returns from the first position (counting from 1) to the second position (or the end of the string if not supplied). somgen223.stanford.edu 7

Regular expressions somgen223.stanford.edu 8

Regular expressions: patterns in strings • String often have some kind of internal pattern. • Example: phone number 650-723-2300 (area code, exchange/prefix, line number) • Example: zip code 94305 or 94305-1234 • We need some way to describing these patterns: regular expressions somgen223.stanford.edu 9

Regular expressions • Regular expressions (“regexs” or “regexps”) are a powerful (but confusing) language for expressing character-level patterns. • The general idea is that some characters match themselves, but other characters take on special meaning, and control how other characters are matched. There needs to be some way to specify that those special characters can literally appear, which leads to some odd quoting conventions. somgen223.stanford.edu 10

[1] TRUE [1] FALSE # entire pattern matches [1] TRUE str_detect ("bc", "ac") # entire pattern does not match [1] FALSE str_detect ("aacc", "ac") # entire pattern is contained in string str_detect ("ac", "ac") str_detect ("ccaa", "ac") # pattern not contained in string [1] FALSE str_detect ("abcde", "ac") # pattern not contained in string Most characters match only themselves • str_detect takes two arguments: a string, and a pattern. • It returns TRUE if it finds the complete pattern anywhere in the string. somgen223.stanford.edu 11

str_detect ( c ("ac", "bc", "aacc", "ccaa", "abcde"), "ac") [1] TRUE FALSE TRUE FALSE FALSE str_detect accepts vectors • These are the same examples as on the previous slide, but all in one vector for the string argument. somgen223.stanford.edu 12

str_detect ("bac", "[ab]c") [1] TRUE [1] TRUE str_detect ("bc", "[ab]c") [1] TRUE str_detect ("cc", "[ab]c") [1] FALSE str_detect ("abc", "[ab]c") [1] TRUE str_detect ("ac", "[ab]c") [1] TRUE str_detect ("cabc", "[ab]c") Matching one character from a set of characters • The bracket expression [ ] matches when any one character contained in the brackets occurs in the string at that position. • In this example, [ab] will match either a or b . • This raises the question of how we would match a bracket character. somgen223.stanford.edu 13

[1] TRUE [1] TRUE [1] TRUE str_detect ("aacc", "âa") [1] TRUE str_detect ("aacc", "âc") [1] FALSE str_detect ("aacc", "âacc") [1] TRUE str_detect ("aacc", "cc") str_detect ("aacc", "ac") str_detect ("aacc", "cc$") [1] TRUE str_detect ("aacc", "ac$") [1] FALSE str_detect ("aacc", "âacc$") Anchors • ^ matches only at the beginning of the string • $ matches only at the end of the string somgen223.stanford.edu 14

str_detect ("aacc", "^[ab][ac][bc][bb]$") [1] FALSE Exercise Explain this: somgen223.stanford.edu 15

str_detect ("aacc", "^[ab][ac][bc][ab]$") [1] FALSE Answer • Starting at beginning of the string, we need: • a or b (true), and • a or c (true), and • b or c (true), and • a or b (false) somgen223.stanford.edu 16

str_detect (word, "^c[aeiou]t$") Exercise: List all the word strings that this pattern matches? somgen223.stanford.edu 17

str_detect ( c ("cat", "cet", "cit", "cot", "cut"), "^c[aeiou]t$") [1] TRUE TRUE TRUE TRUE TRUE Answer somgen223.stanford.edu 18

str_detect (some_string, "^[ab][bc][cd][da]$") Exercise • How many different strings could this pattern match? somgen223.stanford.edu 19

Answer • Each [ab] can match two characters. • There are four occurrences of [ ] • 2 4 = 16 somgen223.stanford.edu 20

str_detect ("abc", "abc") [1] TRUE str_detect ("abc", "abc?") [1] TRUE ## same as: str_detect ("abc", "ab") | str_detect ("abc", "abc") [1] TRUE str_detect ("abc", "abcd?") [1] TRUE Optional matches • ? matches the previous expression zero or one times • In abc? , the previous expression is c , not abc . somgen223.stanford.edu 21

str_detect ("abcd", "ab(ef)?cd") [1] TRUE str_detect ("abcd", "abcd(ef)?") [1] TRUE str_detect ("abcd", "ab(ef)?cd") [1] TRUE Grouping • () provides grouping. Any suffix operator applies to everything inside the parentheses. somgen223.stanford.edu 22

phone_number <- "[2-9][0-9]{2}-[0-9]{3}-[0-9]{4}" str_detect ( c ("650-723-2300", "123-45-6789"), phone_number) [1] TRUE FALSE str_count ( c ("650-723-2300", "123-45-6789"), phone_number) [1] 1 0 Example: phone number • [2-9] any single digit in the range 2 through 9. Note that - has special meaning here and does not match itself. • {2} the preceding expression must occur twice in a row • This pattern assumes that hyphen ( - ) is used as punctuation in the phone number. somgen223.stanford.edu 23

zip_code <- "^[0-9]{5}(-[0-9]{4})?$" str_detect ( c ("94305", "94305-1234", "94305-123", "My zip code is 94305-1234"), zip_code) [1] TRUE TRUE FALSE FALSE Example: zip code somgen223.stanford.edu 24

zip_code2 <- "[0-9]{5}" str_view_all ( c ("94305", "1234567", "3.14159"), zip_code2) Visualizing regex matching • This only works in RStudio. • See the result in the Viewer pane (lower right), with highlights for the parts of each string that match the regex. somgen223.stanford.edu 25

Reading • See ?regex for more • Read: 14 Strings | R for Data Science • Read: Introduction to stringr • Optional: Short illustrated guide to regex • Optional: Online regex tutorial • Optional: Regular expression - Wikipedia somgen223.stanford.edu 26

Strings, string patterns using regular expressions Steve Bagley - PowerPoint PPT Presentation

Strings, string patterns using regular expressions Steve Bagley somgen223.stanford.edu 1 Strings somgen223.stanford.edu 2 s <- c ("a", "ab", "abcdef") length (s) [1] 3 str_length (s) [1] 1 2 6 Strings in R

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

The String Class Trace Code Constructing a String String s = "Java"; String

Regular expressions String Manipulation with stringr Regular expressions A language for

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Regular Expressions Lecture 11b Larry Ruzzo Outline Some string tidbits Regular

Regular Expressions A regular expression describes a language using three operations. Regular

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Similarity & Link Analysis Stony Brook University CSE545, Fall 2016 Finding Similar

2 3 4 5 6 7 Baltzan,P. & Phillips, A., 2010. Business Driven Technology, 4 th edition .

Paper Reading 2018-11-24 Beyond Part Models: Person Retrieval with Refined Part

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Continuous Delivery mit Docker Berlin Expert Days 2014 Dr. Halil-Cem Grsoy adesso AG 04.04.14

Mitigating Covert Compromises: A Game-Theoretic Model of Targeted and Non-Targeted Covert Attacks

Randomness and determinism in Diophantine approximation: small linear forms, lattice flows and

ECED2200 Digital Circuits Karnaugh Maps 16/07/2012 Colin OFlynn - CC BY-SA 1 General

Strings, string patterns using regular expressions Steve Bagley - PowerPoint PPT Presentation

Strings, string patterns using regular expressions Steve Bagley somgen223.stanford.edu 1 Strings somgen223.stanford.edu 2 s <- c ("a", "ab", "abcdef") length (s) [1] 3 str_length (s) [1] 1 2 6 Strings in R

Regular Expressions (REs) Regular Expressions (REs) p.1/37 Expressions In arithmetic:

Regexp Lecture 26: Regular Expressions Regular Expressions Regular expressions are a small

Chapter 9 Strings 1 C-Strings vs C++ Strings T wo string types: C-strings Array

Objectives You should be able to ... Regular Languages Use the syntax of regular expressions

Chapter 9: Strings (To avoid confusion, C-style strings will be referred to as C-string,

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

Regular expressions String Manipulation with stringr Regular expressions A language for

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings

s[i] Introduction to Computer Programming Strings CSCI-UA 2 Strings and Characters Strings are

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Regular Expressions Lecture 11b Larry Ruzzo Outline Some string tidbits Regular

Regular Expressions A regular expression describes a language using three operations. Regular

C++0x Regular Expressions Simon Andreas Frimann Lund Datalogisk Institut Kbenhavns

CS/COE 1520 pitt.edu/~ach54/cs1520 Regular expressions Regular expressions Formally:

Regular Expressions = Regular Languages Mark Greenstreet, CpSc 421, Term 1, 2008/09 17

Similarity &amp; Link Analysis Stony Brook University CSE545, Fall 2016 Finding Similar

2 3 4 5 6 7 Baltzan,P. &amp; Phillips, A., 2010. Business Driven Technology, 4 th edition .

Paper Reading 2018-11-24 Beyond Part Models: Person Retrieval with Refined Part

Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in

Continuous Delivery mit Docker Berlin Expert Days 2014 Dr. Halil-Cem Grsoy adesso AG 04.04.14

Mitigating Covert Compromises: A Game-Theoretic Model of Targeted and Non-Targeted Covert Attacks

Randomness and determinism in Diophantine approximation: small linear forms, lattice flows and

ECED2200 Digital Circuits Karnaugh Maps 16/07/2012 Colin OFlynn - CC BY-SA 1 General

The String Class Trace Code Constructing a String String s = "Java"; String

Similarity & Link Analysis Stony Brook University CSE545, Fall 2016 Finding Similar

2 3 4 5 6 7 Baltzan,P. & Phillips, A., 2010. Business Driven Technology, 4 th edition .