strings languages and regular expressions
play

Strings, Languages, and Regular expressions Lecture 2 1 Strings - PowerPoint PPT Presentation

Strings, Languages, and Regular expressions Lecture 2 1 Strings 2 Definitions for strings e.g., = {0,1}, = { , , , } , = set of ascii characters alphabet = finite set of symbols string = finite


  1. Strings, Languages, and 
 Regular expressions Lecture 2 1

  2. Strings 2

  3. Definitions for strings e.g., Σ = {0,1}, 
 Σ = { α , β , …, ω } , 
 Σ = set of ascii characters • alphabet Σ = finite set of symbols • string = finite sequence of symbols of Σ • length of a string w is denoted | w | . Could formalize • empty string is denoted “ ε ”. as a function 
 w: [ n ] → Σ 
 | ε | = ? 0 |cat|=3 where | w | = n Variable conventions (for this lecture) ! a , b , c , ... elements of Σ (i.e., strings of length 1) w , x , y , z , ... strings of length 0 or more CS 374 A , B , C ,... sets of strings 3

  4. Much ado about nothing • ε is a string containing no symbols. It is not a set. • { ε } is a set containing one string: the empty string ε . It is a set, not a string. • Ø is the empty set . It contains no strings. CS 374 4

  5. Concatenation & its properties • xy denotes the concatenation of strings x and y (sometimes written x ⋅ y ) • Associative: ( uv ) w = u ( vw ) and we write uvw . • Identity element ε : ε w = w ε = w If | x |= m , | y |= n 
 • Can be used to define strings 
 xy : [ m+n ] → Σ 
 (set of all strings Σ * ) inductively such that 
 xy ( i ) = x ( i ) if i ≤ m 
 • NOT commutative: ab ≠ ba xy ( i ) = y ( i-m ) else CS 374 5

  6. Substring, Prefix, Suffix, Exponents • v is a substring of w iff there exist strings x , y , such that w = xvy . – If x = ε ( w = vy ) then v is a prefix of w . – If y = ε ( w = xv ) then v is a suffix of w . • If w is a string, then w n is defined inductively by: – w n = ε if n = 0 (blah) 4 =? – w n = ww n -1 if n > 0 blahblahblahblah CS 374 6

  7. Set Concatenation • If X and Y are sets of strings, then XY = { xy | x ∈ X, y ∈ Y } % e.g. X = { fido, rover, spot } , Y = { fluffy, tabby } then XY = { fidofluffy, fidotabby, roverfluffy, ... } | XY | =? 6 A = { a,aa } , B = { ε ,a } | AB | = ? 3 A = { a,aa } , B = Ø CS 374 Ø AB = ? 7

  8. Σ n , Σ *, and Σ + • Σ n is the set of all strings over Σ of length exactly n . Defined inductively as: – Σ 0 = { ε } – Σ n = ΣΣ n -1 if n > 0 • Σ * is the set of all finite length strings: Σ * = ∪ n ≥ 0 Σ n % • Σ + is the set of all nonempty finite length strings: CS 374 Σ + = ∪ n ≥ 1 Σ n 8

  9. Σ n , Σ *, and Σ + • | Σ n | = ? | Σ | n • | Ø n | = ? – Ø 0 = { ε } – Ø n = ØØ n -1 = Ø if n > 0 • | Ø n | = 1 if n = 0 
 | Ø n | = 0 if n > 0 CS 374 9

  10. Σ n , Σ *, and Σ + • Σ * is the set of all finite length strings: Σ * = ∪ n ≥ 0 Σ n % This can be • x is a string iff x = ε or x = au where | u |=| x | -1 the formal definition of a • | Σ * | = ? “string” – Infinity. More precisely, ℵ 0 – | Σ * | = | Σ + | = | N | = ℵ 0 no longest • How long is the longest string in Σ * ? string! CS 374 • How many infinitely long strings in Σ * ? none 10

  11. Σ n , Σ *, and Σ + • Σ + is the set of all nonempty finite length strings: Σ + = ∪ n ≥ 1 Σ n % • Σ + = ? % – Σ Σ * % – Σ * Σ % – Σ Σ * Σ% – Σ ∪ Σ 2 Σ * CS 374 11

  12. Enumerating Strings • Canonical (standard) ordering is the 1 ε 0 2 0 1 lexicographical (dictionary) ordering 3 1 1 4 00 2 5 • Order by length (starting with 0) 01 2 6 10 2 7 11 2 • Order the | Σ | n strings of length n 8 000 3 9 001 3 by comparing characters left to 10 010 3 11 011 3 right 12 100 3 13 101 3 14 110 3 15 111 3 16 1000 4 CS 374 17 1001 4 18 1010 4 19 1011 4 12 20 1100 4

  13. Inductive Definitions • Often operations on strings are formally defined inductively ε R = ε 
 ( au ) R = u R a – e.g., w n in terms of w n -1 % – Another example: w R ( w reversed) inducting on length Well-defined: 
 | u |<| w | • If | w | = 0 , w R = ε a ∈ Σ , u ∈ Σ * • If | w | ≥ 1 , w R = u R a where w = au CS 374 – e.g. (cat) R = (c ⋅ at) R = (at) R ⋅ c = (a ⋅ t) R ⋅ c 
 = (t) R ⋅ a ⋅ c = (t ⋅ ε ) R ⋅ ac = ε R ⋅ tac = tac 13

  14. Inductive Proofs • Inductive proofs follow inductive definitions • Theorem : ( uv ) R = v R u R % ε R = ε 
 • Proof : By induction ( au ) R = u R a But on what? | u |, | v |, | u+v |, double induction on | u |,| v |? | u | (or | v |) is good enough: Base case: | u | = 0 : i.e., u = ε . 
 Then: ( uv ) R = v R 
 & v R u R = v R ε R = v R ε = v R ☑️ CS 374 Definition of Reversal: 
 base-case 14

  15. Inductive Proofs • Inductive proofs follow inductive definitions • Theorem : ( uv ) R = v R u R % ε R = ε 
 • Proof : By induction ( au ) R = u R a Inductive step: Let n > 0. Assume ( wv ) R = v R w R ∀ w, | w | < n Consider any u with | u | = n . So u = aw , a ∈ Σ , w ∈ Σ * . Definition of Reversal: ( uv ) R = ( awv ) R = ( a ( wv )) R = ( wv ) R a 
 inductive-case = v R w R a 
 Inductive Hypothesis: | w | <n = v R ( aw ) R 
 Definition of Reversal: CS 374 inductive-case = v R u R 15

  16. Languages 16

  17. Recall Computation Problem : 
 Program : 
 To compute a function F that A finitely described process maps each input (a string) to taking a string as input, and an output bit outputting a bit (or not halting) P computes F if for every x, P(x) outputs F(x) and halts Too restrictive? Enough to compute functions with longer outputs too: 
 P(x,i) outputs the i th bit of F(x) CS 374 Enough to model interactive computation too: 
 P*(x,state) outputs (y,new_state) 17

  18. Language 0 1 ε • A function from Σ * to {0,1} can be identified 2 0 0 with the set of strings mapped to 1 3 1 1 4 0 00 • A language is a subset of Σ * 5 1 01 6 1 10 – Computational problem for a language: 7 0 11 8 0 000 given a string in Σ * , decide if it belongs 9 1 001 to the language 10 1 010 11 0 011 • Examples of languages : Ø , Σ * , Σ , { ε } , 
 12 1 100 set of strings of odd length, set of strings 13 0 101 14 0 110 encoding valid C programs, set of strings 15 1 111 encoding valid C programs that halt, … 16 1 1000 CS 374 17 0 1001 • There are uncountably many languages (but 18 0 1010 each language has countably many strings) 19 1 1011 18 20 0 1100

  19. Operations on Languages • Already seen concatenation: L 1 L 2 = { xy | x ∈ L 1 , y ∈ L 2 } • Set operations: ̅ = Σ * - L = { x ∈ Σ * | x ∉ L } – Complement: L – Union: L 1 ∪ L 2 – Intersection, difference (can be based on the above two) • L n inductively defined: L 0 = { ε }, L n = LL n- 1 % • L* = ∪ n ≥ 0 L n , and L + = LL* % CS 374 • { ε }* = ? Ø* = ? 19

  20. Complexity of Languages • How computable is a language? • Singleton languages – L such that | L | = 1. Example: L = {374} – An algorithm can have the single string hard-coded into it • More generally, finite languages – Algorithm can have all the strings hard-coded into it • Many interesting languages are uncomputable CS 374 • But many others are neither too easy nor impossible… 20

  21. Regular Languages 21

  22. Regular Languages • The set of regular languages over some alphabet Σ is defined inductively by: • Ø is a regular language • { ε } is a regular language • { a } is a regular language for each a ∈ Σ • If L 1 , L 2 are regular, then L 1 ∪ L 2 is regular • If L 1 , L 2 are regular, then L 1 L 2 is regular CS 374 • If L is regular, then L* is regular 22

  23. Regular Languages Examples • L = { w } where w ∈ Σ * is any fixed string – e.g., L = { aba } = { a }{ b }{ a } and { a } & { b } are both regular – Proof by induction on | w |, using concatenation for induction • L = any finite set of strings – e.g., L = set of all strings of length at most 10 – Proof by induction on | L |, using union for induction (and the above) CS 374 – Beware: Induction applicable only for | L | ∈ N , not | L |= ℵ 0 23

  24. Regular Languages Examples • Infinite sets, but of strings with “regular” patterns – Σ * (recall: L* is regular if L is) – Σ + = ΣΣ * – All binary integers, without leading 0’s • L = {1}{0,1}* ∪ {0} – All binary integers which are multiples of 37 • later CS 374 24

  25. Regular Expressions 25

  26. Regular Expressions • A short-hand to denote a regular language as strings that match a pattern • Useful in – text search (editors, Unix/grep) – compilers: lexical analysis • Dates back to 50’s: Stephen Kleene, 
 who has a star named after him * CS 374 * The star named after him is the Kleene star “*” 26

  27. Inductive Definition A regular expression r over alphabet Σ is one of the following (L( r ) is the language it represents): Atomic expressions (Base cases) Ø % L (Ø) = Ø % ε" L ( ε ) = { ε } % a for a ∈ Σ L ( a ) = { a } Inductively defined expressions alt notation 
 ( r 1 + r 2 ) % L ( r 1 +r 2 ) = L ( r 1 ) ∪ L ( r 2 ) % ( r 1 | r 2 ) or ( r 1 ∪ r 2 ) ( r 1 r 2 ) % L ( r 1 r 2 ) = L ( r 1 ) L ( r 2 ) % ( r )* L ( r* ) = L ( r )* CS 374 Any regular language has a regular expression and vice versa 27

  28. Regular Expressions • Can omit many parentheses – By following precedence rules : 
 * before concatenation before + % • e.g. r*s + t ≡ (( r* ) s ) + t " – By associativity: ( r+s ) +t ≡ r+s+t , ( rs ) t ≡ rst " • More short-hand notation – e.g., r + ≡ rr* (note: + is in superscript) CS 374 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend