some useful tasks involving language
play

Some useful tasks involving language Find all phone numbers in a - PDF document

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such as Finite-State Machines and Regular Languages When you call (614) 292-8833, you reach the fax machine. Find multiple adjacent occurrences of the


  1. Some useful tasks involving language • Find all phone numbers in a text, e.g., occurrences such as Finite-State Machines and Regular Languages When you call (614) 292-8833, you reach the fax machine. • Find multiple adjacent occurrences of the same word in a text, as in I read the the book. Detmar Meurers: Intro to Computational Linguistics I OSU, LING 684.01 • Determine the language of the following utterance: French or Polish? Czy pasazer jadacy do Warszawy moze jechac przez Londyn? 2 More useful tasks involving language Regular expressions • A regular expression is a description of a set of strings, i.e., a • Look up the following words in a dictionary: language. laughs, became, unidentifiable, Thatcherization • They can be used to search for occurrences of these strings • A variety of unix tools (grep, sed), editors (emacs), and programming • Determine the part-of-speech of words like the following, even if you languages (perl, python) incorporate regular expressions. can’t find them in the dictionary: • Just like any other formalism, regular expressions as such have no conurbation, cadence, disproportionality, lyricism, parlance linguistic contents, but they can be used to refer to linguistic units. ⇒ Such tasks can be addressed using so-called finite-state machines. ⇒ How can such machines be specified? 3 4 The syntax of regular expressions (1) The syntax of regular expressions (2) Regular expressions consist of • counters • optionality: ? • strings of characters: c , A100 , natural language , 30 years! colou?r • any number of occurrences: * (Kleene star) • disjunction: [0-9]* years – ordinary disjunction: devoured|ate , famil(y|ies) • at least one occurrence: + – character classes: [Tt]he , bec[oa]me [0-9]+ dollars – ranges: [A-Z] (a capital letter) • wildcard for any character: . beg.n for any character in between beg and n • negation: [ ˆ a] (any symbol but a ) [ ˆ A-Z0-9] (not an uppercase letter or number) 5 6

  2. The syntax of regular expressions (3) Regular languages How can the class of regular languages which is specified by regular Operator precedence, from highest to lowest: expressions be characterized? parentheses () Let Σ be the set of all symbols of the language, the alphabet, then: counters * + ? 1. {} is a regular language character sequences 2. ∀ a ∈ Σ : { a } is a regular language disjunction | 3. If L 1 and L 2 are regular languages, so are: (a) the concatenation of L 1 and L 2 : L 1 · L 2 = { xy | x ∈ L 1 , y ∈ L 2 } Note: The various unix tools and languages differ w.r.t. the exact syntax (b) the union of L 1 and L 2 : L 1 ∪ L 2 of the regular expressions they allow. (c) the Kleene closure of L: L ∗ = L 0 ∪ L 1 ∪ L 2 ∪ ... where L i is the language of all strings of length i . 7 8 Properties of regular languages • intersection: L 1 ∩ L 2 set of strings in both L 1 and L 2 • reversal: L R 1 The regular languages are closed under ( L 1 and L 2 regular languages): set of the reversal of all strings in L 1 • concatenation: L 1 · L 2 set of strings with beginning in L 1 and continuation in L 2 • Kleene closure: L ∗ 1 set of repeated concatenation of a string in L 1 • union: L 1 ∪ L 2 set of strings in L 1 or in L 2 • complementation: Σ ∗ − L 1 set of all possible strings that are not in L 1 • difference: L 1 − L 2 set of strings which are in L 1 but not in L 2 9 10 Finite state machines Defining finite state automata Finite state machines (or automata) (FSM, FSA) recognize or generate regular languages, exactly those specified by regular expressions. A finite state automaton is a quintuple ( Q, Σ , E, S, F ) with Example: • Q a finite set of states • Σ a finite set of symbols, the alphabet • Regular expression: colou?r • S ⊆ Q the set of start states • Finite state machine: • F ⊆ Q the set of final states 1 r • E a set of edges Q × (Σ ∪ { ǫ } ) × Q c o l o 0 6 5 4 2 u r The transition function d can be defined as d ( q, a ) = { q ′ ∈ Q |∃ ( q, a, q ′ ) ∈ E } 3 11 12

  3. Language accepted by an FSA Finite state transition networks (FSTN) E ⊆ Q × Σ ∗ × Q is the smallest set such that The extended set of edges ˆ Finite state transition networks are graphical descriptions of finite state machines: ( q, σ, q ′ ) ∈ ˆ • ∀ ( q, σ, q ′ ) ∈ E : E • nodes represent the states • ∀ ( q 0 , σ 1 , q 1 ) , ( q 1 , σ 2 , q 2 ) ∈ ˆ ( q 0 , σ 1 σ 2 , q 2 ) ∈ ˆ • start states are marked with a short arrow E : E • final states are indicated by a double circle • arcs represent the transitions The language L(A) of a finite state automaton A is defined as L ( A ) = { w | q s ∈ S, q f ∈ F, ( q s , w, q f ) ∈ ˆ E } 13 14 Example for a finite state transition network Finite state transition tables Finite state transition tables are an alternative, textual way of describing a b S1 finite state machines: S0 S3 • the rows represent the states c b S2 • start states are marked with a dot after their name b • final states with a colon Regular expression specifying the language generated or accepted by • the columns represent the alphabet the corresponding FSM: ab|cb+ • the fields in the table encode the transitions 15 16 The example specified as finite state transition table Some properties of finite state machines a b c d • Recognition problem can be solved in linear time (independent of the S0. S1 S2 size of the automaton). S1 S3: • There is an algorithm to transform each automaton into a unique S2 S2,S3: equivalent automaton with the least number of states. S3: 17 18

  4. Deterministic Finite State Automata Example: Determinization of FSA A finite state automaton is deterministic iff it has ✗✔ ✗✔ ❄ ❄ • no ǫ transitions and ✖✕ ✖✕ 1 1 PPPP PPPP a ✟ b a ✟ b ✛✘ ✛✘ ✛✘ ✛✘ ✟ ✟ ✟ ✟ q P P q • for each state and each symbol there is at most one applicable ✟ ✟ ✙ ✟ ✙ ✟ c ✲ ✛✘ ✚✙ ✚✙ ✚✙ ✚✙ transition. 2 3 2 PPPP c 3 ❍❍❍❍❍❍❍❍❍❍❍ P q ★ ✥ { 3,5 } ✚✙ d d e Every non-deterministic automaton can be transformed into a a a c a ✛✘ ✛✘ ✛✘ ✛✘ ✤✜ deterministic one: ❄ ✗✔ ❄ ❄ ❄ ❄ e ❇ ✁ ✁ ❥ ❍ ✲ ❇ ✚✙ ✚✙ ★✥ ✁ ✚✙ ✚✙ 4 5 4 5 ✖✕ • Define new states representing a disjunction of old states for each ✟ { 5,6 } ✣✢ ❇ ✁ ✑ ◆ ❇ ✟ ✑ ❩❩❩❩❩ ❩❩❩❩❩ ✡ ✙ ✟ ✑ ✑ ❖ ❈ e non-determinacy which arises. ✑ ✑ ✡ ✛✘ ✛✘ a ✢ ✡ ❈ a ❈ a ✖ ✌ ✓✏ ✑ ✓✏ ✑ c { 4,5 } c ✧✦ ✑ ✰ ❲ ❈ ✑ ✰ ❳❳❳❳❳❳❳❳ ✻ ⑦ ⑦ • Define arcs for these states corresponding to each transition which ✒✑ ✒✑ e c, a ✚✙ ✚✙ 6 ③ 6 is defined in the non-deterministic automaton for one of the disjuncts in the new state names. 19 20 From Automata to Transducers Transducers and determinization A finite state transducer understood as consuming an input and Needed: mechanism to keep track of path taken producing an output cannot generally be determinized. A finite state transducer is a 6-tuple ( Q, Σ 1 , Σ 2 , E, S, F ) with Example: ★ ✘ • Q a finite set of states a:b ✡ ✣ ✓✏ ✡ ❆ ❆ ❯ ✡ ✒✑ • Σ 1 a finite set of symbols, the input alphabet ✯ ✟ ❍❍❍❍❍❍❍❍❍❍ ✟✟✟✟✟✟✟✟✟✟ b:b a :b ✛✘ • Σ 2 a finite set of symbols, the output alphabet ✛✘ ❤ ❥ ❍ ✲ ✚✙ ✚✙ ✘ ✿ • S ⊆ Q the set of start states ✘✘✘✘✘✘✘✘✘✘ ❳❳❳❳❳❳❳❳❳ ✛✘ a :c ③ ❳ c:c ✚✙ • F ⊆ Q the set of final states ✁ ✕ ❆ ✁ ❆ ✁ ❆ ✁ • E a set of edges Q × (Σ 1 ∪ { ǫ } ) × Q × (Σ 2 ∪ { ǫ } ) ❆ ❯ ❆ ✫ ✦ a:c 21 22 Summary Reading assignment 2 • Ch. 1 “Finite State Techniques” of course notes • Ch. 2 “Regular expressions and automata”, Jurafsky & Martin (2000) • Notations for characterizing regular languages: • For a more in-depth discussion of the NLP aspects, take a look at: • Regular expressions – Chapter 1 (Introduction) of E. Roche and Y. Shabes (1987): • Finite state transition networks Finite State Language Processing . MIT Press. • Finite state transition tables – Richard Sproat, “Lexical Analysis”, in Robert Dale, Hermann Moisl, and Harold Somers (eds.) Handbook of NLP . 2000. • Finite state machines and regular languages: Definitions and some properties • Good reference books on the theoretical computer science aspects: • Finite state transducers – “Elements of the theory of computation” H.R. Lewis, C.H. Papadimitriou. Prentice-Hall. 2nd Ed. 1998 – “Introduction to Automata Theory, Languages, and Computation.” John E. Hopcroft, Rajeev Motwani, Jeffrey D. Ullman. 2nd Ed. 2001. Addison-Wesley. or the 1979 version by John E. Hopcroft and Jeffrey D. Ullman. 23 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend