some useful tasks involving language more useful tasks
play

Some useful tasks involving language More useful tasks involving - PowerPoint PPT Presentation

Some useful tasks involving language More useful tasks involving language Find all phone numbers in a text, e.g., occurrences such as Look up the following words in a dictionary: Finite-State Machines and Regular Languages When you call


  1. Some useful tasks involving language More useful tasks involving language • Find all phone numbers in a text, e.g., occurrences such as • Look up the following words in a dictionary: Finite-State Machines and Regular Languages When you call (614) 292-8833, you reach the fax machine. laughs, became, unidentifiable, Thatcherization • Find multiple adjacent occurrences of the same word in a text, as in • Determine the part-of-speech of words like the following, even if you can’t find them in the dictionary: Detmar Meurers: Intro to Computational Linguistics I I read the the book. conurbation, cadence, disproportionality, lyricism, parlance OSU, LING 684.01 • Determine the language of the following utterance: French or Polish? ⇒ Such tasks can be addressed using so-called finite-state machines. Czy pasazer jadacy do Warszawy moze jechac przez Londyn? ⇒ How can such machines be specified? 2 3 Regular expressions The syntax of regular expressions (1) The syntax of regular expressions (2) Regular expressions consist of • A regular expression is a description of a set of strings, i.e., a • counters language. • optionality: ? • strings of characters: c , A100 , natural language , 30 years! colou?r • They can be used to search for occurrences of these strings • any number of occurrences: * (Kleene star) • disjunction: • A variety of unix tools (grep, sed), editors (emacs), and programming [0-9]* years languages (perl, python) incorporate regular expressions. – ordinary disjunction: devoured|ate , famil(y|ies) • at least one occurrence: + – character classes: [Tt]he , bec[oa]me [0-9]+ dollars • Just like any other formalism, regular expressions as such have no – ranges: [A-Z] (a capital letter) linguistic contents, but they can be used to refer to linguistic units. • wildcard for any character: . beg.n for any character in between beg and n • negation: [ ˆ a] (any symbol but a ) [ ˆ A-Z0-9] (not an uppercase letter or number) 4 5 6 The syntax of regular expressions (3) Regular languages Properties of regular languages How can the class of regular languages which is specified by regular The regular languages are closed under ( L 1 and L 2 regular languages): Operator precedence, from highest to lowest: expressions be characterized? • concatenation: L 1 · L 2 parentheses () Let Σ be the set of all symbols of the language, the alphabet, then: set of strings with beginning in L 1 and continuation in L 2 counters * + ? • Kleene closure: L ∗ 1. {} is a regular language 1 set of repeated concatenation of a string in L 1 character sequences 2. ∀ a ∈ Σ : { a } is a regular language • union: L 1 ∪ L 2 disjunction | set of strings in L 1 or in L 2 3. If L 1 and L 2 are regular languages, so are: • complementation: Σ ∗ − L 1 (a) the concatenation of L 1 and L 2 : L 1 · L 2 = { xy | x ∈ L 1 , y ∈ L 2 } set of all possible strings that are not in L 1 Note: The various unix tools and languages differ w.r.t. the exact syntax (b) the union of L 1 and L 2 : L 1 ∪ L 2 of the regular expressions they allow. • difference: L 1 − L 2 (c) the Kleene closure of L: L ∗ = L 0 ∪ L 1 ∪ L 2 ∪ ... where L i is the set of strings which are in L 1 but not in L 2 language of all strings of length i . 7 8 9

  2. • intersection: L 1 ∩ L 2 Finite state machines Defining finite state automata set of strings in both L 1 and L 2 • reversal: L R Finite state machines (or automata) (FSM, FSA) recognize or generate 1 set of the reversal of all strings in L 1 regular languages, exactly those specified by regular expressions. A finite state automaton is a quintuple ( Q, Σ , E, S, F ) with Example: • Q a finite set of states • Σ a finite set of symbols, the alphabet • Regular expression: colou?r • S ⊆ Q the set of start states • Finite state machine: • F ⊆ Q the set of final states r 1 • E a set of edges Q × (Σ ∪ { ǫ } ) × Q c o l o u r 0 6 5 4 2 The transition function d can be defined as d ( q, a ) = { q ′ ∈ Q |∃ ( q, a, q ′ ) ∈ E } 3 10 11 12 Language accepted by an FSA Finite state transition networks (FSTN) Example for a finite state transition network E ⊆ Q × Σ ∗ × Q is the smallest set such that The extended set of edges ˆ Finite state transition networks are graphical descriptions of finite state a b S1 machines: ( q, σ, q ′ ) ∈ ˆ • ∀ ( q, σ, q ′ ) ∈ E : S0 S3 E • nodes represent the states c b S2 • ∀ ( q 0 , σ 1 , q 1 ) , ( q 1 , σ 2 , q 2 ) ∈ ˆ ( q 0 , σ 1 σ 2 , q 2 ) ∈ ˆ • start states are marked with a short arrow E : E b • final states are indicated by a double circle Regular expression specifying the language generated or accepted by • arcs represent the transitions the corresponding FSM: ab|cb+ The language L(A) of a finite state automaton A is defined as L ( A ) = { w | q s ∈ S, q f ∈ F, ( q s , w, q f ) ∈ ˆ E } 13 14 15 Finite state transition tables The example specified as finite state transition table Some properties of finite state machines a b c d Finite state transition tables are an alternative, textual way of describing • Recognition problem can be solved in linear time (independent of the finite state machines: S0. S1 S2 size of the automaton). S1 S3: • the rows represent the states • There is an algorithm to transform each automaton into a unique S2 S2,S3: equivalent automaton with the least number of states. S3: • start states are marked with a dot after their name • final states with a colon • the columns represent the alphabet • the fields in the table encode the transitions 16 17 18

  3. Deterministic Finite State Automata Example: Determinization of FSA From Automata to Transducers A finite state automaton is deterministic iff it has Needed: mechanism to keep track of path taken ✗✔ ✗✔ ❄ ❄ • no ǫ transitions and ✖✕ ✖✕ 1 1 PPPP PPPP A finite state transducer is a 6-tuple ( Q, Σ 1 , Σ 2 , E, S, F ) with a ✟ b a ✟ b ✛✘ ✛✘ ✛✘ ✛✘ ✟ ✟ ✟ ✟ q P P q • for each state and each symbol there is at most one applicable ✟ ✟ ✟ ✙ ✟ ✙ c ✲ ✛✘ ✚✙ ✚✙ ✚✙ ✚✙ • Q a finite set of states transition. 2 3 2 c 3 PPPP ❍❍❍❍❍❍❍❍❍❍❍ P q ★ ✥ d d ✚✙ { 3,5 } • Σ 1 a finite set of symbols, the input alphabet Every non-deterministic automaton can be transformed into a e a a c a ✛✘ ✛✘ ✛✘ ✛✘ ✤✜ deterministic one: ❄ ✗✔ ❄ ❄ ❄ ❄ • Σ 2 a finite set of symbols, the output alphabet e ❇ ✁ ✁ ❍ ❥ ✲ ❇ ★✥ ✁ ✚✙ 4 ✚✙ 5 ✚✙ 4 ✚✙ 5 ✖✕ • Define new states representing a disjunction of old states for each ✟ { 5,6 } ❇ ✣✢ ◆ ❇ ✁ ✟ • S ⊆ Q the set of start states ❩❩❩❩❩ ✑ ❩❩❩❩❩ ✑ ✟ ✙ ✡ ✑ ✑ ❖ ❈ e non-determinacy which arises. ✡ ✑ a ✑ ✡ ✢ ❈ ✛✘ ✛✘ a ❈ a ✖ ✌ ✓✏ ✑ ✓✏ ✑ c { 4,5 } c ✧✦ • F ⊆ Q the set of final states ✑ ✰ ❈ ❲ ✑ ✰ ❳❳❳❳❳❳❳❳ ✻ ⑦ ⑦ • Define arcs for these states corresponding to each transition which ✒✑ ✒✑ e c, a ✚✙ ✚✙ 6 ③ 6 • E a set of edges Q × (Σ 1 ∪ { ǫ } ) × Q × (Σ 2 ∪ { ǫ } ) is defined in the non-deterministic automaton for one of the disjuncts in the new state names. 19 20 21 Transducers and determinization Summary Reading assignment 2 A finite state transducer understood as consuming an input and • Notations for characterizing regular languages: producing an output cannot generally be determinized. • Chapter 1 “Finite State Techniques” of course notes • Regular expressions Example: • Finite state transition networks ★ ✘ • Chapter 2 “Regular expressions and automata” of • Finite state transition tables Jurafsky and Martin (2000) a:b ✓✏ ✡ ✣ ✡ ❆ ❯ ❆ ✡ ✒✑ • Finite state machines and regular languages: Definitions and some ✟ ✯ ❍❍❍❍❍❍❍❍❍❍ ✟✟✟✟✟✟✟✟✟✟ b:b a :b properties ✛✘ ✛✘ ❤ ❍ ❥ ✲ ✚✙ • Finite state transducers ✚✙ ✘ ✿ ✘✘✘✘✘✘✘✘✘✘ ❳❳❳❳❳❳❳❳❳ ✛✘ a :c ③ ❳ c:c ✚✙ ✁ ✕ ❆ ✁ ❆ ✁ ❆ ✁ ❆ ❯ ❆ ✫ ✦ a:c 22 23 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend