Gbor Csernyi Department of English Linguistics University of - PowerPoint PPT Presentation

Gábor Csernyi Department of English Linguistics University of Debrecen gabor.csernyi@arts.unideb.hu http://ieas.unideb.hu/~csernyi

The architecture of the Xerox Linguistic Environment (XLE) reflects a modular pattern:  (tokenizer(s));  (morphological analyzer(s));  lexicon in the form of files containing lexical entries;  grammar in the form of files comprising the grammatical rules (with functional annotations). 2

Different modules responsible for the different (sub)tasks of parsing: TOKENIZER MORPHOLOGICAL LEXICON SYNTAX, ANALYZER SEMANTICS morphological tokenization lexical lookup chart parsing analysis 3

Relevant commands in XLE:  tokens  analyze-string <word-form>  parse “<(category:) string>” 4

DEMO HUNGARIAN CONFIG (1.0) CHARACTERENCODING utf-8. ROOTCAT ROOT. FILES . LEXENTRIES (DEMO HUNGARIAN). RULES (DEMO HUNGARIAN). TEMPLATES (DEMO HUNGARIAN). GOVERNABLERELATIONS SUBJ OBJ OBJ2 OBL OBL-?+ COMP XCOMP. SEMANTICFUNCTIONS ADJUNCT TOPIC. NONDISTRIBUTIVES NUM PERS. EPSILON e. OPTIMALITYORDER NOGOOD. ---- 5

The first line is a special line, it specifies:  the version of grammar (DEMO);  the language (HUNGARIAN);  that this is the configuration file (CONFIG);  the XLE version number (1.0). 6

Other parts of the config file:  FILES: all the (external) files needed for parsing (and also generation): tokenizers, morphological FSTs and other transducers, the lexicon(s), the grammar file(s);  LEXENTRIES: the list of the lexicons – if there is more than one, order might be important);  RULES: the grammar (if rules are structured into different files, the headers of each should be listed here);  TEMPLATES: reference to the template file; 7

 GOVERNABLERELATIONS: a list of grammatical functions;  SEMANTICFUNCTIONS: attributes the values of which are required to contain a PRED;  NONDISTRIBUTIVES: attributes that do not distribute when coordinated;  EPSILON: a category that is not overt in the c- structure;  OPTIMALITYORDER: ranking of optimality constraints;  ----: the configuration file is closed with four dashes. 8

The lexical entries, the morphology (if exists, together with the tokeizer(s)), the rules and the templates sections also follow the same pattern:  they start with a header (by which they can be called in the appropriate sections ( LEXENTRIES , RULES , TEMPLATES , MORPHOLOGY ) of the configuration file);  they should be terminated with four dashes. These sections can be placed either in the configuration file, or they can also be stored in different files (e.g.: lexicon for nouns, lexicon for verbs, etc.). In this last case, they are to be indicated in the FILES section as well. 9

General form of rules: category --> category1: schemata1; category2: schemata2; … .  A simple rule: S --> NP VP.  Each rule is terminated with a dot. 10

 Assigning grammatical functions in the rules (see schemata above): S --> NP: (^ SUBJ)=! (! CASE)=nom; VP.  When schemata are given and order is important, a semicolon must follow (or a period, if it ends the rule). 11

 Optionality can be expressed with the help of parentheses (surrounding the optional element in the rule).  When order is “not important” regarding the categories (on the right side of the rule), they (with the schemata) are to be separated with a comma. VP --> V: ^=!, (NP: (^ OBJ)=!). 12

 Disjunction: | ( → use of curly brackets) NP --> {(D) N | PRON}.  Use of Kleene star: on the right side of the rule “attached” to the category in question to account for zero or more repetitions. VP --> V: ^=!; (NP: (^ OBJ)=!) PP*. 13

General form of a lexical entry: word Category1 Morphcode1 Schemata1; Category2 Morphcode2 Schemata2; … . Here,  Category is the category of the word;  Morphcode tells XLE whether the analyises of the word are provided by the morphological analyzer (» the morphcode is XLE), or it is only the lexical entry that provides information about the given form (» the morphcode is *);  Schemata are similar to those in the grammar file. 14

Example: eszik V * {(^ PRED)='eszik <(^ SUBJ)>' |(^ PRED)='eszik <(^ SUBJ) (^ OBJ)>' (^ OBJ CASE)=acc} (^ SUBJ CASE)=nom (^ SUBJ NUM)=1 (^ SUBJ PERS)=sg (^ TNS-ASP TENSE)=pres (^ TNS-ASP MOOD)=indicative (^ TNS-ASP PROG)=+. 15

Example: eszik V * { (^ PRED)='eszik <(^ SUBJ)>' | (^ PRED)='eszik <(^ SUBJ) (^ OBJ)>' (^ OBJ CASE)=acc } (^ SUBJ CASE)=nom (^ SUBJ NUM)=1 (^ SUBJ PERS)=sg (^ TNS-ASP TENSE)=pres (^ TNS-ASP MOOD)=indicative (^ TNS-ASP PROG)=+. 16

Multiple entries (for the same form)  use of special tags:  related to whole entries:  ETC extension of previous entry;  ONLY keep only this entry;  Placed in front of subentries:  + add new subentry;  - remove subentry;  ! override subentry;  = keep subentry. 17

Example: base entry: baa N XLE @(NOUN baa); V XLE @(VERB baa); A XLE @(ADJ baa). in some other later part: baa +P XLE @(PREP baa); =N; ONLY. => the effective entry: baa N XLE @(NOUN baa); P XLE @(PREP baa). 18

xfst lexc

 Finite-state transducers  Non-deterministic  Possible additional functions (Kaplan et al. 1997):  Normalization: removing additional white spaces  Editing: removing tags from annotated text  Capitalization handling: upper-case, lower-case  Contraction handling  Compound word isolation / multiword expression recognition  Using more than one: through composition 20

 Finite-state transducers (other external analyzers are also possible, provided their output is properly mapped to what XLE expects at this level) => effectiveness in speed  Non-deterministic: multiple analyses when possible  Stemming, morphological features as tags (e.g.: +Nom)  Composition of morphological FSTs is possible (also: union or priority union)  Guessers (Kaplan et al. 2004) 21

 Butt, Miriam, King, Tracy H., Niño , María -Eugenia and Segond, Frédérique . 1999. A Grammar Writer ’s Cookbook. Stanford: CSLI Publications.  Kaplan, Ronald M. and Newman, Paula S. 1997. Lexical Resource Reconciliation in the Xerox Linguistic Environment. In ACL/EACL’ 98 Computational environments for grammar development and linguistic engineering, 54-61.  Kaplan, Ronald M., Maxwell, John T., King, Tracy H. and Crouch, Richard. 2004. Integrating Finite-state Technology with Deep LFG Grammars. In Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP . 21

Gbor Csernyi Department of English Linguistics University of - PowerPoint PPT Presentation

Gbor Csernyi Department of English Linguistics University of Debrecen gabor.csernyi@arts.unideb.hu http://ieas.unideb.hu/~csernyi The architecture of the Xerox Linguistic Environment (XLE) reflects a modular pattern: (tokenizer(s));

Basic hardware The main units Csernyi G abor Department of English Linguistics University of

Fundamental concepts of Information Technology A brief history, the Neumann architecture, the

BOR: The Challenge and the Opportunity BOR Meeting, March 15, 2012 Educational success as a

MULTI SYSTEM COMPARISON OF PATIENT DOSES IN INTERVENTIONAL RADIOLOGY Doan Bor, Trkay Toklu,

Nano Graphene Platelets (NGPs), Graphene Nanocomposites, and Graphene-Enabled Energy Devices Bor

2019-21 Biennial Budget Request Board of Regents Meeting August 23, 2018 Sean P. Nelson,

SHERIFF SALES BOR vs. Judicial Filings 2000 1813 1680 1624 1594 1542 1500 1421 1132 1000

Poligon Krasny Bor Alexey Trutnev Historical reference St. Petersburg State Unitary

PATIENT AND STAFF DOSES IN INTERVENTIONAL NEURORADIOLOGY Doan Bor 1 Ph.D, Saruhan ekirge 2 ,

1 Provided September 13, 2018. Provided August 3, 2018. Safe Har bor State me nt Special

SOUTH DAKOTA STATE UNIVERSITY JOINT COMMITTEE ON APPROPRIATIONS February 1, 2017 BOR Budget

Por tland Har bor Dr e dge and CAD Ce ll Pr oje c t kshop South Por tland City Counc il Wor

Chor phor bor number here Expert, import, export Summary of learnings Give yourself

Ann Ar bor Downtown De ve lopme nt Author ity F Y21 Budg e t Re vie w Ann Arb o r DDA F Y21

Fourth Quarter and Full Year 2018 R E S U L T S SAFE HARBOR BOR STATEMEN ENTS TS Cautionary

ST STOCK OCK CO CONFE NFERENCE RENCE JUNE JUNE 20 2019 19 SAF SAFE E HAR HARBOR BOR

2. Lexical Analysis 2.1 Tasks of a Scanner 2.2 Regular Grammars and Finite Automata 2.3 Scanner

Exercise 4.53. In each case, given the context-free grammar G , find an c. equivalent CFG with no

For Thursday No new reading Homework: Chapter 23, exercise 15 Homework Instructions

Abstract Categorial Grammar Parsing the general case in Honor of G erard Huet Philippe de

Lustre V6 Synchronous Team VERIMAG, Grenoble 2 Lustre Basics Structuration Only nodes

Control Structures CIS 118 Intro to LINUX Basic Control Structures TEST The test utility,

Hardware Design with VHDL Sequential Stmts ECE 443 Sequential Statements This slide set covers

CS1150 Principles of Computer Science Boolean, Selection Statements (Part II) Yanyan Zhuang