Gbor Csernyi Department of English Linguistics University of - - PowerPoint PPT Presentation

g bor csernyi
SMART_READER_LITE
LIVE PREVIEW

Gbor Csernyi Department of English Linguistics University of - - PowerPoint PPT Presentation

Gbor Csernyi Department of English Linguistics University of Debrecen gabor.csernyi@arts.unideb.hu http://ieas.unideb.hu/~csernyi The architecture of the Xerox Linguistic Environment (XLE) reflects a modular pattern: (tokenizer(s));


slide-1
SLIDE 1

Gábor Csernyi

Department of English Linguistics University of Debrecen gabor.csernyi@arts.unideb.hu http://ieas.unideb.hu/~csernyi

slide-2
SLIDE 2

The architecture of the Xerox Linguistic Environment (XLE) reflects a modular pattern:

 (tokenizer(s));  (morphological analyzer(s));  lexicon in the form of files containing lexical

entries;

 grammar in the form of files comprising the

grammatical rules (with functional annotations).

2

slide-3
SLIDE 3

Different modules responsible for the different (sub)tasks of parsing:

tokenization morphological analysis lexical lookup chart parsing

TOKENIZER MORPHOLOGICAL ANALYZER LEXICON SYNTAX, SEMANTICS

3

slide-4
SLIDE 4

Relevant commands in XLE:

 tokens  analyze-string <word-form>  parse “<(category:) string>”

4

slide-5
SLIDE 5

DEMO HUNGARIAN CONFIG (1.0) CHARACTERENCODING utf-8. ROOTCAT ROOT. FILES . LEXENTRIES (DEMO HUNGARIAN). RULES (DEMO HUNGARIAN). TEMPLATES (DEMO HUNGARIAN). GOVERNABLERELATIONS SUBJ OBJ OBJ2 OBL OBL-?+ COMP XCOMP. SEMANTICFUNCTIONS ADJUNCT TOPIC. NONDISTRIBUTIVES NUM PERS. EPSILON e. OPTIMALITYORDER NOGOOD.

  • 5
slide-6
SLIDE 6

The first line is a special line, it specifies:

 the version of grammar (DEMO);  the language (HUNGARIAN);  that this is the configuration file (CONFIG);  the XLE version number (1.0).

6

slide-7
SLIDE 7

Other parts of the config file:

 FILES: all the (external) files needed for parsing

(and also generation): tokenizers, morphological FSTs and other transducers, the lexicon(s), the grammar file(s);

 LEXENTRIES: the list of the lexicons– if there is

more than one, order might be important);

 RULES: the grammar (if rules are structured into

different files, the headers of each should be listed here);

 TEMPLATES: reference to the template file;

7

slide-8
SLIDE 8

 GOVERNABLERELATIONS: a list of grammatical

functions;

 SEMANTICFUNCTIONS: attributes the values of

which are required to contain a PRED;

 NONDISTRIBUTIVES: attributes that do not distribute

when coordinated;

 EPSILON: a category that is not overt in the c-

structure;

 OPTIMALITYORDER:

ranking

  • f
  • ptimality

constraints;

 ----: the configuration file is closed with four dashes.

8

slide-9
SLIDE 9

The lexical entries, the morphology (if exists, together with the tokeizer(s)), the rules and the templates sections also follow the same pattern:

 they start with a header (by which they can be called in

the appropriate sections (LEXENTRIES, RULES, TEMPLATES,

MORPHOLOGY) of the configuration file);

 they should be terminated with four dashes.

These sections can be placed either in the configuration file, or they can also be stored in different files (e.g.: lexicon for nouns, lexicon for verbs, etc.). In this last case, they are to be indicated in the FILES section as well.

9

slide-10
SLIDE 10

General form of rules:

category --> category1: schemata1; category2: schemata2; … .

 A simple rule:

S --> NP VP.

 Each rule is terminated with a dot.

10

slide-11
SLIDE 11

 Assigning grammatical functions in the rules (see

schemata above):

S --> NP: (^ SUBJ)=! (! CASE)=nom; VP.

 When schemata are given and order is important, a

semicolon must follow (or a period, if it ends the rule).

11

slide-12
SLIDE 12

 Optionality can be expressed with the help of

parentheses (surrounding the optional element in the rule).

 When order is “not important” regarding the

categories (on the right side of the rule), they (with the schemata) are to be separated with a comma.

VP --> V: ^=!, (NP: (^ OBJ)=!).

12

slide-13
SLIDE 13

 Disjunction: | (→ use of curly brackets)

NP --> {(D) N | PRON}.

 Use of Kleene star: on the right side of the rule

“attached” to the category in question to account for zero or more repetitions.

VP --> V: ^=!; (NP: (^ OBJ)=!) PP*.

13

slide-14
SLIDE 14

General form of a lexical entry:

word Category1 Morphcode1 Schemata1; Category2 Morphcode2 Schemata2; … .

Here,

 Category is the category of the word;  Morphcode tells XLE whether the analyises of the word

are provided by the morphological analyzer (» the morphcode is XLE), or it is only the lexical entry that provides information about the given form (» the morphcode is *);

 Schemata are similar to those in the grammar file.

14

slide-15
SLIDE 15

Example:

eszik V * {(^ PRED)='eszik <(^ SUBJ)>' |(^ PRED)='eszik <(^ SUBJ) (^ OBJ)>' (^ OBJ CASE)=acc} (^ SUBJ CASE)=nom (^ SUBJ NUM)=1 (^ SUBJ PERS)=sg (^ TNS-ASP TENSE)=pres (^ TNS-ASP MOOD)=indicative (^ TNS-ASP PROG)=+.

15

slide-16
SLIDE 16

Example:

eszik V * {(^ PRED)='eszik <(^ SUBJ)>' |(^ PRED)='eszik <(^ SUBJ) (^ OBJ)>' (^ OBJ CASE)=acc} (^ SUBJ CASE)=nom (^ SUBJ NUM)=1 (^ SUBJ PERS)=sg (^ TNS-ASP TENSE)=pres (^ TNS-ASP MOOD)=indicative (^ TNS-ASP PROG)=+.

16

slide-17
SLIDE 17

Multiple entries (for the same form)  use of special tags:

 related to whole entries:

  • ETC

extension of previous entry;

  • ONLY

keep only this entry;

 Placed in front of subentries:

  • +

add new subentry;

  • -

remove subentry;

  • !
  • verride subentry;
  • =

keep subentry.

17

slide-18
SLIDE 18

Example:

base entry: baa N XLE @(NOUN baa); V XLE @(VERB baa); A XLE @(ADJ baa). in some other later part: baa +P XLE @(PREP baa); =N; ONLY.

=> the effective entry:

baa N XLE @(NOUN baa); P XLE @(PREP baa).

18

slide-19
SLIDE 19

xfst lexc

slide-20
SLIDE 20

 Finite-state transducers  Non-deterministic  Possible additional functions (Kaplan et al. 1997):

  • Normalization: removing additional white spaces
  • Editing: removing tags from annotated text
  • Capitalization handling: upper-case, lower-case
  • Contraction handling
  • Compound word isolation / multiword expression

recognition

 Using more than one: through composition

20

slide-21
SLIDE 21

 Finite-state transducers (other external analyzers

are also possible, provided their output is properly mapped to what XLE expects at this level) => effectiveness in speed

 Non-deterministic: multiple analyses when possible  Stemming, morphological features as tags

(e.g.: +Nom)

 Composition of morphological FSTs is possible

(also: union or priority union)

 Guessers (Kaplan et al. 2004)

21

slide-22
SLIDE 22

 Butt, Miriam, King, Tracy H., Niño, María-Eugenia and Segond,

Frédérique. 1999. A Grammar Writer’s Cookbook. Stanford: CSLI Publications.

 Kaplan, Ronald M. and Newman, Paula S. 1997. Lexical Resource

Reconciliation in the Xerox Linguistic Environment. In ACL/EACL’98 Computational environments for grammar development and linguistic engineering, 54-61.

 Kaplan, Ronald M., Maxwell, John T., King, Tracy H. and Crouch,

  • Richard. 2004. Integrating Finite-state Technology with Deep LFG
  • Grammars. In Proceedings of the ESSLLI 2004 Workshop on

Combining Shallow and Deep Processing for NLP .

21