Some Fine Points of Hybrid Natural Language Processing Peter - - PowerPoint PPT Presentation

some fine points of hybrid natural language processing
SMART_READER_LITE
LIVE PREVIEW

Some Fine Points of Hybrid Natural Language Processing Peter - - PowerPoint PPT Presentation

LREC 2008 Marrakech, Morocco 28 th May 2008 Some Fine Points of Hybrid Natural Language Processing Peter Adolphs, DFKI GmbH, Language Technology Lab, Berlin Stephan Oepen, Universitetet i Oslo, Department of Informatics Ulrich Callmeier,


slide-1
SLIDE 1

Peter Adolphs, DFKI GmbH, Language Technology Lab, Berlin Stephan Oepen, Universitetet i Oslo, Department of Informatics Ulrich Callmeier, acrolinx GmbH, Berlin Berthold Crysmann, Universität Bonn Dan Flickinger, Stanford University, CSLI Bernd Kiefer, DFKI GmbH, Language Technology Lab, Saarbrücken

LREC 2008 Marrakech, Morocco 28th May 2008

Some Fine Points of Hybrid Natural Language Processing

slide-2
SLIDE 2

Motivation

  • hybrid processing, integrating annotations of

‘shallow’ tools into HPSG parsing

  • different tools make different assumptions
  • example: PTB-style tokenizers for English

– e.g.: Don't you! → <do, n't, you, !> – contracted verb forms are split – punctuation is split off the preceding word form

  • we need to adapt annotations of different tools

to the requirements of our grammar

  • goal: a declarative, expressive, scalable device
slide-3
SLIDE 3

Token Feature Structures

  • feature structures for

describing tokens

  • different annotations

provided as feature structures

  • lattice of structured

categories (token feature structures) as input to the parser

slide-4
SLIDE 4

Generalized Chart

  • tools may assume different tokenization

(paradigm case: input from speech recognizers)

  • chart: dag whose vertices are abstract objects

rather than indexed token boundary positions

slide-5
SLIDE 5

Chart Mapping

  • chart mapping: non-monotonic rewrite

mechanism on feature structure chart edges

  • general format:

[ CONTEXT : ] INPUT → OUTPUT

  • CONTEXT, INPUT, OUTPUT are sequences of

feature structures (each possibly empty)

  • resource-sensitive: chart edges that let a rule

fire may be removed (namely, all INPUT edges)

slide-6
SLIDE 6

Chart Mapping – Example

  • example: recombining split contracted forms
  • rules extended with regular expression matches
  • regex capture groups can be referred to in the
  • utput
  • rules themselves described as feature

structures, thus we can use re-entrancies

slide-7
SLIDE 7

Chart Mapping – Examples

  • light-weight named entity recognition
  • fixing broken tokenization
slide-8
SLIDE 8

Preprocessing Lexical Instantiation Syntactic Parsing natural language input

SYN ... SEM ...

Previous Architecture (Simplified)

  • preprocessing has to provide

the input chart as expected by the grammar

  • this has to be ensured by

specialized conversion routines without recourse to the grammar

  • changes to the grammar

have to be reflected in these data adaptation routines

slide-9
SLIDE 9
  • proposal: token mapping per-

forms certain preprocessing steps within the grammar

  • advantages:

– full control for the grammar

writer, using the same formalism as for the grammar

– makes assumptions by the

grammar explicit

– removes complexity from

preprocessing

Preprocessing Token Mapping Lexical Instantiation Syntactic Parsing natural language input

SYN ... SEM ...

Proposed Architecture (Simplified)

slide-10
SLIDE 10

Hybrid Processing

  • shaping the search

space of the parser:

– widening search

space (e.g. unknown word handling)

– narrowing search

space (e.g. removing / postponing the processing of edges)

  • constraints on the

search space

– hard: categorial

conditions for introduction / removal

  • f chart edges

– soft: probabilistic

disambiguation, prioritize parser's tasks on the agenda

slide-11
SLIDE 11

Lexical Instantiation

  • native and generic lexical entries (les)
  • selection of appropriate generic lexical entries
  • riginally controlled by the parser (hard-coded)
  • strategy:

– map from part-of-speech tags to generic les – instantiate generic le for highest ranked pos tag

where no native le is available

  • disadvantage:

– not flexible enough (e.g. no chain of responsibility) – partial lexical coverage: We’ll bus to Paris.

slide-12
SLIDE 12
  • proposal: try to instantiate all generic les for all

tokens

  • token feature structure is unified into a

predefined path in the lexical entry

  • selection of compatible tokens by constraints
  • n the token feature structure
  • example:

Lexical Instantiation

slide-13
SLIDE 13
  • after lexical instantiation, native and generic les

may be available in the same chart cell

  • we can restrict lexical instantiation by positing

constraints on the token feature structures

  • but we might also want to prevent some lexical

chart edges in certain contexts (set operations)

  • proposal: lexical filtering phase
  • same formalism as for token mapping: chart

mapping rules with empty OUTPUT list

Lexical Filtering

slide-14
SLIDE 14

Proposed Architecture

  • use feature structures to

describe tokens

  • chart mapping: resource-

sensitive rewriting of feature structure items

  • chart mapping on token fs
  • generic instantiation driven by

compatibility with token fs

  • lexical filtering with chart

mapping

Preprocessing Token Mapping Lexical Instantiation Lexical Parsing Lexical Filtering Syntactic Parsing natural language input

SYN ... SEM ...

slide-15
SLIDE 15

Applications

  • fine grained control over instantiation of generic

lexical entries

  • mapping external morphological information

into the grammar's universe

  • chart dependency filter (optimizing parsing

performance)

  • activate syntactic rules only for certain spans of

the input (e.g., in hybrid grammar checking)

slide-16
SLIDE 16

Conclusions

  • versatile device for many applications
  • external information is made accessible to the

grammar

  • pre-processing can be better controlled with

grammar-specific means

  • reduces the need for special code inside and
  • utside the parser
  • outlook: consilidation of our current parsers and

grammars

slide-17
SLIDE 17

Thank you!

slide-18
SLIDE 18
  • DELPH-IN community and beyond, especially

Nuria Bertomeu, Ann Copestake, Remy Sanouillet, Ulrich Schäfer and Benjamin Waldron for numerous in-depth discussions

  • funding:

– ProFIT program of the German federal state of

Berlin and the EFRE program of the EU (to the DFKI project Checkpoint)

– the University of Oslo (through its scientific

partnership with CSLI)

Acknowledgements