UIMA-based Annotation Type System for a Text Mining Architecture Udo - - PowerPoint PPT Presentation

uima based annotation type system for a text mining
SMART_READER_LITE
LIVE PREVIEW

UIMA-based Annotation Type System for a Text Mining Architecture Udo - - PowerPoint PPT Presentation

UIMA-based Annotation Type System for a Text Mining Architecture Udo Hahn, Ekaterina Buyko , Katrin Tomanek, Scott Piao, Yoshimasa Tsuruoka, John McNaught, Sophia Ananiadou Jena University Language and Information Engineering Lab & School


slide-1
SLIDE 1

UIMA-based Annotation Type System for a Text Mining Architecture

Udo Hahn, Ekaterina Buyko, Katrin Tomanek, Scott Piao, Yoshimasa Tsuruoka, John McNaught, Sophia Ananiadou Jena University Language and Information Engineering Lab & School of Computer Science, University of Manchester

slide-2
SLIDE 2

BOOTStrep NLP Infrastructure

Bootstrapping Of Ontologies and Terminologies STrategic REsearch Project

NLP Components Repository Tool 1 Tool 2 Tool n Team 1 Annotated Facts Team 2 Team n

slide-3
SLIDE 3

Annotation in Natural Language Processing (NLP)

NLP System

Tokenizer POS Tagger Entity Tagger Relation Tagger .....

..... Fred is CEO of IBM ..... <document source ../> <sentence begin ... end../> <token begin .. end ../> <token begin .. end ..> ..... <entity person begin .. end> <entity organization begin .. end/> <relation is_ceo_of begin .. end/>

slide-4
SLIDE 4

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS1 .. Token POS2 Token POSn

slide-5
SLIDE 5

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS1 .. Token POS2 Token POSn Data Conversion

slide-6
SLIDE 6

Advantages of the UIMA Framework Interoperability between NLP systems

  • Portability of components
  • Flexible exchange of components
slide-7
SLIDE 7

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS1 .. Token POS2 Token POSn

slide-8
SLIDE 8

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS1 .. Token POS2 Token POSn Data Conversion

slide-9
SLIDE 9

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS1 .. Token POS2 Token POSn Data Conversion

slide-10
SLIDE 10

Advantages of the UIMA Framework Interoperability between NLP systems ✔ Portability of components ✗ Flexible exchange of components

slide-11
SLIDE 11

Exchange of components in UIMA

  • Adaptation Efforts
  • Over-write Wrappers
  • Create Matching Files
  • Define a Common Annotation Type

System in advance

slide-12
SLIDE 12

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS1 .. Token POS2 Token POSn Common Type System POS

slide-13
SLIDE 13

Annotation in NLP Systems

NLP Tool Suite 1 NLP Tool Suite 2 NLP Tool Suite 3 NLP Tool Suite n Token POS .. Token POS Token POS Common Type System POS

slide-14
SLIDE 14

Advantages of the UIMA Framework Interoperability between NLP systems ✔ Portability of components ✔ Flexible exchange of components

slide-15
SLIDE 15

Design of an Annotation Type System

  • Requirements from various NLP teams
  • Annotation guidelines and schemata
slide-16
SLIDE 16

Requirements for an Annotation Type System

  • Broad coverage for the information extraction
  • Compatible to “standard” NLP annotation schemata
  • Definition of the core type system which is extensible
  • Using UIMA specific features
  • Multiple annotation of the same type
  • Annotation control through the restriction of values
slide-17
SLIDE 17

Annotation Guidelines & Schemata Corpus Annotation

  • Annotation languages (e.g. XML (in-line, stand-off))
  • Annotation levels:
  • Document Meta (e.g. Dublin Core Metadata Initiative)
  • Linguistic Analysis (e.g. TEI, XCES (EAGLES), Penn Treebank)
  • Semantic Analysis (e.g. MUC, ACE, GENIA)
  • NLP system annotation guidelines?
slide-18
SLIDE 18

Coverage Multi-Layered Annotation Type System

  • 1. Document Meta: author, publication data, source
  • 2. Document Structure & Style : title, sections, text bold
  • 3. Morpho-Syntax: token, part-of speech, lemma
  • 4. Syntax: chunks, constituents, dependency relations
  • 5. Semantics: entities, relations, events
  • 6. Discourse: anaphora
slide-19
SLIDE 19
slide-20
SLIDE 20

Basic Annotation Type

slide-21
SLIDE 21

Document Meta

slide-22
SLIDE 22

Document Meta Information I

slide-23
SLIDE 23

Document Meta Information II

slide-24
SLIDE 24

Document Structure

slide-25
SLIDE 25

Morpho-Syntax

slide-26
SLIDE 26

Morpho-Syntax I

slide-27
SLIDE 27

Morpho-Syntax II

slide-28
SLIDE 28

Morpho-Syntax III

slide-29
SLIDE 29

Morpho-Syntax IV

slide-30
SLIDE 30

Syntax

slide-31
SLIDE 31

Shallow Parsing

slide-32
SLIDE 32

Full Parsing (constituent-based)

slide-33
SLIDE 33

Full Parsing (dependency-based)

slide-34
SLIDE 34

Semantics

slide-35
SLIDE 35

Resource Connection

slide-36
SLIDE 36

To wrap up ..

  • Multi-layered annotation
  • Core annotation type system
  • Extended for the biomedical domain
  • Can easily be extended for other domains
  • Restriction of values for the annotation control
  • Sub-Types for multiple annotation (e.g. POS, Chunk)
  • Connection to external resources
slide-37
SLIDE 37

Open Issues

  • Performance measure of the type system
  • Definitions:
  • Semantics (Relation, Event)
  • Discourse (Anaphora)
slide-38
SLIDE 38

UIMA Annotation Type System Working Group?

Download: http://www.julielab.de/ Contact: buyko@coling-uni-jena.de

Sponsored by