Hybrid Information Extraction PD Dr. Gnter Neumann DFKI GmbH - - PowerPoint PPT Presentation

hybrid information extraction
SMART_READER_LITE
LIVE PREVIEW

Hybrid Information Extraction PD Dr. Gnter Neumann DFKI GmbH - - PowerPoint PPT Presentation

Hybrid Information Extraction PD Dr. Gnter Neumann DFKI GmbH Dienstag, 8. Februar 2011 Hybrid Is a system, if consists of different technologies can be combined each one depicts a solution by its own the integration


slide-1
SLIDE 1

Hybrid Information Extraction

PD Dr. Günter Neumann DFKI GmbH

Dienstag, 8. Februar 2011

slide-2
SLIDE 2

Hybrid

  • Is a system, if consists of different

technologies

  • can be combined
  • each one depicts a solution by its own
  • the integration constitute an innovative

plus for the whole system

Dienstag, 8. Februar 2011

slide-3
SLIDE 3

Examples

Dienstag, 8. Februar 2011

slide-4
SLIDE 4

Examples

hybrid engine

Dienstag, 8. Februar 2011

slide-5
SLIDE 5

Examples

hybrid engine HumanMachine

Dienstag, 8. Februar 2011

slide-6
SLIDE 6

Examples

hybrid engine HumanMachine Hybrid Language Processing

Dienstag, 8. Februar 2011

slide-7
SLIDE 7

Information Extraction

  • The aim of information extraction (IE) is the

identification and structuring of domain specific information from free text by skipping irrelevant information at the same time.

  • What counts as relevant is given to the

system in form of pre-defined domain specific annotations, lexicon entries or rules.

Dienstag, 8. Februar 2011

slide-8
SLIDE 8

Example: news about turnover

Dienstag, 8. Februar 2011

slide-9
SLIDE 9

Example: news about turnover

turnover(Company, Year, Manner, Amount, Tendendcy, Differnce)

Dienstag, 8. Februar 2011

slide-10
SLIDE 10

Example: news about turnover

turnover(Company, Year, Manner, Amount, Tendendcy, Differnce)

Eine Mixtur aus wachsendem Dienstleistungsgeschäft, Kostensenkungen und erfolgreichen Akquisitionen brachte Wettbewerber IBM im zweiten Quartal deutlich verbesserte Ergebnisse. Zwischen April und Juni stiegen der Umsatz um 10% auf 21,6 Mrd.$ und der Reingewinn auf 1,7 Mrd.$. Sonderlasten in Höhe von 1,4 Mrd.$ hatten den Vorjahresgewinn auf 56 Mill.$ gedrückt.

Dienstag, 8. Februar 2011

slide-11
SLIDE 11

Example: news about turnover

turnover(Company, Year, Manner, Amount, Tendendcy, Differnce)

Eine Mixtur aus wachsendem Dienstleistungsgeschäft, Kostensenkungen und erfolgreichen Akquisitionen brachte Wettbewerber IBM im zweiten Quartal deutlich verbesserte Ergebnisse. Zwischen April und Juni stiegen der Umsatz um 10% auf 21,6 Mrd.$ und der Reingewinn auf 1,7 Mrd.$. Sonderlasten in Höhe von 1,4 Mrd.$ hatten den Vorjahresgewinn auf 56 Mill.$ gedrückt.

Dienstag, 8. Februar 2011

slide-12
SLIDE 12

Example: news about turnover

turnover(Company, Year, Manner, Amount, Tendendcy, Differnce)

Eine Mixtur aus wachsendem Dienstleistungsgeschäft, Kostensenkungen und erfolgreichen Akquisitionen brachte Wettbewerber IBM im zweiten Quartal deutlich verbesserte Ergebnisse. Zwischen April und Juni stiegen der Umsatz um 10% auf 21,6 Mrd.$ und der Reingewinn auf 1,7 Mrd.$. Sonderlasten in Höhe von 1,4 Mrd.$ hatten den Vorjahresgewinn auf 56 Mill.$ gedrückt.

Dienstag, 8. Februar 2011

slide-13
SLIDE 13

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Dienstag, 8. Februar 2011

slide-14
SLIDE 14

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Dienstag, 8. Februar 2011

slide-15
SLIDE 15

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Bill Gates is a Person

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Dienstag, 8. Februar 2011

slide-16
SLIDE 16

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Bill Gates is a Person Microsoft is a Company

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Dienstag, 8. Februar 2011

slide-17
SLIDE 17

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Bill Gates is a Person Microsoft is a Company

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Seattle is a Location

Dienstag, 8. Februar 2011

slide-18
SLIDE 18

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Bill Gates is a Person Microsoft is a Company founder_of

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Seattle is a Location

Dienstag, 8. Februar 2011

slide-19
SLIDE 19

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Bill Gates is a Person Microsoft is a Company founder_of lives_in

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Seattle is a Location

Dienstag, 8. Februar 2011

slide-20
SLIDE 20

IE - History

  • Early IE-systems were mainly rule-

based (manual or learned) and the underlying methodology was specialized for specific applications, cf. MUC systems of the 90th.

  • One result of the MUC challenges was

a systematic division of labor into IE subtasks

  • Named-Entity Extraction (NER)
  • Relation Entity Extraction (REE)
  • Event Entity Extraction (EEE)
  • Coreferential analysis

Bill Gates is a Person Microsoft is a Company founder_of hq_located_in lives_in

The founder of Microsoft, Bill Gates, lives in Seattle, Washington, which is also the place of the company‘s headquarter.

Seattle is a Location

Dienstag, 8. Februar 2011

slide-21
SLIDE 21

IE - the Present

  • There exists knowledge-based IE (KIE) and statistical IE (SIE)
  • SIE is the State-of-the-Art in research, WIE in industry
  • There exists a number of different strategies for the various IE-

subtasks

  • from simple gazetteers to complex ontologies
  • from supervised, to minimal supervised to unsupervised

Machine Learning algorithms

  • Recently, the research focus is on NER, REE, Web-based IE,

scalability, domain adaptivity, ...

  • Open question: Which method is actually better suited for which

text source, domain and application?

Dienstag, 8. Februar 2011

slide-22
SLIDE 22

Hybrid IE

  • Methods and strategies for the combination
  • f different IE-components and the analysis
  • f their plausibility.
  • What are possible combinations ?

Dienstag, 8. Februar 2011

slide-23
SLIDE 23

Multi-Strategy

Dienstag, 8. Februar 2011

slide-24
SLIDE 24

Multi-Strategy

IE IE IE IE

Dienstag, 8. Februar 2011

slide-25
SLIDE 25

Multi-Strategy

IE IE IE IE Combiner

Dienstag, 8. Februar 2011

slide-26
SLIDE 26

Multi-Strategy

IE IE IE IE Combiner

Dienstag, 8. Februar 2011

slide-27
SLIDE 27

Example: NER

Ling Pipe

Open

NLP

BiQue

Combiner

Sprout

Dienstag, 8. Februar 2011

slide-28
SLIDE 28

Example: NER

Ling Pipe

Open

NLP

BiQue

Combiner

Sprout

Problem:

  • Ambiguities
  • Bracketing

Wort1 Wort4 Wort3 Wort2 Wort5 LOC 2 PER 3 ORG 4 LOC 5 LOC 3

Dienstag, 8. Februar 2011

slide-29
SLIDE 29

Example: NER

Ling Pipe

Open

NLP

BiQue

Combiner

Sprout

Problem:

  • Ambiguities
  • Bracketing

Wort1 Wort4 Wort3 Wort2 Wort5 LOC 2 PER 3 ORG 4 LOC 5 LOC 3

Solutions:

  • meta-learning
  • consider IEi as independent

black-boxes

Dienstag, 8. Februar 2011

slide-30
SLIDE 30

Example: NER

Ling Pipe

Open

NLP

BiQue

Combiner

Sprout

Good news:* Hybrid NER are better than the single NER wrt. recall and precision.

Combining Information Extraction Systems Using Voting and Stacked Generalization by: G Sigletos et al., J. Mach. Learn. Res.,

  • Vol. 6 (2005), pp. 1751-1782.

Wort1 Wort4 Wort3 Wort2 Wort5 LOC 2 PER 3 ORG 4 LOC 5 LOC 3

Problem:

  • Ambiguities
  • Bracketing

Dienstag, 8. Februar 2011

slide-31
SLIDE 31

Example: NER

Ling Pipe

Open

NLP

BiQue

Combiner

Sprout

Meta learning

  • majority voting
  • stacking

Strategies:

  • maximum weights
  • linear regression: PC=1-∏i(1-Pi)
  • cross-validation

Good news:* Hybrid NER are better than the single NER wrt. recall and precision.

Combining Information Extraction Systems Using Voting and Stacked Generalization by: G Sigletos et al., J. Mach. Learn. Res.,

  • Vol. 6 (2005), pp. 1751-1782.

Wort1 Wort4 Wort3 Wort2 Wort5 LOC 2 PER 3 ORG 4 LOC 5 LOC 3

Problem:

  • Ambiguities
  • Bracketing

Dienstag, 8. Februar 2011

slide-32
SLIDE 32

Example: Template Filling

MEM - Maximum Entropy Modeling DOP - Data- Oriented Parsing

Iterative Tag Insertion

Corpus: German press releases about turnover (Training: 4850 Tokens, Testing: 1000 Tokens)

Der Gewinn <Org>der Schweppes Gmbh & Co.</Org> KG betrug <TIMEX>im ersten Quartal 1997</TIMEX> weit ueber 20 Mio. DM.

Neumann, G. (2006) A Hybrid Machine Learning Approach for Information Extraction from Free Texts. In Spiliopoulou at al. (Eds). From Data and Information Analysis to Knowledge Engineering, Springer series: Studies in Classification, Data Analysis, and Knowledge Organization, pages 390-397, Springer-Verlag Berlin, Heidelber, New-York.

Dienstag, 8. Februar 2011

slide-33
SLIDE 33

Example: Template Filling

MEM - Maximum Entropy Modeling DOP - Data- Oriented Parsing

Iterative Tag Insertion

Corpus: German press releases about turnover (Training: 4850 Tokens, Testing: 1000 Tokens)

Der Gewinn <Org>der Schweppes Gmbh & Co.</Org> KG betrug <TIMEX>im ersten Quartal 1997</TIMEX> weit ueber 20 Mio. DM.

Neumann, G. (2006) A Hybrid Machine Learning Approach for Information Extraction from Free Texts. In Spiliopoulou at al. (Eds). From Data and Information Analysis to Knowledge Engineering, Springer series: Studies in Classification, Data Analysis, and Knowledge Organization, pages 390-397, Springer-Verlag Berlin, Heidelber, New-York.

Dienstag, 8. Februar 2011

slide-34
SLIDE 34

Example: Template Filling

MEM - Maximum Entropy Modeling DOP - Data- Oriented Parsing

Iterative Tag Insertion

Corpus: German press releases about turnover (Training: 4850 Tokens, Testing: 1000 Tokens)

Der Gewinn <Org>der Schweppes Gmbh & Co.</Org> KG betrug <TIMEX>im ersten Quartal 1997</TIMEX> weit ueber 20 Mio. DM.

Result:

  • only MEM: 79.3 %
  • only DOP: 51.9 %
  • both: 85.2 %

Neumann, G. (2006) A Hybrid Machine Learning Approach for Information Extraction from Free Texts. In Spiliopoulou at al. (Eds). From Data and Information Analysis to Knowledge Engineering, Springer series: Studies in Classification, Data Analysis, and Knowledge Organization, pages 390-397, Springer-Verlag Berlin, Heidelber, New-York.

Dienstag, 8. Februar 2011

slide-35
SLIDE 35

Feature based Strategies

Feature Extraktion Feature Extraktion Feature Extraktion

Machine Learning

Feature Extraktion

Idea:

  • choose a ML algorithm
  • choose manually and

automatically determined feature templates

  • combination of knowledge and

statistics

Dienstag, 8. Februar 2011

slide-36
SLIDE 36

Feature based Strategies

Feature Extraktion Feature Extraktion Feature Extraktion

Machine Learning

Feature Extraktion

Idea:

  • choose a ML algorithm
  • choose manually and

automatically determined feature templates

  • combination of knowledge and

statistics Proposal (Fresko et al., 2005):

  • regular grammars (hand coded)
  • Maximum Entropy Learning

Fresko, Rozenfeld, Feldman „A Hybrid Approach to NER by Integrating Manual Rules into MEM“, CIKM, 2005.

Dienstag, 8. Februar 2011

slide-37
SLIDE 37

Feature based Strategies

Feature Extraktion Feature Extraktion Feature Extraktion

Machine Learning

Feature Extraktion

Idea:

  • choose a ML algorithm
  • choose manually and

automatically determined feature templates

  • combination of knowledge and

statistics Proposal (Fresko et al., 2005):

  • regular grammars (hand coded)
  • Maximum Entropy Learning

Fresko, Rozenfeld, Feldman „A Hybrid Approach to NER by Integrating Manual Rules into MEM“, CIKM, 2005.

Dienstag, 8. Februar 2011

slide-38
SLIDE 38

Co-Training & Bootstrapping

Bootstrapper

Dienstag, 8. Februar 2011

slide-39
SLIDE 39

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Dienstag, 8. Februar 2011

slide-40
SLIDE 40

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Dienstag, 8. Februar 2011

slide-41
SLIDE 41

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Dienstag, 8. Februar 2011

slide-42
SLIDE 42

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Dienstag, 8. Februar 2011

slide-43
SLIDE 43

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Dienstag, 8. Februar 2011

slide-44
SLIDE 44

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Dienstag, 8. Februar 2011

slide-45
SLIDE 45

Co-Training & Bootstrapping

Bootstrapper

Classifier 1 Classifier 2

Initial data (seed)

Dienstag, 8. Februar 2011

slide-46
SLIDE 46

Co-Training & Bootstrapping

Bootstrapper

Note: These are manually specified, e.g., through reference to an

  • ntology!

Classifier 1 Classifier 2

Initial data (seed)

Dienstag, 8. Februar 2011

slide-47
SLIDE 47

Co-Training & Bootstrapping

Bootstrapper

Note: These are manually specified, e.g., through reference to an

  • ntology!

Classifier 1 Classifier 2

Co-training & IE

  • NER, cf Singer & Collins, 1999

Interaction of spelling and context features

  • REE, cf. Surdeanu et al. 2006

Interaction of text classifier and pattern acquisition

Initial data (seed)

Dienstag, 8. Februar 2011

slide-48
SLIDE 48

Co-Training & Bootstrapping

Bootstrapper

Note: These are manually specified, e.g., through reference to an

  • ntology!

Classifier 1 Classifier 2

Co-training & IE

  • NER, cf Singer & Collins, 1999

Interaction of spelling and context features

  • REE, cf. Surdeanu et al. 2006

Interaction of text classifier and pattern acquisition

Initial data (seed)

Dienstag, 8. Februar 2011

slide-49
SLIDE 49

QA and Hybrid IE

  • Observation: answer extraction is a kind
  • f question-driven IE (NER and REE)

Where does Bill Gates live? lives_in(Town:?, Pers:Bill Gates) What is a CEO? is_a(Pos:CEO,Conc:?)

Figueroa, A., Neumann, G. and Atkinson, J. (2009) Searching for Definitional Answers

  • n the Web using Surface Patterns. In journal IEEE Computer volume 42 number 4,

Pages 68-76, IEEE, 4/2009.

Dienstag, 8. Februar 2011

slide-50
SLIDE 50

QA and Hybrid IE

  • Observation: answer extraction is a kind
  • f question-driven IE (NER and REE)

Where does Bill Gates live? lives_in(Town:?, Pers:Bill Gates) What is a CEO? is_a(Pos:CEO,Conc:?)

Domain open answering of definition questions from the Web

Figueroa, A., Neumann, G. and Atkinson, J. (2009) Searching for Definitional Answers

  • n the Web using Surface Patterns. In journal IEEE Computer volume 42 number 4,

Pages 68-76, IEEE, 4/2009.

Dienstag, 8. Februar 2011

slide-51
SLIDE 51

QA and Hybrid IE

  • Observation: answer extraction is a kind
  • f question-driven IE (NER and REE)

Where does Bill Gates live? lives_in(Town:?, Pers:Bill Gates) What is a CEO? is_a(Pos:CEO,Conc:?)

Domain open answering of definition questions from the Web

Was ist XYZ ?

Web-QA

XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Figueroa, A., Neumann, G. and Atkinson, J. (2009) Searching for Definitional Answers

  • n the Web using Surface Patterns. In journal IEEE Computer volume 42 number 4,

Pages 68-76, IEEE, 4/2009.

Dienstag, 8. Februar 2011

slide-52
SLIDE 52

QA and Hybrid IE

  • Observation: answer extraction is a kind
  • f question-driven IE (NER and REE)

Where does Bill Gates live? lives_in(Town:?, Pers:Bill Gates) What is a CEO? is_a(Pos:CEO,Conc:?)

Domain open answering of definition questions from the Web

Problem: How to find optimal ranking of answer candidates?

Was ist XYZ ?

Web-QA

XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Figueroa, A., Neumann, G. and Atkinson, J. (2009) Searching for Definitional Answers

  • n the Web using Surface Patterns. In journal IEEE Computer volume 42 number 4,

Pages 68-76, IEEE, 4/2009.

Dienstag, 8. Februar 2011

slide-53
SLIDE 53

Wikipedia as Blueprint!

  • Learn from Wikipedia, what a good verbalization
  • f a definition looks like!

What is XYZ ? XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Web-QA

Dienstag, 8. Februar 2011

slide-54
SLIDE 54

Wikipedia as Blueprint!

  • Learn from Wikipedia, what a good verbalization
  • f a definition looks like!

What is XYZ ? XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Web-QA

Solution: Rank answer candidates according to similarity of Wikipedia?

Dienstag, 8. Februar 2011

slide-55
SLIDE 55

Wikipedia as Blueprint!

  • Learn from Wikipedia, what a good verbalization
  • f a definition looks like!

What is XYZ ? XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Web-QA

Solution: Rank answer candidates according to similarity of Wikipedia?

Unsupervised Learning of Feature Model

Properties

  • automatic computation of POS und NEG

training examples

  • lex-sem feature-templates via

dependency analysis

  • Maximum Entropy Modeling

Dienstag, 8. Februar 2011

slide-56
SLIDE 56

Wikipedia as Blueprint!

  • Learn from Wikipedia, what a good verbalization
  • f a definition looks like!

What is XYZ ? XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Web-QA

Solution: Rank answer candidates according to similarity of Wikipedia?

Unsupervised Learning of Feature Model

Properties

  • automatic computation of POS und NEG

training examples

  • lex-sem feature-templates via

dependency analysis

  • Maximum Entropy Modeling

Dienstag, 8. Februar 2011

slide-57
SLIDE 57

Wikipedia as Blueprint!

  • Learn from Wikipedia, what a good verbalization
  • f a definition looks like!

What is XYZ ? XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a XYZ is a

Web-QA

Solution: Rank answer candidates according to similarity of Wikipedia?

Unsupervised Learning of Feature Model

Properties

  • automatic computation of POS und NEG

training examples

  • lex-sem feature-templates via

dependency analysis

  • Maximum Entropy Modeling

Remark: Method is a step towards Web-scalable ontology learning.

Dienstag, 8. Februar 2011

slide-58
SLIDE 58

IE

Ontology based IE

Dienstag, 8. Februar 2011

slide-59
SLIDE 59

IE

Ontology based IE

The ontology defines the type of the information, which has to be extracted from texts: e.g., types of or institutions and their inter

  • relationship. It defines the structure
  • f the data base, which has to be

extracted automatically with the help

  • f OBIE.

Dienstag, 8. Februar 2011

slide-60
SLIDE 60

IE

Ontology based IE

Ontology population

  • ntology learning

The ontology defines the type of the information, which has to be extracted from texts: e.g., types of or institutions and their inter

  • relationship. It defines the structure
  • f the data base, which has to be

extracted automatically with the help

  • f OBIE.

Dienstag, 8. Februar 2011

slide-61
SLIDE 61

IE

Ontology based IE

Ontology population

  • ntology learning

Projekt TheseusOrdo TechWatch

The ontology defines the type of the information, which has to be extracted from texts: e.g., types of or institutions and their inter

  • relationship. It defines the structure
  • f the data base, which has to be

extracted automatically with the help

  • f OBIE.

Dienstag, 8. Februar 2011

slide-62
SLIDE 62

TEG - Tree Extraction Grammars

Manually written extraction grammar CFG Annotated Corpus Trained corpus-adapted SCFG HMM-inspired Semantik Parser

Rosenfeld, Feldman & Freski „TEG - a hybrid approach to information extraction“, Knowledge Information Systems (2006) 1-18.

Dienstag, 8. Februar 2011

slide-63
SLIDE 63

TEG - Example

nonterm start Text; concept Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; termlist TLHonorific = Mr Mrs Miss Ms Dr; (1) Person :- TLHonorific NGLastName; (2) Person :- NGFirstName NGLastName; (3) Text :- NGNone Text; (4) Text :- Person Text; (5) Text :- ;

Hand coded grammars

Dienstag, 8. Februar 2011

slide-64
SLIDE 64

TEG - Example

nonterm start Text; concept Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; termlist TLHonorific = Mr Mrs Miss Ms Dr; (1) Person :- TLHonorific NGLastName; (2) Person :- NGFirstName NGLastName; (3) Text :- NGNone Text; (4) Text :- Person Text; (5) Text :- ;

Yesterday, <Person> Dr Simmons </Person>, the distinguished scientist presented the discovery.

Hand coded grammars

Dienstag, 8. Februar 2011

slide-65
SLIDE 65

TEG - Example

nonterm start Text; concept Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; termlist TLHonorific = Mr Mrs Miss Ms Dr; (1) Person :- TLHonorific NGLastName; (2) Person :- NGFirstName NGLastName; (3) Text :- NGNone Text; (4) Text :- Person Text; (5) Text :- ;

Yesterday, <Person> Dr Simmons </Person>, the distinguished scientist presented the discovery.

Hand coded grammars Parse corpus

Dienstag, 8. Februar 2011

slide-66
SLIDE 66

TEG - Example

nonterm start Text; concept Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; termlist TLHonorific = Mr Mrs Miss Ms Dr; (1) Person :- TLHonorific NGLastName; (2) Person :- NGFirstName NGLastName; (3) Text :- NGNone Text; (4) Text :- Person Text; (5) Text :- ;

Yesterday, <Person> Dr Simmons </Person>, the distinguished scientist presented the discovery.

Hand coded grammars Parse corpus

P(Dr | TLHonorific) = 1/5 (choice of one term among five equiprobable

  • nes),

P(Dr | NGFirstName) ≈ 1/N, where N is the number of all known words (untrained ngram behaviour).

Collect statistics

Dienstag, 8. Februar 2011

slide-67
SLIDE 67

TEG - Example

nonterm start Text; concept Person; ngram NGFirstName; ngram NGLastName; ngram NGNone; termlist TLHonorific = Mr Mrs Miss Ms Dr; (1) Person :- TLHonorific NGLastName; (2) Person :- NGFirstName NGLastName; (3) Text :- NGNone Text; (4) Text :- Person Text; (5) Text :- ;

Yesterday, <Person> Dr Simmons </Person>, the distinguished scientist presented the discovery.

Hand coded grammars

termlist TLHonorific = Mr Mrs Miss Ms <2>Dr; Person :- <2>TLHonorific NGLastName; Text :- <11>NGNone Text; Text :- <2>Person Text; Text :- <2>;

adapt rules Parse corpus

P(Dr | TLHonorific) = 1/5 (choice of one term among five equiprobable

  • nes),

P(Dr | NGFirstName) ≈ 1/N, where N is the number of all known words (untrained ngram behaviour).

Collect statistics

Dienstag, 8. Februar 2011

slide-68
SLIDE 68

TEG - Experiments

MUC-7 NER task ACE-2 relation extraction INC relation extraction

Dienstag, 8. Februar 2011

slide-69
SLIDE 69

TEG - Potential

  • Advantages
  • precise rules can be specified for arbitrary IE applications
  • external knowledge sources can be integrated via termlist
  • ngram-context for terminals via ngram (usable for disambiguation)
  • external systems can be integrated
  • „ngram ngOrgNoun featureset ExtPoS restriction Noun;“
  • Possible innovations
  • Constraint based formalism as basis for grammar
  • Specialized parsing algorithms (e.g., supertagging)
  • Ontologies as basis for termlist
  • Extending grammars on basis of bootstrapping (human-controlled)
  • ...

Dienstag, 8. Februar 2011

slide-70
SLIDE 70

Conclusion

  • Hybrid IE as innovative plus for IE research and

development.

  • There exists already a number of promising and exciting

approaches.

  • High innovation potential to bring language technology,

knowledge-based and statistical system under one umbrella.

  • E.g., Multilingual Information Extraction
  • E.g., Multi-Channel Information Extraction

Dienstag, 8. Februar 2011