Weakly-Supervised Acquisition of Labeled Class Instances for - - PowerPoint PPT Presentation

weakly supervised acquisition of labeled class instances
SMART_READER_LITE
LIVE PREVIEW

Weakly-Supervised Acquisition of Labeled Class Instances for - - PowerPoint PPT Presentation

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction Partha Pratim Talukdar (UPenn) Joseph Reisinger (UT Austin) Marius Pa sca (Google) Deepak Ravichandran (Google) Rahul Bhagat (USC) Fernando


slide-1
SLIDE 1

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

Partha Pratim Talukdar (UPenn) Joseph Reisinger (UT Austin) Marius Pa¸ sca (Google) Deepak Ravichandran (Google) Rahul Bhagat (USC) Fernando Pereira (Google)

Work done at Google during Summer 2008.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-2
SLIDE 2

Motivation

  • (Class, Instance) pairs (e.g. (pain killer, aspirin)) can be useful in

many applications e.g. web search.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-3
SLIDE 3

Motivation

  • (Class, Instance) pairs (e.g. (pain killer, aspirin)) can be useful in

many applications e.g. web search.

  • Given an entity/instance, it is often desirable to know its type.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-4
SLIDE 4

Motivation

  • (Class, Instance) pairs (e.g. (pain killer, aspirin)) can be useful in

many applications e.g. web search.

  • Given an entity/instance, it is often desirable to know its type.
  • A limited number of classes are not enough:

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-5
SLIDE 5

Motivation

  • (Class, Instance) pairs (e.g. (pain killer, aspirin)) can be useful in

many applications e.g. web search.

  • Given an entity/instance, it is often desirable to know its type.
  • A limited number of classes are not enough:
  • Web search queries include active volcanoes like Kilauea, zoonotic

diseases like monkeypox etc., demonstrating general user interest in them.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-6
SLIDE 6

Motivation

  • (Class, Instance) pairs (e.g. (pain killer, aspirin)) can be useful in

many applications e.g. web search.

  • Given an entity/instance, it is often desirable to know its type.
  • A limited number of classes are not enough:
  • Web search queries include active volcanoes like Kilauea, zoonotic

diseases like monkeypox etc., demonstrating general user interest in them.

  • Covering one class at a time (as in standard Named Entity

Extraction) is resource intensive and not sufficient.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-7
SLIDE 7

Motivation

  • (Class, Instance) pairs (e.g. (pain killer, aspirin)) can be useful in

many applications e.g. web search.

  • Given an entity/instance, it is often desirable to know its type.
  • A limited number of classes are not enough:
  • Web search queries include active volcanoes like Kilauea, zoonotic

diseases like monkeypox etc., demonstrating general user interest in them.

  • Covering one class at a time (as in standard Named Entity

Extraction) is resource intensive and not sufficient.

  • Need open domain extraction involving large number of classes and

large number of instances.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-8
SLIDE 8

Previous Work

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-9
SLIDE 9

Previous Work

  • Named Entity Extraction: small number of classes, extensive

supervision.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-10
SLIDE 10

Previous Work

  • Named Entity Extraction: small number of classes, extensive

supervision.

  • (Van Durme and Pasca, AAAI 08): open domain extraction, high

precision, low recall: precision drops fast with increasing recall.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-11
SLIDE 11

Previous Work

  • Named Entity Extraction: small number of classes, extensive

supervision.

  • (Van Durme and Pasca, AAAI 08): open domain extraction, high

precision, low recall: precision drops fast with increasing recall.

  • Our starting point: extractions from (Van Durme and Pasca,

2008).

Class Size Examples of Instances Book Publishers 70 Crown Publishing, Kluwer Academic, Prentice Hall, Puffin, . . .

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-12
SLIDE 12

Objectives

Starting with such automatically extracted (class, instance) pairs:

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-13
SLIDE 13

Objectives

Starting with such automatically extracted (class, instance) pairs:

  • Extract additional instances for existing classes.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-14
SLIDE 14

Objectives

Starting with such automatically extracted (class, instance) pairs:

  • Extract additional instances for existing classes.
  • Identify additional class labels for existing instances.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-15
SLIDE 15

Objectives

Starting with such automatically extracted (class, instance) pairs:

  • Extract additional instances for existing classes.
  • Identify additional class labels for existing instances.
  • Handle initial pairs from diverse sources and methods.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-16
SLIDE 16

Objectives

Starting with such automatically extracted (class, instance) pairs:

  • Extract additional instances for existing classes.
  • Identify additional class labels for existing instances.
  • Handle initial pairs from diverse sources and methods.
  • Require minimal human supervision.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-17
SLIDE 17

Objectives

Starting with such automatically extracted (class, instance) pairs:

  • Extract additional instances for existing classes.
  • Identify additional class labels for existing instances.
  • Handle initial pairs from diverse sources and methods.
  • Require minimal human supervision.
  • Do all these in a scalable manner.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-18
SLIDE 18

Objectives

Starting with such automatically extracted (class, instance) pairs:

  • Extract additional instances for existing classes.
  • Identify additional class labels for existing instances.
  • Handle initial pairs from diverse sources and methods.
  • Require minimal human supervision.
  • Do all these in a scalable manner.
  • Increase coverage (recall) at comparable quality (precision)!

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-19
SLIDE 19

Where do we get instances from?

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-20
SLIDE 20

Where do we get instances from?

  • A8: Extractions from unstructured text by (Van Durme and

Pasca, AAAI 08).

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-21
SLIDE 21

Where do we get instances from?

  • A8: Extractions from unstructured text by (Van Durme and

Pasca, AAAI 08).

  • WebTables (Cafarella et al., VLDB 2008)

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-22
SLIDE 22

Where do we get instances from?

  • A8: Extractions from unstructured text by (Van Durme and

Pasca, AAAI 08).

  • WebTables (Cafarella et al., VLDB 2008)
  • 154M HTML tables extracted from the web.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-23
SLIDE 23

Where do we get instances from?

  • A8: Extractions from unstructured text by (Van Durme and

Pasca, AAAI 08).

  • WebTables (Cafarella et al., VLDB 2008)
  • 154M HTML tables extracted from the web.
  • Rich source of instances, already segmented by webpage creators.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-24
SLIDE 24

Where do we get instances from?

  • A8: Extractions from unstructured text by (Van Durme and

Pasca, AAAI 08).

  • WebTables (Cafarella et al., VLDB 2008)
  • 154M HTML tables extracted from the web.
  • Rich source of instances, already segmented by webpage creators.
  • Structured text.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-25
SLIDE 25

Assigning class labels to WebTable instances

A8

. . Bob Dylan . .

WebTable

Artist musician Albums . . Johnny Cash Bob Dylan . . Year . . . . . . Bob Dylan . . Johnny Cash . .

Score (musician, Johnny Cash) = 0.87

. . .

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-26
SLIDE 26

Putting together tuples from first phase extractors

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-27
SLIDE 27

Putting together tuples from first phase extractors

  • A graph based representation is used: each tuple from A8 and

WebTable is a weighted edge, with nodes representing classes and instances.

musician singer Bob Dylan Johnny Cash Billy Joel

0.95 0.87 0.82 0.73 0.75 Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-28
SLIDE 28

Initialization: Seed Labels Marked

musician singer Bob Dylan Johnny Cash Billy Joel

0.95 0.87 0.82 0.73 0.75

musician 1.0 singer 1.0

Seed Labels

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-29
SLIDE 29

Label Propagation: Adsorption (Baluja et al., 2008)

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-30
SLIDE 30

Label Propagation: Adsorption (Baluja et al., 2008)

  • After 1 iteration:

musician singer Bob Dylan Johnny Cash Billy Joel

0.95 0.87 0.82 0.73 0.75

musician 1.0 singer 1.0 musician 0.7 singer 0.3 musician 1.0

Derived Labels

singer 1.0 singer 0.8 Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-31
SLIDE 31

Label Propagation: Adsorption (Baluja et al., 2008)

  • After 2 iterations:

musician singer Bob Dylan Johnny Cash Billy Joel

0.95 0.87 0.82 0.73 0.75

musician 1.0 singer 1.0 musician 0.7 singer 0.3 singer 0.8 singer 0.6 musician 0.4 singer 0.9 musician 0.1 musician 0.8 singer 0.2 Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-32
SLIDE 32

Label Propagation: Adsorption (Baluja et al., 2008)

  • After 3 iterations:

musician singer Bob Dylan Johnny Cash Billy Joel

0.95 0.87 0.82 0.73 0.75

musician 1.0 singer 1.0 musician 0.7 singer 0.3 singer 0.6 musician 0.4 singer 0.9 musician 0.1 musician 0.8 singer 0.2 singer 0.8 musician 0.2 Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-33
SLIDE 33

Experimental Setup

  • Dataset A8:
  • 924K (class, instance) pairs extracted from 100M web docs.
  • Extracted from unstructured text.
  • High precision, low recall.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-34
SLIDE 34

Experimental Setup

  • Dataset A8:
  • 924K (class, instance) pairs extracted from 100M web docs.
  • Extracted from unstructured text.
  • High precision, low recall.
  • Dataset WT:
  • 74M unique additional pairs extracted from WebTables.
  • Source of new instances, extracted from structured text.
  • Low precision, high recall.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-35
SLIDE 35

Experimental Setup

  • Dataset A8:
  • 924K (class, instance) pairs extracted from 100M web docs.
  • Extracted from unstructured text.
  • High precision, low recall.
  • Dataset WT:
  • 74M unique additional pairs extracted from WebTables.
  • Source of new instances, extracted from structured text.
  • Low precision, high recall.
  • Set of class labels in WT is the same as in A8.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-36
SLIDE 36

Experimental Setup

  • Dataset A8:
  • 924K (class, instance) pairs extracted from 100M web docs.
  • Extracted from unstructured text.
  • High precision, low recall.
  • Dataset WT:
  • 74M unique additional pairs extracted from WebTables.
  • Source of new instances, extracted from structured text.
  • Low precision, high recall.
  • Set of class labels in WT is the same as in A8.
  • Graph constructed using A8 + WT had 1.4M nodes and 75M
  • edges. This graph is used in all subsequent experiments.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-37
SLIDE 37

Experiments

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-38
SLIDE 38

Experiments

  • EXPT 1: Can we find new instances for fixed classes?

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-39
SLIDE 39

Experiments

  • EXPT 1: Can we find new instances for fixed classes?
  • EXPT 2: For a fixed set of instances, can we assign better class

labels?

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-40
SLIDE 40

EXPT 1: Seed (Class, Instance) Pairs

Seed Class Seed Instances

Book Publishers Millbrook Press, Academic Press, Springer Verlag, Chronicle Books, Shambhala Publications NFL Players Ike Hilliard, Isaac Bruce, Torry Holt, Jon Kitna, Ja- mal Lewis Scientific Journals American Journal of Roentgenology, PNAS, Journal

  • f Bacteriology, American Economic Review, IBM

Systems Journal Table: Classes and seeds used to initialize Adsorption.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-41
SLIDE 41

EXPT 1: Finding new instances for fixed classes

Class Precision at 100 (non-A8 extractions) Book Publishers 87.36 Federal Agencies 29.89 NFL Players 94.95 Scientific Journals 90.82 Mammal Species 84.27

Table: Precision of top 100 Adsorption extractions not present in A8.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-42
SLIDE 42

EXPT 1: Finding new instances for fixed classes

Class Precision at 100 (non-A8 extractions) Book Publishers 87.36 Federal Agencies 29.89 NFL Players 94.95 Scientific Journals 90.82 Mammal Species 84.27

Table: Precision of top 100 Adsorption extractions not present in A8.

Coverage increased at precision level comparable to A8.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-43
SLIDE 43

New extractions found by Adsorption

Seed Class Top Ranked Instances Discovered by Adsorption Scientific Journals Journal of Physics, Nature, Structural and Molecular Biology, Sciences Sociales et sant´ e, Kidney and Blood Pressure Research, American Journal of Physiology– Cell Physiology NFL Players Tony Gonzales, Thabiti Davis, Taylor Stubblefield, Ron Dixon, Rodney Hannah Book Publishers Small Night Shade Books, House of Anansi Press, Highwater Books, Distributed Art Publishers, Copper Canyon Press

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-44
SLIDE 44

Semantically similar class labels found by Adsorption: A Byproduct

Seed Class Non-Seed Class Labels Discovered

Book Publishers small presses, journal publishers, educational pub- lishers, academic publishers, commercial publishers NFL Players sports figures, football greats, football players, backs, quarterbacks Scientific Journals prestigious journals, peer-reviewed journals, refereed journals, scholarly journals, academic journals Table: Top class labels ranked by their similarity to a given seed class in Adsorption.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-45
SLIDE 45

EXPT 2: Class assignment for fixed instances

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 30 40 50 60 70

Mean Reciprocal Rank (MRR) Recall (%) Evaluation against WordNet Dataset (38 classes, 8910 instances)

A8 WT Adsorption (1 seed) Adsorption (5 seeds) Adsorption (10 seeds) Adsorption (25 seeds) Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-46
SLIDE 46

EXPT 2: Class assignment for fixed instances

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 30 40 50 60 70

Mean Reciprocal Rank (MRR) Recall (%) Evaluation against WordNet Dataset (38 classes, 8910 instances)

A8 WT Adsorption (1 seed) Adsorption (5 seeds) Adsorption (10 seeds) Adsorption (25 seeds)

Adsorption is able to assign better class labels to more instances.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-47
SLIDE 47

Conclusion

  • Demonstrated a scalable graph-based label propagation algorithm.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-48
SLIDE 48

Conclusion

  • Demonstrated a scalable graph-based label propagation algorithm.
  • Improved coverage while maintaining adequate precision.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-49
SLIDE 49

Conclusion

  • Demonstrated a scalable graph-based label propagation algorithm.
  • Improved coverage while maintaining adequate precision.
  • Combined information from two different sources: unstructured

and structured texts.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-50
SLIDE 50

Conclusion

  • Demonstrated a scalable graph-based label propagation algorithm.
  • Improved coverage while maintaining adequate precision.
  • Combined information from two different sources: unstructured

and structured texts.

  • Future Work:

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-51
SLIDE 51

Conclusion

  • Demonstrated a scalable graph-based label propagation algorithm.
  • Improved coverage while maintaining adequate precision.
  • Combined information from two different sources: unstructured

and structured texts.

  • Future Work:
  • Class label assignment in context.

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-52
SLIDE 52

Conclusion

  • Demonstrated a scalable graph-based label propagation algorithm.
  • Improved coverage while maintaining adequate precision.
  • Combined information from two different sources: unstructured

and structured texts.

  • Future Work:
  • Class label assignment in context.
  • Scaling up further!

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction

slide-53
SLIDE 53

Thank You!

Weakly-Supervised Acquisition of Labeled Class Instances for Open-Domain Information Extraction