Snowball : Extracting Relations from Large Plain-Text Collections - - PowerPoint PPT Presentation

snowball extracting relations from large plain text
SMART_READER_LITE
LIVE PREVIEW

Snowball : Extracting Relations from Large Plain-Text Collections - - PowerPoint PPT Presentation

Snowball : Extracting Relations from Large Plain-Text Collections Eugene Agichtein Luis Gravano Department of Computer Science Columbia University 1 Extracting Relations from Documents Text documents hide valuable structured information. If


slide-1
SLIDE 1

1

Snowball : Extracting Relations from Large Plain-Text Collections

Eugene Agichtein Luis Gravano Department of Computer Science Columbia University

slide-2
SLIDE 2

Eugene Agichtein Columbia University

  • 2

Extracting Relations from Documents

Text documents hide valuable structured information. If we manage to extract this information:

  • We can answer user queries more accurately
  • We can run data mining tasks (e.g., finding

trends)

slide-3
SLIDE 3

Eugene Agichtein Columbia University

  • 3

GOAL: Extract All Tuples “Hidden” in the Document Collection System must:

  • Require minimal training for each new task
  • Recover from noise
  • Exploit redundancy of information in

documents

slide-4
SLIDE 4

Eugene Agichtein Columbia University

  • 4

Example Task: Organization/Location

Apple's programmers "think different" on a "campus" in Cupertino, Cal. Nike employees "just do it" at what the company refers to as its "World Campus," near Portland, Ore. Microsoft's central headquarters in Redmond is home to almost every product group and division.

Organization Location Microsoft Apple Computer Nike Redmond Cupertino Portland

Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."

Redundancy

slide-5
SLIDE 5

Eugene Agichtein Columbia University

  • 5

Extracting Relations from Text Collections

  • Related Work
  • The Snowball System
  • Evaluation Metrics
  • Experimental Results
slide-6
SLIDE 6

Eugene Agichtein Columbia University

  • 6

Related Work

  • Traditional Information Extraction

– MUCs (Message Understanding Conferences)

  • Significant (manual) training for each new task
  • Bootstrapping

– Riloff et al. (‘99), Collins & Singer (‘99)

  • (Named-entity recognition)

– Brin (DIPRE) (‘98)

slide-7
SLIDE 7

Eugene Agichtein Columbia University

  • 7

Extracting Relations from Text: DIPRE

Initial Seed Tuples: Initial Seed Tuples Occurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table

ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA

slide-8
SLIDE 8

Eugene Agichtein Columbia University

  • 8

Extracting Relations from Text: DIPRE

Occurrences of seed tuples:

Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share of Redmond-based Microsoft fell… The Armonk-based IBM introduced a new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor.

ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table

slide-9
SLIDE 9

Eugene Agichtein Columbia University

  • 9
  • <STRING1>’s headquarters in <STRING2>
  • <STRING2> -based <STRING1>
  • <STRING1> , <STRING2>

Extracting Relations from Text: DIPRE

Initial Seed Tuples Occurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table DIPRE Patterns:

slide-10
SLIDE 10

Eugene Agichtein Columbia University

  • 10

Extracting Relations from Text: DIPRE

Initial Seed Tuples Occurrences of Seed Tuples Generate Extraction Patterns Generate New Seed Tuples Augment Table Generate new seed tuples; start new iteration

ORGANIZATION LOCATION AG EDWARDS ST LUIS 157TH STREET MANHATTAN 7TH LEVEL RICHARDSON 3COM CORP SANTA CLARA 3DO REDWOOD CITY JELLIES APPLE MACWEEK SAN FRANCISCO

slide-11
SLIDE 11

Eugene Agichtein Columbia University

  • 11

Extracting Relations from Text: Potential Pitfalls

  • Invalid tuples generated

– Degrade quality of tuples on subsequent iterations – Must have automatic way to select high quality tuples to use as new seed

  • Pattern representation

– Patterns must generalize

slide-12
SLIDE 12

Eugene Agichtein Columbia University

  • 12

Extracting Relations from Text Collections

  • Related Work

– DIPRE

  • The Snowball System:

– Pattern representation and generation – Tuple generation – Automatic pattern and tuple evaluation

  • Evaluation Metrics
  • Experimental Results
slide-13
SLIDE 13

Eugene Agichtein Columbia University

  • 13

Extracting Relations from Text: Snowball

Initial Seed Tuples: Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table

ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA

slide-14
SLIDE 14

Eugene Agichtein Columbia University

  • 14

Extracting Relations from Text: Snowball

Occurrences of seed tuples:

ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA

Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table

Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share of Redmond-based Microsoft fell… The Armonk-based IBM introduced a new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor.

slide-15
SLIDE 15

Eugene Agichtein Columbia University

  • 15

Today's merger with McDonnell Douglas positions Seattle -based Boeing to make major money in space.

…, a producer of apple-based jelly, ... Pattern: <STRING2>-based <STRING1>

Problem: Patterns Excessively General

<jelly, apple>

slide-16
SLIDE 16

Eugene Agichtein Columbia University

  • 16

Extracting Relations from Text: Snowball

Computer servers at Microsoft’s headquarters in Redmond… In mid-afternoon trading, share of Redmond-based Microsoft fell… The Armonk-based IBM introduced a new line… The combined company will operate from Boeing’s headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium processor.

Tag Entities Use MITRE’s Alembic Named Entity tagger Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table [Himanshu] + use of types

slide-17
SLIDE 17

Eugene Agichtein Columbia University

  • 17

Extracting Relations from Text

Computer servers at Microsoft's headquarters in Redmond... Exxon, Irving, said it will boost its stake in the... In midafternoon trading, shares of Irving-based Exxon fell… The Armonk-based IBM has introduced a new line ... The combined company will operate from Boeing's headquarters in Seattle. Intel, Santa Clara, cut prices of its Pentium...

  • <ORGANIZATION>’s headquarters in <LOCATION>
  • <LOCATION> -based <ORGANIZATION>
  • <ORGANIZATION> , <LOCATION>

Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table PROBLEM: Patterns too specific: have to match text exactly. [Akshay] + unexact pattern matching

slide-18
SLIDE 18

Eugene Agichtein Columbia University

  • 18

Snowball: Pattern Representation

A Snowball pattern vector is a 5-tuple <left, tag1, middle, tag2, right>,

– tag1, tag2 are named-entity tags – left, middle, and right are vectors of weighed terms.

< left , tag1 , middle , tag2 , right >

ORGANIZATION 's central headquarters in LOCATION is home to... LOCATION ORGANIZATION {<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>} {<is 0.75>, <home 0.75> }

slide-19
SLIDE 19

Eugene Agichtein Columbia University

  • 19

The combined company will operate from Boeing’s headquarters in Seattle. The Armonk -based IBM introduced a new line… In mid-afternoon trading, share of Redmond-based Microsoft fell… Computer servers at Microsoft’s central headquarters in Redmond…

Snowball: Pattern Generation

Tagged Occurrences of seed tuples:

slide-20
SLIDE 20

Eugene Agichtein Columbia University

  • 20

{<servers 0.75> <at 0.75>}

Snowball Pattern Generation: Cluster Similar Occurrences

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

ORGANIZATION LOCATION

{<shares 0.75> <of 0.75>} {<- 0.75> <based 0.75> } {<fell 1>} {<the 1>} {<- 0.75> <based 0.75> }

ORGANIZATION LOCATION

{<introduced 0.75> <a 0.75>}

LOCATION ORGANIZATION

{<operate 0.75> <from 0.75>} {<’s 0.7> <headquarters 0.7> <in 0.7>}

ORGANIZATION LOCATION Occurrences of seed tuples converted to Snowball representation: [Dinesh] + vector-based rep

  • f context
slide-21
SLIDE 21

Eugene Agichtein Columbia University

  • 21

Similarity Metric

{

Lp . Ls + Mp . Ms + Rp . Rs if the tags match

  • therwise

Match(P, S) =

P = S = < Lp , tag1 , Mp , tag2 , Rp > < Ls , tag1 , Ms , tag2 , Rs >

[Ankit]

  • Could be better?

[Yash]

  • Semantic eq of

context missing?

slide-22
SLIDE 22

Eugene Agichtein Columbia University

  • 22

{<servers 0.75> <at 0.75>}

Snowball Pattern Generation: Clustering

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

ORGANIZATION LOCATION

{<shares 0.75> <of 0.75>} {<- 0.75> <based 0.75> } {<fell 1>} {<the 1>} {<- 0.75> <based 0.75> }

ORGANIZATION LOCATION

{<introduced 0.75> <a 0.75>}

LOCATION ORGANIZATION

{<operate 0.75> <from 0.75>} {<’s 0.7> <headquarters 0.7> <in 0.7>}

ORGANIZATION LOCATION Cluster 1 Cluster 2

slide-23
SLIDE 23

Eugene Agichtein Columbia University

  • 23

Snowball: Pattern Generation

{<’s 0.7> <in 0.7> <headquarters 0.7>} ORGANIZATION LOCATION {<- 0.75> <based 0.75>} ORGANIZATION LOCATION Pattern2 Patterns are formed as centroids of the clusters. Filtered by minimum number of supporting tuples. Pattern1

slide-24
SLIDE 24

Eugene Agichtein Columbia University

  • 24

Snowball: Tuple Extraction

Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table Using the patterns, scan the collection to generate new seed tuples:

slide-25
SLIDE 25

Eugene Agichtein Columbia University

  • 25

Snowball: Tuple Extraction

Represent each new text segment in the collection as the context 5-tuple: Find most similar pattern (if any)

LOCATION ORGANIZATION {<'s 0.5>, <flashy 0.5>, <headquarters 0.5>, < in 0.5>} {<is 0.75>, <near 0.75> } Netscape 's flashy headquarters in Mountain View is near LOCATION ORGANIZATION {<'s 0.7>, <headquarters 0.7>, < in 0.7>}

slide-26
SLIDE 26

Eugene Agichtein Columbia University

  • 26

Snowball: Automatic Pattern Evaluation

Automatically estimate probability of a pattern generating valid tuples: Conf(Pattern) = _____Positive____ Positive + Negative e.g., Conf(Pattern) = 2/3 = 66% Pattern “ORGANIZATION, LOCATION” in action:

Pattern Confidence:

Boeing, Seattle, said… Positive Intel, Santa Clara, cut prices… Positive invest in Microsoft, New York-based Negative analyst Jane Smith said

ORGANIZATION LOCATION MICROSOFT REDMOND IBM ARMONK BOEING SEATTLE INTEL SANTA CLARA

Seed tuples

[Yash]

  • Primary key

assumption

slide-27
SLIDE 27

Eugene Agichtein Columbia University

  • 27

Snowball: Automatic Tuple Evaluation

Conf(Tuple) = 1 - (1 -Conf(Pi))

– Estimation of Probability (Correct (Tuple) ) – A tuple will have high confidence if generated by multiple high-confidence patterns (Pi).

Apple's programmers "think different" on a "campus" in Cupertino, Cal. Brent Barlow, 27, a software analyst and beta-tester at Apple Computer headquarters in Cupertino, was fired Monday for "thinking a little too different."

<Apple Computer, Cupertino> [Barun]

  • Not a good

metric??

slide-28
SLIDE 28

Eugene Agichtein Columbia University

  • 28

Snowball: Filtering Seed Tuples

Initial Seed Tuples Occurrences of Seed Tuples Tag Entities Generate Extraction Patterns Generate New Seed Tuples Augment Table Generate new seed tuples:

ORGANIZATION LOCATION CONF AG EDWARDS ST LUIS 0.93 AIR CANADA MONTREAL 0.89 7TH LEVEL RICHARDSON 0.88 3COM CORP SANTA CLARA 0.8 3DO REDWOOD CITY 0.8 3M MINNEAPOLIS 0.8 MACWORLD SAN FRANCISCO 0.7 157TH STREET MANHATTAN 0.52 15TH CENTURY EUROPE NAPOLEON 0.3 15TH PARTY CONGRESS CHINA 0.3 MAD SMITH 0.3

[Akshay] + confidence score management

slide-29
SLIDE 29

Eugene Agichtein Columbia University

  • 29

Extracting Relations from Text Collections

  • Related Work
  • The Snowball System:

– Pattern representation and generation – Tuple generation – Automatic pattern and tuple evaluation

  • Evaluation Metrics
  • Experimental Results

[Kuldeep]

  • Many thresholds
slide-30
SLIDE 30

Eugene Agichtein Columbia University

  • 30

Task Evaluation Methodology

  • Data: Large collection, extracted tables

contain many tuples (> 80,000)

  • Need scalable methodology:

– Ideal set of tuples – Automatic recall/precision estimation

  • Estimated precision using sampling

[Dinesh] + automatic eval

slide-31
SLIDE 31

Eugene Agichtein Columbia University

  • 31

Collections used in Experiments

More than 300,000 real newspaper articles

Collection Source Year The New York Times 1996 Training The Wall Street Journal 1996 The Los Angeles Times 1996 The New York Times 1995 Test The Wall Street Journal 1995 The Los Angeles Times 1995,’97

slide-32
SLIDE 32

Eugene Agichtein Columbia University

  • 32

The Ideal Metric (1)

Creating the Ideal set of tuples

All tuples mentioned in the collection Hoover’s directory (13K+ organizations)

*Ideal

* A perfect, (ideal) system would be able to extract all these tuples

slide-33
SLIDE 33

Eugene Agichtein Columbia University

  • 33

The Ideal Metric (2)

  • Precision:

| Correct (Extracted  Ideal) | | Extracted  Ideal |

  • Recall:

| Correct (Extracted  Ideal) | | Ideal |

Extracted Ideal Correct location found [Danish] + great way to compute precision

slide-34
SLIDE 34

Eugene Agichtein Columbia University

  • 34

Estimate Precision by Sampling

  • Sample extracted table

– Random samples, each 100 tuples

  • Manually check validity of tuples in each

sample

[Haroun]

  • fishy?
slide-35
SLIDE 35

Eugene Agichtein Columbia University

  • 35

Extracting Relations from Text Collections

  • Related Work
  • The Snowball System:

– Pattern representation and generation – Tuple generation – Automatic pattern and tuple validation

  • Evaluation Metrics
  • Experimental Results
slide-36
SLIDE 36

Eugene Agichtein Columbia University

  • 36

Experimental results: Test Collection

(a) (b)

Recall (a) and precision (a) using the Ideal metric, plotted against the minimal number of occurrences

  • f test tuples in the collection

[Akshay/Dinesh]

  • Eval on 1 rel
  • Prec/recall

tradeoff missing

slide-37
SLIDE 37

Eugene Agichtein Columbia University

  • 37

Experimental results: Sample and Check

(a) (b) Recall (a) and precision (b) for varying minimum confidence threshold Tt. NOTE: Recall is estimated using the Ideal metric, precision is estimated by manually checking random samples of result table.

slide-38
SLIDE 38

Eugene Agichtein Columbia University

  • 38

Organization Location Microsoft Corp Wash. Microsoft Corporation Redmond Microsoft Corp. WA Apple Computer Calif. Apple Corp Cupertino Apple Computer Corp. US

Approximate Matching of Organizations

  • Use Whirl (W. Cohen @ AT&T) to match similar organization names
  • Self-join the Extracted table on the Organization attribute
  • Join resulting table with the Test table, and compare values of

Location attributes

Location Organization Redmond Microsoft Armonk IBM Santa Clara Intel Mountain View Netscape Cupertino Apple

Extracted Extracted ‘ [Himanshu]

  • NER errors
slide-39
SLIDE 39

Eugene Agichtein Columbia University

  • 39

Conclusions

We presented

  • Our Snowball system:

– Requires minimal training (handful of seed tuples) – Uses a flexible pattern representation – Achieves high recall/precision

> 80% of test tuples extracted

  • Scalable evaluation methodology

[Haroun] + qual analysis

slide-40
SLIDE 40

Critique

  • Negation

[Himanshu] Algorithm does not take into account the semantic meaning of words in a given pattern, which may lead to inaccurate

  • results. For eg. It may extract tuples that

follow the pattern <organization> is not located in <headquarters>

  • Orgs with multiple offices //gen principle?

[Gagan] Google’s NY office

Eugene Agichtein Columbia University

  • 40
slide-41
SLIDE 41

Eugene Agichtein Columbia University

  • 41

Recent and Future Work

  • Recent (presented in DMKD’00 workshop)

– Alternative pattern representation – Combining representations

  • Future Work

– Evaluation on other extraction tasks – Extensions:

  • Non-binary relations
  • Relations with no key

– HTML documents

[Dinesh]

  • Only binary
  • Seed per relation
slide-42
SLIDE 42

42

Snowball: Extracting Relations from Large Plain-Text Collections

Eugene Agichtein (eugene@cs.columbia.edu) Luis Gravano Department of Computer Science Columbia University

slide-43
SLIDE 43

43

Backup Slides

slide-44
SLIDE 44

Eugene Agichtein Columbia University

  • 44

Snowball Solutions

  • Flexible pattern representation
  • Pattern generation
  • Automatic pattern and tuple evaluation

– Able to recover from noise – Keeps only high quality tuples as new seed

slide-45
SLIDE 45

Eugene Agichtein Columbia University

  • 45

Experimental Results: Training

(a) (b) Recall (a) and precision (b) using the Ideal metric (training collection)

slide-46
SLIDE 46

Eugene Agichtein Columbia University

  • 46

Sampling Results: Error Analysis

Type of Error Correct Incorrect Location Organization Relationship DIPRE 74 26 3 28 5

Snowball (all tuples)

52 48 6 41 1

Snowball (t = 0.8)

93 7 3 4 Baseline 25 75 8 62 5

The tuples in the random samples were checked by hand to pinpoint the “culprits” responsible for incorrect tuples. Sample size is 100.

slide-47
SLIDE 47

Eugene Agichtein Columbia University

  • 47

Sample Discovered Patterns

Left Middle Right Conf <NEAR 0.01> <IN 0.79> <HEADQUARTERS 0.03> <, 0.20> 0.36 < OF 0.61> <, 0.61> <, 0.15) 0.37 < - 0.53> < BASED 0.53> < , 0.25 > <SAID 0.1> 1 <WHILE 0.01> <BASED 0.52> <IN 0.52> < , 0.43> <, 0.28> 0.96 < - 0.70> <, 0.08> 0.63 FROM 0.01 <S 0.52> <' 0.52> <IN 0.24> <HEADQUARTERS 0.22> <AND 0.01> 0.69

slide-48
SLIDE 48

Eugene Agichtein Columbia University

  • 48

Convergence of Snowball and DIPRE

Precision (a) and Recall (b) of the DIPRE and Snowball with increased iterations (a) (b)

slide-49
SLIDE 49

Eugene Agichtein Columbia University

  • 49

Organization Location Microsoft Corp Wash. Microsoft Corporation Redmond Microsoft Corp. WA Apple Computer Calif. Apple Corp Cupertino Apple Computer Corp. US

Approximate Matching of Organizations

  • Use Whirl (W. Cohen @ AT&T) to match similar organization names
  • Self-join the Extracted table on the Organization attribute
  • Join resulting table with the Test table, and compare values of

Location attributes

Location Organization Redmond Microsoft Armonk IBM Santa Clara Intel Mountain View Netscape Cupertino Apple

Extracted Extracted ‘

slide-50
SLIDE 50

Eugene Agichtein Columbia University

  • 50

References

  • Blum & Mitchell. Combining labeled and unlabeled data with

co-training. Proceedings of 1998 Conference on Computational Learning Theory.

  • Brin. Extracting patterns and relations from the World-Wide
  • Web. Proceedings on the 1998 International Workshop on

Web and Databases (WebDB’98).

  • Collins & Singer. Unsupervised models for named entity
  • classification. EMNLP 1999.
  • Riloff & Jones. Learning dictionaries for information extraction

by multi-level bootstrapping. AAAI’99.

  • Yarowsky. Unsupervised word sense disambiguation rivaling

supervised methods. ACL’95.