We KnowItAll: Lessons from a Quarter Century of Web Extraction - - PDF document

we knowitall
SMART_READER_LITE
LIVE PREVIEW

We KnowItAll: Lessons from a Quarter Century of Web Extraction - - PDF document

We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Et zioni Turing Cent er Universit y of Washingt on http:/ / turing. cs. washington. edu KnowItAll Team Michele Banko Michael Caf arella Tom Lin


slide-1
SLIDE 1

1

We KnowItAll:

Lessons from a Quarter Century

  • f Web Extraction Research

Oren Et zioni Turing Cent er Universit y of Washingt on

http:/ / turing. cs. washington. edu

2

KnowItAll Team

Michele Banko Michael Caf arella Tom Lin Mausam Alan Rit t er St ef an Schoenmackers St ephen Soderland Dan Weld Alumni: Doug Downey, Ana-

Maria Popescu, Alex Yat es, and ot hers.

slide-2
SLIDE 2

2

3

“In the Knowledge Lies the Power”

Manual “knowledge engineering”

Cyc, Freebase, WordNet

Volunt eers

Open Mind, DBPedia von Ahn: games Verbosit y

KnowI t All proj ect : “read” t he Web

4

Outline

1.

Turing Cent er

  • 2. Machine Reading & Open I E
  • 3. Text Runner
  • 4. Text ual I nf erence
  • 5. Conclusions
slide-3
SLIDE 3

3

5

  • 2. What is Machine Reading?

Unsupervised underst anding of t ext I nf ormation extraction + inf erence

6

Information Extraction (IE) via Rules

(Hearst ’92)

“…

Cities such as Boston, Los Angeles, and Seattle… ”

(“C such as NP1, NP2, and NP3”) => I S-A(each(head(NP)), C), …

  • Det ailed inf ormat ion f or several countries

such as maps, … ”ProperNoun(head(NP))

  • “I list en t o pret t y much all music but

pref er country such as Garth Brooks”

slide-4
SLIDE 4

4

7

IE as Supervised Learning

(E.g., Riloff ‘96, Soderland ’99)

Hand- label examples of each relat ion Manual labor linear in | relat ions| Learn lexicalized ext ract or

+

  • S. Smith formerly

chairman of XYZ Corporation … ManagementSuccession Person-In Person-Out Organization Position <P2> formerly <POS>

  • f <ORG>

Labeled Examples Relation Extractor

=

8

Semi-Supervised Learning

Few hand-labeled examples Limit on t he number of relat ions Relat ions are pre-specif ied Problemat ic f or Machine Reading

Alt ernat ive: self - supervised learning

Learner discovers relat ions on t he f ly (Sekine ’06) Learner aut omat ically labels examples “Look Ma, No Hands!” (Downey & Etzioni ’08)

per relation!

slide-5
SLIDE 5

5

9

Open IE = Self-supervised IE

(Banko, et. al, IJCAI ’07, ACL ‘08)

O(D) D document s Relation- independent O(D * R) R relat ions Lexicalized, relat ion-specif ic

Complexity: Output:

Discovered Aut omat ically Specif ied in Advance

Relations:

Corpus Corpus + Hand- labeled Dat a

I nput: Open I E Traditional I E

10

How is Open IE Possible?

There is a compact set of relat ionship

expressions in English (Banko & Et zioni ACL ’08)

“Relat ionship expressions” can be

specif ied wit hout ref erence t o t he individual relat ion!

Gat es f ounded Microsof t

f ounded(Gat es, Microsof t )

slide-6
SLIDE 6

6

11

1.0 1.8% 5.2% 9.4% 16.0% 22.8% 37.8%

Frequency

E NP ( | )? E E1 (and|,) E2 Verb

X , Y merge

Coordinatev E1 (and|,|-|:) E2 NP

X - Y deal

Coordinaten E1 Verb E2 NP

X is Y winner

Modifier E1 to Verb E2

X to acquire Y

Infinitive E1 Verb Prep E2

X moved to Y

Verb+Prep E1 NP Prep E2

the X settlement with Y

Noun+Prep E1 Verb E2

X established Y

Verb

Pattern Category

Relation-Indep. Extraction Rules

12

Relation-Independent Observations

  • 95% of sample 8 broad pat t erns!
  • Rules are simplif ied…
  • Applicabilit y condit ions complex

“E1 Verb E2”

1.

Kentucky Fried Chicken

2.

Microsoft announced Tuesday that…

  • Learned via a CRF model
  • Ef f ect ive, but f ar f rom perf ect !
slide-7
SLIDE 7

7

Lesson 1: to scale IE to the Web corpus, use Open IE What about extraction errors?

slide-8
SLIDE 8

8

15

Leveraging the Web’s Redundancy

(Downey, Soderland, & Etzioni, IJCAI ’05)

2) Mult iple, dist inct ext ract ion rules

Hit s Phrase

1) Repet it ion in dist inct sent ences

“Atlanta and other cities”

980

“Canada and other cities”

286

“cities such as Atlanta”

5860

“cities such as Canada”

7

IJCAI ‘05: A formal model of these intuitions

16

IJCAI ’05 Combinatorial Model

If an extraction x appears k times in a set of n distinct sentences matching a rule, what is the probability that x ∈ class C?

slide-9
SLIDE 9

9

Lesson 2: utilize the Web’s redundancy

18

What about complex sentences?

Paris, which has been proclaimed by many

lit erary f igures t o be a great source of inspirat ion, is also a capit al cit y, but not t he capit al cit y of an ordinary count ry, but rat her t he capit al of a great republic--- t he republic of France! Paris is t he Capit al of France.

slide-10
SLIDE 10

10

19

But What About…

Relat ive clauses Disj unct ion & conj unct ion Anaphora Quant if icat ion Temporal qualif icat ion Count erf act uals

See Durm & Schubert ’08; MacCart ney ‘08

Lesson 3: focus on ‘tractable’ sentences

slide-11
SLIDE 11

11

21

  • 3. TextRunner (Web’s 1st Open IE system)

1.

Self - Supervised Learner: automatically labels example ext ract ions & t rains a CRF- based ext ract or

2.

Single- Pass Extractor: makes a single pass

  • ver corpus, ident if ying ext ract ions in

‘tractable’ sentences

3.

Search Engine: indexes and ranks ext ract ions based on redundancy

4.

Query Processor: int eract ive speeds

22

Application: Information Fusion

What kills bact eria? What west coast , nano-t echnology

companies are hiring?

Compare Obama’s “buzz” versus Hillary’s? What is a quiet , inexpensive, 4-st ar hot el

in Vancouver?

slide-12
SLIDE 12

12

TextRunner Demo

Ext ract ion run at Google on 500,000,000 high-qualit y Web pages.

24

slide-13
SLIDE 13

13

25 26

Triples 11.3 million With Well-Formed Relation 9.3 million With Well-Formed Entities 7.8 million Abstract 6.8 million 79.2% correct Concrete 1.0 million 88.1% correct

Sample of Web Pages

Concrete f acts:

(Oppenheimer, t aught at , Berkeley)

Abstract f acts: (f ruit ,

cont ain, vit amins)

slide-14
SLIDE 14

14

Lesson 4: Open IE yields a mass of isolated “nuggets”

Text ual inf erence is t he next st ep t owards machine reading

28

  • 4. Examples of Textual Inference

I .

Ent it y and predicat e resolut ion

I I .

Composing f act s t o draw conclusions

slide-15
SLIDE 15

15

29

  • I. Entity Resolution

P(String1 = String2)? Resolver (Yates & Etzioni, HLT ’07): det ermines synonymy based on relat ions f ound by Text Runner (cf . Pantel & Lin ‘01)

(X, born in, 1941) (M, born in, 1941) (X, cit izen of , US) (M, cit izen of , US) (X, f riend of , J oe) (M, f riend of , Mary)

P(X = M) ~ shared relat ions

30

Relation Synonymy

(1, R, 2) (2, R 4) (4, R, 8) Et c. (1, Q, 2) (2, Q, 4) (4, Q, 8) Et c.

P(R = Q) ~ shared argument pair s

  • See paper f or probabilist ic model
  • O(n log(n)) algor it hm, n =| t uples|
slide-16
SLIDE 16

16

31

Functional Relations help resolution

1.

married-t o(Oren, X)

2.

married-t o(Oren, Y) I f f unct ion(married-t o) t hen P(X=Y) is high

  • Caveat : Oren & Oren ref er t o t he same

individual

Lesson 4: leverage relations implicit in the text

Funct ional relat ions are part icularly inf ormat ive

slide-17
SLIDE 17

17

33

Returning to Our Caveat…

BornI n(Oren, New York) BornI n(Oren, Maryland) But t hese are t wo dif f erent Orens. How can you t ell?

Web page cont ext

Lesson 5: context enhances entity resolution

To dat e, our ext ract ion process has been myopic

slide-18
SLIDE 18

18

35

  • III. Compositional Inference is Key

The Tour de France is the world's largest cycle race. The Tour de France takes place in France. The illuminated helmet is used during sporting events such as cycle racing. Is-A(Tour de France, cycle race) In(Tour de France, France) Is-A(cycle race, sporting event) Information Extraction

sporting event in France? Tour de France

36

Scalable Textual Inference

(Schoenmackers, Etzioni, Weld EMNLP ‘08)

Desiderat a f or inf erence:

  • I n t ext probabilist ic inf erence
  • On t he Web scalable
  • linear in | Corpus|

Novel propert y of t ext ual relat ions:

  • Prevalent
  • Provably linear
  • Empirically linear!
slide-19
SLIDE 19

19

37

Inference Scalability for Holmes

38

Related Work

Weld’s I nt elligence in Wikipedia proj ect Sekine’s “pre-empt ive I E” Powerset Text ual Ent ailment AAAI ‘07 Symposium on “Machine Reading” Growing body of work on I E f rom t he Web

slide-20
SLIDE 20

20

39

Lessons Reprised

  • Scale & diversit y of t he Web Open I E
  • Redundancy helps accuracy!
  • Focus on ‘t ract able’ sent ences
  • Open I E(Web) = disj oint ed “nugget s”
  • Scalable inf erence is key
  • Leverage relat ions implicit in t he t ext
  • Sent ence-based I E page-based I E

40

Directions for Future Work

Charact erize ‘t ract able’ sent ences Page-based I E

Bet t er ent it y resolut ion

I nt egrat e Open I E wit h ont ologies I nvest igat e ot her domains f or Open I E

(e.g., news)

slide-21
SLIDE 21

21

41

Implications for Web Search

I magine search syst ems t hat operat e over a (more) semant ic space

  • Key words, document s extractions
  • TF-I DF, pagerank entities, relations
  • Web pages, links Q/ A, inf ormation f usion

Reading the Web new Search Paradigm

42

Machine Reading of the Web

How is the ThinkPad T-40?

Found 2,300 reviews; 85% positive, key features are…

World Wide Web I n f

  • r

m a t i

  • n

F

  • d

C h a i n

Open IE over 10^10 Web pages.

slide-22
SLIDE 22

22

43 44

Machine Reading at Web Scale?

Open Information Extraction + Linear-time Inference Difficult, ungrammatical sentences Unreliable information Massive Scale Tractable Sentences Unanticipated

  • bjects & relations

(PageRank) Redundancy

slide-23
SLIDE 23

23

45

Holmes - KBMC

Answers Answers

Knowledge Bases Knowledge Bases Find Proof Trees Approximate Probabilistic Inference Construct Markov Network KBs

0.9 Prevents(Ca, osteo.) 0.8 Contains(kale, Ca) …

  • Inf. Rules

Weighted Horn Clauses 1.5 P(X,Z):-C(X,Y)^P(Y,Z)

Query

Q(X) :- prevent(X,osteo.)

Answers

E x p a n d M N ? Chain backwards from the query Convert proof tree into Markov network

46

Test

Assess candidate extraction E using Pointwise Mutual

Information (PMI) between E and a “discriminator phrase” D

PMI-IR (Turney ’01): use hit counts for efficient PMI

computation over the Web

| ) ( | | ) ( | ) , ( Yakima Hits City Yakima Hits City Yakima PMI + =

slide-24
SLIDE 24

24

47

PMI In Action

1.

E = “Yakima” (1, 340, 000)

2.

D = < class name>

3.

E+D = “Yakima cit y” (2760)

4.

PMI = (2760 / 1.34M)= 0. 002

  • E = “Avocado” (1,000,000)
  • E+D =“Avocado city” (10)

PMI = 0.00001 << 0.002

48

slide-25
SLIDE 25

25

49 50

Can IE achieve“adequate” precision/ recall on the Web corpus?

slide-26
SLIDE 26

26

51

Recall – Precision Curve

(City, Country, USState)

52

High Precision for Binary Predicates

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Precision CapitalOf CeoOf

slide-27
SLIDE 27

27

53

Methods for Improving Recall

RL: learn class-specif ic pat t erns

“Headquart ed in < cit y> ”

  • SE: Recursively extract subclasses

“Scientists such as physicists, chemists, and biologists…”

  • LE: extract lists of items & vote
  • Bootstrapped from generic KnowItAll—no hand-labeled examples!

Results for City

Found 10,300 cities missing from Tipster Gazetteer.

slide-28
SLIDE 28

28

55

Sources of ambiguity

Time: “Bush is t he president ”

in 2006!

Cont ext : “common misconcept ions..” Opinion: Who killed J FK? Mult iple word senses: Amazon, Chicago,

Chevy Chase, et c.

56

Critique of KnowItAll 1.0

User query (“Af rican cit ies”) launches

search engine queries slow, limit ed

Relat ions specif ied by user N relat ions N passes over t he dat a

can we make a single pass over t he Web corpus?

slide-29
SLIDE 29

29

57

Extractor Overview (Banko & Etzioni, ’08)

1.

Use a simple model of relat ionships in English t o label ext ract ions

2.

Boot st rap a general model of relat ionships in English sent ences, encoded as a CRF

3.

Decompose each sent ence int o one or more (NP 1, VP, NP2) “chunks”

4.

Use CRF model t o ret ain relevant part s of each NP and VP. The extractor is relation- independent!

58

slide-30
SLIDE 30

30

59 60

Fundamental AI Questions

How t o accumulat e massive amount s of

knowledge?

Can a program read and “underst and” what

it ’s read?

slide-31
SLIDE 31

31

61

Massive Information Extraction from the Web

How is the ThinkPad T-40?

Found 2,300 reviews; 85% positive.

World Wide Web I n f

  • r

m a t i

  • n

F

  • d

C h a i n

Stats over 10^10 web pages.

KnowItAll