Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 - - PowerPoint PPT Presentation

zero shot entity extraction from web pages
SMART_READER_LITE
LIVE PREVIEW

Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 - - PowerPoint PPT Presentation

Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang Focus: Entity Extraction What are the longest hiking trails near Baltimore ? Data Source hiking trails near Baltimore Avalon Super Loop Patapsco


slide-1
SLIDE 1

Zero-shot Entity Extraction from Web Pages

ACL June 23, 2014

Panupong Pasupat and Percy Liang

slide-2
SLIDE 2

Focus: Entity Extraction

hiking trails

hiking trails near Baltimore Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point ...

What are the longest near Baltimore?

Data Source

1

slide-3
SLIDE 3

Focus: Entity Extraction

hiking trails

hiking trails near Baltimore Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point ...

What are the longest near Baltimore?

Data Source

Applications: question answering / semantic parsing / taxonomy construction / ontology expansion / knowledge base population / ...

1

slide-4
SLIDE 4

Semi-Structured Data on the Web

2

slide-5
SLIDE 5

Challenge: Long Tail of Categories

person location

  • rganization

3

slide-6
SLIDE 6

Challenge: Long Tail of Categories

person location

  • rganization

airport battleship acid pitcher settlement headgear metaphor haircut poker hand biome enzyme superstition

3

slide-7
SLIDE 7

Challenge: Long Tail of Categories

person location

  • rganization

airport battleship acid pitcher settlement headgear metaphor haircut poker hand biome enzyme superstition tutorials at ACL 2014 dishes at Pu Pu Hot Pot Stanford computer science professors We want to generalize to unseen categories

3

slide-8
SLIDE 8

Relevant Approaches

Bootstrapping from Seed Examples:

seeds Avalon Super Loop Hilton Area System answers Avalon Super Loop Hilton Area Wildlands Loop ... web pages web pages web pages

Use seed examples to specify the entity category

[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...] 4

slide-9
SLIDE 9

Relevant Approaches

Bootstrapping from Seed Examples:

seeds Avalon Super Loop Hilton Area System answers Avalon Super Loop Hilton Area Wildlands Loop ... web pages web pages web pages

Use seed examples to specify the entity category ... but we might not have seeds (e.g. in question answering)

[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...] 4

slide-10
SLIDE 10

Our Work

query hiking trails near Baltimore System answers Avalon Super Loop Hilton Area Wildlands Loop ... web page

Use a natural language query to specify the entity category

5

slide-11
SLIDE 11

Outline

  • 1. Setup
  • Problem Setup
  • Dataset
  • 2. Approach
  • 3. Results

6

slide-12
SLIDE 12

Problem Setup

Input:

  • query x

hiking trails near Baltimore

  • web page w

7

slide-13
SLIDE 13

Problem Setup

Input:

  • query x

hiking trails near Baltimore

  • web page w

7

slide-14
SLIDE 14

Problem Setup

Input:

  • query x

hiking trails near Baltimore

  • web page w

7

slide-15
SLIDE 15

Problem Setup

Input:

  • query x

hiking trails near Baltimore

  • web page w

Output:

  • list of entities y

[Avalon Super Loop, Patapsco Valley State Park, ...]

7

slide-16
SLIDE 16

Dataset

We created the OpenWeb dataset with diverse queries and web pages. airlines of italy natural causes of global warming lsu football coaches bf3 submachine guns badminton tournaments foods high in dha technical colleges in south carolina songs on glee season 5 singers who use auto tune san francisco radio stations

8

slide-17
SLIDE 17

Dataset

We created the OpenWeb dataset with diverse queries and web pages.

airlines of italy natural causes of global warming lsu football coaches

8

slide-18
SLIDE 18

Query Generation

Breadth-first search on Google Suggest

list of Google Suggest list of Indian movies ...

[Berant et al., 2013] 9

slide-19
SLIDE 19

Query Generation

Breadth-first search on Google Suggest

list of Google Suggest list of Indian movies ... Template Extraction list of movies list of movies list of Indian ...

[Berant et al., 2013] 9

slide-20
SLIDE 20

Query Generation

Breadth-first search on Google Suggest

list of Google Suggest list of Indian movies ... Template Extraction list of movies list of movies list of Indian ...

[Berant et al., 2013] 9

slide-21
SLIDE 21

Dataset Annotation

Annotate the first, second, and last entities matching the query using Amazon Mechanical Turk.

10

slide-22
SLIDE 22

Dataset Annotation

Annotate the first, second, and last entities matching the query using Amazon Mechanical Turk. airlines of italy Annotation First: Air Dolomiti Second: Air Europe Last: Wind Jet

10

slide-23
SLIDE 23

Dataset Statistics

2773 examples 2269 unique queries 894 unique headwords ← long tail! 1483 unique web domains ← long tail! (= wrapper induction)

11

slide-24
SLIDE 24

Outline

  • 1. Setup
  • 2. Approach
  • Extraction Predicate
  • Framework
  • Modeling
  • Features
  • 3. Results

12

slide-25
SLIDE 25

Extraction Predicate

How can we choose what to extract from a web page w?

html head body table tr td td td td h1 table tr th th tr td td ... tr td td

number of possible entity lists ≈ 2number of nodes

13

slide-26
SLIDE 26

Extraction Predicate

Idea: Entities usually share the same tag and tree level

html head body table tr td td td td h1 table tr th th tr td td ... tr td td

z = /html[1]/body[1]/table[2]/tr/td[1]

[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001] 14

slide-27
SLIDE 27

Extraction Predicate

Idea: Entities usually share the same tag and tree level

html head body table tr td td td td h1 table tr th th tr td td ... tr td td

z = /html[1]/body[1]/table[2]/tr/td[1] Captures structures such as table columns, list entries, headers of the same level, ... Each web page has ≈ 8500 extraction predicates z

[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001] 14

slide-28
SLIDE 28

Framework

x w hiking trails near Baltimore

html head ... body ...

15

slide-29
SLIDE 29

Framework

x w Generation Z hiking trails near Baltimore

html head ... body ...

(|Z| ≈ 8500)

15

slide-30
SLIDE 30

Framework

x w Generation Z Model z hiking trails near Baltimore

html head ... body ...

(|Z| ≈ 8500) /html[1]/body[1]/table[2]/tr/td[1]

15

slide-31
SLIDE 31

Framework

x w Generation Z Model z Execution y hiking trails near Baltimore

html head ... body ...

(|Z| ≈ 8500) /html[1]/body[1]/table[2]/tr/td[1] [Avalon Super Loop, Patapsco Valley State Park, ...]

15

slide-32
SLIDE 32

Framework

x w Generation Z Model z Execution y hiking trails near Baltimore

html head ... body ...

(|Z| ≈ 8500) /html[1]/body[1]/table[2]/tr/td[1] [Avalon Super Loop, Patapsco Valley State Park, ...]

A graphical model with latent extraction predicate z

15

slide-33
SLIDE 33

Modeling

Let x be a query and w be a web page. Define a log-linear distribution over the extraction predicates z ∈ Z: pθ(z | x, w) ∝ exp{θ⊤φ(x, w, z)}

  • θ is a parameter vector
  • φ(x, w, z) is a feature vector

16

slide-34
SLIDE 34

Modeling

Let x be a query and w be a web page. Define a log-linear distribution over the extraction predicates z ∈ Z: pθ(z | x, w) ∝ exp{θ⊤φ(x, w, z)}

  • θ is a parameter vector
  • φ(x, w, z) is a feature vector
  • Find θ that maximizes the log-likelihood of the training data

using AdaGrad [Duchi et al., 2010]

16

slide-35
SLIDE 35

Features

pθ(z | x, w) ∝ exp{θ⊤φ(x, w, z)}

17

slide-36
SLIDE 36

Features

pθ(z | x, w) ∝ exp{θ⊤φ(x, w, z)} Structural Features: context

>

17

slide-37
SLIDE 37

Features

pθ(z | x, w) ∝ exp{θ⊤φ(x, w, z)} Denotation Features: content

hiking trails near Baltimore Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Rachel Carson Conservation Park Union Mills Hike ...

>

hiking trails near Baltimore Home About Baltimore Tour Pricing Contact Online Support ...

17

slide-38
SLIDE 38

Defining Features on Lists

George Washington John Adams Thomas Jefferson James Madison ... (39 more) ... Barack Obama John Adams John Adams John Adams John Adams John Adams John Adams ... (100 more) ... John Adams Blog Photos and Video Briefing Room In the White House Mobile Apps Contact Us

good bad bad

18

slide-39
SLIDE 39

Defining Features on Lists

George Washington John Adams Thomas Jefferson James Madison ... (39 more) ... Barack Obama John Adams John Adams John Adams John Adams John Adams John Adams ... (100 more) ... John Adams Blog Photos and Video Briefing Room In the White House Mobile Apps Contact Us

good bad bad identity diverse identical diverse

18

slide-40
SLIDE 40

Defining Features on Lists

NNP NNP NNP NNP NNP NNP NNP NNP ... (39 more) ... NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP NNP ... (100 more) ... NNP NNP NN NNS CC NNP NN NN IN DT NNP NNP NNP NNPS NN PRP

good bad bad identity diverse identical diverse POS identical identical diverse

18

slide-41
SLIDE 41

Defining Features on Lists

Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point

19

slide-42
SLIDE 42

Defining Features on Lists

Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point 3 4 4 3 2

  • 1. Abstraction

Map list elements into abstract tokens

19

slide-43
SLIDE 43

Defining Features on Lists

Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point 3 4 4 3 2

2 3 4 histogram

Entropy Majority MajorityRatio Single Mean Variance

  • 1. Abstraction

Map list elements into abstract tokens

  • 2. Aggregation

Define features using the histogram of the abstract tokens

19

slide-44
SLIDE 44

Defining Features on Lists

Avalon Super Loop Patapsco Valley State Park Gunpowder Falls State Park Union Mills Hike Greenbury Point 3 4 4 3 2

2 3 4 histogram

Entropy Majority MajorityRatio Single Mean Variance

  • 1. Abstraction

Map list elements into abstract tokens

  • 2. Aggregation

Define features using the histogram of the abstract tokens Use this method for both structural and denotation features

19

slide-45
SLIDE 45

Outline

  • 1. Setup
  • 2. Approach
  • 3. Results
  • Main Results
  • Error Analysis
  • Feature Analysis

20

slide-46
SLIDE 46

Main Results

Baseline (Most frequent extraction predicates) Accuracy Accuracy @ 5 10 20 30 40 50 60

Accuracy

10.3

21

slide-47
SLIDE 47

Main Results

Baseline (Most frequent extraction predicates) Accuracy Accuracy @ 5 10 20 30 40 50 60

Accuracy

10.3 40.5 55.8

21

slide-48
SLIDE 48

Error Analysis

Correct 40.5% Coverage Errors 33.4% Ranking Errors 26.1%

22

slide-49
SLIDE 49

Examples of Correct Predictions

Query: disney channel movies

/html[1]/body/div[2]/div/div/div[3]/div[1]/div/div/div/div/b

23

slide-50
SLIDE 50

Examples of Correct Predictions

Query: universities in canada

/html[1]/body/div/div/div/div/div/div/div/a/text

24

slide-51
SLIDE 51

Examples of Correct Predictions

Query: nobel prize winners

/html[1]/body/div/div[2]/div/div/div/h6/a/text

25

slide-52
SLIDE 52

Error Analysis

Correct 40.5% Coverage Errors 33.4% Ranking Errors 26.1%

26

slide-53
SLIDE 53

Error Analysis

Correct 40.5% Coverage Errors 33.4% Ranking Errors 26.1%

Coverage Errors No extraction predicate z produces an entity list y matching the annotation

26

slide-54
SLIDE 54

Examples of Coverage Errors

Query: companies named after a person

/html/body/div[3]/div[3]/div[4]/ul/li/a

Need richer extraction predicates!

27

slide-55
SLIDE 55

Examples of Coverage Errors

Query: hedge funds in new york

/html/body/div[3]/div[3]/div[4]/.../table/tbody/tr/td[2]/a

Need compositionality!

28

slide-56
SLIDE 56

Error Analysis

Correct 40.5% Coverage Errors 33.4% Ranking Errors 26.1%

Coverage Errors No extraction predicate z produces an entity list y matching the annotation

29

slide-57
SLIDE 57

Error Analysis

Correct 40.5% Coverage Errors 33.4% Ranking Errors 26.1%

Coverage Errors No extraction predicate z produces an entity list y matching the annotation Ranking Errors The system finds a list y matching the anno- tation, but it does not have the highest model score.

29

slide-58
SLIDE 58

Examples of Ranking Errors

Query: doctors at emory

/html/body/div[3]/div[4]/table/tbody/tr/td[2]

30

slide-59
SLIDE 59

Augmenting Denotation Features

Observation: Entities of different categories have different linguistic properties.

mayors of Chicago universities in Chicago Rahm Emanuel Aurora University Richard M. Daley DePaul University Eugene Sawyer Illinois Institute of Technology ... ...

31

slide-60
SLIDE 60

Augmenting Denotation Features

Observation: Entities of different categories have different linguistic properties.

mayors of Chicago universities in Chicago Rahm Emanuel Aurora University Richard M. Daley DePaul University Eugene Sawyer Illinois Institute of Technology ... ...

Experiment: Augment denotation features with the query category.

POS majority = NNP NNP

(

POS majority = NNP NNP , query category = people

)

31

slide-61
SLIDE 61

Augmenting Denotation Features

Denotation Augmented Denotation 10 20 30

Accuracy (dev)

19.8 25

32

slide-62
SLIDE 62

Augmenting Denotation Features

Structural + Denotation (default) Structural + Augmented Denotation 10 20 30 40 50

Accuracy (dev)

41.1 41.7

33

slide-63
SLIDE 63

Augmenting Denotation Features

Structural + Denotation (default) Structural + Augmented Denotation 10 20 30 40 50

Accuracy (dev)

41.1 41.7

Hypothesis: Structural features have high influence when the web page comes from Web search result.

33

slide-64
SLIDE 64

Augmenting Denotation Features

Hypothesis: Structural features have high influence when the web page comes from Web search result.

34

slide-65
SLIDE 65

Augmenting Denotation Features

Hypothesis: Structural features have high influence when the web page comes from Web search result.

hiking trails near Baltimore

Verify the hypothesis: Concatenate a random web page

34

slide-66
SLIDE 66

Augmenting Denotation Features

Hypothesis: Structural features have high influence when the web page comes from Web search result.

hiking trails near Baltimore

Verify the hypothesis: Concatenate a random web page

  • Creates noise: entity lists with high

structural feature scores might not be the correct list

34

slide-67
SLIDE 67

Augmenting Denotation Features

hiking trails near Baltimore

Structural + Denotation (default) Structural + Augmented Denotation 10 20 30 40

Accuracy (stitched)

19.3 29.2

35

slide-68
SLIDE 68

Summary

query hiking trails near Baltimore System answers Avalon Super Loop Hilton Area Wildlands Loop ... web page

A framework for extracting entities from a natural language query and a single web page

36

slide-69
SLIDE 69

Summary

tutorials at ACL

Focus on the long tail of entity categories

37

slide-70
SLIDE 70

Summary

tutorials at ACL

Focus on the long tail of entity categories Consider both structural and de- notation features

37

slide-71
SLIDE 71

Summary

tutorials at ACL

Focus on the long tail of entity categories Consider both structural and de- notation features

Avalon .. Patapsco .. Gunpowder .. Union .. Greenbury .. 3 4 4 3 2

2 3 4 histogram

Handle lists of different sizes with abstraction and aggregation

37

slide-72
SLIDE 72

Future Work

  • Model relationship between entities and category strings
  • Compositionality in natural language

38

slide-73
SLIDE 73

Download code and dataset: http://nlp.stanford.edu/software/web-entity-extractor-ACL2014 Thank you!

39