Is Fidel Castro Really an American President? On Set Expansion An - - PowerPoint PPT Presentation

is fidel castro really an american president
SMART_READER_LITE
LIVE PREVIEW

Is Fidel Castro Really an American President? On Set Expansion An - - PowerPoint PPT Presentation

Is Fidel Castro Really an American President? On Set Expansion An Example Google Sets Google Sets Google Sets Google Sets Notion Notion Seeds: Barack Obama, Bill Clinton, George Bush Notion Seeds: Barack Obama, Bill Clinton,


slide-1
SLIDE 1

Is Fidel Castro Really an American President?

On Set Expansion

slide-2
SLIDE 2

An Example

slide-3
SLIDE 3

Google Sets

slide-4
SLIDE 4

Google Sets

slide-5
SLIDE 5

Google Sets

slide-6
SLIDE 6

Google Sets

slide-7
SLIDE 7

Notion

slide-8
SLIDE 8

Notion

  • Seeds: Barack Obama, Bill Clinton,

George Bush

slide-9
SLIDE 9

Notion

  • Seeds: Barack Obama, Bill Clinton,

George Bush

  • Target Set: US-American presidents
slide-10
SLIDE 10

Notion

  • Seeds: Barack Obama, Bill Clinton,

George Bush

  • Target Set: US-American presidents
  • Answer to query: more probable

elements of the target set

slide-11
SLIDE 11

Notion

  • Seeds: Barack Obama, Bill Clinton,

George Bush

  • Target Set: US-American presidents
  • Answer to query: more probable

elements of the target set

  • John F. Kennedy
slide-12
SLIDE 12

Notion

  • Seeds: Barack Obama, Bill Clinton,

George Bush

  • Target Set: US-American presidents
  • Answer to query: more probable

elements of the target set

  • John F. Kennedy
  • George Washington, etc.
slide-13
SLIDE 13

SEAL

slide-14
SLIDE 14

The SEAL System

slide-15
SLIDE 15

The SEAL System

Seeds

slide-16
SLIDE 16

The SEAL System

Seeds Web pages

Fetcher

Google

slide-17
SLIDE 17

The SEAL System

Seeds Web pages

Fetcher

Google

Extractor

Mentions Wrapper

slide-18
SLIDE 18

The SEAL System

Seeds Web pages

Fetcher

Google

Extractor

Mentions Wrapper

Ranker

Suggestions Graph

slide-19
SLIDE 19

The SEAL System

Seeds Web pages

Fetcher

Google

Extractor

Mentions Wrapper

Ranker

Suggestions Graph

slide-20
SLIDE 20

Fetcher

slide-21
SLIDE 21

Fetcher

slide-22
SLIDE 22
  • Simple Google search

Fetcher

slide-23
SLIDE 23
  • Simple Google search
  • Query: Concatenation of all seeds

Fetcher

slide-24
SLIDE 24
  • Simple Google search
  • Query: Concatenation of all seeds
  • Crawl top results

Fetcher

slide-25
SLIDE 25

The SEAL System

Seeds Web pages

Fetcher

Google

Extractor

Mentions Wrapper

Ranker

Suggestions Graph

slide-26
SLIDE 26

Extractor

slide-27
SLIDE 27

Extractor

  • Idea:
slide-28
SLIDE 28

Extractor

  • Idea:
  • Find common contexts of all seeds
slide-29
SLIDE 29

Extractor

  • Idea:
  • Find common contexts of all seeds
  • Wrapper
slide-30
SLIDE 30

Extractor

  • Idea:
  • Find common contexts of all seeds
  • Wrapper
  • Derive new entities
slide-31
SLIDE 31

Extractor

  • Idea:
  • Find common contexts of all seeds
  • Wrapper
  • Derive new entities

<ul> <li>Obama</li> <li>Bush</li> <li>Kennedy</li> </ul>

slide-32
SLIDE 32

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

slide-33
SLIDE 33

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

slide-34
SLIDE 34

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds Right Contexts

slide-35
SLIDE 35

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • xxaEsd...

Right Contexts

slide-36
SLIDE 36

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • xxaEsd...
  • xkhSlp...

Right Contexts

slide-37
SLIDE 37

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • xxaEsd...
  • xkhSlp...
  • gzPHMA...

Right Contexts

slide-38
SLIDE 38

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • xxaEsd...
  • xkhSlp...
  • gzPHMA...
  • pfblRo...

Right Contexts

slide-39
SLIDE 39

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil Document

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • xxaEsd...
  • xkhSlp...
  • gzPHMA...
  • pfblRo...
  • gzUmXu...

Right Contexts

slide-40
SLIDE 40

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil

Obama Clinton 1 Obama 2 Clinton 3 Clinton 4

  • xxaEsd...
  • xkhSlp...
  • gzPHMA...
  • pfblRo...
  • gzUmXu...

Document Seeds Right Contexts

slide-41
SLIDE 41
  • Practical Algorithm

To Retrieve Information Coded In Alphanumeric

Patricia Trie

slide-42
SLIDE 42

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil

Obama 0 Clinton 1 Obama 2 Clinton 3 Clinton 4

“” {0,1,2,3,4} “pfblRo...” {3} “x” {0,1} “gz” {2,4}

  • xxaEsd...
  • xkhSlp...
  • gzPHMA...
  • pfblRo...
  • gzUmXu...

“xaEsd...” {0} “khSlp...” {1} “PHMA...” {2} “UmXu...” {4}

Document Seeds Right Contexts Trie Tright

slide-43
SLIDE 43

“” {0,1,2,3,4} “trp” {0,1} “Ag” {2,3,4} “ASxy...” {1} “f” {2,4} “Lloc...” {3} “” {0,1,2,3,4} “pfblRo...” {3} “x” {0,1} “gz” {2,4} “xaEsd...” {0} “khSlp...” {1} “PHMA...” {2} “UmXu...” {4}

Tright Tleft

“Upl...” {2} “TiL...” {4}

Obama 0 Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

slide-44
SLIDE 44

“” {0,1,2,3,4} “trp” {0,1} “Ag” {2,3,4} “ASxy...” {1} “f” {2,4} “Lloc...” {3} “” {0,1,2,3,4} “pfblRo...” {3} “x” {0,1} “gz” {2,4} “xaEsd...” {0} “khSlp...” {1} “PHMA...” {2} “UmXu...” {4}

Tright Tleft

“Upl...” {2} “TiL...” {4}

Obama 0 Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • 1. TopNodes

Algorithm

slide-45
SLIDE 45

“” {0,1,2,3,4} “trp” {0,1} “Ag” {2,3,4} “ASxy...” {1} “f” {2,4} “Lloc...” {3} “” {0,1,2,3,4} “pfblRo...” {3} “x” {0,1} “gz” {2,4} “xaEsd...” {0} “khSlp...” {1} “PHMA...” {2} “UmXu...” {4}

Tright Tleft

“Upl...” {2} “TiL...” {4}

Obama 0 Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • 1. TopNodes
  • 2. Match

Algorithm

slide-46
SLIDE 46

“” {0,1,2,3,4} “trp” {0,1} “Ag” {2,3,4} “ASxy...” {1} “f” {2,4} “Lloc...” {3} “” {0,1,2,3,4} “pfblRo...” {3} “x” {0,1} “gz” {2,4} “xaEsd...” {0} “khSlp...” {1} “PHMA...” {2} “UmXu...” {4}

Tright Tleft

“Upl...” {2} “TiL...” {4}

Obama 0 Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

  • 1. TopNodes
  • 2. Match
  • 3. Match back

Algorithm

slide-47
SLIDE 47

“” {0,1,2,3,4} “trp” {0,1} “Ag” {2,3,4} “ASxy...” {1} “f” {2,4} “Lloc...” {3} “” {0,1,2,3,4} “pfblRo...” {3} “x” {0,1} “gz” {2,4} “xaEsd...” {0} “khSlp...” {1} “PHMA...” {2} “UmXu...” {4}

Tright Tleft

“Upl...” {2} “TiL...” {4}

Obama 0 Clinton 1 Obama 2 Clinton 3 Clinton 4

Seeds

slide-48
SLIDE 48

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil

Wrapper: prt[...]x Content:

  • bama, clinton

Wrapper: fgA[...]gz Content:

  • bama, clinton

Document

slide-49
SLIDE 49

prtoBamAxxaEsdSlkprtKenNed yxSAprtCLinTOnxkhSlpUfgAob AMagzPHMAcolLgAcLIntOnpfb lRoiusWgoprtcAstrOxkLiTfgAClI nTongzUmXuSYfgAkEnneDygzil

Wrapper: prt[...]x Content:

  • bama, clinton, kennedy, castro

Wrapper: fgA[...]gz Content:

  • bama, clinton, kennedy

Document

slide-50
SLIDE 50

Characters vs Tags

slide-51
SLIDE 51

Characters vs Tags

<ul> <li>Obama</li> <li>Bush</li> <li>Kennedy</li> </ul>

slide-52
SLIDE 52
  • Exploit (HTML) tags

Characters vs Tags

<ul> <li>Obama</li> <li>Bush</li> <li>Kennedy</li> </ul>

slide-53
SLIDE 53
  • Exploit (HTML) tags
  • Good idea?

Characters vs Tags

<ul> <li>Obama</li> <li>Bush</li> <li>Kennedy</li> </ul>

slide-54
SLIDE 54
  • Exploit (HTML) tags
  • Good idea?
  • Apparently not

Characters vs Tags

<ul> <li>Obama</li> <li>Bush</li> <li>Kennedy</li> </ul>

slide-55
SLIDE 55

Characters vs Tags

slide-56
SLIDE 56

Characters vs Tags

  • No parser
slide-57
SLIDE 57

Characters vs Tags

  • No parser
  • Language-independent
slide-58
SLIDE 58

Characters vs Tags

  • No parser
  • Language-independent
  • ⠟⠁⠃⠹⠕⠊⠟⠭⠛⠑⠹
slide-59
SLIDE 59

Characters vs Tags

  • No parser
  • Language-independent
  • ⠟⠁⠃⠹⠕⠊⠟⠭⠛⠑⠹
  • Other meta language (e.g. TeX)
slide-60
SLIDE 60

Characters vs Tags

  • No parser
  • Language-independent
  • ⠟⠁⠃⠹⠕⠊⠟⠭⠛⠑⠹
  • Other meta language (e.g. TeX)
  • Seeds in unusual places
slide-61
SLIDE 61

Characters vs Tags

  • No parser
  • Language-independent
  • ⠟⠁⠃⠹⠕⠊⠟⠭⠛⠑⠹
  • Other meta language (e.g. TeX)
  • Seeds in unusual places
  • <span class=”obama”>
slide-62
SLIDE 62

Characters vs Tags

  • No parser
  • Language-independent
  • ⠟⠁⠃⠹⠕⠊⠟⠭⠛⠑⠹
  • Other meta language (e.g. TeX)
  • Seeds in unusual places
  • <span class=”obama”>
  • More restrictions ⇒ lower performance
slide-63
SLIDE 63

The SEAL System

Seeds Web pages

Fetcher

Google

Extractor

Mentions Wrapper

Ranker

Suggestions Graph

slide-64
SLIDE 64

Ranker

slide-65
SLIDE 65

Ranker

  • Problem: Noise
slide-66
SLIDE 66

Ranker

  • Problem: Noise
  • Solution: Similarity measure between

seeds and mentions

slide-67
SLIDE 67

Ranker

  • Problem: Noise
  • Solution: Similarity measure between

seeds and mentions

  • Ranked output
slide-68
SLIDE 68

Ranker

  • Problem: Noise
  • Solution: Similarity measure between

seeds and mentions

  • Ranked output
  • Understand relation
slide-69
SLIDE 69

Analysis

slide-70
SLIDE 70

Analysis

slide-71
SLIDE 71

Analysis

slide-72
SLIDE 72

Analysis

slide-73
SLIDE 73

Analysis

slide-74
SLIDE 74

Analysis

  • 4 source types:
  • document
  • seed
  • wrapper
  • mention
slide-75
SLIDE 75

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

slide-76
SLIDE 76

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

Graph-walk (Page rank)

slide-77
SLIDE 77

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

Graph-walk (Page rank) P(find|doc) = 0.5

slide-78
SLIDE 78

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

Graph-walk (Page rank) P(find|doc) = 0.5 P(derive|doc) = 0.5

slide-79
SLIDE 79

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

Graph-walk (Page rank) P(find|doc) = 0.5 P(derive|doc) = 0.5 P(seeds|doc,find) = 1

slide-80
SLIDE 80

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

Graph-walk (Page rank) P(find|doc) = 0.5 P(derive|doc) = 0.5 P(seeds|doc,find) = 1 P(prt..x|doc,derive) = 0.5

slide-81
SLIDE 81

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

Graph-walk (Page rank) P(find|doc) = 0.5 P(derive|doc) = 0.5 P(seeds|doc,find) = 1 P(prt..x|doc,derive) = 0.5 P(fgA..gz|doc,derive) = 0.5

slide-82
SLIDE 82

doc seeds

prt..x Castro JFK fgA..gz

1/2 1/4 1/4 extract extract extract

Graph-walk (Page rank) P(find|doc) = 0.5 P(derive|doc) = 0.5 P(seeds|doc,find) = 1 P(prt..x|doc,derive) = 0.5 P(fgA..gz|doc,derive) = 0.5

slide-83
SLIDE 83

doc seeds

prt..x Castro JFK fgA..gz

find derive 1/2 extract 1/4 1/4

Graph-walk (Page rank) Transitions in both ways

slide-84
SLIDE 84

s d w1 w2 m1 m2 s d w1 w2 m1 m2

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-85
SLIDE 85

x y

s d w1 w2 m1 m2 s d w1 w2 m1 m2

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-86
SLIDE 86

x y

s d w1 w2 m1 m2 s d w1 w2 m1 m2

Transition Matrix (x,y) = P(x→y)

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-87
SLIDE 87

s d w1 w2 m1 m2 s d w1 w2 m1 m2

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-88
SLIDE 88

s d w1 w2 m1 m2 s d 1 w1 w2 m1 m2

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-89
SLIDE 89

s d w1 w2 m1 m2 s d 1 w1 w2 m1 m2

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-90
SLIDE 90

s d w1 w2 m1 m2 s ½ d 1 w1 ¼ w2 ¼ m1 m2

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-91
SLIDE 91

s d w1 w2 m1 m2 s ½ d 1 ½ ½ w1 ¼ ½ 1 w2 ¼ ½ m1 ¼ ½ m2 ¼

Transition Matrix

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-92
SLIDE 92

doc seeds

prt..x Castro JFK fgA..gz

find derive derive extract extract extract

s d w1 w2 m1 m2 s ½ d 1 ½ ½ w1 ¼ ½ 1 w2 ¼ ½ m1 ¼ ½ m2 ¼

Transition Matrix

with lazyness factor λ=0.01

slide-93
SLIDE 93

s d w1 w2 m1 m2

State Vector

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-94
SLIDE 94

s 1 d w1 w2 m1 m2

State Vector

d s

w2 m2 m1 w1

find derive derive extract extract extract

slide-95
SLIDE 95

Transition Matrix and State Vector

slide-96
SLIDE 96

Transition Matrix and State Vector · =

slide-97
SLIDE 97

Transition Matrix and State Vector · =

slide-98
SLIDE 98

Iterated Multiplication · =

slide-99
SLIDE 99

Iterated Multiplication · =

1000x

slide-100
SLIDE 100

Iterated Multiplication · =

1000x

slide-101
SLIDE 101

Iterated Multiplication · =

d s

w2 m2 m1 w1

find derive derive extract extract extract

1000x

slide-102
SLIDE 102

Evaluation

slide-103
SLIDE 103
  • Tested English, Chinese, Japanese

Evaluation

slide-104
SLIDE 104
  • Tested English, Chinese, Japanese
  • 36 datasets, 18 classes

Evaluation

slide-105
SLIDE 105
  • Tested English, Chinese, Japanese
  • 36 datasets, 18 classes
  • 9 classes for all 3 languages (e.g. US

presidents)

Evaluation

slide-106
SLIDE 106
  • Tested English, Chinese, Japanese
  • 36 datasets, 18 classes
  • 9 classes for all 3 languages (e.g. US

presidents)

  • 9 language-specific classes (e.g.

Chinese dynasties)

Evaluation

slide-107
SLIDE 107
  • Tested English, Chinese, Japanese
  • 36 datasets, 18 classes
  • 9 classes for all 3 languages (e.g. US

presidents)

  • 9 language-specific classes (e.g.

Chinese dynasties)

  • 5 runs/dataset, 3 random seeds

Evaluation

slide-108
SLIDE 108

MAP >90%

slide-109
SLIDE 109

US Presidents cont.

slide-110
SLIDE 110

US Presidents cont.

  • rcwang.com/seal
slide-111
SLIDE 111

US Presidents cont.

  • rcwang.com/seal
  • boowa.com
slide-112
SLIDE 112

US Presidents cont.

  • rcwang.com/seal
  • boowa.com
  • Still strange results
slide-113
SLIDE 113

US Presidents cont.

  • rcwang.com/seal
  • boowa.com
  • Still strange results
  • Google results
slide-114
SLIDE 114

US Presidents cont.

  • rcwang.com/seal
  • boowa.com
  • Still strange results
  • Google results
  • Latest news
slide-115
SLIDE 115

US Presidents cont.

  • rcwang.com/seal
  • boowa.com
  • Still strange results
  • Google results
  • Latest news
slide-116
SLIDE 116

US Presidents cont.

  • rcwang.com/seal
  • boowa.com
  • Still strange results
  • Google results
  • Latest news
slide-117
SLIDE 117

Bootstrapping

slide-118
SLIDE 118

stats = {}, used = input, rslt = {} for i = 1 to M do m = min(3,|used|) seeds = selectm(used) ∪ top(rslt) stats = expand(seeds) rslt = rank(stats) used = used ∪ seeds rof

slide-119
SLIDE 119

Binary Relations

slide-120
SLIDE 120

More Wrappers

slide-121
SLIDE 121

More Wrappers

  • New: Middle context
slide-122
SLIDE 122

More Wrappers

  • New: Middle context
  • Slight adjustments to the old algorithm
slide-123
SLIDE 123

More Wrappers

  • New: Middle context
  • Slight adjustments to the old algorithm
  • “Mayor of...”
slide-124
SLIDE 124

More Wrappers

  • New: Middle context
  • Slight adjustments to the old algorithm
  • “Mayor of...”
  • “Duet with...”, etc.
slide-125
SLIDE 125

sadhSAbcGsadutsobAMawrKu SAjkLsdFautsmeRKelwrKgErmA nyjkLsdfkuxeSAmcvDkBSs

slide-126
SLIDE 126

sadhSAbcGsadutsobAMawrKu SAjkLsdFautsmeRKelwrKgErmA nyjkLsdfkuxeSAmcvDkBSs

slide-127
SLIDE 127

Much research going on

Not sure if that fits...

slide-128
SLIDE 128

Thank you

Seeds Web pages

Fetcher

Google

Extractor

Mentions

Wrapper

Ranker

Suggestions

Graph

  • Bootstrapping
  • Binary relations