Algorithms and Applications for Web-Scale Knowledge Graphs Marco - - PowerPoint PPT Presentation

algorithms and applications for web scale knowledge graphs
SMART_READER_LITE
LIVE PREVIEW

Algorithms and Applications for Web-Scale Knowledge Graphs Marco - - PowerPoint PPT Presentation

Algorithms and Applications for Web-Scale Knowledge Graphs Marco Ponza Supervisor Prof. Paolo Ferragina Menu 1. Entity Annotation The Modeling of Knowledge Terminology The Annotation Pipeline Applications A New


slide-1
SLIDE 1

Algorithms and Applications for Web-Scale Knowledge Graphs

Marco Ponza

Supervisor

  • Prof. Paolo Ferragina
slide-2
SLIDE 2

Menu

  • 1. Entity Annotation

○ The Modeling of Knowledge ○ Terminology ○ The Annotation Pipeline ○ Applications ○ A New Text Representation

  • 2. Work done in the first year

○ Entity Relatedness ○ Document Aboutness

  • 3. Future Work
slide-3
SLIDE 3

1.

Entity Annotation

slide-4
SLIDE 4

The Modeling of Knowledge

▷ Classical approaches ○ Document Knowledge = Words ○ Bag-of-words (aka BoW) ○ Vector Space Model (aka VSM) (Salton, 1971)

Stop-word removal, stemming, ... 2 2 1 Counting, scaling, normalization, ... Vector Space Model

Document Document’s Words

slide-5
SLIDE 5

The Modeling of Knowledge

▷ Well-known issues (Jurafsky, ‘00) ○ Ambiguity (Polysemy and Synonymy)

Jaguar

  • r

?

Jaguar (felin) Jaguar_Cars

slide-6
SLIDE 6

The Modeling of Knowledge

▷ Well-known issues (Jurafsky, ‘00) ○ Ambiguity (Polysemy and Synonymy) Barack_Obama United_States ○ Semantic Connections

slide-7
SLIDE 7

The Modeling of Knowledge

▷ Algorithmic solutions ○ Latent Approaches (e.g. LSI/LSA, Word2Vec) ■ Unintelligible for humans (Gabrilovich IJCAI ‘07) ○ “Knowledge is Power” Hypothesis (Lenat, ‘91; Gabrilovich SIGIR ‘16) ■ Semantic and unambiguous concepts ■ Depend on the design of Entity Annotators ▷ Well-known issues (Jurafsky, ‘00) ○ Ambiguity (Polysemy and Synonymy) ○ Semantic Connections

slide-8
SLIDE 8

Entity Annotation

Terminology

▷ Wikipedia Knowledge Graph ▷ Node?

slide-9
SLIDE 9
slide-10
SLIDE 10

Entity Annotation

Terminology

▷ Wikipedia Knowledge Graph ▷ Link? ▷ Node: Wikipedia Page (Entity)

slide-11
SLIDE 11
slide-12
SLIDE 12

Entity Annotation

Terminology

▷ Wikipedia Knowledge Graph

Enrich a text T with proper annotations

Annotation = (mention, entity)

Goal

▷ Link: Wikipedia Hyperlink ▷ Node: Wikipedia Page (Entity)

slide-13
SLIDE 13

Entity Annotation

1. Identify mentions (spots) 2. Retrieve candidate entities

Spotting Disambiguation

Assign the most pertinent entity to each spot

Pruning

Remove not pertinent annotations

The Annotation Pipeline

Input Text

Entity Annotator

Annotated Text

slide-14
SLIDE 14

Entity Annotation

Pruning The Annotation Pipeline

Yesterday Maradona won against Mexico.

Spotting

Diego_Maradona Maradona_by_ Kusturica Diego_Sinagra

...

Mexico Mexico_national_ football_team Mexico,_ New_York

...

Candidate Generation Mention Detection

Yesterday_(Time) Yesterday_ (Beatles_song) Yesterday_ (Guns_N_Roses_ song)

...

slide-15
SLIDE 15

Entity Annotation

The Annotation Pipeline

Spotting

  • 1. Mention Detection

○ Named Entity Recognition (aka NER) ○ N-gram generation

  • 2. Candidate Generation

○ Gazetteer: { mention →entities } ■ How?

slide-16
SLIDE 16
slide-17
SLIDE 17

Entity Annotation

The Annotation Pipeline

Spotting

  • 1. Mention Detection

○ Named Entity Recognition (aka NER) ○ N-gram generation

  • 1. Mention Detection

○ Named Entity Recognition (aka NER) ○ N-gram generation

  • 2. Candidate Generation

○ Gazetteer: { mention →entities } ■ How? Wikipedia anchor texts! ■ Ranking (+ Thresholding)

  • Commonness (Ferragina, CIKM ’10; Guo, CIKM ’14)
  • Entity-context Similarity (Zwicklbauer, SIGIR ’16)
  • ...
slide-18
SLIDE 18

Entity Annotation

Pruning The Annotation Pipeline

Yesterday Maradona won against Mexico.

Spotting

Diego_Maradona Maradona_by_ Kusturica Diego_Sinagra

...

Mexico Mexico_national_ football_team Mexico,_ New_York

...

Yesterday_(Time) Yesterday_ (Beatles_song) Yesterday_ (Guns_N_Roses_ song)

...

Disambiguation

slide-19
SLIDE 19

Entity Annotation

Pruning The Annotation Pipeline

Yesterday Maradona won against Mexico.

Spotting

Mexico_national_ football_team Yesterday_ (Beatles_song)

Disambiguation

▷ Spots have been disambiguated ○ Ambiguous lexical elements (words) are now labeled with unambiguous concepts ▷ Finally, coherence scores are assigned

0.1 0.8 0.7 Diego_Maradona

Disambiguation

slide-20
SLIDE 20

Entity Annotation

The Annotation Pipeline

Spotting Disambiguation

(Mendes, SemSys ‘11) (Mihalcea, CIKM ‘07) (Moro, ACL ‘14)

AIDA

(Nguyen, LDOW ‘14) (Piccinno, SIGIR ‘14) (Scaiella, CIKM ‘10)

PBoH

(Ganea, WWW ‘16)

DoSeR

(Zwicklbauer, SIGIR ’16)

...

The Annotation Pipeline

LS NED

(Cucerzan, ACL ‘07)

slide-21
SLIDE 21

Entity Annotation

The Annotation Pipeline

Spotting Disambiguation

Pruning

[...] Maradona won against Mexico.

Algorithm: (Scaiella, CIKM ‘10; Piccinno, SIGIR ‘14)

  • Voting Scheme
  • M&W / Jaccard

Relatedness

slide-22
SLIDE 22

Entity Annotation

The Annotation Pipeline

Spotting Disambiguation

Pruning

[...] Maradona won against Mexico.

Algorithm: DoSeR (Zwicklbauer, SIGIR ’16)

  • Graph of

candidates

slide-23
SLIDE 23

Entity Annotation

The Annotation Pipeline

Spotting Disambiguation

Pruning

[...] Maradona won against Mexico.

  • Graph of

candidates

  • Entity2Vec

Relatedness

Algorithm: DoSeR (Zwicklbauer, SIGIR ’16)

slide-24
SLIDE 24

Entity Annotation

The Annotation Pipeline

Spotting Disambiguation

Pruning

[...] Maradona won against Mexico.

  • Graph of

candidates

  • Entity2Vec

Relatedness

  • PageRank

Algorithm: DoSeR (Zwicklbauer, SIGIR ’16)

slide-25
SLIDE 25

Entity Annotation

Pruning The Annotation Pipeline

Yesterday Maradona won against Mexico.

Spotting

Mexico_national_ football_team Yesterday_ (Beatles_song)

Disambiguation

0.1 0.8 0.7 Diego_Maradona

Pruning

▷ Remove not pertinent annotations ▷ Clear text from erroneous annotations ▷ Coherence thresholding

slide-26
SLIDE 26

Applications

Web Search Results (Gabrilovich, SIGIR ’16)

slide-27
SLIDE 27

Applications

Web Search Results (Gabrilovich, SIGIR ’16)

slide-28
SLIDE 28

Applications

Question Answering (Gabrilovich, SIGIR ’16)

slide-29
SLIDE 29

Condition → What does it mean? Symptoms →What do they indicate?

Applications

Implicit Questions (Gabrilovich, SIGIR ’16)

slide-30
SLIDE 30

▷ Originally introduced by (Scaiella, WSDM ‘12)

Widely deployed (Dunietz, EACL ‘14; Schuhmacher, WSDM '14; Ni, WSDM ‘15), ...

▷ Text = Graph of Entities

A New Text Representation

Text

Entity Annotator Graph of Entities

▷ What about…

slide-31
SLIDE 31

▷ Originally introduced by (Scaiella, WSDM ‘12)

Widely deployed (Dunietz, EACL ‘14; Schuhmacher, WSDM '14; Ni, WSDM ‘15), ...

▷ Text = Graph of Entities

A New Text Representation

Text

Entity Annotator Graph of Entities

▷ What about… ○ ...edge weights? ○ ...node weights? Work done in the first year

slide-32
SLIDE 32

2.

Work done in the first year

Entity Relatedness & Document Aboutness

slide-33
SLIDE 33

Entity Relatedness

slide-34
SLIDE 34

Entity Relatedness

▷ How much related are... ○ ...Bank with Money? ○ ...Wood with Book?

Compute how much two entities are related

Relatedness : Entities x Entities →Real

Goal

▷ Semantic Reasoning: ○ Human: Background Knowledge ○ Machines: Knowledge Graph

slide-35
SLIDE 35

Entity Relatedness

▷ Document/Word Similarity ○

WikiRelate (Strube, AAI ‘06)

Explicit Semantic Analysis (Gabrilovich, IJCAI ‘07)

WikiWalk (Yeh, ACL ‘09)

Temporal Semantic Analysis (Radinsky, WWW ‘11)

Concept Graph Representation (Ni, WSDM ‘16)

Milne & Witten (Witten, AAI ‘08)

Salient Semantic Analysis (Hassan, AAI ‘11)

▷ Machine Translation (Agirre, NAACL ‘09; Rothe, ACL ‘14) ▷ Document Classification (Perozzi, WWW ‘14; Tang, WWW ‘15) ▷ ...

(A brief list of) Algorithms and Applications

slide-36
SLIDE 36

▷ Two entities are related whether… ○ ...they are described by related texts (Corpus-based)

■ Example: ESA (Gabrilovich, IJCAI ‘07)

  • Concepts grounded in human cognition
  • Opposite to latent concepts

Entity Relatedness

slide-37
SLIDE 37

Entity Relatedness

▷ Two entities are related whether… ○ ...they are described by related texts (Corpus-based)

■ Example: ESA (Gabrilovich, IJCAI ‘07)

  • Concepts grounded in human cognition
  • Opposite to latent concepts

○ ...they are referenced by related entities (Graph-based) ■ Example: CoSimRank (Rothe, ACL ‘14)

slide-38
SLIDE 38

Entity Relatedness

CoSimRank (Rothe, ACL ‘14)

▷ Graph-based approach ▷ Relatedness algorithm for nodes in a graph ▷ Exploits Random Walks ▷ Algorithm (in brief)

1. Sets damping vectors for e1 and e2 2. Runs an iteration of PageRank 3. Updates relatedness score e1 , e2 ∈ Entities

slide-39
SLIDE 39

Entity Relatedness

1.0 0.0 0.0 0.0 0.0 0.0 0.0

e1 e2

p0(e1) p0(e2) Relatedness0(e1, e2) = 0.0

CoSimRank (Rothe, ACL ‘14)

0.0 0.0 1.0 0.0 0.0 0.0 0.0

slide-40
SLIDE 40

Entity Relatedness

0.2 0.4 0.0 0.4 0.0 0.0 0.0

e1 e2

p1(e1) p1(e2) Relatedness1(e1, e2) = 0.16

CoSimRank (Rothe, ACL ‘14)

0.0 0.4 0.2 0.0 0.4 0.0 0.0

slide-41
SLIDE 41

Entity Relatedness

0.52 0.08 0.16 0.08 0.0 0.0 0.16

e1 e2

p2(e1) p2(e2) Relatedness2(e1, e2) = 0.33

CoSimRank (Rothe, ACL ‘14)

0.16 0.08 0.46 0.0 0.05 0.21 0.10

slide-42
SLIDE 42

Entity Relatedness

0.26 0.27 0.03 0.27 0.08 0.00 0.03

e1 e2

p3(e1) p3(e2) Relatedness3(e1, e2) = 0.47

CoSimRank (Rothe, ACL ‘14)

0.03 0.25 0.25 0.10 0.20 0.04 0.02

slide-43
SLIDE 43

Entity Relatedness

1.0 0.0 0.0 0.0 0.0 0.0 0.0

e1 e3

p0(e1) p0(e3) Relatedness0(e1, e3) = 0.0

CoSimRank (Rothe, ACL ‘14)

0.0 0.0 0.0 0.0 0.0 1.0 0.0

slide-44
SLIDE 44

Entity Relatedness

0.26 0.27 0.03 0.27 0.08 0.00 0.03

e1 e3

p3(e1) p3(e2) Relatedness3(e1, e3) = 0.13

CoSimRank (Rothe, ACL ‘14)

0.0 0.04 0.02 0.04 0.16 0.24 0.02

slide-45
SLIDE 45

Entity Relatedness

▷ Two entities are related whether… ○ ...they are described by related texts (Corpus-based) ■ Example: ESA (Gabrilovich, IJCAI ‘07)

  • Concepts grounded in human cognition
  • Opposite to latent concepts

○ ...they are referenced by related entities (Graph-based) ■ Example: CoSimRank (Rothe, ACL ‘14) ▷ Need of a fair and meaningful comparison

slide-46
SLIDE 46

Entity Relatedness

Preliminary Results: The Relatedness Framework

▷ Design algorithms based on

○ Set Operations (Milne & Witten, Jaccard, ...) ○ Embeddings (Word2Vec, LDA, ...) ○ Random Walk (CoSimRank, PPR+Cos)

▷ Preliminary results

○ Analyse entity pairs ○ Deploy corpus-based algorithms ○ A new algorithm: LLP

slide-47
SLIDE 47

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

▷ Standard Label Propagation (Newman, Phys. Rev. ‘04)

○ Clustering algorithm (node labeling) ○ Pro: Scales on very large graphs ○ Cons: Can generate few big clusters 1. Randomly initialize each node with a label (cluster) 2. Update label according to a specific rule ○ Maximize nonlocal discount (Ronhovde, Phys. Rev. ‘10)

▷ Layered Label Propagation ▷

Standard Label Propagation with a resolution parameter 𝜹 ○ Graph compression ○ Algorithm (in brief)

slide-48
SLIDE 48

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

Round: 1 Step: Initialization

slide-49
SLIDE 49

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

Round: 1 Step: Initialization

slide-50
SLIDE 50

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

Round: 1 Step: Updating (1)

slide-51
SLIDE 51

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

Round: 1 Step: Updating (2)

Node Labels 1 2 3 4 5 6 7

slide-52
SLIDE 52

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Initialization

Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

slide-53
SLIDE 53

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Initialization

2 1 4 7 5 6 3 Node Labels 1 2 3 4 5 6 7

slide-54
SLIDE 54

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Updating (1)

2 1 4 7 5 6 3 Node Labels 1 2 3 4 5 6 7

slide-55
SLIDE 55

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Updating (2)

2 1 4 7 5 6 3 Node Labels 1 2 3 4 5 6 7 Node Labels 1 2 3 4 5 6 7

slide-56
SLIDE 56

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Updating (2)

Node Labels 1 2 3 4 5 6 7 Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

slide-57
SLIDE 57

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Updating (2)

Node Labels 1 2 3 4 5 6 7 Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

slide-58
SLIDE 58

Entity Relatedness

A New Algorithm: Layered Label Propagation (Boldi, WWW ‘11)

Round: 2 Step: Updating (2)

Node Labels 1 2 3 4 5 6 7 Node Labels 1 2 3 4 5 6 7 2 1 4 7 5 6 3

Relatedness = Similarity between signatures Signatures

slide-59
SLIDE 59

Document Aboutness

slide-60
SLIDE 60

Document Aboutness

Aboutness = Succinct representation of a document’s subject matter (Hutchins, 1977)

Goal

▷ Weight information (e.g. entities, words, ...) within a document

slide-61
SLIDE 61

Document Aboutness

Entity Weight

Barack_Obama 0.85 Hillary_Clinton 0.8 Hawaii 0.3 George_W._Bush 0.2

... ...

Aboutness

slide-62
SLIDE 62

Document Aboutness

Aboutness = Succinct representation of a document’s subject matter (Hutchins, 1977)

Goal

▷ Weight information (e.g. entities, words, ...) within a document ▷ Wide range of practical applications: a. Recommendation b. Categorization c. Exploratory search d. Web Ranking e. …

slide-63
SLIDE 63

Document Aboutness

Words Proper nouns Sentences ... Aboutness Entities Dictionary POS tags ... Candidate Extraction Ranking/Classification Subject Matter Identification

Keyphrase Extraction Entity Salience

Interpretation Overgeneration Infrequency Redundancy Issues Dependency on KG ? Entity Annotation

slide-64
SLIDE 64

“[...] errors could be addressed using background knowledge.” “[...] features more directly linked to Wikipedia [...] can provide more focused background information.” (Hasen, ACL ‘14) (Dunietz, EACL ‘14)

Keyphrase Extraction Entity Salience

slide-65
SLIDE 65

Document Aboutness

Our Proposal

▷ Entity Salience Approach

Document Enrichment ▷ Pos tagging ▷ Mention detection ▷ Dependency parsing ▷ Co-reference resolver.

slide-66
SLIDE 66

Document Aboutness

Our Proposal

▷ Entity Salience Approach

Document Enrichment ▷ Entity annotation ▷ Graph of entities ▷ Relatedness

slide-67
SLIDE 67

Document Aboutness

Our Proposal

▷ Entity Salience Approach

Document Enrichment

TextRank

▷ Document Summarizer ▷ Sentence Ranker

0.5 0.23 0.07 0.2

slide-68
SLIDE 68

Document Aboutness

Our Proposal

▷ Entity Salience Approach

Feature Generation Document Enrichment

TextRank

Document

▷ Entity →Feature Vector ▷ Classical syntactic features

■ Frequency ■ Position ■ ...

▷ New syntactic and semantic features:

■ Position-based ■ Dependency-based ■ Relatedness-based ■ Centrality-based ■ ...

slide-69
SLIDE 69

Document Aboutness

Our Proposal

▷ Entity Salience Approach

Feature Generation Document Enrichment Classification

TextRank

Salient Entities vs. Non-salient entities

Document

slide-70
SLIDE 70

Document Aboutness

Our Proposal: Main Contributions

▷ Fully documented system ▷ Public available via Web-API ▷ Improvement of state-of-the-art (Cmu-Google, F1: 61.5) ○ New York Times’ dataset (110,000 news, 1.3M entities) ○ 62.6 micro-F1 (+1.1%) and 59.5 macro-F1 (+2.5%) ○ More robust when entities are not biased at the beginning (+9%) ▷ Deep Feature and Error Analysis

slide-71
SLIDE 71

3.

Future Work

slide-72
SLIDE 72

Future Work

▷ Conclude Entity Relatedness ○ Finalize experiments ○ Related vs Non-related ○ Scalability ▷ Improve our Entity Salience System ○ Deep Learning (i.e. w2v) ○ Abstractive Summarization ○ Create and test new datasets ○ Plug the new TagMe-Wat 2.0 (Piccinno, 2016) ▷ Entity Annotation Improvement

slide-73
SLIDE 73

Thanks!

Any questions?