graph based algorithms in nlp
play

Graph-based Algorithms in NLP Edge weights w ( u, v ) define a - PowerPoint PPT Presentation

Graph-based Representation Let G ( V, E ) be a weighted undirected graph V - set of nodes in the graph E - set of weighted edges Graph-based Algorithms in NLP Edge weights w ( u, v ) define a measure of pairwise similarity between


  1. Graph-based Representation • Let G ( V, E ) be a weighted undirected graph – V - set of nodes in the graph – E - set of weighted edges Graph-based Algorithms in NLP • Edge weights w ( u, v ) define a measure of pairwise similarity between nodes u , v 0.2 0.4 0.4 0.3 0.7 0.1 Graph-Based Algorithms in NLP Graph-based Representation • In many NLP problems entities are connected by a range of relations 33 23 1 2 3 4 5 1 2 • Graph is a natural way to capture connections 23 5 1 between entities 55 33 5 2 3 4 55 • Applications of graph-based algorithms in NLP: 3 50 50 – Find entities that satisfy certain structural 4 5 properties defined with respect to other entities 5 – Find globally optimal solutions given relations between entities

  2. Examples of Graph-based Representations Analysis of the Link Structure • Assumption: the creator of page p, by including a link to page q, has in some measure conferred authority in q Data Directed? Node Edge • Issues to consider: Web yes page link – some links are not indicative of authority (e.g., Citation Net yes citation reference relation navigational links) Text no sent semantic connectivity – we need to find an appropriate balance between the criteria of relevance and popularity Hubs and Authorities Algorithm Outline of the Algorithm (Kleinberg, 1998) • Compute focused subgraphs given a query • Iteratively compute hubs and authorities in the • Application context: information retrieval subgraph • Task: retrieve documents relevant to a given query • Naive Solution: text-based search – Some relevant pages omit query terms – Some irrelevant do include query terms We need to take into account the authority of the page! Hubs Authorities

  3. Focused Subgraph Constructing a Focused Subgraph: Algorithm Set S c := R σ For each page p ∈ R σ • Subgraph G [ W ] over W ⊆ V , where edges Let Γ + ( p ) denote the set of all pages p points to correspond to all the links between pages in W Let Γ − ( p ) denote the set of all pages pointing to p Add all pages in Γ + ( p ) to S σ • How to construct G σ for a string σ ? If | Γ − ( p ) | ≤ d then – G σ has to be relatively small Add all pages in | Γ − ( p ) | to S σ – G σ has to be rich in relevant pages Else Add an arbitrary set of d pages from | Γ − ( p ) | to S σ – G σ must contain most of the strongest End authorities Return S σ Constructing a Focused Subgraph: Constructing a Focused Subgraph Notations base Subgraph ( σ, Eng, t, d ) root σ : a query string Eng : a text-based search engine t, d : natural numbers Let R σ denote the top t results of Eng on σ

  4. I operation Computing Hubs and Authorities • Authorities should have considerable overlap in terms of pages pointing to them Given { y ( p ) } , compute: • Hubs are pages that have links to multiple x ( p ) ← � y ( p ) authoritative pages q :( q,p ) ∈ E • Hubs and authorities exhibit a mutually reinforcing q 1 relationship page p x[p]:=sum of y[q] q 2 for all q pointing to p q 3 Hubs Authorities O operation An Iterative Algorithm Given { x ( p ) } , compute: • For each page p , compute authority weight x ( p ) and y ( p ) ← � x ( p ) hub weight y ( p ) q :( p,q ) ∈ E – x ( p ) ≥ 0 , x ( p ) ≥ 0 q 1 p ∈ s σ ( x ( p ) ) 2 = 1 , � p ∈ s σ ( y ( p ) ) 2 = 1 – � q 2 page p • Report top ranking hubs and authorities y[p]:=sum of x[q] for all q pointed to by p q 3

  5. Convergence Algorithm:Iterate Iterate (G,k) G: a collection of n linked paged k: a natural number Theorem: The sequence x 1 , x 2 , x 3 and y 1 , y 2 , y 3 Let z denote the vector (1 , 1 , 1 , . . . , 1) ∈ R n converge. Set x 0 := z Set y 0 := z • Let A be the adjacency matrix of g σ For i = 1 , 2 , . . . , k • Authorities are computed as the principal Apply the I operation to ( x i − 1 , y i − 1 ) , obtaining new x -weights x ′ i eigenvector of A T A Apply the O operation to ( x ′ i , y i − 1 ) , obtaining new y -weights y ′ i Normalize x ′ i , obtaining x i • Hubs are computed as the principal eigenvector of Normalize y ′ AA T i , obtaining y i Return ( x k , y k ) Algorithm: Filter Subgraph obtained from www.honda.com Honda http://www.honda.com Filter (G,k,c) G: a collection of n linked paged Ford Motor Company http://www.ford.com k,c: natural numbers Campaign for Free Speech http://www.eff.org/blueribbon.html ( x k , y k ) := Iterate ( G, k ) Welcome to Magellan! http://www.mckinley.com Report the pages with the c largest coordinates in x k as authorities Welcome to Netscape! http://www.netscape.com Report the pages with the c largest coordinates in y k as hubs LinkExchange — Welcome http://www.linkexchange.com Welcome to Toyota http://www.toyota.com

  6. Authorities obtained from Intuitive Justification www.honda.com From The Anatomy of a Large-Scale Hypertextual Web Search Engine (Brin&Page, 1998) 0.202 Welcome to Toyota http://www.toyota.com PageRank can be thought of as a model of used behaviour. We 0.199 Honda http://www.honda.com assume there is a “random surfer” who is given a web page at 0.192 Ford Motor Company http://www.ford.com random and keeps clicking on links never hitting “back” but 0.173 BMW of North America, Inc. http://www.bmwusa.com eventually get bored and starts on another random page. The 0.162 VOLVO http://www.volvo.com probability that the random surfer visists a page is its PageR- 0.158 Saturn Web Site http://www.saturncars.com ank. And, the d damping factor is the probability at each page 0.155 NISSAN http://www.nissanmotors.com the “random surfer” will get bored and request another ran- dom page. PageRank Algorithm (Brin&Page,1998) PageRank Computation Original Google ranking algorithm • Similar idea to Hubs and Authorities Iterate PR(p) computation: • Key differences: pages q 1 , . . . , q n that point to page p d is a damping factor (typically assigned to 0.85) – Authority of each page is computed off-line C ( p ) is out-degree of p – Query relevance is computed on-line ∗ Anchor text PR ( p ) = (1 − d ) + d ∗ ( PR ( q 1 ) C ( q 1 ) + . . . + PR ( q n ) C ( q n ) ) ∗ Text on the page – The prediction is based on the combination of authority and relevance

  7. Notes on PageRank Text as a Graph S1 S2 S6 • PageRank forms a probability distribution over web pages • PageRank corresponds to the principal eigenvector S3 S5 of the normalized link matrix of the web S4 Extractive Text Summarization Centrality-based Summarization(Radev) • Assumption: The centrality of the node is an Task: Extract important information from a text indication of its importance • Representation: Connectivity matrix based on intra-sentence cosine similarity • Extraction mechanism: – Compute PageRank score for every sentence u PageRank ( u ) = (1 − d ) PageRank ( v ) � + d N deg ( v ) v ∈ adj [ u ] , where N is the number of nodes in the graph – Extract k sentences with the highest PageRanks score

  8. Does it work? Graph-Based Algorithms in NLP • Evaluation: Comparison with human created summary • Rouge Measure: Weighted n-gram overlap (similar • Applications of graph-based algorithms in NLP: to Bleu) – Find entities that satisfy certain structural Method Rouge score properties defined with respect to other entities Random 0.3261 – Find globally optimal solutions given relations between entities Lead 0.3575 Degree 0.3595 PageRank 0.3666 Does it work? Min-Cut: Definitions • Evaluation: Comparison with human created • Graph cut: partitioning of the graph into two summary disjoint sets of nodes A,B • Rouge Measure: Weighted n-gram overlap (similar • Graph cut weight: cut ( A, B ) = � u ∈ A,v ∈ B w ( u, v ) to Bleu) – i.e. sum of crossing edge weights Method Rouge score • Minimum Cut: the cut that minimizes Random 0.3261 cross-partition similarity Lead 0.3575 0.2 0.2 0.4 0.4 0.4 0.4 Degree 0.3595 0.3 0.7 0.1 0.3 0.7 0.1 PageRank 0.3666

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend