Measuring Hyperlink Distances Wikipedia case study Rodrigo R. Paim - - PowerPoint PPT Presentation
Measuring Hyperlink Distances Wikipedia case study Rodrigo R. Paim - - PowerPoint PPT Presentation
Measuring Hyperlink Distances Wikipedia case study Rodrigo R. Paim Daniel R. Figueiredo Universidade Federal do Rio de Janeiro 2 nd Workshop of Brazilian Institute for Web Science Research Webpages and Hyperlinks Paim & Figueiredo - 2011
Paim & Figueiredo - 2011
Webpages and Hyperlinks
Paim & Figueiredo - 2011
The Web Graph
Webpages → Vertices Hyperlinks → Directed Edges
UFRJ Campus Rio Brazil
Webpages have some specific content
Can’t get it from structure of the web
New concept to analyze “connected webpages”
Inversely proportional to contextual similarity
Paim & Figueiredo - 2011
Hyperlink Distance
Physics Maths Economic Crisis Europe Greece d1 d2 d3
d1 < d2, d3
Paim & Figueiredo - 2011
Measuring Distances
Multiple distance metrics
Variation of Jaccard distance
IDF-based
Keywords play a crucial role
Indicate context of a webpage
Why to measure distances?
Paim & Figueiredo - 2011
Image from The Opte Project Website
Can one go from any webpage to another using local information only?
Navigating the Web
Paim & Figueiredo - 2011
Navigating the Web
Food Car Safety
Paim & Figueiredo - 2011
Navigating the Web
Food Food Industry Agriculture Restaurant Car Safety
Paim & Figueiredo - 2011
Navigating the Web
Food Food Industry Manufacturing Car Safety Agriculture Restaurant Nestlé Food Processing
Paim & Figueiredo - 2011
Navigating the Web
Automotive Industry Food Food Industry Manufacturing Car Safety Automobile Agriculture Restaurant Nestlé Food Processing
Paim & Figueiredo - 2011
Navigating the Web
Automotive Industry Food Food Industry Manufacturing Car Safety Automobile Agriculture Restaurant Nestlé Food Processing
Paim & Figueiredo - 2011
Problem Formulation
Decentralized greedy algorithm From any u to any v Choose closest hyperlink to destination Local information only Can it reach destination? In few steps Using hyperlink distances
Paim & Figueiredo - 2011
Case Study
Wikipedia Two major sets
D: Documents (articles) C: Categories
Keywords of documents
Wikipedia web graph
Vertices (~ 3.6 M) Edges (~ 100 M)
Paim & Figueiredo - 2011
Navigation Algorithms
BFS
Minimum distance with global knowledge
Random Walk
Next webpage chosen randomly
Greedy Algorithm
Only closest neighbor (dead ends)
Modified Greedy
Closest neighbor (not visited yet)
Results
Results
96.45%
Results
16.88%
Results
5.46% Dead Ends
Results
45.27%
Paim & Figueiredo - 2011
Conclusion
Greedy performs worse than Random Walk But Modified Greedy performs better
Only for big distances However far from optimal (BFS)
Categories are not a good “compass” Ongoing work:
How to define a better greedy algorithm?
Paim & Figueiredo - 2011