Measuring Hyperlink Distances Wikipedia case study Rodrigo R. Paim - - PowerPoint PPT Presentation

measuring hyperlink distances
SMART_READER_LITE
LIVE PREVIEW

Measuring Hyperlink Distances Wikipedia case study Rodrigo R. Paim - - PowerPoint PPT Presentation

Measuring Hyperlink Distances Wikipedia case study Rodrigo R. Paim Daniel R. Figueiredo Universidade Federal do Rio de Janeiro 2 nd Workshop of Brazilian Institute for Web Science Research Webpages and Hyperlinks Paim & Figueiredo - 2011


slide-1
SLIDE 1

Measuring Hyperlink Distances

Wikipedia case study

Rodrigo R. Paim Daniel R. Figueiredo Universidade Federal do Rio de Janeiro 2nd Workshop of Brazilian Institute for Web Science Research

slide-2
SLIDE 2

Paim & Figueiredo - 2011

Webpages and Hyperlinks

slide-3
SLIDE 3

Paim & Figueiredo - 2011

The Web Graph

Webpages → Vertices Hyperlinks → Directed Edges

UFRJ Campus Rio Brazil

slide-4
SLIDE 4

Webpages have some specific content

Can’t get it from structure of the web

New concept to analyze “connected webpages”

Inversely proportional to contextual similarity

Paim & Figueiredo - 2011

Hyperlink Distance

Physics Maths Economic Crisis Europe Greece d1 d2 d3

d1 < d2, d3

slide-5
SLIDE 5

Paim & Figueiredo - 2011

Measuring Distances

Multiple distance metrics

Variation of Jaccard distance

IDF-based

Keywords play a crucial role

Indicate context of a webpage

Why to measure distances?

slide-6
SLIDE 6

Paim & Figueiredo - 2011

Image from The Opte Project Website

Can one go from any webpage to another using local information only?

Navigating the Web

slide-7
SLIDE 7

Paim & Figueiredo - 2011

Navigating the Web

Food Car Safety

slide-8
SLIDE 8

Paim & Figueiredo - 2011

Navigating the Web

Food Food Industry Agriculture Restaurant Car Safety

slide-9
SLIDE 9

Paim & Figueiredo - 2011

Navigating the Web

Food Food Industry Manufacturing Car Safety Agriculture Restaurant Nestlé Food Processing

slide-10
SLIDE 10

Paim & Figueiredo - 2011

Navigating the Web

Automotive Industry Food Food Industry Manufacturing Car Safety Automobile Agriculture Restaurant Nestlé Food Processing

slide-11
SLIDE 11

Paim & Figueiredo - 2011

Navigating the Web

Automotive Industry Food Food Industry Manufacturing Car Safety Automobile Agriculture Restaurant Nestlé Food Processing

slide-12
SLIDE 12

Paim & Figueiredo - 2011

Problem Formulation

Decentralized greedy algorithm From any u to any v Choose closest hyperlink to destination Local information only Can it reach destination? In few steps Using hyperlink distances

slide-13
SLIDE 13

Paim & Figueiredo - 2011

Case Study

Wikipedia Two major sets

D: Documents (articles) C: Categories

Keywords of documents

Wikipedia web graph

Vertices (~ 3.6 M) Edges (~ 100 M)

slide-14
SLIDE 14

Paim & Figueiredo - 2011

Navigation Algorithms

BFS

Minimum distance with global knowledge

Random Walk

Next webpage chosen randomly

Greedy Algorithm

Only closest neighbor (dead ends)

Modified Greedy

Closest neighbor (not visited yet)

slide-15
SLIDE 15

Results

slide-16
SLIDE 16

Results

96.45%

slide-17
SLIDE 17

Results

16.88%

slide-18
SLIDE 18

Results

5.46% Dead Ends

slide-19
SLIDE 19

Results

45.27%

slide-20
SLIDE 20

Paim & Figueiredo - 2011

Conclusion

Greedy performs worse than Random Walk But Modified Greedy performs better

Only for big distances However far from optimal (BFS)

Categories are not a good “compass” Ongoing work:

How to define a better greedy algorithm?

slide-21
SLIDE 21

Paim & Figueiredo - 2011

Thanks for your attention!

Rodrigo R. Paim Daniel R. Figueiredo LAND – PESC/COPPE - UFRJ Measuring Hyperlink Distances: Wikipedia Case Study ACM WebSci'11 (Extended Abstract)