Estimating Peer Similarity using Distance of Shared Files Distance - - PowerPoint PPT Presentation

estimating peer similarity using distance of shared files
SMART_READER_LITE
LIVE PREVIEW

Estimating Peer Similarity using Distance of Shared Files Distance - - PowerPoint PPT Presentation

Estimating Peer Similarity using Distance of Shared Files Distance of Shared Files Yuval Shavitt, Ela Weinsberg , Udi Weinsberg Tel-Aviv University Problem Setting Peer-to-Peer (p2p) networks are used by millions for sharing content


slide-1
SLIDE 1

Estimating Peer Similarity using Distance of Shared Files Distance of Shared Files

Yuval Shavitt, Ela Weinsberg, Udi Weinsberg Tel-Aviv University

slide-2
SLIDE 2

Problem Setting

Peer-to-Peer (p2p) networks are used by millions for sharing content Increasingly difficult to find useful content

Noise in user generated content (meta-data)

  • Noise in user generated content (meta-data)
  • Extreme dimensions
  • Sparseness

2 Udi Weinsberg, IPTPS, April 2010

slide-3
SLIDE 3

Work Goal

Suggest a new metric for peer similarity

  • Overcome the sparseness problem

Improve ability to find content

Search algorithms

  • Search algorithms
  • Similar peers are likely to hold relevant content
  • Collaborative filtering
  • Find “like-minded” peers

3 Udi Weinsberg, IPTPS, April 2010

slide-4
SLIDE 4

Key Concept

Build a file similarity graph

  • Use data about all shared files
  • Weights of edges = distance between files

Peer similarity is calculated using the distance Peer similarity is calculated using the distance between their shared files

  • No need for overlapping content between peers

4 Udi Weinsberg, IPTPS, April 2010

slide-5
SLIDE 5

Dataset

Active crawl of Gnutella in 2007 Crawled 1.2 million peers Only 35% of songs contain meta-data 530k distinct songs 530k distinct songs

  • Identified using “title|artist”
  • Accounting for spelling mistakes with edit distance

5 Udi Weinsberg, IPTPS, April 2010

slide-6
SLIDE 6

Dataset Statistics

Using a sample of 100k peers (<10%) Over 511k songs remain (96%)

Power-law Power-law Popularity

6 Udi Weinsberg, IPTPS, April 2010

98% of the peers 98% of the peers share less than 50 songs Popularity distribution

slide-7
SLIDE 7

Sparseness Problem

Median maximal Median maximal

Peers with very Peers with very few popular

7 Udi Weinsberg, IPTPS, April 2010

Median maximal

  • verlap is 20%

few popular songs

slide-8
SLIDE 8

File Similarity Graph

Normalize similarity with popularity:

Files are vertices Link weight is the number of peers sharing both

Power-law Power-law distribution, filter

Filter

  • Keep only top 40%
  • And no less than 10

8 Udi Weinsberg, IPTPS, April 2010

distribution, filter causes distortion

slide-9
SLIDE 9

Peer Similarity Estimation (1)

Create a bi-partite graph connecting the files

  • f every two peers

Connect files in the two sides with links:

If exact same file – weight is 1

  • If exact same file – weight is 1
  • Otherwise – use normalized similarity along the

shortest path between the files

9 Udi Weinsberg, IPTPS, April 2010

slide-10
SLIDE 10

Distance Estimation

….

0.2 0.8 0.5 10 Udi Weinsberg, IPTPS, April 2010 1 0.9

slide-11
SLIDE 11

Peer Similarity Estimation (2)

Run maximal weighted matching on the bi- partite

  • Find the “best” matching links between files
  • The matching M is the sum of links weight
  • The matching M is the sum of links weight

Peer similarity

11 Udi Weinsberg, IPTPS, April 2010

slide-12
SLIDE 12

Maximal Weighted Matching

….

0.2 0.5 12 Udi Weinsberg, IPTPS, April 2010

slide-13
SLIDE 13

Distance Estimation Issues

File similarity graph can have connected components

  • Some distances are infinite

All pairs shortest paths can be costly All pairs shortest paths can be costly

  • Reduce the size of the similarity graph
  • Limit the search depth

13 Udi Weinsberg, IPTPS, April 2010

slide-14
SLIDE 14

Reducing Similarity Graph Size

For each file, take only the top N nearest neighboring files Distribution almost overlap for N≥10

14 Udi Weinsberg, IPTPS, April 2010

slide-15
SLIDE 15

Limit Search Depth

Stop searching files once reached K times the distance of the first finding

  • Distance between files become asymmetric
  • Depends on the peer we start from
  • Depends on the peer we start from

For K≥1.5 links removed are unlikely to be selected in the maximum matching

  • Asymmetric links are mostly low-similarity links
  • Hence will not be selected in the matching

15 Udi Weinsberg, IPTPS, April 2010

slide-16
SLIDE 16

Meta-data and Similarity

Similarity between peers i and j using artists Normalized similarity matches meta-data

16 Udi Weinsberg, IPTPS, April 2010

slide-17
SLIDE 17

Geography and Similarity

Comparing the distance with similarity No direct correlation!

17 Udi Weinsberg, IPTPS, April 2010

slide-18
SLIDE 18

Conclusions

A metric for similarity between peers Evaluation using song files shared in Gnutella

  • Metric reflects the similarity of peer preferences

in music in music

Geography is not necessarily a good indication for peer similarity!

18 Udi Weinsberg, IPTPS, April 2010

slide-19
SLIDE 19

Thank You! Thank You!

Udi Weinsberg udiw@eng.tau.ac.il