estimating peer similarity using distance of shared files
play

Estimating Peer Similarity using Distance of Shared Files Distance - PowerPoint PPT Presentation

Estimating Peer Similarity using Distance of Shared Files Distance of Shared Files Yuval Shavitt, Ela Weinsberg , Udi Weinsberg Tel-Aviv University Problem Setting Peer-to-Peer (p2p) networks are used by millions for sharing content


  1. Estimating Peer Similarity using Distance of Shared Files Distance of Shared Files Yuval Shavitt, Ela Weinsberg , Udi Weinsberg Tel-Aviv University

  2. Problem Setting � Peer-to-Peer (p2p) networks are used by millions for sharing content � Increasingly difficult to find useful content o Noise in user generated content (meta-data) Noise in user generated content (meta-data) o Extreme dimensions o Sparseness Udi Weinsberg, IPTPS, April 2010 2

  3. Work Goal � Suggest a new metric for peer similarity o Overcome the sparseness problem � Improve ability to find content o Search algorithms Search algorithms • Similar peers are likely to hold relevant content o Collaborative filtering • Find “like-minded” peers Udi Weinsberg, IPTPS, April 2010 3

  4. Key Concept � Build a file similarity graph o Use data about all shared files o Weights of edges = distance between files � Peer similarity is calculated using the distance � Peer similarity is calculated using the distance between their shared files o No need for overlapping content between peers Udi Weinsberg, IPTPS, April 2010 4

  5. Dataset � Active crawl of Gnutella in 2007 � Crawled 1.2 million peers � Only 35% of songs contain meta-data � 530k distinct songs � 530k distinct songs o Identified using “title|artist” o Accounting for spelling mistakes with edit distance Udi Weinsberg, IPTPS, April 2010 5

  6. Dataset Statistics � Using a sample of 100k peers (<10%) � Over 511k songs remain (96%) Power-law Power-law Popularity Popularity 98% of the peers 98% of the peers distribution share less than 50 songs Udi Weinsberg, IPTPS, April 2010 6

  7. Sparseness Problem Peers with very Peers with very Median maximal Median maximal Median maximal few popular few popular overlap is 20% songs Udi Weinsberg, IPTPS, April 2010 7

  8. File Similarity Graph � Files are vertices � Link weight is the number of peers sharing both � Normalize similarity with popularity: Power-law Power-law distribution, filter distribution, filter � Filter causes distortion o Keep only top 40% o And no less than 10 Udi Weinsberg, IPTPS, April 2010 8

  9. Peer Similarity Estimation (1) � Create a bi-partite graph connecting the files of every two peers � Connect files in the two sides with links: o If exact same file – weight is 1 If exact same file – weight is 1 o Otherwise – use normalized similarity along the shortest path between the files Udi Weinsberg, IPTPS, April 2010 9

  10. Distance Estimation …. 0.2 0.5 0.8 0.9 1 Udi Weinsberg, IPTPS, April 2010 10

  11. Peer Similarity Estimation (2) � Run maximal weighted matching on the bi- partite o Find the “best” matching links between files o The matching M is the sum of links weight o The matching M is the sum of links weight � Peer similarity Udi Weinsberg, IPTPS, April 2010 11

  12. Maximal Weighted Matching …. 0.2 0.5 Udi Weinsberg, IPTPS, April 2010 12

  13. Distance Estimation Issues � File similarity graph can have connected components o Some distances are infinite � All pairs shortest paths can be costly � All pairs shortest paths can be costly o Reduce the size of the similarity graph o Limit the search depth Udi Weinsberg, IPTPS, April 2010 13

  14. Reducing Similarity Graph Size � For each file, take only the top N nearest neighboring files � Distribution almost overlap for N≥10 Udi Weinsberg, IPTPS, April 2010 14

  15. Limit Search Depth � Stop searching files once reached K times the distance of the first finding o Distance between files become asymmetric o Depends on the peer we start from o Depends on the peer we start from � For K ≥1.5 links removed are unlikely to be selected in the maximum matching o Asymmetric links are mostly low-similarity links o Hence will not be selected in the matching Udi Weinsberg, IPTPS, April 2010 15

  16. Meta-data and Similarity � Similarity between peers i and j using artists � Normalized similarity matches meta-data Udi Weinsberg, IPTPS, April 2010 16

  17. Geography and Similarity � Comparing the distance with similarity � No direct correlation! Udi Weinsberg, IPTPS, April 2010 17

  18. Conclusions � A metric for similarity between peers � Evaluation using song files shared in Gnutella o Metric reflects the similarity of peer preferences in music in music � Geography is not necessarily a good indication for peer similarity! Udi Weinsberg, IPTPS, April 2010 18

  19. Thank You! Thank You! Udi Weinsberg udiw@eng.tau.ac.il

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend