SLIDE 1 Comparing Hybrid Peer-to-Peer Systems
Beverly Yang and Hector Garcia-Molina
Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems
Hybrid peer-to-peer systems
Pure peer-to-peer systems are hard to scale
Gnutella
Look at hybrids between p2p and server-client
Servers will index files, clients download from each
Searching can be done more efficiently on a server Napster (but Napster had its own problems...) Several other architectures
Questions for hybrid systems
Best way to organize servers? Index replication policy? What queries are submitted often? How do we deal with churn? How do query patterns affect performance?
Contributions of this paper
Presents several architectures for hybrid systems Presents and evaluates a probabilistic model for queries Compares architectures quantitatively, based on their models and the music sharing domain Compares strategies in non-music-sharing domains (a bit)
SLIDE 2
General concepts: basic actions
Login
A client connects to a server and uploads metadata about the files it offers It is a local user to that server, a remote user to others
Query
A list of words to search on Satisfied if preset maximum number of results found
Download
Contact peer directly after getting info from server
Goal
The goal of this study is to maximize UsersPerServer What do you think of this goal?
Batch vs. incremental logins
Batch: on login/logout, user’ s entire metadata set is added/removed
Allows index to remain small, but login/logout is expensive
Incremental: metadata kept in index at all times, and only deltas are sent at login
Saves much effort on login/logout Queries become more expensive, as server must filter for online users
Architectures (1)
Chained architecture
Servers are arranged in a linear chain (ring?) Each server keeps metadata for local users Unsatisfied queries sent along chain Logins and downloads scalable; queries potentially expensive
SLIDE 3
Architectures (2)
Full replication architecture
Each server keeps metadata about all users Logins expensive Queries cheap
Architectures (3)
Hash architecture
Metadata words hashed so a particular server is responsible for a particular subset of them Queries sent to relevant servers On login, metadata sent to all relevant servers Limited number of servers need to see each query, but sending the lists may be expensive
Architectures (4)
Unchained architecture
Servers are independent and don’ t communicate A user can only search files on the server he/she connects to Napster Disadvantage: user’ s views are limited Advantage: scales very well (as servers, users increase together)
Query model
Universe of queries: q1, q2, q3, ...; densities f, g g(i) is probability that a submitted query is query qi (query popularity) f(i) is probability that any given file will match query qi (selection power) g tells us what queries users like to submit, while f tells us which files users like to store
SLIDE 4 Expected results for chained
ExServ = Expected number of servers needed to
- btain R results (MaxResults)
If P(s) is the probability that exactly s servers are needed to return R or more results, we have:
ExLocalResults based on (UsersPerServer * FilesPerUser) files ExTotalResults based on (ExLocalResults * k) files
Expected values for others
ExServ trivially 1 for full replication and unchained ExServ is equivalent to balls-in-bins for hash
Distributions for f() and g()
Exponential distributions work well for music domain:
Monotonically decreasing
Popularity and selection power are correlated
Most popular has highest selection power, and so on
Validation of query model
M(n) = expected # results from n files Q(n) = probability we don’ t get R results These data gathered from OpenNap
SLIDE 5
Performance model
CPU cycles
Cost estimates based on examination and guesswork, plus some experiments Matched OpenNap relatively well for batch logins
Inter-server bandwidth
Varies among architectures
Server-client bandwidth
Napster protocol: Login, AddFile, RemoveFile
Take min over resources (iterative estimation)
Evaluation
Metric: max users per server (throughput, not latency)
Memory requirements Beyond music
f() and g() could be different
May be no or negative correlation e.g. Adding “price > 0” to a query makes it less popular but doesn't change size of result set e.g. Archive system will return more results from farther in the past (queries presumably rarer)
No or negative correlation can be modeled by adjusting the ratio of the parameters to f and g
No: r = 1 Negative: r >> 1
SLIDE 6
CPU performance vs. r Conclusion
Chained is the best architecture for music domain Full replication might be good with lots of cheap memory and stable network connections Incremental logins do best when there is negative correlation between f and g, and it performs best in short, bandwidth-limited sessions