Machine Learning for Efficient Neighbor Selection in Unstructured - - PowerPoint PPT Presentation

machine learning for efficient neighbor selection in
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Efficient Neighbor Selection in Unstructured - - PowerPoint PPT Presentation

Efficient Neighbor Selection Methodology Results Summary Machine Learning for Efficient Neighbor Selection in Unstructured P2P Networks Robert Beverly 1 Mike Afergan 2 1 MIT CSAIL rbeverly@csail.mit.edu 2 Akamai/MIT afergan@alum.mit.edu


slide-1
SLIDE 1

Efficient Neighbor Selection Methodology Results Summary

Machine Learning for Efficient Neighbor Selection in Unstructured P2P Networks

Robert Beverly1 Mike Afergan2

1MIT CSAIL

rbeverly@csail.mit.edu

2Akamai/MIT

afergan@alum.mit.edu

USENIX SysML, 2007

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-2
SLIDE 2

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-3
SLIDE 3

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Efficient Neighbor Selection

in unstructured P2P networks

Problem Domain Unstructured P2P overlays, e.g. Kazaa, Gnutella, etc. Problem Self-reorganization in unstructured P2P overlays promises better performance, scalability and resilience But cost of reorganization may be greater than benefit! Neighbor Selection Problem Choose neighbors efficiently → with few queries Choose neighbors effectively → with high success

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-4
SLIDE 4

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Efficient Neighbor Selection

in unstructured P2P networks

Problem Domain Unstructured P2P overlays, e.g. Kazaa, Gnutella, etc. Problem Self-reorganization in unstructured P2P overlays promises better performance, scalability and resilience But cost of reorganization may be greater than benefit! Neighbor Selection Problem Choose neighbors efficiently → with few queries Choose neighbors effectively → with high success

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-5
SLIDE 5

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Efficient Neighbor Selection

in unstructured P2P networks

Problem Domain Unstructured P2P overlays, e.g. Kazaa, Gnutella, etc. Problem Self-reorganization in unstructured P2P overlays promises better performance, scalability and resilience But cost of reorganization may be greater than benefit! Neighbor Selection Problem Choose neighbors efficiently → with few queries Choose neighbors effectively → with high success

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-6
SLIDE 6

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Efficient Neighbor Selection

in unstructured P2P networks

Our Approach Support Vector Machines (SVMs) and feature selection for classification Simulate algorithm using live P2P datasets Results Predict “good” neighbors with over 90% accuracy using minimal knowledge of the neighbor’s files or type Find neighbors capable of answering future queries

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-7
SLIDE 7

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Efficient Neighbor Selection

in unstructured P2P networks

Our Approach Support Vector Machines (SVMs) and feature selection for classification Simulate algorithm using live P2P datasets Results Predict “good” neighbors with over 90% accuracy using minimal knowledge of the neighbor’s files or type Find neighbors capable of answering future queries

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-8
SLIDE 8

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Unstructured P2P Networks

Simple, popular and widely used e.g. Gnutella estimated at ≃ 3.5M nodes Typically used for file sharing Overlay Structure:

Organic; nodes interconnect with minimal constraints Nodes are dynamic

Queries:

Flooded through overlay Peers answer Initiate peer-to-peer download

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-9
SLIDE 9

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-10
SLIDE 10

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Self-Reorganization

Because node connections are unconstrained, previous research suggests self-reorganization Improved query recall, efficiency, speed, scalability, resilience, trust, etc.

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-11
SLIDE 11

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Reorganization Paradox

But, how can a node determine in real-time whether or not to attach to another node? F F F

1 2 3

N

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-12
SLIDE 12

Efficient Neighbor Selection Methodology Results Summary Problem Overview Neighbor Selection and Self-Reorganization

Reorganization Paradox

How can a node determine in real-time whether or not to attach to another node? Reorganization presents a paradox: only way to learn about another node is to issue queries, but issuing queries reduces the benefit of reorganization. Our insight: use machine learning classification plus feature selection

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-13
SLIDE 13

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-14
SLIDE 14

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Live P2P Datasets

Want to evaluate potential algorithms on real data Used two Gnutella datasets DataSet Nodes Contains Beverly, et al. 1,500 Queries, Files, Timestamps Goh, et al. 4,500 Queries, Files, Timestamps Both captured with a promiscuous UltraPeer Similar results from both datasets

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-15
SLIDE 15

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-16
SLIDE 16

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Data Preprocessing

Nodes hold and advertise files, ex:

"Red Hot Chili Peppers - Californication.mp3"

Nodes issue queries, ex:

"remember madonna i’ll" @ 1051761774

Remove: non-alphanumerics, stop-words, single chars Per the Gnutella protocol, we tokenize queries and file name on remaining white space: fi, qi Let N be the set of all nodes and n = |N|. Represent all unique tokens and files as Q = qi and F = fi

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-17
SLIDE 17

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Hypothetical Oracle

Dataset includes all files and queries for every node We employ an oracle model in order to measure prediction accuracy For every potential connection compute utility ui(j) This work defines ui(j) simply as the number of queries from i matched by j Form an n-x-n adjacency matrix Y where Yi,j = sign (ui((j))

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-18
SLIDE 18

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Hypothetical Oracle

i

Node i . . . . . . Node j . . . . . . Token Index (a) Adjacency Matrix (b) File Store Matrix Node i

k

x 1 x x x1

2 3

y = sign(u (j))

i,j Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-19
SLIDE 19

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Hypothetical Oracle

Using all file store tokens, F, we assign each token a unique index where |F| = k. Form an n-x-k file store matrix X where Xi,j = 1 ⇐ ⇒ Fj ∈ fi

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-20
SLIDE 20

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Hypothetical Oracle

i

Node i . . . . . . Node j . . . . . . Token Index (a) Adjacency Matrix (b) File Store Matrix Node i

k

x 1 x x x1

2 3

y = sign(u (j))

i,j Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-21
SLIDE 21

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Representing a single node i

X . . . . . .

y 0 1 1 x x x

Node j

1 2 k

The i’th row of the adjacency matrix is the first column first column represents node i’s connection preferences (class labels). horizontal concatenation with file store matrix X Call this oracle representation Z

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-22
SLIDE 22

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-23
SLIDE 23

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Learning Task

Given the oracle representation, we turn to ML for classification Each node faces a separate learning task Optimal features will be different for each node and need not match node’s queries

Node issues queries for “lord of the rings” Best feature: “elves” Intuition: future query for “the two towers” more likely to succeed

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-24
SLIDE 24

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Learning Task

Overview

feature select

1

  • 1

1 1 1 0 0 1 1

... ...

  • 1

... ... y

?

TEST TRAIN x

3 1 2

x x

1

xk

1 ? ? ?

θ1θ2 θd PREDICT X y

Randomly permute rows of Z Select ≪ n training nodes Learner finds small number d ≪ k of features θ ∈ ^ X that best predict y Test model on remaining potential peers using θ Note features don’t contain any queries!

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-25
SLIDE 25

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Key ML Insight for Systems Architects

Feature Selection Feature selection, variable reduction traditionally used to reduce computational complexity Key Insight for Systems Architects Use feature selection to reduce communication cost

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-26
SLIDE 26

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Key ML Insight for Systems Architects

Feature Selection Feature selection, variable reduction traditionally used to reduce computational complexity Key Insight for Systems Architects Use feature selection to reduce communication cost

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-27
SLIDE 27

Efficient Neighbor Selection Methodology Results Summary Datasets Representing the Dataset Learning Task

Feature Selection

We consider mutual information (MI) and forward fitting (FF) feature selection MI determines how well correlated individual features are to the class label independent of the classifier FF greedily finds features that minimize training error for a given classifier

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-28
SLIDE 28

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-29
SLIDE 29

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Building a Model

Some questions

Classifier Which classifier works best? Number of Training Points Minimum training size that allows for good predictions? All results product of five trials with random data permutation We find best results with SVMs (also tried Naïve Bayes)

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-30
SLIDE 30

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Building a Model

Some questions

Classifier Which classifier works best? Number of Training Points Minimum training size that allows for good predictions? All results product of five trials with random data permutation We find best results with SVMs (also tried Naïve Bayes)

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-31
SLIDE 31

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Building a Model

Some questions

Classifier Which classifier works best? Number of Training Points Minimum training size that allows for good predictions? All results product of five trials with random data permutation We find best results with SVMs (also tried Naïve Bayes)

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-32
SLIDE 32

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Training Points

0.5 0.6 0.7 0.8 0.9 1 50 100 150 200 250 300 350 Test Classification Percentage Training Size (Samples) Accuracy Precision Recall

∼ 100 training samples for good predictions

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-33
SLIDE 33

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-34
SLIDE 34

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Prediction Results

Some questions

Number of Features How many features are required for accurate predictions? Feature Selection How do FF and MI compare? How much better than random?

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-35
SLIDE 35

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Prediction Results

Some questions

Number of Features How many features are required for accurate predictions? Feature Selection How do FF and MI compare? How much better than random?

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-36
SLIDE 36

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Test Accuracy

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 Test Classification Accuracy Number of Features FF MI RND

As few as 5 features give accurate predictions!

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-37
SLIDE 37

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Test Precision

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 Test Precision Number of Features FF MI RND

FF outperforms MI

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-38
SLIDE 38

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Test Recall

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 12 14 16 Test Recall Number of Features FF MI RND

Random features a useful baseline

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-39
SLIDE 39

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Outline

1

Efficient Neighbor Selection Problem Overview Neighbor Selection and Self-Reorganization

2

Methodology Datasets Representing the Dataset Learning Task

3

Results Training Points Prediction Results Discussion

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-40
SLIDE 40

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Discussion

FF outperforms MI

Danger of FF is overfitting, but we do not observe any Consider a node with songs by “Britney Spears” Both “Britney” and “Spears” are good features But with MI, once “Britney” is used, “Spears” doesn’t help Future: remove correlations with feature-to-feature MI

Little SVM overfitting

From training size analysis, SVMs are robust to overfitting We do not face the problem of too many features leading to

  • verfitting

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-41
SLIDE 41

Efficient Neighbor Selection Methodology Results Summary Training Points Prediction Results Discussion

Discussion Con’t

Computationally Practical

FF requires training a combinatorial number of SVMs Can run as a background process Or, use MI for comparable results

Practical in real networks

Use existing P2P bootstrap mechanisms We show that selecting ≃ 100 nodes suffices to build an effective classifier

Neighbor Selection is a general problem

Feature selection to minimize communication overhead may generalize to other systems/network tasks

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks

slide-42
SLIDE 42

Efficient Neighbor Selection Methodology Results Summary

Summary

Novel application of ML to the neighbor selection problem in self-reorganizing networks Use feature selection to reduce communication cost in a distributed system Correct predictions with over 90% accuracy while requiring minimal queries (< 2% of features) Thanks! Questions?

Robert Beverly, Mike Afergan Efficient Neighbor Selection in P2P Networks