[PPT] - An Internet Protocol Address Clustering Algorithm Robert Beverly PowerPoint Presentation

SLIDE 1

An Internet Protocol Address Clustering Algorithm

Robert Beverly Karen Sollins

MIT Computer Science and Artificial Intelligence Laboratory {rbeverly,sollins}@csail.mit.edu December 11, 2008

USENIX SysML 2008

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 1

SLIDE 2

Scope of Talk

1

Motivation: Learning to operate in an increasingly complex and malicious Internet

2

Challenges: Many at Internet-scale, in dynamic environment

3

Needed: Building-blocks for network and systems designers

4

Approach (And why we didn’t do X): An IP Clustering Algorithm as one building-block with many practical applications

5

Results: Predictive performance, including ability to detect changed network portions

6

Future: What’s next, work building upon this research

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 2

SLIDE 3

Internet-Scale Learning

Outline

1

Internet-Scale Learning

2

Defining the Problem

3

Exploiting Network Structure

4

Results

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 3

SLIDE 4

Internet-Scale Learning

Evolution of Internet Architecture

The Internet is a phenomenal success, but original assumptions underlying its design have changed, e.g.: Security ... historically a second concern Trust ... in a world of botnets, phishers, etc Scale ... traffic, routes, multi-homing, etc Complexity ... policy constraints, network demands, economics And it’s continuing to evolve, grow more complex. E.g.: Scale along new dimension: bad hosts/users Support increasingly critical services Trend to content-based networking Adding devices with intermittent connectivity (sensor nets, DTNs)

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 4

SLIDE 5

Internet-Scale Learning

The Research Challenge

Apply statistical learning to embrace Internet’s natural complexity Find predictive models: generalize to unseen data, new situations Networking problems are a challenging learning environment: Non-stationary On-line Distributed Tradeoff between effort vs. improvement obtained vs. errors Needed: Building blocks to realize ML promise while mitigating challenges

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 5

SLIDE 6

Defining the Problem

Outline

1

Internet-Scale Learning

2

Defining the Problem

3

Exploiting Network Structure

4

Results

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 6

SLIDE 7

Defining the Problem Overview

IP Clustering as a Building Block

Internet Protocol (IP) v4 addresses are unsigned 32-bit integers

e.g. 18.26.0.230

Hosts given addresses based on the network on which they reside An IP Address Clustering Algorithm: Supervised learning (describe change detection later) Given (informally):

Training samples from a portion of the IP address space Labeled with a real or discrete property (e.g. latency, security reputation, etc)

Find a “good” partitioning of the space Why?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 7

SLIDE 8

Defining the Problem Motivation

IPs as Identifiers:

For better or worse, IP addresses are overloaded. IPs serve as identifiers for: End hosts Location in the network topology Location in the physical topology Implications of this conflation: Security policy (firewalls, etc) Reputation (spam sources, etc) Service selection, load balancing, performance optimization (P2P , CDNs, etc) User-directed routing, grid computing, more... For example...

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 8

SLIDE 9

Defining the Problem Building Intuition

Practical Example: Internet Mail Server

Spam Ham Spam Spam Ham ???

2

32

Mail Server

Assuming spam originates from “grouped” hosts/networks Can a mail server build a predictive model of likely spam sources/networks?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 9

SLIDE 10

Defining the Problem Building Intuition

Emulating Ideal World

Ideally, a “knowledge plane” would provide oracle information on every node in the network Unfortunately, the size (∼ 3B addresses, ∼ 300K networks) and dynamics of the Internet generally precludes complete knowledge Instead, leverage Internet’s inherent structure due to physical, logical and administrative boundaries How much structure exists?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 10

SLIDE 11

Defining the Problem Building Intuition

IANA /8 Allocations by Continent

IP addressing is hierarchical Discontinuous, fragmented Correct granularity? Hosts within same sub network likely have consistent policy, latencies, routes, etc.

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 11

SLIDE 12

Defining the Problem Building Intuition

Learning Structure

Idea 1: Statically divide input space

Email server example:

Spam Ham Spam Spam Ham ???

2

32

Mail Server

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 12

SLIDE 13

Defining the Problem Building Intuition

Learning Structure

Idea 1: Statically divide input space

Email server example:

2

32

2

32

P(Spam|Struct) = 0.5 P(S) = 0 P(S) = 1

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 13

SLIDE 14

Defining the Problem Building Intuition

Learning Structure

Idea 1: Statically divide input space

Email server example:

2

32

2

32

P(Spam|Struct) = 0.5 P(S) = 0 P(S) = 1

Issues: Pre-supposes a structure; we may want to infer this Requires large amount of memory to perform decently Static alignment with data leads to inferior performance compared to

ther approaches
R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 13

SLIDE 15

Defining the Problem Building Intuition

Idea 2: Leverage network routing

IP Hierarchy and Aggregation: Blocks (varying size) of contiguous addresses assigned to networks (e.g. AT&T, UCSD, Level3, etc)

Aggregated unit: prefix/mask (defined precisely in paper) E.g. 18.0.0.0/8 is a large prefix with 224 addresses

Smaller blocks are further sub-delegated (“smaller” prefixen) Routers exchange aggregated prefixes, perform per-packet longest-match forwarding to get packet closer to destination Implication: There’s an existing source of rich data e.g. [Balachandar & Wang] For example...

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 14

SLIDE 16

Defining the Problem Building Intuition

Learning Structure

Idea 2: Leverage network routing

Email server example:

2

32

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 15

SLIDE 17

Defining the Problem Building Intuition

Learning Structure

Idea 2: Leverage network routing

Email server example:

2

32

Sprint AT&T Qwest

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 16

SLIDE 18

Defining the Problem Building Intuition

Learning Structure

Idea 2: Leverage network routing

Email server example:

2

32

Sprint AT&T Qwest Seaworld Hotel Qualcomm

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 17

SLIDE 19

Defining the Problem Building Intuition

Learning Structure

Idea 2: Leverage network routing

Email server example:

2

32

Sprint AT&T Qwest Seaworld Hotel Qualcomm

Issues: Inferior to more sophisticated approaches Even if readily available, typically at wrong granularity Similar problems in using registry databases

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 17

SLIDE 20

Defining the Problem Takeaways

How to Best Learn/Exploit Structure?

Temptation to formulate network task into a learning problem (i.e. use out-of-the-box “black-box” algorithms)

Often suboptimal e.g. how to set thresholds, regularization parameter, kernel, etc?

How about Internet-specific learning algorithms?

Leverage domain-specific knowledge

Learn in a way amenable to non-stationary environment, on-line directed learning As Important: Must be fast (ideally suitable for Internet core / high-speed routers) Memory efficient (think FIBs not RIBs)

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 18

SLIDE 21

Exploiting Network Structure

Outline

1

Internet-Scale Learning

2

Defining the Problem

3

Exploiting Network Structure

4

Results

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 19

SLIDE 22

Exploiting Network Structure Refining the Problem

Data Set

Latency Data Set Reference data set drawn from live Internet measurements Use round-trip latency as per-IP property (label) Note algorithm isn’t specific to latency prediction Latency is evocative of many structural properties (e.g. latencies of sub-networks are

ften a function of the network

to which they belong) Live RTT Measurements:

IP1 IP2

N

IP

ping = RTT

N

ping = RTT

2

ping = RTT1

Agent

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 20

SLIDE 23

Exploiting Network Structure Refining the Problem

Data Set

Find: 30,000 random Internet hosts responding to ping Gather: Average latency to each over 5 pings

0.01 0.02 0.03 0.04 0.05 0.06 0.07 100 200 300 400 500 Probability Latency (ms)

Several modes, non-trivial distribution

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 21

SLIDE 24

Exploiting Network Structure Refining the Problem

Black-block Performance

Let’s try out-of-the-box SVM regression: Predict latency to unknown destinations With lots of tuning, performs reasonably well; several insights from feature selection

100 200 300 400 500 100 200 300 400 500 Predicted Latency (ms) Measured Latency (ms)

75% within 30%

Points within yellow lines represent good predictions

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 22

SLIDE 25

Exploiting Network Structure Network Environment

What about the network?

Cool, but... Highly (unnatural) parametric models?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 23

SLIDE 26

Exploiting Network Structure Network Environment

What about the network?

Cool, but... Highly (unnatural) parametric models?

n

t=1

αt − 1 2

n

i=1

n

j=1

αiαjK(φ(xi), φ(xj)) s.t. C ≥ αt ≥ 0,

n

t=1

αtyt = 0

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 23

SLIDE 27

Exploiting Network Structure Network Environment

What about the network?

Cool, but... Highly (unnatural) parametric models?

n

t=1

αt − 1 2

n

i=1

n

j=1

αiαjK(φ(xi), φ(xj)) s.t. C ≥ αt ≥ 0,

n

t=1

αtyt = 0 Artificial geometries? (How “close” are 18.255.255.255 and 19.0.0.1?)

18.0.0.0/8 19.0.0.0/8

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 23

SLIDE 28

Exploiting Network Structure Network Environment

What about the network (con’t)?

And... Structural, temporal network dynamics? When, how often to retrain? On-line learning? For Example, Latency Prediction: Structural changes → new link, routing change Temporal effects → congestion, time-of-day

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 24

SLIDE 29

Exploiting Network Structure Network Environment

Change Point Detection

Change point detection: Assume errors are normally distributed Change from known initial world θ0 = N(µ0, σ) To θ1, unknown µ1 change Generalized Likelihood Ratio: Perform double maximization, derivative: gk = 1 2σ2 max

1≤j≤k

1 k − j + 1  

k

i=j

(xi − µ0)  

2

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 25

SLIDE 30

Exploiting Network Structure Network Environment

Change Point Detection

Change point detection: Assume errors are normally distributed Change from known initial world θ0 = N(µ0, σ) To θ1, unknown µ1 change Generalized Likelihood Ratio: Perform double maximization, derivative: gk = 1 2σ2 max

1≤j≤k

1 k − j + 1  

k

i=j

(xi − µ0)  

2

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 25

SLIDE 31

Exploiting Network Structure Network Environment

GLR

GLR as a test statistic on our data Learn model, predict, receive ground truth, update Real data, synthetic errors begin at 1000th sample

50 100 150 200 250 300 350 400 200 400 600 800 1000 1200 1400 1600 1800 Error (ms) Time-Ordered Latency Predictions Error GLR 50 100 150 200 250 300 350 400 200 400 600 800 1000 1200 1400 1600 1800 Error (ms) Time-Ordered Latency Predictions

Change @ 1000

Error GLR

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 26

SLIDE 32

Exploiting Network Structure Network Environment

GLR

GLR traditionally used in operations management, etc. In our environment, test error = train error → gk drifts positive gk drifts with slope of β until change to β′

100 200 300 400 500 600 500 1000 1500 2000 2500 3000 3500 4000 4500 gk Time-Ordered Samples (Change 4000) GLR wma(GLR)

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 27

SLIDE 33

Exploiting Network Structure Network Environment

GLR

Overcoming the drift effect Take first derivative to get step function Take second derivative to get impulse response

0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 500 1000 1500 2000 2500 3000 3500 4000 4500 gk Time-Ordered Samples (Change 4000) d/dx GLR normalized d’/dx GLR WMA(normalized d’/dx GLR)

Intuition: gk drifts with slope of β until change to β′ 1st derivative step function still requires thresholding 2nd derivative:

d dx of

a constant is zero Can now edge trigger

n impulse response
R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 28

SLIDE 34

Exploiting Network Structure Network Environment

GLR

Result: Decision function for predicting a change in a supervised learning problem

10

10 20 30 40 50 60 500 1000 1500 2000 2500 3000 3500 4000 4500 gk Time-Ordered Samples (Change 4000) GLR(GLR) Change Point

Take away: a principled method to detect structural network change

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 29

SLIDE 35

Exploiting Network Structure A Network-Specific Algorithm

An IP Clustering Algorithm

Dynamics: Incorporate dynamics into model GLR provides a means to detect change But what portion of the network? Domain knowledge IP address blocks are assigned on 2x boundaries Can we incorporate this domain-specific knowledge?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 30

SLIDE 36

Exploiting Network Structure A Network-Specific Algorithm

An IP Clustering Algorithm

Induces a partitioning over an IP address space:

2

32

Maintain partitioning in a binary radix trie:

40ms 20ms 32/3 50ms 64/2 86ms 0/1 100ms 128/1 90ms 192/2

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 31

SLIDE 37

Exploiting Network Structure A Network-Specific Algorithm

An IP Clustering Algorithm

Divisive formulation Perform a t-test on permutations of 2i input partitionings

t−test, H0:? t−test, H0:? t−test, H0:?

Gives a strong statistical notion of whether points come from same distribution, i.e. common latencies Use t-test to drive partitioning; each partition inserted into radix trie → longest prefix matching Also an agglomerative version (build up)

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 32

SLIDE 38

Exploiting Network Structure Maximal Prefixes

Maximal Partitioning

76.105.0.0 76.105.255.255

Partition from 76.105.64.0 to 76.105.255.255 is not valid

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 33

SLIDE 39

Exploiting Network Structure Maximal Prefixes

Maximal Partitioning

76.105.0.0 76.105.255.255

Partition from 76.105.64.0 to 76.105.255.255 is not valid Divide into 4 equally sized 214 prefixes?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 34

SLIDE 40

Exploiting Network Structure Maximal Prefixes

Maximal Partitioning

76.105.0.0 76.105.255.255 AS33651 AS7725 AS33490

Partition from 76.105.64.0 to 76.105.255.255 is not valid Divide into 4 equally sized 214 prefixes? No, example shows three different ASes:

76.105.0.0/18: Sacramento, CA 76.105.64.0/18: Atlanta, GA 76.105.128.0/17: Oregon

Take away: incorporate domain specific knowledge

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 35

SLIDE 41

Results

Outline

1

Internet-Scale Learning

2

Defining the Problem

3

Exploiting Network Structure

4

Results

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 36

SLIDE 42

Results

Performance

20 30 40 50 60 70 80 90 100 110 120 10 100 1000 10000 100000 Mean Absolute Error (ms) Training Size

Less than 40ms error with only 1000 training points ∼ 24ms error, tight bounds with 10,000 training points ∼ 130kB memory to maintain binary trie Lookups in O(b) time

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 37

SLIDE 43

Results

An IP Clustering Algorithm

Advantages A natural means to penalize model complexity A natural means to bound memory Accommodate change detection Allows for active learning

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 38

SLIDE 44

Results

Change Detection

0.0.0.0/1 0.0.0.0/4 64.0.0.0/2 128.0.0.0/1 160.0.0.0/3 Change Detected?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 39

SLIDE 45

Results

Change Detection

0.0.0.0/1 0.0.0.0/4 64.0.0.0/2 128.0.0.0/1 160.0.0.0/3 Change? Change?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 40

SLIDE 46

Results

Change Detection

0.0.0.0/1 0.0.0.0/4 64.0.0.0/2 128.0.0.0/1 160.0.0.0/3 Change?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 41

SLIDE 47

Results

Change Detection Accuracy

2

32

Inferred Change

TN TP FP TN

Real Change

FN

0.2 0.4 0.6 0.8 1 2 4 6 8 10 12 14 Percent Size of network change (/x) Accuracy Precision Recall

Induced change point game Accuracy high across size of change Large changes detected very well More samples required to perform well for smaller changes

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 42

SLIDE 48

Parting Thoughts

Further Research

Improving algorithm:

Agglomerative version has appealing properties Address stability of optimal split in sequential t-test with a random forest algorithm

Variability change point detection Better understanding tradeoff between pruning stale data and the cost of retraining Perform active learning on poorly performing or sparse portions of tree Coping with adversarial agents that disrupt learning?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 43

SLIDE 49

Parting Thoughts

Summary

Learning useful for many problems in a complex Internet But, must be cognizant of difficult issues when employing learning in an Internet-context IP Address Clustering is one building block with wide applicability

Learns underlying structure Leverages domain-specific knowledge Detects environment dynamics Provides a means to penalize model complexity and memory in a network-natural way

Thanks! Questions?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 44

SLIDE 50

Backup Slides

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 45

SLIDE 51

Backup Slides Background

IP Prefixes: prefix/mask p/m := [p, p + 2b−m − 1] b = 32 for IPv4 p/m has 2b−m addresses

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 46

SLIDE 52

Backup Slides Verifying Extant Structure

Examining the hypothesis of structure

Before trying to learn, let’s sanity check ¨ ⌣ d = |IP1 − IP2|, numerical “distance” For a random pair of d-distant IPs, how well do their RTTs agree given d?

Agent IP IP1

2

ping = RTT1 ping = RTT2

i.e. Pr (RTT1 = ǫRTT2|d)?

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 47

SLIDE 53

Backup Slides Verifying Extant Structure

Examining the hypothesis of structure

d = |IP1 − IP2|, numerical “distance” Probability that the RTT of a pair of log2(d)-distant IP address disagrees?

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2 4 6 8 10 12 14 16 18 20 22 24 25 26 rnd % RTT Disagreement log2(Pair Distance)

Take away: structure present to learn upon

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 48

SLIDE 54

Backup Slides Regression Performance

Feature Selection

b Range US Euro 1 0- 56 21 127 128- 51 10 255

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 49

SLIDE 55

Backup Slides Regression Performance

Feature Selection

b Range US Euro 1 0- 56 21 127 128- 51 10 255 2 0-63 77 7 128-191 64-127 30 24 192-255

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 50

SLIDE 56

Backup Slides Regression Performance

Feature Selection

b Range US Euro 1 0- 56 21 127 128- 51 10 255 2 0-63 77 7 128-191 64-127 30 24 192-255

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 51

SLIDE 57

Backup Slides Regression Performance

Performance

Using Support Vector Regression Predict latency to unknown destinations

100 200 300 400 500 100 200 300 400 500 Predicted Latency (ms) Measured Latency (ms)

ideal

Ideally, predictions clustered on 45o line

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 52

SLIDE 58

Backup Slides Regression Performance

Coping with Network Dynamics

Comparative Results: Outperforms SVR approach Does not require SVR parametric “tuning” Linear lookup time in number IP address bits; fast in practice

20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 Mean Absolute Error (ms) Training Size (k) SVR IP-clust

R. Beverly, K. Sollins (MIT)

IP Clustering SysML 2008 53