Identifying Frequent Items in Sliding Windows over On-Line Packet - - PowerPoint PPT Presentation

identifying frequent items in sliding windows over on
SMART_READER_LITE
LIVE PREVIEW

Identifying Frequent Items in Sliding Windows over On-Line Packet - - PowerPoint PPT Presentation

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams Alejandro Lpez-Ortiz School of Computer Science University of Waterloo Joint work with Lukasz Golab (Waterloo), David DeHaan (Waterloo), Erik Demaine (MIT), and J.


slide-1
SLIDE 1

Identifying Frequent Items in Sliding Windows over On-Line Packet Streams

Alejandro López-Ortiz School of Computer Science University of Waterloo

Joint work with Lukasz Golab (Waterloo), David DeHaan (Waterloo), Erik Demaine (MIT), and J. Ian Munro (Waterloo)

slide-2
SLIDE 2

Alejandro Lopez-Ortiz 2 IMC ’03 Miami, Florida

Application

Real-time analysis of network traffic

find frequently appearing packet types

Packet type: port #, protocol type, source IP.

But, interested in recent usage trends

E.g. for routing system analysis or anomaly

detection

So, want to find frequently appearing packets

in a sliding window of N most recent packets

slide-3
SLIDE 3

Alejandro Lopez-Ortiz 3 IMC ’03 Miami, Florida

If we could store the entire window:

Maintain frequency counts of each category

in the window

Update counters as new packets arrive and

  • ld packets are expired out of the window

Periodically scan counters and return the

packet types corresponding to the k largest counters (and possibly the actual counts too)

slide-4
SLIDE 4

Alejandro Lopez-Ortiz 4 IMC ’03 Miami, Florida

What if we can’t store the entire window?

Idea from [Zhu, Shasha, VLDB ’02] :

Divide the sliding window into sub-windows, i.e.

use a coarser time grain of T packets

Store summary for each sub-window Every T packets:

Expire oldest sub

  • w

indow

Add most recent sub

  • window

Update answer

Space req:

summary T window ×

slide-5
SLIDE 5

Alejandro Lopez-Ortiz 5 IMC ’03 Miami, Florida

Example: windowed SUM

5 8 4 9 11 6 8 5 3 20 8 7 3 SUM = 5 + … + 3 = 97 8 4 9 11 6 8 5 3 20 8 7 3 7 SUM = SUM_OLD – 5 + 7 = 99

slide-6
SLIDE 6

Alejandro Lopez-Ortiz 6 IMC ’03 Miami, Florida

Updating Top-k counters

Tb = current count for packet of type b Update:

Tb= Tb- Tb(old sub-window) + Tb(new sub-window)

Problem is: Tb(old sub-window) might not

be part of summary in old sub-window

slide-7
SLIDE 7

Alejandro Lopez-Ortiz 7 IMC ’03 Miami, Florida

…but, let’s use the technique anyway

Sub-window summary: IDs and counts of the

k most frequent categories

  • = sum of the occurrence count of least

frequent item in summary of each sub- window

Compute overall occurrence count for each

packet type from sub-window summaries

Packets exceeding count

are reported as top-k

slide-8
SLIDE 8

Alejandro Lopez-Ortiz 8 IMC ’03 Miami, Florida

The algorithm

  • Let a, b, c, … be distinct packet types, let k = 3

a:17 a:14 d:16 c:22 e:15 b:24 b:21 e:13 c:18 c:12 d:20 f:15 f:17 g: 9 c:6 g:12 f:10 k:12 h:8 f:6 a:8 n:11 a:6 e:13 d:7 d:6 e:4 h:4 a:3 j:3 b:4 c:4 m:6 k:4 b:4 b:4 p:8 h:3 r:5

  • = 4+4+3+…+8+3+5 = 56
  • Total frequency counts from the top-k lists: a=48, b=57,c=62,d=49,e=45,

f=48,g=21,h=12,j=3,k=16,m=6,n=11,p=4,r=5

  • Return b and c as frequent items in this window
slide-9
SLIDE 9

Alejandro Lopez-Ortiz 9 IMC ’03 Miami, Florida

Hypothesis

If categories are +/- equally distributed, previous

method may not work

But, in a Power Law distribution, we expect a few

heavy flows which should register on many top-k lists

Experimented with a TCP trace

1 month of traffic from Lawrence Berkeley Lab to the rest

  • f the world; almost 800 000 packets in total

1647 distinct source IP addresses, which we treated as

distinct categories

slide-10
SLIDE 10

Alejandro Lopez-Ortiz 10 IMC ’03 Miami, Florida

Results: accuracy

Percentage of identified over-threshold items 20 40 60 80 100 1 2 3 4 5 6 7 8 9 10

k Percent

b=20 b=100 b=500

size of each sub-window Window size = 100 000 packets

slide-11
SLIDE 11

Alejandro Lopez-Ortiz 11 IMC ’03 Miami, Florida

Results: precision of the reported frequencies

Relative error in the reported frequencies 2 4 6 8 10 12 14 1 2 3 4 5 6 7 8 9 10

k Relative error (percent)

b=20 b=100 b=500

Window size = 100 000 packets size of each sub-window

slide-12
SLIDE 12

Alejandro Lopez-Ortiz 12 IMC ’03 Miami, Florida

Conclusions

Extended sub-window model to a holistic

aggregate

Good results due to the non-uniform

distribution of Internet traffic

Low space requirements