Outline Outline Motivation Data Stream Processing & I magine - - PDF document

outline outline
SMART_READER_LITE
LIVE PREVIEW

Outline Outline Motivation Data Stream Processing & I magine - - PDF document

High- -Throughput Sketch Update on a Throughput Sketch Update on a High Low - -Power Stream Processor Power Stream Processor Low Yu- -Kuen Lai Kuen Lai Greg Byrd Yu Greg Byrd Dept. of Electrical Engineering Dept. of Electrical and


slide-1
SLIDE 1

High High-

  • Throughput Sketch Update on a

Throughput Sketch Update on a Low Low -

  • Power Stream Processor

Power Stream Processor

Yu Yu-

  • Kuen Lai

Kuen Lai

  • Dept. of Electrical Engineering
  • Dept. of Electrical Engineering

Chung Chung-

  • Yuan Christian Univ.

Yuan Christian Univ. Chung Chung-

  • Li, Taiwan

Li, Taiwan

Greg Byrd Greg Byrd

  • Dept. of Electrical and Computer Engineering
  • Dept. of Electrical and Computer Engineering

Center for Embedded Systems Research Center for Embedded Systems Research NC State University NC State University Raleigh, NC Raleigh, NC

ANCS 2006 ANCS 2006 2

Outline Outline

Motivation Data Stream Processing & I magine

Stream Architecture

The Sketch Data Structure I mplementation and Performance Conclusion & Future Work

slide-2
SLIDE 2

ANCS 2006 ANCS 2006 3

Motivation Motivation

[Slide from “Querying and Mining Data Streams: You Only Get One Look”, VLDB’02, by Garofalakis et al.]

A growing number of applications are operated

base on streams of data

  • Traffic measurement & analysis for

Infrastructure planning Capacity forecasting accounting Security related

Application characteristics

  • Massive volumes of data (several terabytes)
  • Data arrives at a rapid rate

How to process queries and

compute statistics on data streams in real-time?

ANCS 2006 ANCS 2006 4

Data Stream Model Data Stream Model

A data stream is a massive sequence of elements

  • An input stream arrives sequentially, item by item
  • Each item consists of a key and an update

key space Update,

  • A time varying signal A[ ] (an array of n buckets)
  • The arrival of each item cause the signal A[] to be updated

,...) , (

2 1 a

a = φ

t t t

u k A k A + = ] [ ] [

) , (

t t t

u k a =

{ }

1 ,..., − ∈ n kt

] [ A n

t

k

t

u

slide-3
SLIDE 3

ANCS 2006 ANCS 2006 5

What Do We Need? What Do We Need?

Fast Counter Update

  • A smaller but faster SRAM

Being Programmable

  • Accommodate different algorithms
  • Different threshold/approximate accuracy for different

applications

Approximated Answers

  • May not be able to produce exact answers due to limited

resources (memory and processing power)

High Computation Power

  • The statistic operations are computational intensive

ANCS 2006 ANCS 2006 6

I magine Stream Processor I magine Stream Processor

  • Built by Dr. William J. Dally’s

research team at Stanford University

  • Designed for stream media

processing

  • A VLI W programmable co-

processor supports stream programming model in SI MD fashion.

  • Has 3 Levels of memory

hierarchies

  • Key modules are organized

around the Stream Register File (SRF) through stream buffers

slide-4
SLIDE 4

ANCS 2006 ANCS 2006 7

Stream Processing Stream Processing

What is Stream?

  • An important data

representation in stream programming model

  • A collection of data records of

variable length.

  • Streams are inputs to kernels

where computation is performed

  • n its elements.

Stream Stream

www.etplanet.com

ANCS 2006 ANCS 2006 8

Programming Model Programming Model

Stream Level

  • Run at the Host processor
  • Kernel invocation
  • Orchestrating the flow of

streams

Kernel Level

  • Computations are done locally within the kernel
  • Operates on streams as inputs and produces streams as outputs
  • No arbitrary memory reference
  • Loops are the ONLY control flow operations
  • Conditional= Predicates
slide-5
SLIDE 5

ANCS 2006 ANCS 2006 9

Some Performance Highlights Some Performance Highlights

AES (2Gbps)

  • Advanced Encryption Standard, Rijndael
  • I n ECB and OCB modes with key agility
  • Our best/ worst case is 32/ 76 cycles, 41 cycles for the variable sized packet (AI X-

1054837521-1)

Bloom filter based Content I nspection Engine

(400Mbps)

  • Matching 2000+ signatures up to 400 Mbps with 500Mhz system clock. (1500

bytes packet)

  • The false positive error rate is 9.8e-7

MMH Message Authentication (7Gbps)

  • Multilinear Modular Hashing
  • Achieving Multi-Gigabit throughput. Ultra fast and unconditionally secure.
  • No matter how much computing power the adversary has. The probability for the

adversary to compromise is lower than a probability p.

ANCS 2006 ANCS 2006 10

The Count The Count-

  • Min Sketch

Min Sketch

[Cormode & Muthukrishnan, Dec 2003]

I t’s a probabilistic, approximated algorithm The BEST existing sketch scheme [2005 G. M. Lee et al.] The operation is based on a two-dimensional array,

count count[d][w]

H1(k) H2(k) Hd(k)

w d

z-bits ] ][ 1 [ count ] ][ 2 [ count ] ][ [ d count

slide-6
SLIDE 6

ANCS 2006 ANCS 2006 11

The Count The Count-

  • Min Sketch (Cont.)

Min Sketch (Cont.)

Update

  • For each arrived item
  • Hash the key by d different

independent hash functions

  • Use those d hash values as

indexes to update the value into each array

Point Query Q(k)

  • Given a key k, again we hash

the key

  • Use these hash values as

indexes to look up the value stored in each array

  • The answer to Q(k) is to pick

the minimum of all these d values

{ }

)] ( ][ [ min ˆ k h j count a

j j k =

d j ≤ ≤ 1

H1(k) H2(k) Hd(k)

w d

z-bits

) , ( u k ak = u k h j count

j

= + )] ( ][ [ d j ≤ ≤ 1 u

] ][ 1 [ count ] ][ 2 [ count ] ][ [ d count ANCS 2006 ANCS 2006 12

Many Applications Many Applications [

[ Cormode Cormode & & Muthu Muthu, SI AM I CDM05] , SI AM I CDM05]

Significant differences [Charikar et al, I CALP’02]

and relative changes [Cormode & Muthu I NFOCOM’04]

Anomaly Detections [Krishnamurthy et al, SI GCOMM’03] Heavy hitters [Cormode & Muthu ACM PDS 03] Top-K items [Manku & Motwani I CDM’05] Estimating frequent items [Manku & Motwani

I CVLDB’02]

slide-7
SLIDE 7

ANCS 2006 ANCS 2006 13

Change Detection Change Detection

Monitoring the significant differences in traffic

attributes over two observing intervals

Attributes of interest

  • Number of packets
  • Flows
  • Total bytes

Typical Approaches

  • Brute Force (store & sort)
  • Sampling
  • Sketch

ANCS 2006 ANCS 2006 14

Sketch Sketch-

  • based Change Detection

based Change Detection

Absolute difference based on K-ary sketch by

Krishnamurthy et al.

K-ary sketch

  • Same sketch update process
  • Require 4-universal hash function
  • Different point query procedure

Three major modules:

  • Sketch update
  • Forecasting
  • Detection

Sketch Update Forecasting Detection

Time

So(t+2) Kernel B So(t+1) Sf(t+1) So(t-w+2) Se(t+2) Sf(t+2) Kernel C Sf(t+2) keys Error Sketch Kernel A Keys So(t) Forecast Sketch Observed Sketch Alarms

slide-8
SLIDE 8

ANCS 2006 ANCS 2006 15

Sketch Sketch-

  • based Change Detection

based Change Detection

(cont.) (cont.)

ANCS 2006 ANCS 2006 16

The Update Module The Update Module

Sketch Sketch-

  • based Change Detection (cont.)

based Change Detection (cont.)

Same update procedure as that in

the CM sketch

Updating keys for a time interval The update has to be quick enough

to process the incoming packets at line rate

  • For every packet

Key extraction Hashing Updating the counters

  • It’s 32ns for processing a minimum-

sized IP packet at 10Gbps

t Δ

Kernel A Keys So(t) Observed Sketch

H1(k) H2(k) Hd(k)

w d

z-bits

] ][ 1 [ count ] ][ 2 [ count ] ][ [ d count

slide-9
SLIDE 9

ANCS 2006 ANCS 2006 17

The Forecast Module The Forecast Module

Sketch Sketch-

  • based Change Detection

based Change Detection (cont.)

(cont.)

Forecast model based on the

moving average

The forecast sketch

− =

− = +

1

) ( 1 ) 1 (

w i

  • f

i t S w t S

  • Compute the forecast sketch incrementally

)) 1 ( ) 1 ( ( 1 ) 1 ( ) 2 ( + − − + + + = + w t S t S w t S t S

  • f

f

Kernel B2 So(t+1) Sf(t+1) So(t-w+2) Sf(t+2)

Steady State

Time t+2

Kernel B1 So(t) So(t-1) So(t-w+1) Sf(t+1)

Initial State

ANCS 2006 ANCS 2006 18

The Change Detection Module The Change Detection Module

Sketch Sketch-

  • based Change Detection (cont.)

based Change Detection (cont.)

Alarm threshold Given a key, Raise the alarm if

A

T

) ( ) ( ) ( ))] ( ( _ [

2 1 2

t S t S t S t S F Estimate T T

f

  • e

e A

− = ⋅ =

2 2 | | 2 2 2

)) ( ( 1 1 ]) ][ [ ( 1 } { )) ( ( _ S sum k j i T k k F F median t S F Estimate

s k j h h H i e

i i

− − − = =

∈ ∈ A e

T key t S Estimate > ) ), ( (

} / 1 1 / ) ( )] ( ][ [ { ) ), ( (

| |

k k S sum key h i T median key t S Estimate

i H i e

− − =

slide-10
SLIDE 10

ANCS 2006 ANCS 2006 19

The Bottleneck The Bottleneck

Sketch Sketch-

  • based Change Detection (cont.)

based Change Detection (cont.)

The sketch processing (forecasting and Detection)

is based on a time interval

  • Usually the interval is set in minutes, say 1 or 5 minutes
  • Kernel B and C run once per

The bottleneck is in the Sketch update module

t Δ t Δ

ANCS 2006 ANCS 2006 20

Sketch Update Performance Sketch Update Performance

For hashing and update a 32-bit key

  • 15 cycles (2-universal)
  • 33 cycles (4-universal)

I t’s 10.6Gbps and 4.8Gbps for processing 40-byte

packets with system clock of 500 MHz

slide-11
SLIDE 11

ANCS 2006 ANCS 2006 21

I ntel I XP2800 I ntel I XP2800 Network Processor Architecture Network Processor Architecture

Memory Hierarchies

Local Memory (3)

  • 2,560 bytes

Scratchpads (60)

  • 16K Bytes

SRAM (150)

  • 4 Channels, each supports

4GBps @250MHz

DRAM (300)

  • Total size of 2G Bytes
  • Bandwidth 50Gbps

ANCS 2006 ANCS 2006 22

Sketch Update On I XP2800 Sketch Update On I XP2800

Decisions

  • For 1,024 x 8, 32-bit counter array

Local Memory of 2,560 bytes is too small to hold the array Scratchpad of 16K bytes is too small DRAM has ~ 300 cycles of access latency

  • SRAM is the choice

h0(p1) h1(p1) h2(p1) h3(p1) . . h7(p1) h0(p2) h1(p2) h2(p2) h3(p2) . . h7(p2) h0(p8) h1(p8) h2(p8) h3(p8) . . h7(p8) ME 1 ME 2 ME 8

Arrangement

  • Each thread in the ME

fetch the packet header and calculate 8 hashes

slide-12
SLIDE 12

ANCS 2006 ANCS 2006 23

Performance Performance

Atomic Access

  • Sram-test-and-add ()
  • ME issues a single command with address and update value
  • 120 cycles latency for memory access

2-Universal Hashing (single thread)

  • 70 cycles to hash a 32-bit key

4-Universal Hashing (single thread)

  • 220 cycles to hash a 32-bit key

ANCS 2006 ANCS 2006 24

Performance (cont.) Performance (cont.)

I mprovement on shortening the computation?

  • Hardware assists

CRC unit may not provide the quality needed Bus Arbitration overhead on share resources (hash on SHaC

unit)

  • Table lookup

Large table is needed

  • Three 64K x 4 bytes needed for tabulation of 4-universal

hash function

May introduce more memory access latency

  • Still bounded by memory access latency

Hide the latency with more threads

slide-13
SLIDE 13

ANCS 2006 ANCS 2006 25

Performance (cont.) Performance (cont.)

Double the amount of MEs, 16 instead of 8 Same number of hash calculations and counter

updates

36% decrease in total cycles

Variance of Performance Statistics

(5,120 counter updates and hash calculations based on 640 keys) 0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 3.0E+04 3.5E+04 4.0E+04 4.5E+04 5.0E+04 M8T8 M16T8_half Cycles Idle Stalled Aborted Executing

ANCS 2006 ANCS 2006 26

Worst Worst-

  • case Scenario

case Scenario

Collisions due to hash and update on 1,280 keys

  • f the same source I P address
slide-14
SLIDE 14

ANCS 2006 ANCS 2006 27

Comparison Comparison

13.34Gbps 10.86Gbps Throughput 40-byte packet I ntel 0.13um TI 0.15um Technology 30W 4W Max Power 21~ 26W 2.2W Typical Power 16 MEs 8 Clusters Hardware 1,400MHz 500MHz Clock 3ns 3.68ns

  • Avg. Hash&Update

(2-universal) I XP2800 I magine

ANCS 2006 ANCS 2006 28

Some Performance Highlights Some Performance Highlights

AES (2Gbps) Bloom filter based Content I nspection Engine

(400Mbps)

MMH Message Authentication (7Gbps) Sketch Update

10Gbps (2-universal) 4.8Gbps (4-universal)

slide-15
SLIDE 15

ANCS 2006 ANCS 2006 29

Conclusion Conclusion

Low power consumption High computation power Flexibility

  • The data structure can be served as the basis for various

networking applications

  • Capable of accommodating many dynamic aspects

Adjusting thresholds Change statistical models Change methodologies due to accuracy requirements ANCS 2006 ANCS 2006 30

Future Work Future Work

Sketch/ Stream Manipulations

  • For better accuracy

I ntegrate the simulation framework

  • Construct the network interfaces at Media Access

layer (MAC)

  • Facilitate the exploration of many network system

designs

Multi-SI MD hybrid architecture

  • A group of SIMD clusters share the same

microcontroller issuing the instructions while different group of these entities behave in MIMD mode