Towards High- -performance performance Towards High Flow- -level - - PowerPoint PPT Presentation

towards high performance performance towards high flow
SMART_READER_LITE
LIVE PREVIEW

Towards High- -performance performance Towards High Flow- -level - - PowerPoint PPT Presentation

Towards High- -performance performance Towards High Flow- -level Packet Processing level Packet Processing Flow on Multi- -core Network Processors core Network Processors on Multi Yaxuan Qi (presenter), Bo Xu, Fei He, Baohua Yang,


slide-1
SLIDE 1

Towards High Towards High-

  • performance

performance Flow Flow-

  • level Packet Processing

level Packet Processing

  • n Multi
  • n Multi-
  • core Network Processors

core Network Processors

Yaxuan Qi (presenter), Bo Xu, Fei He, Baohua Yang, Jianming Yu and Jun Li ANCS 2007, Orlando, USA

slide-2
SLIDE 2

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-3
SLIDE 3

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-4
SLIDE 4

Introduction

Why flow-level packet processing with

high-performance?

increasing sophistication of applications

stateful firewalls deep inspection in IDS/ IPS flow-based scheduling in load balancers

continual growth of network bandwidth

  • c192 or higher link speed

1 million or more concurrent connections

slide-5
SLIDE 5

Introduction

Problems in flow-level packet processing:

Flow classification:

Importance: access control and protocol analysis difficulty: high-speed with modest memory

Flow state management:

Importance: stateful firewall and anti-DoS Difficulty: fast update with large connections

Per-flow packet order-preserving:

Importance: content inspection Difficulty: mutual exclusion and workload distribution

slide-6
SLIDE 6

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-7
SLIDE 7

Related Work on Multi-core NP

Intel IXP2850

slide-8
SLIDE 8

Related Work on Multi-core NP

Programming Challenges:

Achieving a deterministic bound on packet

processing operation

line rate constraint clock cycles to process the packet should have an upper

bound

Masking memory latency through multi-threading:

memory latencies are typically much higher than the

amount of processing budget

Preserving packet order in spite of parallel

processing:

extremely critical for applications like media gateways and

traffic management.

slide-9
SLIDE 9

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-10
SLIDE 10

Flow Classification

Related work:

  • D. Srinivasan and W. Feng, Lucent Bit-Vector

On the Intel IXP1200 NP Only support 512 rules

  • D. Liu, B. Hua, X. Hu and X. Tang, Bitmap RFC

Achieves near line speed on the Intel IXP2800. 100MB+ SRAM memory for thousands of rules

Our study, Aggregated Cuttings (AggreCuts)

Near line speed on IXP2850 Consumes less than 10MB SRAM

slide-11
SLIDE 11

Flow Classification Algorithms Field-independent Search Algorithms Field-dependent Search Algorithms Trie-Based Algorithms Table-Based Algorithms Trie-Based Algorithms Decision-Tree Algorithms BV ABV AFBV

Bit-Map Aggregation Folded Bit-Map Aggregation

CP RFC

B-RFC

Bit-Map to store rules

HSM Prefix Match Equivalent Match Index Search Binary Search

Bit-Map Aggregation

H-Trie

SP-Trie GoT EGT

No Back Tracking No Back Tracking No Rule Duplication Extend to Multiple Fields

Bit-Test Range-Test

Modular

Single-Field Multi-Field

HiCuts

AggreCuts

HyperCuts

Bit-Map Aggregation

slide-12
SLIDE 12

Flow Classification

Why not HiCuts?

Non-deterministic worst-case search time

Due to heuristics used for #cuts

Excessive memory access

due to linear search on leaf-nodes (8Rules, <3Gbps on IXP28xx)

Our motivations:

Fix the number of cuttings at internal-nodes:

If the number of cuttings is fixed to 2w, then a worst-case bound of

O(W/ w ) is achieved (where W is header width, and w is stride)

Eliminate linear search at leaf-nodes:

Linear search can be eliminated if we “keep cutting” until every sub-

space is full-covered by a certain set of rules.

Consider the common 5-tuple flow classification problem

W=104, set w =8, then the worst-case search time

104/ 8=13 (nearly the same as RFC)

No linear search is required

slide-13
SLIDE 13

Flow Classification

Space Aggregation

slide-14
SLIDE 14

Flow Classification

Data-structure

Bits Description Value 31:30 dimension to Cut (d2c) d2c=00: src IP; d2c=01: dst IP; d2c=10: src port; d2c=11: dst port. 29:28 bit position to Cut (b2c) b2c=00: 31~24; b2c=01: 23~16 b2c=10: 15~8; b2c=11: 7~0 27:20 8-bit HABS if w=8, each bit represent 32 cuttings; if w=4, each bit represent 2 cuttings. 19:0 20-bit Next-Node CPA Base address The minimum memory block is 2w/8*4

  • Byte. So if w=8, 20-bit base address

support 128MB memory address space; if w=4, it supports 8MB memory address space.

slide-15
SLIDE 15

Flow Classification

  • Performance Evaluation
  • Memory Usage:
  • an order of magnitude
  • Memory Access:
  • 3~8 times less
  • Throughput on IXP2850:
  • 3~5 times faster

10 20 30 40 50 60 70 80 90 100 SET01 SET02 SET03 SET04 SET05 SET06 SET07 Memory Accesses (32-bit words) Rule Sets HiCuts AggreCuts-4 AggreCuts-8 10 100 1000 10000 100000 SET01 SET02 SET03 SET04 SET05 SET06 SET07 Memory Usage (MB) Rule Sets HiCuts AggreCuts-4 AggreCuts-8 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 SET01 SET02 SET03 SET04 SET05 SET06 SET07 Throughput (Mbps) Rule Sets HiCuts AggreCuts-4 AggreCuts-8

slide-16
SLIDE 16

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-17
SLIDE 17

Flow State Management

Flow state management:

Problem:

A large number of updates over a short period of time Line speed update

Solution:

Hashing with exact match Collision and computation

Our aim:

Supporting large concurrent sessions with extremely

low collision rate

More than 10M session Less than 1% collision rate

Achieving fast update speed using both SRAM and

DRAM

Near line speed update rate

slide-18
SLIDE 18

Flow State Management

Signature-based

Hashing (SigHash)

m signatures for m

different states with same hash value

Resolving collision

in SRAM (fast, word-oriented)

Storing states in

DRAM (large, burst-

  • riented)
slide-19
SLIDE 19

Flow State Management

Performance

Evaluation

Throughput

10Gbps

Connections

10M

Collision

Less than 1% Depends on

different load factors

2 4 6 8 10 12 8 16 24 32 40 48 56 64 Number of Threads Throughput (Gbps) D i r ect H ash S i gH ash

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 4 2 1 0.5 0.25 0.125 Exception rate Load factor

slide-20
SLIDE 20

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-21
SLIDE 21

Per-flow Packet Ordering

Packet Order-preserving

Typically, only required between packets on the same

flow.

External Packet Order-preserving (EPO)

Sufficient for devices processing packets at network

layer.

Fine-grained workload distribution (packet-level) Need locking

Internal Packet Order-preserving (IPO)

Required by applications that process packets at

semantic levels.

Coarse-grained workload distribution (flow-level) Do not need locking

slide-22
SLIDE 22

Per-flow Packet Ordering

External Packet Order-preserving (EPO)

Ordered-thread Execution

Ordered critical section to read the packet handles off

the scratch ring.

The threads then process the packets, which may get

  • ut of order during packet processing.

Another ordered critical section to write the packet

handles to the next stage.

Mutual Exclusion by Atomic Operation

Packets belong to the same flow may be allocated to

different threads to process

Mutual exclusion can be implemented by locking. SRAM atomic instructions

slide-23
SLIDE 23

Per-flow Packet Ordering

Internal Packet Order-preserving (IPO)

SRAM Q-Array Workload Allocation by CRC Hashing on Headers

slide-24
SLIDE 24

Per-flow Packet Ordering

Performance

Evaluation

Throughput

EPO is faster, 10Gbps IPO has linear speed up,

7Gbps

Workload Allocation

CRC is good (though,

Zipf-like)

While can be better

0.02 0.04 0.06 0.08 0.1 0.12 1 2 3 4 5 6 7 8

Packet drop rate Time (sec)

Queue Length=512 Queue Length=1024 Queue Length=2048

2 4 6 8 10 12 8 16 24 32 40 48 56 64 Number of Threads Throughput (Gbps) I P O E P O

slide-25
SLIDE 25

Outline

Introduction Related Work on Multi-core NP Flow-level Packet Processing

Flow Classification Flow State Management Per-flow Packet Ordering

Summary

slide-26
SLIDE 26

Summary

Contribution:

An NP-optimized flow classification algorithm :

Explicit worst-case search time: 9Gbps hierarchical bitmap space aggregation: upto 16:1

An efficient flow state management scheme:

Fast update rate: 10Gbps Exploit memory hierarchy: 10M connection, low collision rate

Two hardware-supported packet order-preserving schemes :

EPO via ordered-thread execution: 10Gbps IPO via SRAM queue-array: 7Gbps

Future work

Adaptive decision tree algorithm on for different memory

hierarchy?

SRAM SYN-cookie for fast session creation? Flow-let workload distribution?

slide-27
SLIDE 27

Thanks ☺ Questions?