Dynamic Pipelining: Making IP-Lookup Truly Scalable Jahangir Hasan - - PowerPoint PPT Presentation

dynamic pipelining making ip lookup truly scalable
SMART_READER_LITE
LIVE PREVIEW

Dynamic Pipelining: Making IP-Lookup Truly Scalable Jahangir Hasan - - PowerPoint PPT Presentation

Dynamic Pipelining: Making IP-Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar School of Electrical & Computer Engineering Purdue University SIGCOMM 2005 1 Internet Growth and Router Design Number of hosts, total traffic


slide-1
SLIDE 1

1

Dynamic Pipelining: Making IP-Lookup Truly Scalable

Jahangir Hasan T. N. Vijaykumar School of Electrical & Computer Engineering Purdue University

SIGCOMM 2005

slide-2
SLIDE 2

2

Internet Growth and Router Design

Number of hosts, total traffic growing exponentially More hosts → larger routing tables Higher line-rates → faster IP-lookup Need for worst-case guarantees

  • Robust system design / testing
  • Network stability / security

Exponential demand yet worst-case guarantee needed

slide-3
SLIDE 3

3

Incoming packets

Background on IP-lookup

Outgoing packets Routing table : (prefix, next-hop) pairs IP-lookup : find longest matching prefix for dest addr Output Queues IP-lookup

slide-4
SLIDE 4

4

Challenge of Scalable IP-lookup

IP-lookup should scale well in:

  • 1. Space – grow slowly with #prefixes
  • 2. Throughput – match line rates
  • 3. Power – grow slowly with #prefixes, line rates
  • 4. Updates – O(1), independent of #prefixes
  • 5. Cost – reasonable chip area

Many IP-lookup proposals to date None address all factors with worst-case guarantees This work first to attempt worst-case for all factors

slide-5
SLIDE 5

5

Previous Work

As line-rates grow

  • Packet inter-arrival time < memory access times
  • Throughput matters more than latency

→ Must overlap multiple lookups using pipelining

Our Scheme DLP [Basu, Narlikar - Infocom’05] HLP [Varghese et al – ISCA’03] TCAMs Area Power Updates Throughput Space

slide-6
SLIDE 6

6

Contributions

Scalable Dynamic Pipelining First to address all 5 factors under worst-case → Size 4x better than previous → Throughput matches future line rates Pipeline at hardware and data-structure level → Optimum updates Not just O(1) but exactly 1 write per update → Low power, cost for future line rates

slide-7
SLIDE 7

7

Outline

Introduction Previous pipelined IP-lookup schemes Our Scheme: Scalable Dynamic Pipelining Experimental Results Conclusions

slide-8
SLIDE 8

8

Background: Trie-based IP-lookup

Each level in different stage → overlap multiple packets Tree data-structure, prefixes in leaves Process destination address level-by-level, find longest match

P4 = 10010* 1 1 1 P1 P2 P4 P3 P5 1 P6 P7

slide-9
SLIDE 9

9

Closest Previous Work: [Infocom’03] Data Structure Level Pipelining (DLP)

Map trie level to stage but this is a static mapping Updates change prefix distribution but mapping persists In worst-case any stage can have all prefixes Large worst-case memory for each stage 0* 00* 000* .. P1 P2 P3 .. P1 P3 P2 P2 X

slide-10
SLIDE 10

10 10

Closest Previous Work: [Infocom’03] Data Structure Level Pipelining (DLP)

No bound on worst-case update → Could be O(1) using Tree Bitmap But constant huge, 1852 memory accesses per update [SIGCOMM Comm Review ’04]

slide-11
SLIDE 11

11 11

Outline

Introduction Previous pipelined IP-lookup schemes Our Scheme: Scalable Dynamic Pipelining Experimental Results Conclusions

slide-12
SLIDE 12

12 12

Key Idea: Use Dynamic Mapping

Map node height to stage (instead of level to stage) Height changes with updates, captures distribution Hence the name dynamic mapping 0* 00* 000* .. P1 P2 P3 .. P1 P3 P2 P2 X #Nodes at a given height is limited (but not at given level)→ Limited #nodes per stage → small per-stage memory

slide-13
SLIDE 13

13 13

But Updates Inefficient?

Updates may change height of arbitrarily many nodes Must migrate all affected nodes to new stages Does this mean updates inefficient? Surprisingly No … Leverage very mapping that causes problem Achieve optimum updates

slide-14
SLIDE 14

14 14

Jump 010

A Problematic Peculiarity of Tries

.. 1* 1010* .. .. P4 P5 .. Some cases height does not capture distribution

  • String of one child nodes

Artificially distort relation of height and distribution P5 P5 X Jump nodes compress away such strings Restore relation between height and distribution P4 P4 P5 X

slide-15
SLIDE 15

15 15

1-bit Tries with Jump Nodes

Key properties (1) Number of leaves = number of prefixes No replication Avoids inflation of prefix expansion, leaf-pushing (2) Updates do not propagate to subtrees No replication (3) Each internal node has 2 children Jump nodes collapse away single-child nodes

slide-16
SLIDE 16

16 16

Total versus Per-Stage Memory

Jump-nodes bound total size by 2N Would DLP+Jump nodes → small per-stage memory? log2 N W - log2 N N No, DLP is still static mapping → large worst-case per-stage Total bounded but not per-stage

slide-17
SLIDE 17

17 17

SDP’s Per-Stage Memory Bound

Proposition: Map all nodes of height h to (W-h)th pipeline stage Result: Size of kth stage = min( N / (W-k) , 2k )

slide-18
SLIDE 18

18 18

Key Observation #1

A node of height h has at least h prefixes in its subtree At least one path of length h to some leaf h -1 nodes along path Each node leads to at least 1 leaf Path has h -1+1 leaves = h prefixes h

slide-19
SLIDE 19

19 19

Key Observation #2

No more than N / h nodes of height h for any prefix distribution Assume more than N / h nodes of height h Each accounts for at least h prefixes (obs #1) Total prefixes would exceed N By contradiction, obs #2 is true

slide-20
SLIDE 20

20 20

Main Result of the Proposition

Map all nodes of height h to (W-h)th pipeline stage kth stage has only N / (W-k) nodes from obs #2 1-bit trie has binary fanout → at most 2k nodes in kth stage Size of kth stage = min( N / (W-k) , 2k ) nodes Results in ~20 MB for 1 million prefix 4x better than DLP Static pipelining (DLP) Dynamic pipelining (SDP)

slide-21
SLIDE 21

21 21

Optimum Incremental Updates

1 update → change height and stage of many nodes Must migrate all affected nodes → inefficient update? Key: Only ancestors’ heights can be affected Each ancestor in different stage = 1 node-write in each stage = 1 write bubble for any update update Updating SDP not just O(1) but exactly 1

slide-22
SLIDE 22

22 22

Efficient Memory Management

No variable striding / compression → all nodes same size No fragmentation/compaction upon updates

slide-23
SLIDE 23

23 23

Scaling SDP for Throughput

Each SDP stage can be further pipelined in hardware HLP [ISCA’03] pipelined only in hardware without DLP Too deep at high line-rates Combine HLP + SDP for feasibly deep hardware Throughput matches future line rates Size = N / (W-k) Size = 2k

1 2 2 2 3

# of HLP stages

slide-24
SLIDE 24

24 24

Outline

Introduction Previous pipelined IP-lookup schemes Our Scheme: Scalable Dynamic Pipelining Experimental Results Conclusions

slide-25
SLIDE 25

25 25

Experimental Methodology

Worst-case prefix distribution, packet arrival rate CACTI 2.0 for simulating memories Modify CACTI to model TCAM, HLP, DLP and SDP

slide-26
SLIDE 26

26 26

20 40 60 80 100 150 250 500 1000 Num ber of prefixes (thousands) Total Mem

  • ry (M

B) TCAM DLP HLP SDP

Dynamic Pipelining: Tighter Memory Bound

slide-27
SLIDE 27

27 27

Dynamic Pipelining: Low Power

20 40 60 80 100 2.5 10 40 160 Line Rate (Gbps) Pow er (W atts) TCAM DLP HLP SDP

SDP’s small memory + shallow hardware pipeline: Low power

slide-28
SLIDE 28

28 28

Dynamic Pipelining: Small Area

10 20 30 40 50 60 70 80 2.5 10 40 160 Line Rate (Gbps) Chip Area (cm sq) TCAM DLP HLP SDP

* TCAM: pipelining overhead ignored, unfair advantage SDP’s small memory + shallow hardware pipeline:Small area

slide-29
SLIDE 29

29 29

Outline

Introduction Previous pipelined IP-lookup schemes Our Scheme: Scalable Dynamic Pipelining Experimental Results Conclusions

slide-30
SLIDE 30

30 30

Conclusions

Previous schemes use static level-to-stage mapping We proposed dynamic height-to-stage mapping Dynamic mapping enables SDP’s scalability

  • Worst-case memory size 4x better
  • Scales well upto 160Gbps
  • Optimum update, 1 write-bubble per update
  • Efficient memory management
  • Low power
  • Low implementation cost
slide-31
SLIDE 31

31 31

Questions ?

The following slides are the actual questions and answers which were asked after the presentation at SIGCOMM ‘05

slide-32
SLIDE 32

32 32

Did you use real routing tables in the experiments? Which ones?

No We used the worst-case prefix distribution for all experiments Distribution shown in paper, Section 3.2.1 Gives intiuitive “proof” for the distribution being worst case Same worst case is used by previous work DLP [Basu, Narlikar – INFOCOM ‘03]

slide-33
SLIDE 33

33 33

“Tree Based Router Architecture” in ISCA’05 solves same problem?

“A Tree Based Router Search Engine Architecture with Single Port Memories” Baboescu et. al., International Symposium on Computer Architecture (ISCA) 2005 Same question was the main complaint in a review ISCA’05 was in June, well after SIGCOMM submission Having said that, the ISCA paper

  • Does not show how to size stages for N prefixes
  • Makes stage sizes equal given a particular distribution
  • Shows that building this balanced pipeline is O(N)
  • Does not address how to maintain balance upon updates
  • Does not address throughput scalability
  • Has no worst analysis for size, throughput, update cost
slide-34
SLIDE 34

34 34

Large number of banks (32 to 128) high implementation cost?

1 million prefixes = 20MB = 160Mbits We are talking about on-chip implementation There is no pin-count issue We show these actual area estimates for 100nm Of course, by the time we reach 1 million prefixes Technology will scale to allow 160Mbit on-chip Scaling our 100nm area to 50nm gives < 4 cm sq

slide-35
SLIDE 35

35 35

Did you assume a 1-bit trie for all schemes and all experiments?

HLP and DLP are multi-bit trie schemes We address this issue in Section 6.1 First explore design space over all possible strides Pick the optimum stride for HLP and for DLP All experiments performed using these optimum strides