T owards 10Gb/s ope n-source routing Olof Hagsand (KTH) Robert - PowerPoint PPT Presentation

T owards 10Gb/s ope n-source routing Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) Linuxkongress 2008

Introduction Investigate packet forwarding performance of new PC ● hardware: – Multi-core CPUs – Multiple PCI-e buses – 10G NICs Can we obtain enough performance to use open-source ● routing also in the 10Gb/s realm?

Measuring throughput Packet per second ● – Per-packet costs – CPU processing, I/O and memory latency, clock frequency Bandwidth ● – Per-byte costs – Bandwidth limitations of bus and memory

Measuring throughput drops breakpoint capacity overload overload

Inside a router, HW style Swi t ched backpl ane Li ne Car d Li ne Car d Li ne Car d Li ne Car d CPU Car d Buf f er Buf f er Buf f er Buf f er Memor y Memor y Memor y Memor y CPU f or war der f or war der f or war der f or war der RI B Specialized hardware: ASICs, NPUs, backplane with switching stages or crossbars

Inside a router, PC-style Buf f er RI B CPU Memor y Shar ed bus backpl ane Li ne Li ne Li ne Car d Car d Car d Every packet goes twice over shared bus to the CPU ● Cheap, but low performance ● But lets increase the # of CPUs and # of buses! ●

Block hw structure (set 1)

Hardware – Box (set 2) AMD Opteron 2356 with one quad core 2.3GHz Barcelona CPUs on a TYAN 2927 Motherboard (2U)

Hardware - NIC Intel 10g board Chipset 82598 Open chip specs. Thanks Intel!

Equipment summary Hardware needs to be carefully selected ● BifrostLinux on kernel 2.6.24rc7 with LC-trie forwarding ● T weaked pktgen ● Set 1: AMD Opteron 2222 with two double core 3GHz CPUs ● on a T yan Thunder n6650W(S2915) motherboard Set 2: AMD Opteron 2356 with one quad core 2.3GHz ● Barcelona CPUs on a TYAN 2927 Motherboard (2U) Dual PCIe buses ● 10GE network interface cards. ● – PCI Express x8 lanes based on Intel 82598 chipset

Experiments ● Transmission(TX) – Upper limits on (hw) platform ● Forwarding experiments – Realistic forwarding performance

Tx Experiments Goal: ● – Just to see how much the hw can handle – upper limit Loopback tests over fibers ● Don't process RX packets just let MAC count them ● These numbers can give indication what forwarding capacity ● is possible Experiments: ● – Single CPU TX single interface – Four CPUs TX one interface each Tested device

Tx single sender: Packets per second

Tx single sender: Bandwidth

Tx - Four CPUs: Bandwidth CPU 1 CPU 2 CPU 3 CPU 4 SUM Packet length: 1500 bytes

TX experiments summary Single Tx sender is primarily limited by PPS at around ● 3.5Mpps A bandwidth of 25.8 Gb/s and a packet rate of 10 Mp/s using ● four CPU cores and two PCIe buses This shows that the hw itself allows 10Gb/s performance ● We also see nice symmetric Tx between the CPU cores. ●

Forwarding experiments Goal: ● – Realistic forwarding performance Overload measurements (packets are lost) ● Single forwarding path from one traffic source to one traffic ● sink – Single IP flow was forwarded using a single CPU. – Realistic multiple-flow stream with varying destination address and packet sizes using a single CPU. – Multi-queues on the interface cards were used to dispatch different flows to four different CPUs. Test Generator Tested device Sink device

Single flow, single CPU: Packets per second

Single flow, single CPU: Bandwidth

Single sender forwarding summary Virtually wire-speed for 1500-byte packets ● Little difference between forwarding on same card, different ● ports, or between different cards – Seems to be slightly better perf on same card, but not significant Primary limiting factor is pps, around 900Kpps ● TX has small effect on overall performance ●

Introducing realistic traffic For the rest of the experiments we introduce a more realistic ● traffic scenario Multiple packet sizes ● – Simple model based on realistic packet distribution data Multiple flows (multiple dst IP:s) ● – This is also necessary for multi-core experiments since NIC classification is made using hash algorithm on packet headers

Packet size distribution (cdf) Real data from www.caida.org, Wide aug 2008

Flow distribution Flows have size and duration distributions ● 8000 simultaneous flows ● Each flow 30 packets long ● – Mean flow duration is 258 ms 31000 new flows per second ● – Measured by dst cache misses Destinations spread randomly over 11.0.0.0/8 ● FIB contains ~ 280K entries ● – 64K entries in 11.0.0.0/8 This flow distribution is relatively aggressive ●

Multi-flow and single-CPU: PPS & BW max min Set 1 Set 2 Set 1 Set 2 Small routing table 280K entries No ipfilters ipfilters enabled

Multi-Q experiments Use more CPU cores to handle forwarding ● NIC classification (Receiver Side Scaling RSS) uses hash ● algorithm to select input queue Allocate several interrupt channels, one for each CPU. ● Flows are distributed evenly between CPUs ● – need aggregated traffic with multiple flows Questions: ● – Are processing of flows evenly dispatched ? – Will performance increase as CPUs are added?

Multi-flow and Multi-CPU (set 1) 1 CPU 4 CPUs CPU #1 CPU #2 CPU #3 CPU #4 Only 64 byte packets

Results MultiQ Packets are evenly distributed between the four CPUs. ● But forwarding using one CPU is better than using four CPUs! ● Why is this? ●

Profiling. Single CPU Multiple CPUs samples % symbol name samples % symbol name 396100 14.8714 kfree 1087576 22.0815 dev_queue_xmit 390230 14.6510 dev_kfree_skb_irq 651777 13.2333 __qdisc_run 300715 11.2902 skb_release_data 234205 4.7552 eth_type_trans 156310 5.8686 eth_type_trans 204177 4.1455 dev_kfree_skb_irq 142188 5.3384 ip_rcv 174442 3.5418 kfree 106848 4.0116 __alloc_skb 158693 3.2220 netif_receive_skb 75677 2.8413 raise_softirq_irqoff 149875 3.0430 pfifo_fast_enqueue 69924 2.6253 nf_hook_slow 116842 2.3723 ip_finish_output 69547 2.6111 kmem_cache_free 114529 2.3253 __netdev_alloc_skb 68244 2.5622 netif_receive_skb 110495 2.2434 cache_alloc_refill

Multi-Q analysis With multiple CPUs: TX processing is using a large part of the ● CPU making using more CPUs sub-optimal It turns out that the Tx and Qdisc code needs to be adapted ● to scale up performance

MultiQ: Updated drivers We recently made new measurements (not in paper) using ● updated driver code We also used hw set 2 (Barcelona) to get better results ● We now see an actual improvement when we add one ● processor (More to come) ●

Multi-flow and Multi-CPU (set 2) 1 CPU 2 CPUs 4 CPUs

Conclusions Tx and forwarding results towards 10Gb/s performance using ● Linux and selected hardware For optimal results hw and sw must be carefully selected. ● >25Gb/s Tx performance ● Near 10Gb/s wirespeed forwarding for large packets ● Identified bottleneck for multi-q and multi-core forwarding. ● If this is removed, upscaling performance using several CPU ● cores is possible to 10Gb/s and beyond.

T owards 10Gb/s ope n-source routing Olof Hagsand (KTH) Robert - PowerPoint PPT Presentation

T owards 10Gb/s ope n-source routing Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Grden (KTH) Linuxkongress 2008 Introduction Investigate packet forwarding performance of new PC hardware: Multi-core CPUs Multiple PCI-e

Ope ratio ns Ope ratio ns Wo rksho p Wo rksho p 2005 2005 USCG Auxiliary Ope ratio ns De

Scalable Routing Outline Routing Algorithms Scalability 1 Overview Forwarding vs Routing

Ad Hoc Wireless Routing CS 218- Fall 2003 Wireless multihop routing challenges Review of

Routing Algebras What are routing algebras? Created to study properties of routing protocols

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

Interplay between routing and forwarding routing algorithm Routing Algorithms and Routing local

4.3 Routing protocols We first look at Routing Tables and routing mechanisms. A routing table has

Landmark Landmark-based routing based routing Landmark Landmark-based routing based routing

Outline Integer Programming DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Vehicle Routing

Global routing Global routing Global routing Global routing Bill Swartz Bill Swartz

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

External Routing External Routing BGP JeanYves Le Boudec Fall 2009 Self Organization 1 1

CSE461 Section #6 Anran Wang Routing Distance Vector Routing vs. Link State Routing BGP

Network layer Distributed Routing: Link State Routing Link State Routing A very frequently

What is a bus? A Bus is: a shared communication link Slow vehicle that many people ride

Virtual Memory and I/O Lecture 27 CS301 I/O I/O Transfer of data to/from any component

BTW03 BTW03 Presented By Stephen Harrison & Peter Collins PURPOSE The purpose of this

Internet as Backplane gigabit networking. CPU/Memory Camera Disk Use Internet protocols

CS 598: Network Security Matthew Caesar February 7, 2011 1 This lecture Network devices

Deterministic Networking for Real-Time Systems (Using TSN and DetNet) Henrik Austad

Status of PXD DAQ (today mostly ONSEN, DHE/DHC and DATCON update will be presented by Igor

Solar and Atmospheric Neutrinos in Super-Kamiokande Jennifer Raaf Boston University on behalf

T owards 10Gb/s ope n-source routing Olof Hagsand (KTH) Robert - PowerPoint PPT Presentation

T owards 10Gb/s ope n-source routing Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Grden (KTH) Linuxkongress 2008 Introduction Investigate packet forwarding performance of new PC hardware: Multi-core CPUs Multiple PCI-e

Ope ratio ns Ope ratio ns Wo rksho p Wo rksho p 2005 2005 USCG Auxiliary Ope ratio ns De

Scalable Routing Outline Routing Algorithms Scalability 1 Overview Forwarding vs Routing

Ad Hoc Wireless Routing CS 218- Fall 2003 Wireless multihop routing challenges Review of

Routing Algebras What are routing algebras? Created to study properties of routing protocols

Advanced routing topics Tuomas Launiainen Suboptimal routing Routing trees Measurement of

Interplay between routing and forwarding routing algorithm Routing Algorithms and Routing local

4.3 Routing protocols We first look at Routing Tables and routing mechanisms. A routing table has

Landmark Landmark-based routing based routing Landmark Landmark-based routing based routing

Outline Integer Programming DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Vehicle Routing

Global routing Global routing Global routing Global routing Bill Swartz Bill Swartz

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

Abstractions for Routing Abstractions for Network Routing Brighten Godfrey Brighten Godfrey

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

External Routing External Routing BGP JeanYves Le Boudec Fall 2009 Self Organization 1 1

CSE461 Section #6 Anran Wang Routing Distance Vector Routing vs. Link State Routing BGP

Network layer Distributed Routing: Link State Routing Link State Routing A very frequently

What is a bus? A Bus is: a shared communication link Slow vehicle that many people ride

Virtual Memory and I/O Lecture 27 CS301 I/O I/O Transfer of data to/from any component

BTW03 BTW03 Presented By Stephen Harrison &amp; Peter Collins PURPOSE The purpose of this

Internet as Backplane gigabit networking. CPU/Memory Camera Disk Use Internet protocols

CS 598: Network Security Matthew Caesar February 7, 2011 1 This lecture Network devices

Deterministic Networking for Real-Time Systems (Using TSN and DetNet) Henrik Austad

Status of PXD DAQ (today mostly ONSEN, DHE/DHC and DATCON update will be presented by Igor

Solar and Atmospheric Neutrinos in Super-Kamiokande Jennifer Raaf Boston University on behalf

BTW03 BTW03 Presented By Stephen Harrison & Peter Collins PURPOSE The purpose of this