February 2003 FIRST Technical Colloquium February 10-11, 2003 @ - PowerPoint PPT Presentation

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ Uppsala, Sweden bifrost a high performance router & firewall Robert Olsson Hans Wassen

� � � � � Bifrost concept Small size Linux distribution targeted for Flashdisks 20 MB Optimized for networking/firewalling Tested with selected drivers and hardware Open platform for development and collaboration Results and experiences shared

� ✁ � � � � Bifrost concept Linux kernel collaboration FASTROUTE, HW_FLOCONTROL, New NAPI for network stack. Performance testing, development of tools and testing techniques Hardware validation, support from big vendors Detect and cure problems in lab not in the network infrastructure. Test deploy (Often in own network)

Collaboration/development The New API

✂ � � ✁ ✂ ✂ ✁ ✂ ✂ ✁ ✁ Core Problems heavy net load: system congestion collapse High Interupt rates Livelock and Cache locality effects Interupts are just simply expensive CPU interupt driven: takes too long to drop bad packet Bus (PCI) Packets still being DMAed when system overloaded Memory bandwidth Continous allocs and frees to fill DMA rings Unfairness in case of a hogger netdev

� ✁ � ✁ ✂ Overall Effect Inelegant handling of heavy net loads System collapse Scalability affected System and number of NICS A single hogger netdev can bring the system to its knees and deny service to others Summary 2.4 vs feedback 55 50 March 15 report on lkml 45 40 Thread: "How to optimize routing perfomance" 35 reported by Marten.Wikstron@framsfab.se 30 25 - Linux 2.4 peaks at 27Kpps 20 - Pentium Pro 200, 64MB RAM 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100

Looking inside the box IRQ SoftIRQ Later time Forwarding, Backlog queue To stack Backlog locally queue generated processing outgoing packets Transmit path Packet enqueued to backlog if queue not full Incoming packets from devices

✄ � ✁ ✂ ✂ � ✂ ✄ ✄ ✄ BYE BYE Backlog queue Packet stays in original queue (eg DMA) Netrx softirq foreach dev in poll list Calls dev->poll() to grab upto quota packets Device driver are polled from softirq and pkts are pulled and delivered to network stack. Dev driver indicates done/notdone. Done ==> we go back to IRQ mode. Nodone ==> device remain on polling list Breakes the netrx softirq at one jiffie or netdev_max_backlog This to ensure other taskes to run

☎ ☎ ☎ ☎ A high level view of new system pkts Interupt area Polling area Quota P P packets to deliver to the stack (on the RX ring) Horizontal line shows different netdevs with different input rates Area under curve shows how many packets before next interrupt Quota enforces fair share

Kernel support NAPI kernel part was included in: 2.5.7 and back ported to 2.4.20 Current driver support: e1000 Intel GIGE NIC's tg3 BroadCom GIGE NIC's dl2k D-Link GIGE NIC's tulip (pending) 100 Mbs

NAPI: observations & issues Ooh I get even more interrupts.... with polling. As we seen NAPI is an interrupt/polling hybrid. NAPI uses interrupts to guarantee low latency and at high loads interrupts never gets re-enabled. Consecutive polling occur. Old scheme added interrupt delay to handle CPU from being killed by interrupts. In the NAPI case we can do without this delay for the first time but it means more interrupts in low load situations. Should we add interrupt delay just of old habit?

Flexible netlab at Uppsala University El cheapo-- High customable -- We write code :-) Ethernet Ethernet sink Test Tested | device generator device | linux linux * Raw packet performance * TCP * Timing * Variants

Hardware for high perf. Networking Motherboard CPU Uni or multi-processor Chipset BX, ServerWorks, E750X BUS/PCI-design # PCI-BUS'es @ 133MHz Interrupt design PIC, IO-APIC etc Standby Power (Wake on Lan) can be a problem with many NIC's

Hardware for high perf. Networking ServerWorks, Many vendors Intel E750X use Compact chipset PCI already many PCI-X hubs/bridges PCI-X is here bus at 8.5 Gbit/s And dual XEON CPU CPU PCI-X NIC I/O bridge Processor, I/O Memory and memory controller PCI-X NIC I/O bridge

Hardware for high perf. Networking Currently Intel has advantage. Broadcom can be a dark horse. All has NAPI drivers. GIGE chipsets available for PCI e1000 Intel -- e1000 BCM5700 Broadcom – tg3 dl-2k D-Link -- dl2k Some board manufactors switch chipset often. Chip documentation a problem.

Some GIGE experiments/NAPI Ping through a idle router Ping through a router under a DoS attack 890 kpps Ping latency/fairness under xtreme load/UP 500 478 475 450 426 425 Latency in microseconds 400 391 389 0 379 375 1 350 344 2 325 3 300 4 275 5 262 250 6 225 7 211 200 8 175 150 125 125 117 100 95 92 92 92 91 91 91 90 V 75 Idle DoS a Very well behaved just an increase a couple of 100 microsec !! e

Some GIGE experiments Pktgen sending test w. 11 GIGE Clone = 8.5 Gbit/s Alloc = 5.4 Gbit/s 2*XEON 1.8 MHz packet sending @ 1518 byte 81300 pps is 1 Gbit/s 90000 80000 eth0 eth1 70000 eth2 eth3 60000 eth4 eth5 50000 packets/sec eth6 eth7 40000 eth8 eth9 30000 eth10 20000 10000 0 Clone Alloc SeverWorks X5DL8-GG Intel e1000

Some GIGE experiments Pktgen sending test w. 11 GIGE Clone = 10.0 Gbit/s Alloc = 7.4 Gbit/s 2*XEON HyperThreading on 1.8 MHz packet sending @ 1518 byte 81300 pps is 1 Gbit/s 90000 80000 eth0 eth1 70000 eth2 eth3 60000 eth4 eth5 50000 packets/sec eth6 eth7 40000 eth8 eth9 30000 eth10 20000 10000 0 Clone Alloc SeverWorks X5DL8-GG Intel e1000

Some GIGE experiments Aggregated sending performance from pktgen w. 11 GIGE. XEON 2*1.8 GHz @ 64 byte pkts 1.48 Mpps = 1 Gbit/s 2000 1750 1500 Alloc 1250 Clone Kpps 1000 750 500 250 0 w/o HT w HT

Forwarding performance Linux forwarding rate at different pkt sizes Linux 2.5.58 UP/skb recycling 1.8 GHz XEON 900 800 700 600 Input 500 kpps Throughput 400 300 200 100 0 64 128 256 512 1024 1518 packet size Fills a GIGE pipe -- starting from256byte pkts

R&D Parallelization Serialization I CPU 0 CPU 0 TX ring O A Eth0 Eth1 P CPU 1 CPU 1 I C For user apps new scheduler Eth1 holds skb's does affinty from different CPU's Clearing TX-buff releases cache bouncing But for packet forwarding.... eth0->eth1 CPU0 (we can set affinity eth1 -> CPU0) But it would be nice to other CPU for forwarding too. :-)

R&D Very high transaction packet memory system for GIGE and upcoming 10GE Profiling indicates slab is not fully per-CPU Counter 0 counted GLOBAL_POWER_EVENTS events SMP-2-CPU vma samples %-age symbol name 300 kpps c0138e96 37970 8.23162 cache_alloc_refill c0229490 37247 8.07488 alloc_skb c0235e90 32491 7.04381 qdisc_restart c0235b54 27891 6.04657 eth_type_trans c02296d2 25675 8.67698 skb_release_data SMP-1-CPU c0235b54 24438 8.25893 eth_type_trans c0235e90 24047 8.12679 qdisc_restart 302 kpps c0229490 18188 6.14671 alloc_skb c0110a1c 15741 5.31974 do_gettimeofday Note setting input affinity helps. But we like to work on the general problem

R&D V=vanilla router profile XEON no HT 2*1.8 GHz UP=uniprossor SMP1= SMP 1 CPU 550 SMP2= SMP 2 CPU RC= skb recycling 500 IA=input affinity 450 Routing Througput in kpps Profile with p4/xeon 400 performance counters 350 GLOBAL_POWER_EVENTS MISPRED_BRANCH_RETIRED 300 BSQ_CACHE_REFERENCE MACHINE_CLEAR 250 ITLB_REFERENCE 200 150 100 50 0 V UP V SMP2 V SMP1 V SMP2 RC UP RC UP RC RC IA SMP2 IA RC gcc-3.1 gcc-3.1 gcc-3.1 gcc- gcc-3.1 gcc- SMP2 SMP1 gcc-3.1 SMP2 2.95.3 2.95.3 gcc-3.1 gcc-3.1 gcc-3.1

NAPI/SMP production in use: uu .se Stockholm Stockholm UU- 1 UU- 2 Full Internet routing PIII 933MHz via EBGP/IBGP 2.4.10poll/SMP AS 2834 L- uu1 L- uu2 Interneral UU-Net DMZ

Real World use: ftp . sunet.se Stockholm OC- 48 AS 1653 GSR Full Internet routing PIII- 933MHz via EBGP/IBGP NAPI/IRQ Archive- r2 Archive- r1 AS 15980 Switch Ftp0 Load sharing & Redundancy Ftp1 with Router Discovery Ftp2

IP-login -- a Linux router app. user authenticated routing user@host IP- login router User's can only reach the IP- login H R router. This hosts a web server. User web requests are directed to webserver and asked for username, password ev. Authetication server. Today TACACS H R If user/passwd is accepted. 1) Forwarding is enabled for host 2) Monitoring arping is started Based on stolen code from: Pawel Krawczyk -- tacacs client Loss of arping disables forwarding. Alexey Kuznetsov -- arping

IP-login installation at Uppsala University Approx 1000 outlets

A new network symbol has been seen.. . The Penguin Has Landed

✆ ✝ ✆ ✆ ✝ ✆ References and Other Stuff http://bifrost.slu.se Claim they can do 435 Kpps on PIII 700 http://www.pdos.lcs.mit.edu/click/ http://www.cyberus.ca/~hadi/usenix-paper.tgz Some other work http://robur.slu.se/Linux/net-development/

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ - PowerPoint PPT Presentation

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ Uppsala, Sweden bifrost a high performance router & firewall Robert Olsson Hans Wassen Bifrost concept Small size Linux distribution targeted for

Stand and deliver Essential Secutity Testing Tools Nils Magnus FIRST Technical Colloquium 2003

Norwegian Air Shuttle ASA (NAS) Q4 2003 and FY 2003 24-26 February 2004 Agenda _ Introduction

CNGS Horns : Status 8 nov. 2003 NBI 2003 KEK, Japan 7-11 nov. 2003 NBI 2003 - CNGS Horns

2003 AGM AGM P P RESENTATION 2003 RESENTATION 15th July 2003 2003 AGM Presentation

Keppel Land Keppel Land Interim Results 2003 Interim Results 2003 24 July 2003 24 July 2003

2003 Results and Results and 2003 2003 Results and Medium- -Term Management Plan Term

Decus Bonn 7. Bonn 7.- -11. April 2003 11. April 2003 Decus Einfhrung in Quality of Service

ISM Colloquium Student Presentation Guidelines Overview The Colloquium student presentation is a

OC Colloquium DFG SPP 1183 11th Colloquium October 07 th 2010 Munich Stephan Sigg

IAAHS Colloquium 2004 IAAHS Colloquium 2004 Actuarial Foundations of Full Coverage with Ageing

OC Colloquium DFG SPP 1183 11th Colloquium October 07 th 2010 Munich Stephan Sigg

Young Leaders Program Colloquium Genetic Testing of Children for the Sake of Other Family

IFM 2003 Geneva 2003 Alternative Strategies Hedge Funds Geneva, February 2003 Hedge funds

SIP Operation in SIP Operation in SIP Operation in 2003 2003 2003 Iptel.org builders of

Contents Trading Summary Reviews 2003 2002 1 De La Rue 2003 Turnover continuing

Oregon Violent Death Reporting System 2003-2010 Oregon Violent Death Reporting System 2003-2010

Partial agreement in German: A processing issue? Ilona Steiner SFB 441, University of Tbingen

Alexis Brandeker Stockholm University Duy Cuong Nguyen ,

Joint variable and rank selection for parsimonious estimation of high dimensional matrices

Memory Management Strategies in CPU/GPU Database Systems: A Survey Iya Arefyeva, David

Multi-TSO system reliability: cross-border balancing Fridrik Mar Baldursson, Reykjavik University

Equation of state and neutron star properties constrained by nuclear physics and observation Kai

Going Serverless with Kotlin @marcos_placona Marcos Placona marcos_placona mplacona

G R E A T D E S I G N I S P O W E R F U L . M A R T I N D I T T O