SLIDE 1 February 2003 FIRST Technical Colloquium February 10-11, 2003
@
Uppsala, Sweden
bifrost a high performance
router & firewall
Robert Olsson Hans Wassen
SLIDE 2 Bifrost concept
- Small size Linux distribution targeted for
Flashdisks 20 MB
- Optimized for networking/firewalling
- Tested with selected drivers and hardware
- Open platform for development and
collaboration
- Results and experiences shared
SLIDE 3 Bifrost concept
- Linux kernel collaboration
✁
FASTROUTE, HW_FLOCONTROL, New NAPI for network stack.
- Performance testing, development of tools
and testing techniques
- Hardware validation, support from big
vendors
- Detect and cure problems in lab not in the
network infrastructure.
- Test deploy (Often in own network)
SLIDE 4
Collaboration/development The New API
SLIDE 5 Core Problems
- heavy net load: system congestion collapse
✁
High Interupt rates
✂
Livelock and Cache locality effects
✂
Interupts are just simply expensive
✁
CPU
✂
interupt driven: takes too long to drop bad packet
✁
Bus (PCI)
✂
Packets still being DMAed when system overloaded
✁
Memory bandwidth
✂
Continous allocs and frees to fill DMA rings
- Unfairness in case of a hogger netdev
SLIDE 6 Overall Effect
- Inelegant handling of heavy net loads
✁
System collapse
✁
System and number of NICS
✂
A single hogger netdev can bring the system to its knees and deny service to others
10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40 45 50 55
Summary 2.4 vs feedback
March 15 report on lkml
Thread: "How to optimize routing perfomance" reported by Marten.Wikstron@framsfab.se
- Linux 2.4 peaks at 27Kpps
- Pentium Pro 200, 64MB RAM
SLIDE 7 Looking inside the box
Backlog queue processing Forwarding, locally generated
packets Incoming packets from devices To stack IRQ Later time Backlog queue SoftIRQ Transmit path Packet enqueued to backlog if queue not full
SLIDE 8 BYE BYE Backlog queue
- Packet stays in original queue (eg DMA)
- Netrx softirq
✁
foreach dev in poll list
✂
Calls dev->poll() to grab upto quota packets
✂
Device driver are polled from softirq and pkts are pulled and delivered to network stack.
✂
Dev driver indicates done/notdone.
✄
Done ==> we go back to IRQ mode.
✄
Nodone ==> device remain on polling list
✄
Breakes the netrx softirq at one jiffie or netdev_max_backlog
✄
This to ensure other taskes to run
SLIDE 9
A high level view of new system
P
pkts
Interupt area Polling area
☎
P packets to deliver to the stack (on the RX ring)
☎
Horizontal line shows different netdevs with different input rates
☎
Area under curve shows how many packets before next interrupt
☎
Quota enforces fair share
Quota
SLIDE 10
Kernel support
NAPI kernel part was included in: 2.5.7 and back ported to 2.4.20 Current driver support: e1000 Intel GIGE NIC's tg3 BroadCom GIGE NIC's dl2k D-Link GIGE NIC's tulip (pending) 100 Mbs
SLIDE 11
NAPI: observations & issues
Ooh I get even more interrupts.... with polling. As we seen NAPI is an interrupt/polling hybrid. NAPI uses interrupts to guarantee low latency and at high loads interrupts never gets re-enabled. Consecutive polling occur. Old scheme added interrupt delay to handle CPU from being killed by interrupts. In the NAPI case we can do without this delay for the first time but it means more interrupts in low load situations. Should we add interrupt delay just of old habit?
SLIDE 12
Tested device
Flexible netlab at Uppsala University
* Raw packet performance * TCP * Timing * Variants
sink device linux
El cheapo-- High customable -- We write code :-) Ethernet | |
Test generator linux
Ethernet
SLIDE 13
Motherboard
CPU Uni or multi-processor Chipset BX, ServerWorks, E750X BUS/PCI-design # PCI-BUS'es @ 133MHz Interrupt design PIC, IO-APIC etc Standby Power (Wake on Lan) can be a problem with many NIC's
Hardware for high perf. Networking
SLIDE 14 Hardware for high perf. Networking
ServerWorks, Intel E750X chipset many PCI-X hubs/bridges And dual XEON PCI-X is here bus at 8.5 Gbit/s Many vendors use Compact PCI already
Memory
PCI-X I/O bridge PCI-X I/O bridge
CPU CPU
NIC NIC Processor, I/O and memory controller
SLIDE 15
Hardware for high perf. Networking
Currently Intel has advantage. Broadcom can be a dark horse. All has NAPI drivers. GIGE chipsets available for PCI e1000 Intel -- e1000 BCM5700 Broadcom – tg3 dl-2k D-Link -- dl2k Some board manufactors switch chipset often. Chip documentation a problem.
SLIDE 16 Some GIGE experiments/NAPI
Idle DoS 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500
125 117 92 391 95 379 92 478 92 344 91 389 91 426 90 262 91 211
Ping latency/fairness under xtreme load/UP
1 2 3 4 5 6 7 8
Latency in microseconds
Ping through a idle router Ping through a router under a DoS attack 890 kpps
V a e
Very well behaved just an increase a couple of 100 microsec !!
SLIDE 17 Some GIGE experiments
Clone Alloc 10000 20000 30000 40000 50000 60000 70000 80000 90000
2*XEON 1.8 MHz packet sending @ 1518 byte
81300 pps is 1 Gbit/s
eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9 eth10
packets/sec
Pktgen sending test w. 11 GIGE Clone = 8.5 Gbit/s Alloc = 5.4 Gbit/s
SeverWorks X5DL8-GG Intel e1000
SLIDE 18 Some GIGE experiments
Clone Alloc 10000 20000 30000 40000 50000 60000 70000 80000 90000
2*XEON HyperThreading on 1.8 MHz packet sending @ 1518 byte
81300 pps is 1 Gbit/s
eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9 eth10
packets/sec
Pktgen sending test w. 11 GIGE Clone = 10.0 Gbit/s Alloc = 7.4 Gbit/s
SeverWorks X5DL8-GG Intel e1000
SLIDE 19 Some GIGE experiments
w/o HT w HT 250 500 750 1000 1250 1500 1750 2000
XEON 2*1.8 GHz @ 64 byte pkts
1.48 Mpps = 1 Gbit/s
Alloc Clone
Kpps
Aggregated sending performance from pktgen w. 11 GIGE.
SLIDE 20 Forwarding performance
64 128 256 512 1024 1518 100 200 300 400 500 600 700 800 900
Linux forwarding rate at different pkt sizes Linux 2.5.58 UP/skb recycling 1.8 GHz XEON
Input Throughput
packet size kpps
Fills a GIGE pipe -- starting from256byte pkts
SLIDE 21 R&D
I O A P I C
Eth1 Eth0
CPU 0 CPU 0 CPU 1
CPU 1
Parallelization Serialization Eth1 holds skb's from different CPU's Clearing TX-buff releases cache bouncing For user apps new scheduler does affinty But for packet forwarding.... eth0->eth1 CPU0 (we can set affinity eth1 -> CPU0) But it would be nice to other CPU for forwarding too. :-) TX ring
SLIDE 22 R&D
Very high transaction packet memory system for GIGE and upcoming 10GE Profiling indicates slab is not fully per-CPU SMP-2-CPU 300 kpps SMP-1-CPU 302 kpps
Counter 0 counted GLOBAL_POWER_EVENTS events vma samples %-age symbol name c0138e96 37970 8.23162 cache_alloc_refill c0229490 37247 8.07488 alloc_skb c0235e90 32491 7.04381 qdisc_restart c0235b54 27891 6.04657 eth_type_trans
Note setting input affinity helps.
But we like to work on the general problem c02296d2 25675 8.67698 skb_release_data c0235b54 24438 8.25893 eth_type_trans c0235e90 24047 8.12679 qdisc_restart c0229490 18188 6.14671 alloc_skb c0110a1c 15741 5.31974 do_gettimeofday
SLIDE 23 R&D
V UP gcc-3.1 V SMP2 gcc-3.1 V SMP1 gcc-3.1 V SMP2 gcc- 2.95.3 RC UP gcc-3.1 RC UP gcc- 2.95.3 RC SMP2 gcc-3.1 RC SMP1 gcc-3.1 IA SMP2 gcc-3.1 IA RC SMP2 gcc-3.1
50 100 150 200 250 300 350 400 450 500 550
router profile XEON no HT 2*1.8 GHz
Routing Througput in kpps
V=vanilla UP=uniprossor SMP1= SMP 1 CPU SMP2= SMP 2 CPU RC= skb recycling IA=input affinity Profile with p4/xeon performance counters GLOBAL_POWER_EVENTS MISPRED_BRANCH_RETIRED BSQ_CACHE_REFERENCE MACHINE_CLEAR ITLB_REFERENCE
SLIDE 24 NAPI/SMP production in use: uu.se
Stockholm Stockholm
PIII 933MHz 2.4.10poll/SMP Full Internet routing via EBGP/IBGP
DMZ AS 2834 UU- 1 UU- 2 Interneral UU-Net L- uu1 L- uu2
SLIDE 25 Real World use:ftp.sunet.se
Ftp0 Ftp1 Ftp2 Stockholm OC- 48
PIII- 933MHz NAPI/IRQ Load sharing & Redundancy with Router Discovery Full Internet routing via EBGP/IBGP
AS 1653 AS 15980 GSR Archive- r1 Archive- r2 Switch
SLIDE 26 IP-login -- a Linux router app.
user authenticated routing
user@host IP- login router User's can only reach the IP- login
- router. This hosts a web server.
User web requests are directed to webserver and asked for username, password ev. Authetication server. Today TACACS If user/passwd is accepted. 1) Forwarding is enabled for host 2) Monitoring arping is started Loss of arping disables forwarding.
H H R R
Based on stolen code from: Pawel Krawczyk -- tacacs client Alexey Kuznetsov -- arping
SLIDE 27 IP-login installation
at Uppsala University
Approx 1000 outlets
SLIDE 28
A new network symbol has been seen...
The Penguin Has Landed
SLIDE 29 References and Other Stuff
✆
http://bifrost.slu.se
✝
Claim they can do 435 Kpps on PIII 700
✆
http://www.pdos.lcs.mit.edu/click/
✆
http://www.cyberus.ca/~hadi/usenix-paper.tgz
✝
Some other work
✆
http://robur.slu.se/Linux/net-development/