February 2003 FIRST Technical Colloquium February 10-11, 2003 @ - - PowerPoint PPT Presentation

february 2003 first technical colloquium february 10 11
SMART_READER_LITE
LIVE PREVIEW

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ - - PowerPoint PPT Presentation

February 2003 FIRST Technical Colloquium February 10-11, 2003 @ Uppsala, Sweden bifrost a high performance router & firewall Robert Olsson Hans Wassen Bifrost concept Small size Linux distribution targeted for


slide-1
SLIDE 1

February 2003 FIRST Technical Colloquium February 10-11, 2003

@

Uppsala, Sweden

bifrost a high performance

router & firewall

Robert Olsson Hans Wassen

slide-2
SLIDE 2

Bifrost concept

  • Small size Linux distribution targeted for

Flashdisks 20 MB

  • Optimized for networking/firewalling
  • Tested with selected drivers and hardware
  • Open platform for development and

collaboration

  • Results and experiences shared
slide-3
SLIDE 3

Bifrost concept

  • Linux kernel collaboration

FASTROUTE, HW_FLOCONTROL, New NAPI for network stack.

  • Performance testing, development of tools

and testing techniques

  • Hardware validation, support from big

vendors

  • Detect and cure problems in lab not in the

network infrastructure.

  • Test deploy (Often in own network)
slide-4
SLIDE 4

Collaboration/development The New API

slide-5
SLIDE 5

Core Problems

  • heavy net load: system congestion collapse

High Interupt rates

Livelock and Cache locality effects

Interupts are just simply expensive

CPU

interupt driven: takes too long to drop bad packet

Bus (PCI)

Packets still being DMAed when system overloaded

Memory bandwidth

Continous allocs and frees to fill DMA rings

  • Unfairness in case of a hogger netdev
slide-6
SLIDE 6

Overall Effect

  • Inelegant handling of heavy net loads

System collapse

  • Scalability affected

System and number of NICS

A single hogger netdev can bring the system to its knees and deny service to others

10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35 40 45 50 55

Summary 2.4 vs feedback

March 15 report on lkml

Thread: "How to optimize routing perfomance" reported by Marten.Wikstron@framsfab.se

  • Linux 2.4 peaks at 27Kpps
  • Pentium Pro 200, 64MB RAM
slide-7
SLIDE 7

Looking inside the box

Backlog queue processing Forwarding, locally generated

  • utgoing

packets Incoming packets from devices To stack IRQ Later time Backlog queue SoftIRQ Transmit path Packet enqueued to backlog if queue not full

slide-8
SLIDE 8

BYE BYE Backlog queue

  • Packet stays in original queue (eg DMA)
  • Netrx softirq

foreach dev in poll list

Calls dev->poll() to grab upto quota packets

Device driver are polled from softirq and pkts are pulled and delivered to network stack.

Dev driver indicates done/notdone.

Done ==> we go back to IRQ mode.

Nodone ==> device remain on polling list

Breakes the netrx softirq at one jiffie or netdev_max_backlog

This to ensure other taskes to run

slide-9
SLIDE 9

A high level view of new system

P

pkts

Interupt area Polling area

P packets to deliver to the stack (on the RX ring)

Horizontal line shows different netdevs with different input rates

Area under curve shows how many packets before next interrupt

Quota enforces fair share

Quota

slide-10
SLIDE 10

Kernel support

NAPI kernel part was included in: 2.5.7 and back ported to 2.4.20 Current driver support: e1000 Intel GIGE NIC's tg3 BroadCom GIGE NIC's dl2k D-Link GIGE NIC's tulip (pending) 100 Mbs

slide-11
SLIDE 11

NAPI: observations & issues

Ooh I get even more interrupts.... with polling. As we seen NAPI is an interrupt/polling hybrid. NAPI uses interrupts to guarantee low latency and at high loads interrupts never gets re-enabled. Consecutive polling occur. Old scheme added interrupt delay to handle CPU from being killed by interrupts. In the NAPI case we can do without this delay for the first time but it means more interrupts in low load situations. Should we add interrupt delay just of old habit?

slide-12
SLIDE 12

Tested device

Flexible netlab at Uppsala University

* Raw packet performance * TCP * Timing * Variants

sink device linux

El cheapo-- High customable -- We write code :-) Ethernet | |

Test generator linux

Ethernet

slide-13
SLIDE 13

Motherboard

CPU Uni or multi-processor Chipset BX, ServerWorks, E750X BUS/PCI-design # PCI-BUS'es @ 133MHz Interrupt design PIC, IO-APIC etc Standby Power (Wake on Lan) can be a problem with many NIC's

Hardware for high perf. Networking

slide-14
SLIDE 14

Hardware for high perf. Networking

ServerWorks, Intel E750X chipset many PCI-X hubs/bridges And dual XEON PCI-X is here bus at 8.5 Gbit/s Many vendors use Compact PCI already

Memory

PCI-X I/O bridge PCI-X I/O bridge

CPU CPU

NIC NIC Processor, I/O and memory controller

slide-15
SLIDE 15

Hardware for high perf. Networking

Currently Intel has advantage. Broadcom can be a dark horse. All has NAPI drivers. GIGE chipsets available for PCI e1000 Intel -- e1000 BCM5700 Broadcom – tg3 dl-2k D-Link -- dl2k Some board manufactors switch chipset often. Chip documentation a problem.

slide-16
SLIDE 16

Some GIGE experiments/NAPI

Idle DoS 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500

125 117 92 391 95 379 92 478 92 344 91 389 91 426 90 262 91 211

Ping latency/fairness under xtreme load/UP

1 2 3 4 5 6 7 8

Latency in microseconds

Ping through a idle router Ping through a router under a DoS attack 890 kpps

V a e

Very well behaved just an increase a couple of 100 microsec !!

slide-17
SLIDE 17

Some GIGE experiments

Clone Alloc 10000 20000 30000 40000 50000 60000 70000 80000 90000

2*XEON 1.8 MHz packet sending @ 1518 byte

81300 pps is 1 Gbit/s

eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9 eth10

packets/sec

Pktgen sending test w. 11 GIGE Clone = 8.5 Gbit/s Alloc = 5.4 Gbit/s

SeverWorks X5DL8-GG Intel e1000

slide-18
SLIDE 18

Some GIGE experiments

Clone Alloc 10000 20000 30000 40000 50000 60000 70000 80000 90000

2*XEON HyperThreading on 1.8 MHz packet sending @ 1518 byte

81300 pps is 1 Gbit/s

eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9 eth10

packets/sec

Pktgen sending test w. 11 GIGE Clone = 10.0 Gbit/s Alloc = 7.4 Gbit/s

SeverWorks X5DL8-GG Intel e1000

slide-19
SLIDE 19

Some GIGE experiments

w/o HT w HT 250 500 750 1000 1250 1500 1750 2000

XEON 2*1.8 GHz @ 64 byte pkts

1.48 Mpps = 1 Gbit/s

Alloc Clone

Kpps

Aggregated sending performance from pktgen w. 11 GIGE.

slide-20
SLIDE 20

Forwarding performance

64 128 256 512 1024 1518 100 200 300 400 500 600 700 800 900

Linux forwarding rate at different pkt sizes Linux 2.5.58 UP/skb recycling 1.8 GHz XEON

Input Throughput

packet size kpps

Fills a GIGE pipe -- starting from256byte pkts

slide-21
SLIDE 21

R&D

I O A P I C

Eth1 Eth0

CPU 0 CPU 0 CPU 1

CPU 1

Parallelization Serialization Eth1 holds skb's from different CPU's Clearing TX-buff releases cache bouncing For user apps new scheduler does affinty But for packet forwarding.... eth0->eth1 CPU0 (we can set affinity eth1 -> CPU0) But it would be nice to other CPU for forwarding too. :-) TX ring

slide-22
SLIDE 22

R&D

Very high transaction packet memory system for GIGE and upcoming 10GE Profiling indicates slab is not fully per-CPU SMP-2-CPU 300 kpps SMP-1-CPU 302 kpps

Counter 0 counted GLOBAL_POWER_EVENTS events vma samples %-age symbol name c0138e96 37970 8.23162 cache_alloc_refill c0229490 37247 8.07488 alloc_skb c0235e90 32491 7.04381 qdisc_restart c0235b54 27891 6.04657 eth_type_trans

Note setting input affinity helps.

But we like to work on the general problem c02296d2 25675 8.67698 skb_release_data c0235b54 24438 8.25893 eth_type_trans c0235e90 24047 8.12679 qdisc_restart c0229490 18188 6.14671 alloc_skb c0110a1c 15741 5.31974 do_gettimeofday

slide-23
SLIDE 23

R&D

V UP gcc-3.1 V SMP2 gcc-3.1 V SMP1 gcc-3.1 V SMP2 gcc- 2.95.3 RC UP gcc-3.1 RC UP gcc- 2.95.3 RC SMP2 gcc-3.1 RC SMP1 gcc-3.1 IA SMP2 gcc-3.1 IA RC SMP2 gcc-3.1

50 100 150 200 250 300 350 400 450 500 550

router profile XEON no HT 2*1.8 GHz

Routing Througput in kpps

V=vanilla UP=uniprossor SMP1= SMP 1 CPU SMP2= SMP 2 CPU RC= skb recycling IA=input affinity Profile with p4/xeon performance counters GLOBAL_POWER_EVENTS MISPRED_BRANCH_RETIRED BSQ_CACHE_REFERENCE MACHINE_CLEAR ITLB_REFERENCE

slide-24
SLIDE 24

NAPI/SMP production in use: uu.se

Stockholm Stockholm

PIII 933MHz 2.4.10poll/SMP Full Internet routing via EBGP/IBGP

DMZ AS 2834 UU- 1 UU- 2 Interneral UU-Net L- uu1 L- uu2

slide-25
SLIDE 25

Real World use:ftp.sunet.se

Ftp0 Ftp1 Ftp2 Stockholm OC- 48

PIII- 933MHz NAPI/IRQ Load sharing & Redundancy with Router Discovery Full Internet routing via EBGP/IBGP

AS 1653 AS 15980 GSR Archive- r1 Archive- r2 Switch

slide-26
SLIDE 26

IP-login -- a Linux router app.

user authenticated routing

user@host IP- login router User's can only reach the IP- login

  • router. This hosts a web server.

User web requests are directed to webserver and asked for username, password ev. Authetication server. Today TACACS If user/passwd is accepted. 1) Forwarding is enabled for host 2) Monitoring arping is started Loss of arping disables forwarding.

H H R R

Based on stolen code from: Pawel Krawczyk -- tacacs client Alexey Kuznetsov -- arping

slide-27
SLIDE 27

IP-login installation

at Uppsala University

Approx 1000 outlets

slide-28
SLIDE 28

A new network symbol has been seen...

The Penguin Has Landed

slide-29
SLIDE 29

References and Other Stuff

http://bifrost.slu.se

Claim they can do 435 Kpps on PIII 700

http://www.pdos.lcs.mit.edu/click/

http://www.cyberus.ca/~hadi/usenix-paper.tgz

Some other work

http://robur.slu.se/Linux/net-development/