A New So(ware Architecture for Core Internet Routers Robert Broberg - - PowerPoint PPT Presentation

a new so ware architecture for core internet routers
SMART_READER_LITE
LIVE PREVIEW

A New So(ware Architecture for Core Internet Routers Robert Broberg - - PowerPoint PPT Presentation

A New So(ware Architecture for Core Internet Routers Robert Broberg September 16, 2011 Disclaimers and Credits This is research and no product plans are implied by any of this work. r3.cis.upenn.edu Early and conInued support from


slide-1
SLIDE 1

A New So(ware Architecture for Core Internet Routers

Robert Broberg September 16, 2011

slide-2
SLIDE 2

Disclaimers and Credits

  • This is research and no product plans are

implied by any of this work.

  • r3.cis.upenn.edu
  • Early and conInued support from www.vu.nl
  • A large team has generated this work and I am

just one of many spokespersons for them.

– any mistakes in this talk are mine.

slide-3
SLIDE 3

Agenda

Overview of the evoluIon of Core router design A sampling of SW problems encountered during evoluIon An approach to resolving SW problems and conInued evoluIon

slide-4
SLIDE 4

Core Router EvoluIon

  • WAN interconnects of Mainframes over

telecommunicaIon infrastructure

  • LAN/WAN interconnects

– CORE routers(1+1 architectures) – Leased telco lines for customers – Dialup aggregaIon

  • As CORE routers evolved the old migrated to

support edge connects

  • Telco becomes a client of the IP network
slide-5
SLIDE 5

Moore’s law x2/18m DRAM access rate x1.1/18m Silicon speed x1.5/18m Router Capacity x2.9/18m

The demand for increased network system performance/scale is relentless...

200 400 600 800 1000 1200 2004 2006 2008 2010 2012 2014 Internet traffic “2x/year” 1 10 100 1000 10000

Growth driven by increased user demand

slide-6
SLIDE 6

System Scaling Problems

slide-7
SLIDE 7
slide-8
SLIDE 8

Some of the reasons SW problems were encountered

  • Routers started as Ightly coupled embedded systems

– speeds and feeds were the game with features

  • CPUs + NPUs + very aware programmers led the game
  • EvoluIon was very fast

– Business customers

  • leased lines and frame relay

– Mid 1990s 64kbit dialup starts – Core bandwidth doubling every year

  • As IP customer populaIons grew feature demands increased
  • Model of SW delivery not conducive to resilience of rapid feature

deployment

slide-9
SLIDE 9

Intent /Goals

– build an applicaIon unaware fault tolerant distributed system for routers – always on(200msec failover of apps) – allow for inserIon of new features with no impact to exisIng operaIons – support +/‐ 1 versioning of key applicaIons with zero packet loss – versioning to allow for live feature tesIng

slide-10
SLIDE 10

Fault Tolerant RouIng

slide-11
SLIDE 11

MoIvaIons

  • We must be able to do be]er than 1+1

– Low confidence in 1+1 as only tested when actually upgrading/downgrading/crashing

  • Want 100% confidence in new code

– Despite lab Ime, rollout o(en uncovers showstoppers – Rollback can be very disrupIve

  • Aiming for sub‐200ms ‘outages’

– Want to be able to recover before VOIP calls noIce

slide-12
SLIDE 12

Core Routers are built as Clusters but act as a single virtual machine

  • MulIple line cards with potenIally various types of interfaces use NPUs to

route/switch amongst themselves via a data‐plane ( switch fabric )

  • A separate control plane controls all NPUs programming switching tables

and managing interface state along, rouIng protocols along with environmental condiIons

– Control plane CPUs are typically generic and ride the commodity curve

  • The Systems are heterogeneous and large

– Current Cisco CRS3 deployments switch 128tb, have ~150 x86 CPUs for the control plane along with ~1terabyte of memory and scale higher `

slide-13
SLIDE 13

VirtualizaIon/VoIng/BGP

  • BGP state is Ied to TCP connecIon state

– loopback interfaces

  • Process Placement
  • Versioning
  • Leader elecIon
  • HW virtualizaIon

– e.g. NPU virtualizaIon???

slide-14
SLIDE 14

Approach taken

  • AbstracIon layers chosen to isolate applicaIons

– applicaIons ( e.g. protocols) isolated with wrappers

  • applicaIon transparent check poinIng!!!!
  • FTSS used to store state
  • SHIM used as wrapper

– model to allow for voIng

  • OpImize, opImize, opImize

– experiment and prototype

  • ORCM used for process placement
  • Protocols isolated by a shim layer

– mulIple versions called siblings

  • 2 levels of operaIon chosen

– no use seen for hypervisor – user mode for apps; kernel; abstracIon layer via SHIM + FTSS

slide-15
SLIDE 15

Protocol VirtualizaIon

  • ExisIng protocol code largely untouched
  • Can run N siblings

– Can be different versions – the protocol being virtualized – Allows full tesIng of new code – with seamless switchover and switch back

  • Currently we run one virtualizaIon wrapper

– Protected by storing state into FTSS – Can be restarted thus upgradeable – Designed to know as li]le about protocol as possible

  • Treats most of it as a ‘bag of bits’
  • ‘Run anywhere’ – no RP/LC assumpIons

– We don’t care what you call the compute resources

slide-16
SLIDE 16

CRS uIlisaIon

RP RP LC LC LC LC

  • The CRS contains many CPUs which we treat as compute nodes in a

cluster

  • If a node fails the others take up its workload
  • No data is lost on a failure, and the so(ware adapts to re‐establish

redundancy

slide-17
SLIDE 17

CRS uIlisaIon

RP RP LC LC LC LC

  • External resources can be added to the system to add redundancy or

compute power Blade server

slide-18
SLIDE 18

Placement of Components

  • Each compute node runs FTSS

and ORCM – both are started by ‘qn’ (system process monitor)

  • FTSS stores rouIng data

redundantly across all the systems in the router

  • ORCM manages rouIng

processes and distributes them around the router – constraints can be applied via configuraIon

  • FTSS can run on other nodes to

make use of memory if desired.

RP LC FTSS ORCM FTSS ORCM

slide-19
SLIDE 19

BGP VirtualisaIon

FTSS ORCM Distributed dataplane RIB BGP VirtualisaIon service (shim) Reliable TCP endpoint BGP BGP BGP BGP BGP new BGP

slide-20
SLIDE 20

VirtualisaIon Layer recovery

FTSS ORCM Distributed dataplane RIB BGP VirtualisaIon service (shim) Reliable TCP endpoint BGP BGP BGP New shim

slide-21
SLIDE 21

IS‐IS VirtualisaIon

ORCM Distributed dataplane RIB IS‐IS VirtualisaIon service (shim) IS‐IS L2 receiver IS‐IS IS‐IS IS‐IS

slide-22
SLIDE 22

Fault Tolerant State Storage

  • Distributed Hash Table with intelligent

placement of data

  • You can decide how much replicaIon

– 2,3,4,N copies.

  • More copies ‐ more memory & slower write Imes.
  • Fewer copies – less simultaneous failures
  • Virtual Nodes – able to balance memory usage

to space on compute node

slide-23
SLIDE 23

FTSS distributed storage

FTSS RP0 FTSS RP1 FTSS LC0 FTSS LC1 FTSS LC2

Some data – stored redundantly in 2 places

slide-24
SLIDE 24

FTSS: losing a node

FTSS FTSS FTSS FTSS FTSS

Data missing is replicated from predecessor

slide-25
SLIDE 25

Key

  • Binary data
  • Unique in DHT

Value

  • Binary data

Link

  • Unique set of

binary data items

  • OpImizaIons

for use as a list

  • f keys

DHT tuples

DHT provides opImised rouInes for:

  • fast parallel store and deleIon of mulIple tuples
  • fast update of mulIple links within a tuple
  • OperaIons directly using the link list for storing related data
  • fast parallel recovery of mulIple, possibly inter‐linked, KVL tuples

Copies of the tuples are stored on mulIple nodes for redundancy

slide-26
SLIDE 26

DHT use in BGP processing

Receive incoming BGP messages Acknowledge TCP Create minimal message set Hand to BGP siblings; routes produced Pass routes from lead sibling to RIB

Unprocessed messages RIB prefixes A]ributes + NLRI Early redundant store to permit fast acknowledgement of incoming BGP messages Minimal set of incoming BGP data Data store for re‐syncing with RIBs on restart

BGP Shim operaIons DHT

slide-27
SLIDE 27

BGP data in DHT (I)

126 127 128 .. 10.0.0.4 192.168.2 2.5 4.1.0.77 .. ASPATH 1 + a]rs ASPATH 2 + a]rs ASPATH 3 + a]rs .. NLRI + peer 1 NLRI + peer 2 NLRI + peer 3 .. Unprocessed incoming messages Source peers Announcements from peers, minimal set Data within links Tuples Tuples

slide-28
SLIDE 28

BGP data in DHT (II)

1 2 3 .. 10.0.0.4 19.1.22.5 4.1.0.77 .. Siblings RIB prefixes Links Tuples

slide-29
SLIDE 29

DHT use in IS‐IS processing

Receive incoming IS‐IS frames Create minimal message set Hand to IS‐IS siblings; routes produced Pass routes from lead sibling to RIB

RIB prefixes LSPs Minimal set of incoming IS‐IS frames Data store for resyncing with RIBs on restart

IS‐IS Shim operaIons DHT

slide-30
SLIDE 30

MulIpath IGP/EGP demo

slide-31
SLIDE 31