Routing without tears: Bridging without danger Radia Perlman Sun - - PowerPoint PPT Presentation

routing without tears bridging without danger
SMART_READER_LITE
LIVE PREVIEW

Routing without tears: Bridging without danger Radia Perlman Sun - - PowerPoint PPT Presentation

Routing without tears: Bridging without danger Radia Perlman Sun Microsystems Laboratories Radia.Perlman@sun.com 1 Before we get to RBridges Lets sort out bridges, routers, switches... 2 What are bridges, really? Myth:


slide-1
SLIDE 1

1

Routing without tears: Bridging without danger

Radia Perlman

Sun Microsystems Laboratories Radia.Perlman@sun.com

slide-2
SLIDE 2

2

Before we get to RBridges

  • Let’s sort out bridges, routers, switches...
slide-3
SLIDE 3

3

What are bridges, really?

  • Myth: bridges/switches simpler devices,

designed before routers

  • OSI Layers

– 1: physical

slide-4
SLIDE 4

4

Why this whole layer 2/3 thing?

  • Myth: bridges/switches simpler devices,

designed before routers

  • OSI Layers

– 1: physical – 2: data link (nbr-nbr, e.g., Ethernet)

slide-5
SLIDE 5

5

Why this whole layer 2/3 thing?

  • Myth: bridges/switches simpler devices,

designed before routers

  • OSI Layers

– 1: physical – 2: data link (nbr-nbr, e.g., Ethernet) – 3: network (create entire path, e.g., IP)

slide-6
SLIDE 6

6

Why this whole layer 2/3 thing?

  • Myth: bridges/switches simpler devices,

designed before routers

  • OSI Layers

– 1: physical – 2: data link (nbr-nbr, e.g., Ethernet) – 3: network (create entire path, e.g., IP) – 4 end-to-end (e.g., TCP, UDP)

slide-7
SLIDE 7

7

Why this whole layer 2/3 thing?

  • Myth: bridges/switches simpler devices,

designed before routers

  • OSI Layers

– 1: physical – 2: data link (nbr-nbr, e.g., Ethernet) – 3: network (create entire path, e.g., IP) – 4 end-to-end (e.g., TCP, UDP) – 5 and above: boring

slide-8
SLIDE 8

8

Definitions

  • Repeater: layer 1 relay
slide-9
SLIDE 9

9

Definitions

  • Repeater: layer 1 relay
  • Bridge: layer 2 relay
slide-10
SLIDE 10

10

Definitions

  • Repeater: layer 1 relay
  • Bridge: layer 2 relay
  • Router: layer 3 relay
slide-11
SLIDE 11

11

Definitions

  • Repeater: layer 1 relay
  • Bridge: layer 2 relay
  • Router: layer 3 relay
  • OK: What is layer 2 vs layer 3?
slide-12
SLIDE 12

12

Definitions

  • Repeater: layer 1 relay
  • Bridge: layer 2 relay
  • Router: layer 3 relay
  • OK: What is layer 2 vs layer 3?

– The “right” definition: layer 2 is neighbor-

  • neighbor. “Relays” should only be in layer 3!
slide-13
SLIDE 13

13

Definitions

  • Repeater: layer 1 relay
  • Bridge: layer 2 relay
  • Router: layer 3 relay
  • OK: What is layer 2 vs layer 3?
  • True definition of a layer n protocol:

Anything designed by a committee whose charter is to design a layer n protocol

slide-14
SLIDE 14

14

Layer 3 (e.g., IPv4, IPv6, DECnet, Appletalk, IPX, etc.)

  • Put source, destination, hop count on packet
  • Then along came “the EtherNET”

– rethink routing algorithm a bit, but it’s a link not a NET!

  • The world got confused. Built on layer 2
  • I tried to argue: “But you might want to talk from
  • ne Ethernet to another!”
  • Thought Ethernet was a competitor to layer 3
slide-15
SLIDE 15

15

Layer 3 packet

data Layer 3 header source dest hops

slide-16
SLIDE 16

16

Ethernet packet

data Ethernet header source dest

No hop count

slide-17
SLIDE 17

17

Layer 3 packet

data Layer 3 header source dest hops

Addresses have topological significance

slide-18
SLIDE 18

18

Ethernet packet

data Ethernet header source dest

Addresses are “flat” (no topological significance)

slide-19
SLIDE 19

19

It’s easy to confuse “Ethernet” with “network”

  • Both are multiaccess clouds
  • Why can’t Ethernet replace IP?

– Flat addresses – No hop count – Missing additional protocols (such as neighbor discovery) – Perhaps missing features (such as fragmentation, error messages, congestion feedback)

slide-20
SLIDE 20

20

So, we had layer 3, and Ethernet

  • People built protocol stacks leaving out

layer 3

  • There were lots of layer 3 protocols (IP,

IPX, Appletalk, CLNP), and few multi- protocol routers

slide-21
SLIDE 21

21

Problem Statement

Need something that will sit between two Ethernets, and let a station on one Ethernet talk to another A C

slide-22
SLIDE 22

22

Basic idea

  • Listen promiscuously
  • Learn location of source address based on

source address in packet and port from which packet received

  • Forward based on learned location of

destination

slide-23
SLIDE 23

23

What’s different between this and a repeater?

  • no collisions
  • with learning, can use more aggregate

bandwidth than on any one link

  • no artifacts of LAN technology (# of

stations in ring, distance of CSMA/CD)

slide-24
SLIDE 24

24

But loops are a disaster

  • No hop count
  • Exponential proliferation

B1 B2 B3

S

slide-25
SLIDE 25

25

But loops are a disaster

  • No hop count
  • Exponential proliferation

B1 B2 B3

S

slide-26
SLIDE 26

26

But loops are a disaster

  • No hop count
  • Exponential proliferation

B1 B2 B3

S

slide-27
SLIDE 27

27

But loops are a disaster

  • No hop count
  • Exponential proliferation

B1 B2 B3

S

slide-28
SLIDE 28

28

But loops are a disaster

  • No hop count
  • Exponential proliferation

B1 B2 B3

S

slide-29
SLIDE 29

29

What to do about loops?

  • Just say “don’t do that”
  • Or, spanning tree algorithm

– Bridges gossip amongst themselves – Compute loop-free subset – Forward data on the spanning tree – Other links are backups

slide-30
SLIDE 30

30

Algorhyme

I think that I shall never see A graph more lovely than a tree. A tree whose crucial property Is loop-free connectivity. A tree which must be sure to span So packets can reach every LAN. First the Root must be selected By ID it is elected. Least cost paths from Root are traced In the tree these paths are placed. A mesh is made by folks like me. Then bridges find a spanning tree. Radia Perlman

slide-31
SLIDE 31

31

9 3 4 11 7 10 14 2 5 6

2,0,2 2,0,2 2,1,14 2,1,5 2,1,7 2,1,6 2,2,4 2,2,4 2,3,3 2,2,11

A X

slide-32
SLIDE 32

32

Notice you don’t get optimal pairwise paths

slide-33
SLIDE 33

33

9 3 4 11 7 10 14 2 5 6

A X

A talks to X

slide-34
SLIDE 34

34

Bother with spanning tree?

  • Maybe just tell customers “don’t do loops”
  • First bridge sold...
slide-35
SLIDE 35

35

First Bridge Sold

A C

slide-36
SLIDE 36

36

Bridges are cool, but…

  • Routes are not optimal (spanning tree)

– STA cuts off redundant paths – If A and B are on opposite side of path, they have to take long detour path

  • Temporary loops really dangerous

– no hop count in header – proliferation of copies during loops

  • Traffic concentration on selected links
slide-37
SLIDE 37

37

Bridge meltdowns

  • They do occur (a Boston hospital)
  • Lack of receipt of spanning tree msgs tells bridge

to turn on link

– So if bridge can’t keep up with wire speed…

  • In contrast with routers: lost messages will cause

link to be brought down

– Note: original Digital bridge spec said bridges had to be wire speed

  • Also, some additions to bridging involve

configuration, which if wrong…meltdowns

slide-38
SLIDE 38

38

Why are there still bridges?

  • Why not just use routers?

– Bridges plug-and-play – Endnode addresses can be per-campus

  • IP routes to links, not endnodes

– So IP addresses are per-link – Need to configure routers – Need to change endnode address if change links

slide-39
SLIDE 39

39

What if you routed to endnodes, not to links?

  • Suppose you could have a whole campus,

all with one prefix

  • That’s what DECnet/CLNP called “level 1

routing”

  • Used a special protocol called “ES-IS”

– Endnodes periodically announce to routers – Routers periodically announce to endnodes

slide-40
SLIDE 40

40

True “level 1” routing

  • CLNP addresses had two parts

– “area” (14 bytes…) – node (6 bytes)

  • An area was a whole multi-link campus
  • Two levels of routing

– level 1: routes to exact node ID within area – level 2: longest matching prefix of “area”

slide-41
SLIDE 41

41

Hierarchy

One prefix per link One prefix per campus 2* 25* 28* 292* 22* 2835* 2*

IP-style CLNP-style

slide-42
SLIDE 42

42

CLNP level 1 routing

  • Autoconfiguration

– Rtrs discover “area” prefix – Tell endnodes, which plug in their MAC to form their layer 3 address

  • Rtrs tell each other (using link state routing

protocol), within area, location of all endnodes in area

slide-43
SLIDE 43

43

“Level 1 routing” with IP

  • IP has never had true level 1 routing

– Each link has a prefix – Multilink node has two addresses – Move to new link requires new address

  • Bridging is used with IP to sort of do “level

1 routing”

– But not as good: spanning tree paths rather than

  • ptimal pt-to-pt paths, meltdowns
slide-44
SLIDE 44

44

One prefix per campus vs per link

  • Advantages

– Zero configuration of routers within campus – Move nodes within campus without changing address – Multiple points of attachment: same address – Don’t partition address space

  • Disadvantages

– Bigger routing tables of level 1 routers

slide-45
SLIDE 45

45

Bridging vs CLNP-style level 1 Routing

  • Better routes: optimal pt-to-pt routes, can use all

links, can path split, do traffic engineering

  • Stable protocol (lost messages bring link down,

not up)

  • Forwarding with safe hdr (hop count, and specify

next hop)

  • But CLNP depended on ES-IS: We’ll have to do
  • ur best without it
slide-46
SLIDE 46

46

Link State Routing

  • meet nbrs
  • Construct Link State Packet (LSP)

– who you are – list of (nbr, cost) pairs

  • Broadcast LSPs to all rtrs (“a miracle occurs”)
  • Store latest LSP from each rtr
  • Compute Routes (breadth first, i.e., “shortest path”

first—well known and efficient algorithm)

slide-47
SLIDE 47

47

IS-IS

  • A specific link state protocol
  • Similar to OSPF, but more suitable for

RBridges because

– No need to configure IP addresses (which OSPF depends on) – Easy to add new fields with TLV encoding

slide-48
SLIDE 48

48

What we’d like

  • Best features of bridges

– Transparency to endnodes – Plug-and-play switches

  • Best features of routers

– Optimal paths – Safe header for forwarding

  • Interoperate with existing routers,

endnodes, and bridges

slide-49
SLIDE 49

49

RBridges

  • Compatible with today’s bridges and routers
  • Like routers, terminate bridges’ spanning tree
  • Like bridges, glue LANs together to create
  • ne IP subnet (or for other protocols, a

broadcast domain)

  • Like routers, optimal paths, fast convergence,

no meltdowns

  • Like bridges, plug-and-play
slide-50
SLIDE 50

50

RBridging layer 2

  • Link state protocol among Rbridges (so

know how to route to other Rbridges)

  • Like bridges, learn location of endnodes

from receiving data traffic

  • But since traffic on optimal paths, need to

distinguish originating traffic from transit

  • So encapsulate packet
slide-51
SLIDE 51

51

“Layer 2” routing

  • Think of it just like routers

– At each hop, add a layer 2 header to get to the next hop RBridge – The “destination” is the egress RBridge

  • First RBridge

– Looks up destination endnode, finds egress RB – Adds shim header, forwards to egress RB

  • Egree RBridge decapsulates
slide-52
SLIDE 52

52

Rbridging

R1 R2 R3 R4 R6 R7 R5 a c

slide-53
SLIDE 53

53

Encapsulation Header

S=Xmitting Rbridge D=Rcving Rbridge pt=“transit” hop count In/out RBridge

  • riginal pkt (including L2 hdr)

Outer header: new p-t: otherwise, ordinary Ethernet hdr Shim: hop count, and ingress or egress RBridge If unicast, it is egress RBridge If multicast or unknown destination, it is ingress RBridge

slide-54
SLIDE 54

54

Hdrs inside hdrs

R1 R2 R3

  • S

D

slide-55
SLIDE 55

55

Hdrs inside hdrs

R1 R2 R3

  • S

D S: D = S =

slide-56
SLIDE 56

56

Hdrs inside hdrs

R1 R2 R3

  • S

D R1: D = S = shim

TTL R3

D = S = p-t=new

slide-57
SLIDE 57

57

Hdrs inside hdrs

R1 R2 R3

  • S

D R2: D = S = shim

TTL-1 R3

D = S = p-t=new

slide-58
SLIDE 58

58

Hdrs inside hdrs

R1 R2 R3

  • S

D R3: D = S =

slide-59
SLIDE 59

59

Flooded traffic

  • Some traffic needs to be sent to all links

– Unknown destinations – Multicast traffic

  • Could use a single spanning tree

– Spanning tree computed from link state database (not separate spanning tree protocol)

  • But we decided on per-ingress trees
slide-60
SLIDE 60

60

Why per-ingress trees?

  • Yeah, it’s more computation (but not more

protocol messages…link state database gives all the info you need)

  • Advantages

– IP multicasts, and VLAN delivery not to all links, so optimized delivery – Packets won’t get out of order when switching from unknown to known (take same path)

slide-61
SLIDE 61

61

VLANs

  • VLANs allow partitioning of a bridged

campus

  • A packet marked with a VLAN tag must
  • nly be delivered to links in that VLAN
  • With bridges, VLAN membership of links

is configured into the bridges

  • Usually bridges add and remove the VLAN

tag

slide-62
SLIDE 62

62

Calculating a spanning tree

  • No need for separate spanning tree protocol
  • Link state protocol gives enough info
  • For per-RBridge: calculate a spanning tree

rooted at that RBridge

slide-63
SLIDE 63

63

Need a broadcast domain per VLAN

  • That’s the definition of a VLAN
  • A packet for VLAN A must only be

delivered to links in VLAN A

– Could be filtered at egress RBridges

  • Makes forwarding info for intermediate RBridges

trivial, but packet delivery more expensive

– Or could be filtered if no downstream receivers for VLAN A

slide-64
SLIDE 64

64

9 3 4 11 7 10 14 2 5 6

2,0,2 2,0,2 2,1,14 2,1,5 2,1,7 2,1,6 2,2,4 2,2,4 2,3,3 2,2,11

Delivery path for VLAN A if one spanning tree, and pinks are VLAN A

slide-65
SLIDE 65

65

9 3 4 11 7 10 14 2 5 6 Delivery path for VLAN A if per VLAN spanning tree, and pinks are VLAN A

slide-66
SLIDE 66

66

Likewise for IP-derived multicasts

  • If receivers are on just a few links, then it

would be worth calculating per-RBridge spanning tree

  • IP multicast has IGMP to let RBridges

know where receivers are

slide-67
SLIDE 67

67

Filtering info

  • With spanning tree, have a set of ports
  • For each port, mark what VLANs, and IP

multicasts are downstream on that branch

  • Select the spanning tree (based on ingress

RBridge from shim header, or VLAN tag), then apply filtering rules

slide-68
SLIDE 68

68

Shim Header

  • Information needed

– Hop count – Ingress RBridge – Egress RBridge

  • But never need BOTH ingress and egress

RBridge, so can overload the field

slide-69
SLIDE 69

69

MPLS-like header

  • Some people argued that some hardware

already adds 4-byte shim header (for MPLS)

  • MPLS contains TTL, and “label”
  • But “label” is 20 bits
  • How to fit 6-byte address into 20 bits?
slide-70
SLIDE 70

70

Nicknames

  • Piggyback “nickname acquisition protocol”
  • nto link state protocol
  • Stick “I want this nickname” into LSP
  • If someone else claims your nickname,

whoever has lower system ID keeps the nickname; the other one chooses a different

  • ne
slide-71
SLIDE 71

71

Endnode Learning

  • On shared link, only one Rbridge (DR) can

learn and decapsulate onto link

– otherwise, a “naked” packet will look like the source is on that link – have election to choose which Rbridge

  • When DR sees naked pkt from S,

announces S in its link state info to other Rbridges

slide-72
SLIDE 72

72

Pkt Forwarding: Ingress RBridge

  • If D known: look up egress RBridge R2,

encapsulate, and forward towards R2

  • Else, send to “destination=flood”, meaning

send on spanning tree

– Which tree: “ingress RBridge” – each DR decapsulates

slide-73
SLIDE 73

73

IP optimization: proxy ARP

  • For IP, learn (layer 3, layer 2) from ARP

(ND) replies

  • Pass around (layer 3, layer 2) pairs in LSP

info

  • Local RBridge can proxy ARP (i.e., answer

ARP reply) if target (layer 3, layer 2) known

slide-74
SLIDE 74

74

Possible IP optimization: tighter aliveness check

  • Can check aliveness of attached IP

endnodes by sending ARP query

  • Can assume endnode alive, until you

forward traffic to it, or until someone else claims that endnode

slide-75
SLIDE 75

75

VLANs

  • VLAN A endnodes only need to be learned

by RBridges attached to VLAN A

  • All RBridges must be able to forward to any
  • ther RBridge
  • Egress RBridge in the encapsulation header
slide-76
SLIDE 76

76

Endnode Information

  • Only need to know location of VLAN A

endnodes if you are attached to VLAN A

  • So, different instances of IS-IS

– One for basic RBridge-RBridge connectivity – One per VLAN, to share endnode membership for that VLAN

slide-77
SLIDE 77

77

Possible variant

  • Don’t explicitly pass around endnode

information

  • Instead, egress RBridge learns (ingress

RBridge, source) from data packet

slide-78
SLIDE 78

78

Decided not to do that because

  • Sometimes layer 2 is definitive about

enrollment (so better than caches and timeouts)

  • Won’t need to flood as often
  • Don’t want to starve between two bales of

hay

slide-79
SLIDE 79

79

Conclusions

  • Routing without tears

– Zero configuration, optimal paths

  • Bridging without danger

– No meltdowns

  • Can gradually replace bridges with

RBridges

slide-80
SLIDE 80

80

Algorhyme v2

I hope that we shall one day see A graph more lovely than a tree. A graph to boost efficiency While still configuration-free. A network where RBridges can Route packets to their target LAN. The paths they find, to our elation, Are least cost paths to destination. With packet hop counts we now see, The network need not be loop-free. RBridges work effectively. Without a common spanning tree.

Ray Perlner