A Network-State Management Service Peng Sun Ratul Mahajan, - - PowerPoint PPT Presentation

a network state management service
SMART_READER_LITE
LIVE PREVIEW

A Network-State Management Service Peng Sun Ratul Mahajan, - - PowerPoint PPT Presentation

A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft Complex Infrastructure Microsoft Azure Number of 2010 2014 Data Center A few 10s Network


slide-1
SLIDE 1

A Network-State Management Service

Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft

slide-2
SLIDE 2

Complex Infrastructure

Variety of vendors/models/time

1

Number of 2010 2014 Data Center A few 10s Network Device 1,000s 10s of 1,000s Network Capacity 10s of Tbps Pbps

Microsoft Azure

slide-3
SLIDE 3

Management Applications

2

Traffic Engineering Load Balancing Link Corruption Mitigation Device Firmware Upgrade

……

slide-4
SLIDE 4

Our Question How to safely run multiple management applications

  • n shared infrastructure

3

slide-5
SLIDE 5

Naïve Solution

  • Run independently

4

Traffic Engineering Link Corruption Mitigation Firmware Upgrade Network Devices

slide-6
SLIDE 6
  • It does not work due to 2 problems

Naïve Solution

4

Traffic Engineering Link Corruption Mitigation Firmware Upgrade Network Devices

slide-7
SLIDE 7

Agg A ToRs Agg B Core1 2

Problem #2: Safety Violation

6

Link-corruption- mitigation shuts down faulty Agg A Firmware-upgrade schedules Agg B to upgrade

slide-8
SLIDE 8

Potential Solution #1

  • One monolithic application
  • Central control of all actions

7

Traffic Engineering Firmware Upgrade Link Corruption Mitigation

slide-9
SLIDE 9

Too Complex to Build

  • Difficult to develop
  • Combine all applications that are

already individually complicated

  • High maintenance cost
  • for such huge software in practice

8

slide-10
SLIDE 10

Potential Solution #2

  • Explicit coordination among

applications

  • Consensus over network changes

9

Traffic Engineering Firmware Upgrade Link Corruption Mitigation

slide-11
SLIDE 11

Still Too Complex

  • Hard to understand each other
  • Diverse network interactions

10

Application Routing Device Config Traffic Engineering Firmware upgrade

slide-12
SLIDE 12

Main Enemy: Complexity

  • Application development
  • Application coordination

11

Monolithic Indepen- dent Explicitly coordinate

Simple Complex

slide-13
SLIDE 13

What We Advocate

  • Loose coupling of applications
  • Design principle:
  • Simplicity with safety guarantees
  • Forgo joint optimization
  • Worthwhile tradeoff for simplicity
  • Applications could do it out-of-band

12

slide-14
SLIDE 14

Overview of Statesman

  • Network operating system for

safe multi-application operation

  • Uses network state abstraction
  • Three views of network state
  • Dependency model of states

13

slide-15
SLIDE 15

The “State” in Statesman

  • Complexity of dealing with devices
  • Heterogeneity
  • Device-specific commands

14

Network Devices Network State

slide-16
SLIDE 16

State Variable Examples

State Variable Value Device Power Status Up, down Device Firmware Version number Device SDN Agent Boot Up, down Device Routing State Routing rules Link Admin Status Up, down Link Control Plane BGP, OpenFlow, …

15

slide-17
SLIDE 17

Simplify Device Interaction

Past Now

16

SNMP, OF, vendor API, … Read Write

Network Devices Network Devices Network State Application

Device Statistics

Application

Device- specific cmds

slide-18
SLIDE 18

Views of Network State

17

Network Devices Observed State

Observed State Actual state of the whole network Target State Desired state to be updated on the whole network

Target State Application Application Application

slide-19
SLIDE 19

Network Devices

Two Views Are Not Enough

18

Observed State Target State Application Application Application

slide-20
SLIDE 20

Network Devices

Two Views Are Not Enough

18

Observed State Target State

One More View Proposed State A group of entity-variable-values desired by an application

Proposed State Application Application Application

slide-21
SLIDE 21

How Merging Works

  • Combine multiple proposed states

into a safe target state

  • Conflict resolution
  • Last-writer-wins
  • Priority-based locking
  • Sufficient for current deployment
  • Safety invariant checking
  • Partial rejection & Skip update

19

slide-22
SLIDE 22

Choose Safety Invariants

  • Our current choice
  • Connectivity: Every pair of ToRs in
  • ne DC is connected
  • Capacity: 99% of ToR pairs have at

least 50% capacity

20

Hinder application too frequently

Tight Loose

Cannot protect network operation

slide-23
SLIDE 23

Recap of Three-View Model

  • Simplify network management

21

Observed State Target State Proposed State

What we see from the network What we want the network to be What can be actually done

  • n the network

Statesman

Application Application Application

slide-24
SLIDE 24

Yet Another Problem

  • What’s in Proposed State
  • Small number of state variables that

application cares

  • Implicit conflicts arises
  • Caused by state dependency

22

slide-25
SLIDE 25

A B C D

Implicit Conflict

23

TE writes new value

  • f routing state of B

for tunneling traffic Firmware-upgrade writes new value of firmware state of B

slide-26
SLIDE 26

Dependency Relations

24

PowerState FirmwareVersion ConfigurationState RoutingState AdminState ConfigurationState PathState

Device Link

slide-27
SLIDE 27

Build in Dependency Model

  • Statesman calculates it internally
  • Only exposes the result for each

state variable

  • Whether the variable is controllable

25

slide-28
SLIDE 28

Statesman System

26

Target State Monitor Updater Checker Proposed State Observed State Storage Service

slide-29
SLIDE 29

Deployment Overview

  • Operational in Microsoft Azure for

10 months

  • Cover 10 DCs of 20K devices

27

slide-30
SLIDE 30

Production Applications

  • 3 diverse applications built
  • Device firmware upgrade
  • Link corruption mitigation
  • Traffic engineering
  • Finish within months
  • Only thousands of lines of code

28

slide-31
SLIDE 31

Case #1: Resolve Conflict

Inter-DC TE & Firmware-upgrade

29

BR 1 BR 2

DC 1

BR 8 BR 7

DC 4

BR 3 BR 4

DC 2

BR 5

DC 3

BR 6 DC = Data Center BR = Border Router

slide-32
SLIDE 32

30

… … … …

slide-33
SLIDE 33

30

Firmware-upgrade acquires lock of BR1

… … … …

slide-34
SLIDE 34

30

TE fails to acquire lock, and moves traffic away

… … … …

slide-35
SLIDE 35

30

TE fails to acquire lock, and moves traffic away

… … … …

slide-36
SLIDE 36

30

BR1 firmware upgrade starts

… … … …

slide-37
SLIDE 37

30

BR1 firmware upgrade starts BR1 firmware upgrade

  • ends. Lock released.

… … … …

slide-38
SLIDE 38

30

BR1 firmware upgrade starts TE re-acquires lock, and moves traffic back

… … … …

slide-39
SLIDE 39

30

BR1 firmware upgrade starts TE re-acquires lock, and moves traffic back

… … … …

slide-40
SLIDE 40

Case #1 Summary

  • Each application:
  • Simple logic
  • Unaware of the other
  • Statesman enables:
  • Conflict resolution
  • Necessary coordination

31

slide-41
SLIDE 41

Case #2: Maintain Capacity Invariant

Firmware-upgrade & Link-corruption-mitigation

32

ToR Agg

… …

Core

Pod 4 4 1 1 n

Pod 1 4 1 1 n

Pod 10 4 1 1 n 1 4 Link corrupting packets

slide-42
SLIDE 42

33

Upgrade proceeds in normal speed in Pod 3 and 5

… … … … …

slide-43
SLIDE 43

33

Upgrade proceeds in normal speed in Pod 3 and 5

… … … … …

slide-44
SLIDE 44

33

Upgrade proceeds in normal speed in Pod 3 and 5

… … … … …

slide-45
SLIDE 45

33

Upgrade proceeds in normal speed in Pod 3 and 5 Upgrade in Pod 4 is slowed down by checker due to lost capacity

… … … … …

slide-46
SLIDE 46

33

Upgrade proceeds in normal speed in Pod 3 and 5 Upgrade in Pod 4 is slowed down by checker due to lost capacity

… … … … …

slide-47
SLIDE 47

Case #2 Summary

  • Statesman:
  • Automatically adjusts application

progresses

  • Keeps the network within safety

requirements

34

slide-48
SLIDE 48

Conclusion

  • Need network operating system for

multiple management applications

  • Statesman
  • Loose coupling of applications
  • Network state abstraction
  • Deployed and operational in Azure

35

slide-49
SLIDE 49

36

Thanks! Questions?

Check paper for related works