Neutron L3 Agent HA Or: How I Learned to Stop Worrying and Love the - - PowerPoint PPT Presentation

neutron l3 agent ha
SMART_READER_LITE
LIVE PREVIEW

Neutron L3 Agent HA Or: How I Learned to Stop Worrying and Love the - - PowerPoint PPT Presentation

Neutron L3 Agent HA Or: How I Learned to Stop Worrying and Love the API Kevin Bringard // OpenStack Juno Summit // May 2014 There is no one right way The goal is to move L3 resources to a new L2 resource as quickly and seamlessly as


slide-1
SLIDE 1

Neutron L3 Agent HA

Or: How I Learned to Stop Worrying and Love the API Kevin Bringard // OpenStack Juno Summit // May 2014

slide-2
SLIDE 2
  • There is no “one right way”
  • The goal is to move L3 resources to a new L2

resource as quickly and seamlessly as possible

  • This is a really difficult, but important, problem to

solve

slide-3
SLIDE 3

Layer 3

Internet Happens

slide-4
SLIDE 4

L3 agent L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-5
SLIDE 5

L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-6
SLIDE 6

Layer 2

The ARPing is the hardest part

slide-7
SLIDE 7
  • One L3 resource may only be tied to one L2

resource at a time

  • Many technologies exist to sort of work around this
  • HSRP
  • VRRP
  • CARP
  • Work is being done to implement VRRP like

functionality into Juno

  • https://blueprints.launchpad.net/neutron/+spec/l3-

high-availability

  • Nothing is currently integrated into OpenStack
slide-8
SLIDE 8

Pacemaker

http://docs.openstack.org/high-availability-guide/ content/_highly_available_neutron_l3_agent.html

slide-9
SLIDE 9
  • False positives — caused more downtime than

actual outages

  • Split brain possibilities
  • Assumes control of L3 agent start/stop functions
  • Limited Horizontal Scale
  • More difficult to run multiple Active L3 agents
  • Failover requires entire services starts/stops
  • Active/Passive Model Requires More Hardware
  • Works on a “per agent” level
  • Akin to RAID1
slide-10
SLIDE 10

L3 agent L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-11
SLIDE 11

L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-12
SLIDE 12

L3 agent L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-13
SLIDE 13

Neutron HA Tool

https://raw.githubusercontent.com/stackforge/cookbook-

  • penstack-network/master/files/default/neutron-ha-tool.py
slide-14
SLIDE 14
  • API Driven
  • Uses native API calls to perform all functions
  • Can be run externally from infrastructure or cross

site

  • Supports any operations the neutron client

libraries supports

  • Easily Extendable
  • Written in python
  • Leverages standard OpenStack libraries
  • Works on a “per resource” level
slide-15
SLIDE 15

L3 agent L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-16
SLIDE 16

L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-17
SLIDE 17

L3 agent L3 agent router1 router2 router3 router4 router5 router6 VM1 VM3 VM2 VM4 VM5 VM7 VM6 Core Router

slide-18
SLIDE 18
  • Only routers/IPs on the affected L3 agent are

impacted

  • Recovery time depends on the number of routers

which need to be migrated and the number of IPs on each router

  • Migration happens quickly, but every IP on the

routers must re-ARP to the upstream switch

  • Meta-data proxies migrate with the routers
slide-19
SLIDE 19

OK, so what’s the catch?

slide-20
SLIDE 20
  • Not seamless
  • The ARP processes happen in parallel, but generally

take 60-90 seconds for all IPs to complete

  • Various *aaS offerings further complicate things
  • Currently only accounts for “l3-agent” controlled

services

  • No coordination between HA tools
  • How do you HA the HA?
  • Currently not daemonized, runs from cron
  • Add 60 seconds to total recovery time
  • Jitter protection adds additional total recovery time
  • No mechanism by which to ensure resources

actually come up/work

slide-21
SLIDE 21

What about DHCP?

slide-22
SLIDE 22
  • Multiple DHCP agents may be run Active/Active
  • DHCP agents per subnet may be specified in your

agent config file

  • Each agent requires an IP in the tenant’s subnet
  • DHCP is multi-cast
  • All agents have the same lease file
  • The first one to reply binds to the VM
  • Any DHCP agent may reply to a DNS request and

resolve all known leases

  • By default, each DHCP agent hands out a list of

every agent as available resolvers

  • HA tool has an option to replicate DHCP to all agents
slide-23
SLIDE 23
  • VRRP Like functionality
  • Specify number of Active L3 agents per subnet
  • Leverage conntrackd/keepalived
  • Point of diminishing returns for HA tool?
  • The beauty of open source:
  • There is no “one right way”
  • Think outside the box
  • Do cool things

Moving Forward

slide-24
SLIDE 24

Questions?