Going brokerless: The transition from Qpid to 0mq Paul Mathews, - - PowerPoint PPT Presentation

going brokerless the transition from qpid to 0mq
SMART_READER_LITE
LIVE PREVIEW

Going brokerless: The transition from Qpid to 0mq Paul Mathews, - - PowerPoint PPT Presentation

Going brokerless: The transition from Qpid to 0mq Paul Mathews, Systems Architect EIG/Bluehost OpenStack Summit, November 2013 RPC Messaging One major limiting factor we encountered when scaling OpenStack Many services depend upon a


slide-1
SLIDE 1

Going brokerless: The transition from Qpid to 0mq

Paul Mathews, Systems Architect EIG/Bluehost OpenStack Summit, November 2013

slide-2
SLIDE 2

RPC Messaging

  • One major limiting factor we encountered when

scaling OpenStack

  • Many services depend upon a reliable

messaging system

– Compute – Neutron – Conductor – Celiometer – VNC – Cells

slide-3
SLIDE 3

Why Qpid?

  • Used by Redhat
  • Clustering

– Offered the “possibility” of horizontal scaling – Removed in 0.18

slide-4
SLIDE 4

Qpid experience

  • Single instance was not reliable

– Unable to scale – Single point of failure

  • Compute connections to broker are lost

– Restart of compute service required

  • Problematic due to missed messages
slide-5
SLIDE 5

RabbitMQ?

  • Similar design as Qpid
  • Broker model has the same drawbacks
  • Problematic experiences from other users
slide-6
SLIDE 6

Possible solutions for scaling broker

  • Cells
  • Clustering/HA
slide-7
SLIDE 7

Cells

  • Cells do not address

performance issues

  • Cells lower the load
  • n individual brokers
  • Magnify problems

when they occur by chaining services

AMQP Broker API Cell AMQP Broker Child Cell AMQP Broker Grandchild Cell AMQP Broker Child Cell AMQP Broker Grandchild Cell

slide-8
SLIDE 8

Clustering/HA

  • Qpid

– Clustering is slow, and unreliable – Node sync causes problems – New HA module is active/passive

  • RabbitMQ

– Does have an active/active (HA) mode – Complicated setup, many moving pieces

  • Scaling a broker is not practical

– At best, minimal gains with addition of nodes – Loss of nodes causes further issues

slide-9
SLIDE 9

No more brokers!!!

  • Brokers are a single

point of failure

  • Not horizontally

scalable

  • Reduced reliability

with addition of nodes

  • HA provides minimal

benefit, and adds complexity

slide-10
SLIDE 10

Requirements for messaging

  • No single point of failure
  • Horizontally scalable
  • Reliable at scale
slide-11
SLIDE 11
  • No more centralized broker

– Receiver on each node – Routing handled by matchmaker

{ "scheduler": [ "sched1", "sched2" ], "consoleauth": [ "cauth1", "cauth2" ] }

slide-12
SLIDE 12

Messaging topologies

Broker Compute Console auth

Brokers are limited to a star topology ZeroMQ is a partially- connected mesh

Conductor Compute API Scheduler Compute Celiometer Compute Console auth Conductor Compute API Scheduler Compute Celiometer

slide-13
SLIDE 13

Flexibility

  • Brokers have a rigidly defined structure

– Queues, exchanges, subscriptions, fanouts

  • ZeroMQ has four simple methods

– Connect, bind, send, recv

  • ZeroMQ lets us define our own messaging

models

slide-14
SLIDE 14

Lightweight messaging

  • ZeroMQ uses simple socket connections

– Low resource utilization

  • FAST
slide-15
SLIDE 15

RPC Cast Performance

(on a single-core VM) ZeroMQ Qpid RabbitMQ

200 400 600 800 1000 1200 1400 1600 1800 2000

Casts/Second

More is better

slide-16
SLIDE 16

ZeroMQ Configuration

  • Edit nova.conf to use zmq
  • Configure matchmaker file
  • Start zmq-receiver

{ "scheduler": [ "sched1", "sched2" ], "consoleauth": [ "cauth1", "cauth2" ] } rpc_backend = nova.openstack.common.rpc.impl_zmq rpc_zmq_matchmaker = nova.openstack.common.rpc.matchmaker.MatchMakerRing matchmaker_ringfile = /etc/nova/matchmaker.json rpc_zmq_ipc_dir = /var/run/zeromq

slide-17
SLIDE 17

RPC Migration

  • You can't get there from here!

– No easy way to move between messaging systems – No logical divisions

  • Only one backend allowed

– All or nothing switch

slide-18
SLIDE 18

We need a new solution

  • Moving between messaging systems is painful

– Prior strategy will not work

  • Tens of thousands of nodes to migrate
  • Need to migrate with little or no downtime
  • Rollout must allow deployment to individual

servers

slide-19
SLIDE 19

Dual messaging backends

  • Nodes use both Qpid and ZeroMQ messaging

backends concurrently

  • Code can be rolled out without affecting

behavior, and enabled later

– Change config, and start the ZeroMQ receiver

  • Once dual backends are enabled, ZeroMQ is

attempted first, then fails over to Qpid.

slide-20
SLIDE 20

NO

Compute2

Dual message backends

Controller Nodes Compute1 Qpid

Outgoing message to Compute1

Z M Q r e c e i v e r l i s t e n i n g ?

Request

ZMQ Receiver

  • 1. Deploy config to controller nodes
slide-21
SLIDE 21

YES

Compute2

Dual message backends

Controller Nodes ZMQ Receiver Compute1 Qpid

Outgoing message to Compute2

Request

ZMQ Receiver

  • 1. Deploy config to controller nodes

ZMQ receiver listening?

  • 2. Deploy config to compute nodes
slide-22
SLIDE 22

Compute2

Dual message backends

Controller Nodes ZMQ Receiver Compute1 Qpid ZMQ Receiver

  • 1. Deploy config to controller nodes
  • 2. Deploy config to compute nodes

Controller Node Call

Outgoing message to Compute2 via Qpid

slide-23
SLIDE 23

Configuring dual backends

  • Change rpc_backend to impl_zmq, but retain

qpid_hostname setting

  • Nodes will leverage the qpid_hostname value

and connect to Qpid, but will attempt delivery via ZeroMQ first

  • Once switched, nodes will accept incoming

messages from either Qpid or ZeroMQ

qpid_hostname = qpid1 rpc_backend = nova.openstack.common.rpc.impl_zmq rpc_zmq_matchmaker = nova.openstack.common.rpc.matchmaker.MatchMakerRing matchmaker_ringfile = /etc/nova/matchmaker.json rpc_zmq_ipc_dir = /var/run/zeromq

slide-24
SLIDE 24

Migrating to ZeroMQ

  • Dual backend code meant minimal downtime
  • Migration was smooth, without unexpected

losses in messaging

  • Connection checks to the ZeroMQ receiver do

not seem to cause undue stress to nodes

slide-25
SLIDE 25

ZeroMQ in production

  • More reliable than a broker
  • Faster than a broker
  • Solves scalability issues
slide-26
SLIDE 26

Lingering issues

  • Occasionally nova-compute stops processing

queued messages

slide-27
SLIDE 27

Dual backend code

  • https://github.com/paulmathews/nova
slide-28
SLIDE 28

Questions?