Nonconformist Resilience: Database-backed Job Queues John Mileham - - PowerPoint PPT Presentation

nonconformist resilience
SMART_READER_LITE
LIVE PREVIEW

Nonconformist Resilience: Database-backed Job Queues John Mileham - - PowerPoint PPT Presentation

Nonconformist Resilience: Database-backed Job Queues John Mileham | @jmileham User Signup with Email Confirmation User Signup with Email Confirmation A feature so easy were still fighting about how to do it in 2017 Requirements:


slide-1
SLIDE 1

Nonconformist Resilience:

Database-backed Job Queues

John Mileham | @jmileham

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16

User Signup

with Email Confirmation

slide-17
SLIDE 17

User Signup

with Email Confirmation

A feature so easy we’re still fighting about how to do it in 2017

slide-18
SLIDE 18

Requirements:

Validate the user’s profile information Store the user record to the database Email a link When the link is clicked, mark the user as verified

slide-19
SLIDE 19

Requirements:

Validate the user’s profile information Store the user record to the database Email a link When the link is clicked, mark the user as verified

slide-20
SLIDE 20

Take 1:

Inline the email delivery

slide-21
SLIDE 21

Take 1:

Inline the email delivery … but it’s slow

slide-22
SLIDE 22

Take 2:

Spin off a thread or use a thread pool

slide-23
SLIDE 23

Take 2:

Spin off a thread or use a thread pool … but it’s unreliable

slide-24
SLIDE 24

Take 3:

Use a grown-up message bus

slide-25
SLIDE 25

Take 3:

Use a grown-up message bus … but it’s unreliable?

slide-26
SLIDE 26

Commit-then-Enqueue

slide-27
SLIDE 27

Commit to DB App Timeline Enqueue to bus Customer Timeline Deliver to ESP Request Response Email

slide-28
SLIDE 28

Commit to DB App Timeline Enqueue to bus Customer Timeline Deliver to ESP Request Response Email

slide-29
SLIDE 29

Enqueue-then-Commit

slide-30
SLIDE 30

Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Response Email

slide-31
SLIDE 31

Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Email Response

slide-32
SLIDE 32

Enqueue to bus App Timeline Commit to DB Customer Timeline Deliver to ESP Request Email Build Email App Timeline Response

slide-33
SLIDE 33

You could make the enqueue and the database commit atomic via a distributed transaction manager, but:

  • Mature, robust, distributed transaction managers aren’t available for all platforms
  • They’re usually proprietary
  • Even where they exist, these tools have nuanced configuration, can have operational warts and

are another subsystem that requires care and feeding

  • The additional network pings necessary to coordinate the commit between the datastores can

cause write performance problems

Distributed Transactions

slide-34
SLIDE 34

Take 4:

Use the database as a queue

slide-35
SLIDE 35

Take 4:

Use the database as a queue … but it won’t scale

slide-36
SLIDE 36

Commit & Enqueue App Timeline Customer Timeline Deliver to ESP Request Response Email

slide-37
SLIDE 37

Commit & Enqueue App Timeline Customer Timeline Deliver to ESP Request Response Email

slide-38
SLIDE 38

Commit & Enqueue App Timeline Customer Timeline Deliver to ESP Request Response Email

slide-39
SLIDE 39
slide-40
SLIDE 40

Robust By Default

slide-41
SLIDE 41

Addressing the pitfalls

Because everything is a tradeoff

slide-42
SLIDE 42
slide-43
SLIDE 43

DJ: Retry with Exponential Backoff

Two key columns: run_at, and attempts.

  • Jobs are picked up oldest first
  • Only jobs with a run_at in the past are workable
  • When a job fails, a new future run_at is calculated from the previous run_at and number of

previous attempts

  • After too many failures (days later), a job will stop being attempted
slide-44
SLIDE 44

Message Bus Solution: DLQs

Messages don’t have a desired delivery time in a message bus, so exponential backoff isn’t feasible. Message delivery will be attempted a preconfigured number of times, and then transferred to a dead-letter queue, or a cascading set of queues to approximate exponential backoff.

slide-45
SLIDE 45

DJ: Priority

Delayed::Job will work off the highest priority first. Pickup is simply a matter of sorting on priority and then run_at. We use priority to establish different service level objectives for different kinds of work. Allows developer not to worry about resourcing their jobs, leaning into DJ. Allows DJ to fully utilize its worker capacity.

slide-46
SLIDE 46

Message Bus Solution: Topics

Message busses can’t as easily support priority. To assure resource availability for important work, work is shunted to a specific topic or queue with its

  • wn resource pool.

Strong assurance that one job type won’t exhaust resources of another type. But you must resource each topic individually.

slide-47
SLIDE 47

DJ’s got topics too ;)

Even though it’s not the only way to organize work, if you have a mission critical work stream that must be processed no matter what, you can use a specialized queue to keep its workers separate. Opt in for as much control as you need, only when you need it.

slide-48
SLIDE 48
slide-49
SLIDE 49

More Featureful, Not Less

slide-50
SLIDE 50
slide-51
SLIDE 51

Betterment’s Schema

Users Deposits Bank Accounts Goals Investing Accounts Auto Deposits State- ments

slide-52
SLIDE 52

Betterment’s Schema

Users Deposits Bank Accounts Goals Investing Accounts Auto Deposits State- ments

slide-53
SLIDE 53

Power Law Distribution

slide-54
SLIDE 54
slide-55
SLIDE 55

The Message Bus Isn’t a Silver Bullet

slide-56
SLIDE 56
slide-57
SLIDE 57

Coordinated Polling

  • Your application chooses a global polling interval, say a half second.
  • Every active worker process inserts itself into an active_workers table with a last_active_at

timestamp and maintains it every 30 seconds or so.

  • Every few seconds, each worker queries the number of recently active workers.
  • It then multiplies the global polling interval by the number of workers and adds random jitter to

prevent thundering herds

  • Your app converges on the desired polling interval at arbitrary worker scale
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60

When is a DB-backed queue the right tool?

slide-61
SLIDE 61
  • 1. Should your app use a DB at all?

You should be using an ACID SQL DB if:

  • You have a read-heavy usage pattern
  • You value agility in supporting new use cases
  • You aren’t launching directly into #webscale
  • Or even if you are, your app doesn’t exist primarily to solve a graph problem

○ if you’re going big and still want to use SQL, your dataset must inherently shardable

slide-62
SLIDE 62
  • 2. Are your clients human?

If clients are interacting with your app like humans, i.e.:

  • They do individual operations at a reasonable pace
  • The don’t generate batches of 10,000 operations at once

Then you’re looking still looking good.

slide-63
SLIDE 63
  • 3. Are Your Bulk Operations Cool?
  • Are there relatively few of them?
  • Are they customer experience-impacting?
  • Are they no more than daily?
slide-64
SLIDE 64

All Yes? All Set.

slide-65
SLIDE 65

Operating a DB-backed Queue

slide-66
SLIDE 66

Alerting Needs

Two key alerts: 1. Max attempt count 2. Max age Both metrics are partitioned by job priority.

slide-67
SLIDE 67

Max Attempt Count

Total backoff time function: n == 0 ? 0 : n ** 4 + 5 + backoff(n-1)

  • First retry in 6 seconds
  • Third retry in 2 minutes
  • Fifth retry in 16 minutes
  • Tenth retry in 7 hours
  • Twentieth retry in 8 days

Our thresholds:

  • INTERACTIVE errors after 2 attempts (~30 seconds)
  • EVENTUAL errors after 8 attempts (~2.5 hours)
slide-68
SLIDE 68

Max Age

Age is defined as now() - run_at.

slide-69
SLIDE 69

This is your brain

  • n DJ
slide-70
SLIDE 70

(your message may vary)

slide-71
SLIDE 71
slide-72
SLIDE 72
slide-73
SLIDE 73

Why not just use DJ?

slide-74
SLIDE 74

Why not just use DJ?

slide-75
SLIDE 75
slide-76
SLIDE 76

Date Author

June 27th, 2017 John Mileham | @jmileham (is hiring)