getting comfortable in prod to improve your life in dev @cyen - - PowerPoint PPT Presentation

getting comfortable in prod
SMART_READER_LITE
LIVE PREVIEW

getting comfortable in prod to improve your life in dev @cyen - - PowerPoint PPT Presentation

getting comfortable in prod to improve your life in dev @cyen @honeycombio first, some background Christine DEV DEV WRITE TEST COMMIT WRITE TEST COMMIT WRITE TEST COMMIT WRITE TEST COMMIT


slide-1
SLIDE 1
slide-2
SLIDE 2

@cyen @honeycombio

getting comfortable in prod

to improve your life in dev

slide-3
SLIDE 3

first, some background…

slide-4
SLIDE 4

DEV

Christine

slide-5
SLIDE 5

DEV

WRITE → TEST → COMMIT → WRITE → TEST → COMMIT → WRITE → TEST → COMMIT → WRITE → TEST → COMMIT → WRITE → TEST → COMMIT → WRITE → TEST → COMMIT → WRITE → TEST → COMMIT → WRITE → TEST → COMMIT

slide-6
SLIDE 6

DEV OPS

WRITE → TEST → COMMIT → RELEASE 💦 → DEBUG → FIX

slide-7
SLIDE 7

"Works on my machine"

DEV

"The only good diff is a red diff"

OPS

💦

slide-8
SLIDE 8

—Subbu Allamaraju, Expedia, Feb 2019
 https://m.subbu.org/incidents-trends-from-the-trenches-e2f8497d52ed

"Observation 1: Change is the most common trigger"

slide-9
SLIDE 9

APP API GATEWAY USER MGMT BILLING WEB UI PARTNER MGMT

PAYMENT S

INTERNAL WEB UI TXN MGMT

NOTIFICATION SYSTEM REST API REST API REST API REST API REST API REST API

THEN NOW

slide-10
SLIDE 10

"Works on my machine"

DEV

"The only good diff is a red diff"

OPS

slide-11
SLIDE 11

THE FIRST WAVE: THE SECOND WAVE:

OPS DEV

teaching devs to own code in production getting ops folks to code

slide-12
SLIDE 12

it’s all about sharing SOFTWARE OWNERSHIP

OPS DEV

  • bservability
slide-13
SLIDE 13
  • bservability

a.k.a. understanding the behavior of a system based on knowledge of its external outputs. a.k.a. "what is my software doing, and why is it behaving that way?"

slide-14
SLIDE 14

monitoring

  • bservability

The system as black box

  • magic. Thresholds, alerts,

system signals like CPU and memory.
 
 Checking and rechecking for known bad behaviors. The system as a living, adaptable thing. A culture of instrumentation and metadata rather than strictly-defined counters.
 
 Being able to tease out previously-unknown bad behaviors and outliers.

slide-15
SLIDE 15

DEV OPS

💦 → DEBUG → FIX WRITE → TEST → COMMIT → RELEASE

slide-16
SLIDE 16

WRITE → TEST → COMMIT → RELEASE → OBSERVE

DEV OPS

TEST OBSERVE

slide-17
SLIDE 17

DEV OPS

MAKE HAUNTED GRAVEYARDS LESS SCARY

slide-18
SLIDE 18

… why devs, again?

slide-19
SLIDE 19

▸ Design documents ▸ Architecture review ▸ Test-driven development ▸ Integration tests ▸ Code review ▸ Continuous integration ▸ Continuous deployment ▸ 🎊🥃🍿🎋 ▸ Observe our code in

production

DEV

The
 Software Process

TEST

slide-20
SLIDE 20
  • -- FAIL: TestUnitTest (0.00s)

talk_test.go:10: — expected: 4 (type int) actual: 5 (type int)

ACTUAL EXPECTED

slide-21
SLIDE 21

"Works on my machine"

DEV

"The only good diff is a red diff"

OPS

💦

slide-22
SLIDE 22

DEV PROD

still


  • bservability
slide-23
SLIDE 23

prod, part of the dev process?

slide-24
SLIDE 24

DEV

WHAT
 to build HOW TO
 build it WHETHER
 it works ("test in prod")

▸ Design documents ▸ Architecture review ▸ Test-driven development ▸ Integration tests ▸ Code review ▸ Continuous integration ▸ Continuous deployment ▸ 🎊🥃🍿🎋 ▸ (Wait for exception


tracker to complain)

The
 Software Process

when deciding…

slide-25
SLIDE 25

WHAT

▸ Locally: log lines, printfs, debuggers attached to

  • ur IDEs

▸ What’s causing our code to deviate from

expectations?

▸ Stop "pulling straws"—quantify pain, and start

prioritizing.

when deciding …

slide-26
SLIDE 26

HOW TO

▸ Know what "normal" really is ▸ Events (instrumentation) can be

like DEBUG statements in prod

▸ What and how we build should be

informed by reality

when deciding …

slide-27
SLIDE 27

▸ Complex systems have an infinitely long list of

black swan failure scenarios

▸ "Test in Production" to experiment and check

hypotheses

▸ Feature flags + observability = 💜

WHETHE R

when deciding …

slide-28
SLIDE 28

but this is hard.

slide-29
SLIDE 29

make prod feel more like dev

slide-30
SLIDE 30

TOOLS SHOULD SPEAK MY LANGUAGE

▸ As a dev, traditional monitoring tools don't tie

back to the concepts I deal with in my code

CPU utilization AWS availability zone kafka partition Cassandra hostname payload size client OS build ID API endpoint time to render $YOUR_BIZ-relevant ID

slide-31
SLIDE 31

TOOLS SHOULD SPEAK MY LANGUAGE

▸ As a dev, traditional monitoring tools don't tie

back to the concepts I deal with in my code

AWS availability zone customer ID

us-east-1 us-west-2 us-west-1 eu-west-1 eu-central-1 a87fcfcd 98f1d93f fb2ff7ca 144afb2f 2f67a581 70efe4da 7e7ea1d0 394817e6 1528afb3 8bd3acf2 98f1d93f 7e7ea1d0 a87fcfcd 394817e6 fb2ff7ca 1528afb3 2f67a581 1528afb3 1528afb3 394817e6 8bd3acf2 7e7ea1d0 2f67a581 2f67a581 1528afb3 7e7ea1d0 7e7ea1d0 2f67a581 7e7ea1d0 2f67a581 394817e6 1528afb3 7e7ea1d0 7e7ea1d0 8bd3acf2 7e7ea1d0 7e7ea1d0 394817e6 1528afb3 7e7ea1d0 7e7ea1d0 4e4e1207 4e4e1207

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34

TOOLS SHOULD SPEAK MY LANGUAGE

▸ As a dev, traditional monitoring tools don't tie

back to the concepts I deal with in my code

AND LET ME ITERATE

slide-35
SLIDE 35

SHARE PATTERNS WHERE POSSIBLE

▸ Tracing helps production feel even more familiar:

can map a trace directly to my code structure

slide-36
SLIDE 36

PROD SHOULD FEEL LIKE DEVELOPMENT?

slide-37
SLIDE 37
slide-38
SLIDE 38

2019-01-25T01:30:23.743Z Enqueued task 2019-01-25T01:30:24.120Z Task processed, returning 42 entries 2019-01-25T01:30:24.212Z Task complete (email sent to foobar@example.com) Timestamp=2019-01-25T01:30:29.953Z message=Task timed out after 6.01 seconds task_id=72 2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds task_id=72 type=process 2019-01-25T01:30:23.743Z Enqueued task task_id=72 type=enqueue target=email target=email queue_dur_ms=200 timeout_dur_ms=6010

CHANGE CAN BE INCREMENTAL

slide-39
SLIDE 39

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds task=72 2019-01-25T01:30:23.743Z Enqueued task task=72 2019-01-25T01:30:24.212Z Task processed, returning 42 entries task=74 2019-01-25T01:30:26.014Z Task complete (email sent to foobar@example.com) task=74 2019-01-25T01:30:24.120Z Enqueued task task=74 2019-01-25T01:30:26.214Z Enqueued task task=77 2019-01-25T01:30:24.120Z Task errored: unknown constant ::Fixnum task=77 2019-01-25T01:30:32.762Z Enqueued task task=78 2019-01-25T01:30:34.243Z Task processed, returning 0 entries task=78 2019-01-25T01:30:34.243Z Task complete, (email sent to bazqux@example.com) task=78

CHANGE CAN BE INCREMENTAL

slide-40
SLIDE 40

at the end of all of this…

slide-41
SLIDE 41

OPS DEV

💦

slide-42
SLIDE 42

💜 OPS

DEV

slide-43
SLIDE 43

WRITE → TEST → COMMIT → RELEASE → OBSERVE TEST OBSERVE

DEV OPS

slide-44
SLIDE 44

OPS: DEVS: embrace observability, bring production closer to development. share the great responsibility
 (and great power!)

slide-45
SLIDE 45

thanks!

@cyen @honeycombio

CURIOUS? TRY play.honeycomb.io

ASK NEW QUESTIONS SHIP BETTER SOFTWARE

slide-46
SLIDE 46