CONTINUOUS DELIVERY: THE DIRTY DETAILS Mike Brittain Etsy.com - - PowerPoint PPT Presentation

continuous delivery the dirty details
SMART_READER_LITE
LIVE PREVIEW

CONTINUOUS DELIVERY: THE DIRTY DETAILS Mike Brittain Etsy.com - - PowerPoint PPT Presentation

CONTINUOUS DELIVERY: THE DIRTY DETAILS Mike Brittain Etsy.com @mikebrittain mike@etsy.com a.k.a. Continuous Deployment www. .com AUGUST 2012 1.4 Billion page views USD $76 Million in transactions 3.8 Million items sold


slide-1
SLIDE 1

CONTINUOUS DELIVERY: THE DIRTY DETAILS

Mike Brittain

Etsy.com @mikebrittain mike@etsy.com

slide-2
SLIDE 2

a.k.a. “Continuous Deployment”

slide-3
SLIDE 3
slide-4
SLIDE 4

www. .com

slide-5
SLIDE 5
slide-6
SLIDE 6

AUGUST 2012 1.4 Billion page views USD $76 Million in transactions 3.8 Million items sold

http://www.etsy.com/blog/news/2012/etsy-statistics-august-2012-weather-report/
slide-7
SLIDE 7

~170 Committers, everyone deploys

credit: martin_heigan (flickr)

slide-8
SLIDE 8

Very end of 2009 Today

30 20 10 40

slide-9
SLIDE 9

Continuous delivery is a pattern language in growing use in software development to improve the process of software delivery. Techniques such as automated testing, continuous integration, and continuous deployment allow software to be developed to a high standard and easily packaged and deployed to test environments, resulting in the ability to rapidly, reliably and repeatedly push

  • ut enhancements and bug fixes to customers

at low risk and with minimal manual

  • verhead. The technique was one of the

assumptions of extreme programming but at an enterprise level has developed into a discipline of its own, with job descriptions for roles such as "buildmaster" calling for CD skills as mandatory.

~wikipedia

slide-10
SLIDE 10

+ DevOps + Working on mainline, trunk, master + Feature flags + Branching in code

slide-11
SLIDE 11

An Apology

slide-12
SLIDE 12

We build primarily in PHP. Please don’t run away! An Apology

slide-13
SLIDE 13

“Continuous Deployment in Practice at Etsy”

The Dirty Details of...

slide-14
SLIDE 14
slide-15
SLIDE 15

2010-today 2009 Then Now Just before we started using CD

slide-16
SLIDE 16

15 mins 6-14 hours Then 1 person “Deployment Army” Now Rapid release cycle Highly orchestrated and infrequent

slide-17
SLIDE 17

Commonplace and happens so often we cannot keep up Special event and highly disruptive Then Now

slide-18
SLIDE 18

Blocked for 15 minutes, next deploy will

  • nly take

15 minutes Config flags <5 mins Blocked for 6-14 hours, plus minimum of 6 hours to redeploy Then Now

slide-19
SLIDE 19

Mainline, minimal linking and building, rsync, site up Release branch, database schemas, data transforms, packaging, rolling restarts, cache purging, scheduled downtime Then Now

slide-20
SLIDE 20

Fast Simple Common Slow Complex Special Then Now

slide-21
SLIDE 21

Deploying code is the very first thing engineers learn to do at Etsy.

slide-22
SLIDE 22

1st day Add your photo to Etsy.com.

slide-23
SLIDE 23

2nd day Complete tax, insurance, and benefits forms. 1st day Add your photo to Etsy.com.

slide-24
SLIDE 24
slide-25
SLIDE 25

WARNING

slide-26
SLIDE 26
slide-27
SLIDE 27

Continuous Deployment

Small, frequent changes. Constantly integrating into production. 30 deploys per day.

slide-28
SLIDE 28

“Wow... 30 deploys a day. How do you build features so quickly?”

slide-29
SLIDE 29

Software Deploy ≠ Product Launch

slide-30
SLIDE 30

Deploys frequently gated by config flags

(“dark” releases)

slide-31
SLIDE 31

$cfg[‘new_search’] = array('enabled' => 'off'); $cfg[‘sign_in’] = array('enabled' => 'on'); $cfg[‘checkout’] = array('enabled' => 'on'); $cfg[‘homepage’] = array('enabled' => 'on');

slide-32
SLIDE 32

$cfg[‘new_search’] = array('enabled' => 'off');

slide-33
SLIDE 33

$cfg[‘new_search’] = array('enabled' => 'off'); // Meanwhile... # old and boring search $results = do_grep();

slide-34
SLIDE 34

$cfg[‘new_search’] = array('enabled' => 'off'); // Meanwhile... if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr(); } else { # old and boring search $results = do_grep(); }

slide-35
SLIDE 35

$cfg[‘new_search’] = array('enabled' => 'on'); // or... $cfg[‘new_search’] = array('enabled' => 'staff'); // or... $cfg[‘new_search’] = array('enabled' => '1%'); // or... $cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan');

slide-36
SLIDE 36

Validate in production, hidden from public.

slide-37
SLIDE 37

Small incremental changes to the application New classes, methods, controllers Graphics, stylesheets, templates Copy/content changes Turning flags on/off, or ramping up

What’s in a deploy?

slide-38
SLIDE 38

Security, bugs, traffic, load shedding, adding/removing infrastructure. Tweaking config flags or releasing patches.

Quickly Responding to issues

slide-39
SLIDE 39 http://www.flickr.com/photos/flyforfun/2694158656/
slide-40
SLIDE 40 http://www.flickr.com/photos/flyforfun/2694158656/

Operator Config flags Metrics

slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43

“How do you continuously deploy database schema changes?”

slide-44
SLIDE 44

Code deploys: ~ every 15-20 minutes Schema changes: Thursday

slide-45
SLIDE 45

Our web application is largely monolithic.

slide-46
SLIDE 46

Etsy.com, support tools, developer API, back-office, analytics

slide-47
SLIDE 47

External “services” are not deployed with the main application.

slide-48
SLIDE 48

Databases, Search, Photo storage

slide-49
SLIDE 49

For every config flag, there are two states we can support — forward and backward.

slide-50
SLIDE 50

Expose multiple versions in each service. Expect multiple versions in the application.

slide-51
SLIDE 51

Example: Changing a Database Schema

slide-52
SLIDE 52

Prefer ADDs over ALTERs (“non-breaking expansions”)

slide-53
SLIDE 53

Altering in-place requires coupling code and schema changes.

slide-54
SLIDE 54

Merging “users” and “users_prefs”

slide-55
SLIDE 55
  • 1. Write to both versions
  • 2. Backfill historical data
  • 3. Read from new version
  • 4. Cut-off writes to old version
slide-56
SLIDE 56
  • 0. Add new version to schema
  • 1. Write to both versions
  • 2. Backfill historical data
  • 3. Read from new version
  • 4. Cut-off writes to old version
slide-57
SLIDE 57
  • 0. Add new version to schema

Schema change to add prefs columns to “users” table. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “off” “read_prefs_from_users_table” => “off”

slide-58
SLIDE 58
  • 1. Write to both versions

Write code for writing prefs to the “users” table. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “off”

slide-59
SLIDE 59
  • 2. Backfill historical data

Offline process to sync existing data from “user_prefs” to new columns in “users”

slide-60
SLIDE 60
  • 3. Read from new version

Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “staff”

slide-61
SLIDE 61
  • 3. Read from new version

Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “1%”

slide-62
SLIDE 62
  • 3. Read from new version

Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “5%”

slide-63
SLIDE 63
  • 3. Read from new version

Data validation tests. Ensure consistency both internally and in production. “write_prefs_to_user_prefs_table” => “on” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on” (“on” == “100%”)

slide-64
SLIDE 64
  • 4. Cut-off writes to old version

After running on the new table for a significant amount

  • f time, we can cut off writes to the old table.

“write_prefs_to_user_prefs_table” => “off” “write_prefs_to_users_table” => “on” “read_prefs_from_users_table” => “on”

slide-65
SLIDE 65

“Branch by Astraction”

Controller Controller Users Model “users” (old) “user_prefs” “users”

  • ld schema

new schema (Abstraction)

http://paulhammant.com/blog/branch_by_abstraction.html http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
slide-66
SLIDE 66
  • 1. Write to both versions
  • 2. Backfill historical data
  • 3. Read from new version
  • 4. Cut-off writes to old version

“The Migration 4-Step”

slide-67
SLIDE 67
  • 1. Write to both versions
  • 2. Backfill historical data
  • 3. Read from new version
  • 4. Cut-off writes to old version
  • 5. Clean up flags, code, columns (when?)

“The Migration 4-Step”

slide-68
SLIDE 68

Architecture and Process

slide-69
SLIDE 69

Deploying is cheap.

slide-70
SLIDE 70

Some philosophies on product development...

slide-71
SLIDE 71

Gathering data should be cheap, too.

staff, opt-in prototypes, 1%

slide-72
SLIDE 72

Treat first iterations as experiments.

slide-73
SLIDE 73

Get into code as quickly as possible.

slide-74
SLIDE 74

Architecture largely doesn’t matter.

slide-75
SLIDE 75

Kill things that don’t work.

slide-76
SLIDE 76

“Terminate with extreme predjudice.”

slide-77
SLIDE 77

Is the dumb solution enough to build a product? How long will the dumb solution last?

slide-78
SLIDE 78

Your assumptions will be wrong

  • nce you’ve scaled 10x.
slide-79
SLIDE 79

“We don’t optimize for being right. We optimize for quickly detecting when we’re wrong.”

~Kellan Elliott-McCrea, CTO

slide-80
SLIDE 80

Become really good at changing your architecture.

slide-81
SLIDE 81

Invest time in architecture by the 2nd or 3rd iteration.

slide-82
SLIDE 82

Integration and Operations

slide-83
SLIDE 83

Continuous Deployment

Small, frequent changes. Constantly integrating into production. 30 deploys per day.

slide-84
SLIDE 84

Code review before commit

slide-85
SLIDE 85

Automated tests before deploy

slide-86
SLIDE 86

Why Integrate with Production?

slide-87
SLIDE 87

Dev ≠ Prod

slide-88
SLIDE 88

Verify frequently and in small batches.

slide-89
SLIDE 89

Integrating with production is a test in itself.

We do this frequently and in small batches.

slide-90
SLIDE 90

"Production is truly the only place you can validate your code."

slide-91
SLIDE 91

"Production is truly the only place you can validate your code."

~ Michael Nygard, about 40 min ago

slide-92
SLIDE 92

More database servers in prod. Bigger database hardware in prod. More web servers. Various replication schemes. Different versions of server and OS software. Schema changes applied at different times. Physical hardware in prod. More data in prod. Legacy data (7 years of odd user states). More traffic in prod. Wait, I mean MUCH more traffic in prod. Fewer elves. Faster disks (SSDs) in prod.

slide-93
SLIDE 93

Using a MySQL database to test an application that will eventually be deployed on Oracle:

slide-94
SLIDE 94

Using a MySQL database to test an application that will eventually be deployed on Oracle: Priceless.

slide-95
SLIDE 95

Verify frequently and in small batches.

slide-96
SLIDE 96

Dev ≠ Prod

slide-97
SLIDE 97

Dev ⇾ QA ⇾ Staging ⇾ Prod

slide-98
SLIDE 98

Dev ⇾ QA ⇾ Staging ⇾ Prod

slide-99
SLIDE 99

Dev ⇾ Pre-Prod ⇾ Prod

slide-100
SLIDE 100

Test and integrate where you’ll see value.

slide-101
SLIDE 101

Config flags (again)

  • ff, on, staff, opt-in prototypes, user list, 0-100%
slide-102
SLIDE 102

Config flags (again)

  • ff, on, staff, opt-in prototypes, user list, 0-100%

“canary pools”

slide-103
SLIDE 103

Automated tests after deploy

slide-104
SLIDE 104

Real-time metrics and dashboards

Network & Servers, Application, Business

slide-105
SLIDE 105
slide-106
SLIDE 106

Release Managers: 0

slide-107
SLIDE 107
slide-108
SLIDE 108

Is it Broken? Or , is it just better?

slide-109
SLIDE 109
slide-110
SLIDE 110

Metrics + Configs ⇾ OODA Loop

slide-111
SLIDE 111

“Theoretical” vs. “Practical”

slide-112
SLIDE 112

Surprise!!! Turning off multi- language support improves our page generation times by up to 25%.

Homepage (95th perc.)

slide-113
SLIDE 113
  • Nope. It’s really broken.
slide-114
SLIDE 114 http://www.flickr.com/photos/flyforfun/2694158656/

Operator Config flags Metrics

slide-115
SLIDE 115

Thursday, Nov 22 - Thanksgiving Friday, Nov 23 - “Black Friday” Monday, Nov 26 - “Cyber Monday” ~30 days out from Christmas

slide-116
SLIDE 116

30 20 10 40

slide-117
SLIDE 117

Thank you.

Mike Brittain mike@etsy.com @mikebrittain