Deploying large payloads at scale Ramon van Alteren Wednesday, - - PowerPoint PPT Presentation

deploying large payloads at scale
SMART_READER_LITE
LIVE PREVIEW

Deploying large payloads at scale Ramon van Alteren Wednesday, - - PowerPoint PPT Presentation

Deploying large payloads at scale Ramon van Alteren Wednesday, November 9, 2011 Hyves 9,7M dutch members (16,7M population) ~7M unique visitors / month (Comscore 09/2011) ~2.3M unique visitors / day 800.000 photo uploads /


slide-1
SLIDE 1

Deploying large payloads at scale

Ramon van Alteren

Wednesday, November 9, 2011

slide-2
SLIDE 2

Hyves

  • 9,7M dutch members (16,7M population)
  • ~7M unique visitors / month (Comscore 09/2011)
  • ~2.3M unique visitors / day
  • 800.000 photo uploads / day
  • 7M chat messages / day

Wednesday, November 9, 2011

slide-3
SLIDE 3

Hyves - Operational environment

  • 3500 node serverpark in 3 datacenters
  • 6Gbps daily outgoing traffic
  • System Engineering team: 12
  • Development team: 33

Wednesday, November 9, 2011

slide-4
SLIDE 4

Weekend project

Wednesday, November 9, 2011

slide-5
SLIDE 5

Weekend project

Result: 4.5 x speed/throughput increase

Wednesday, November 9, 2011

slide-6
SLIDE 6

Weekend -> Company project

Wednesday, November 9, 2011

slide-7
SLIDE 7

A Few Minor problems

Wednesday, November 9, 2011

slide-8
SLIDE 8

A Few Minor problems

  • compilation took ~40-60 minutes

Wednesday, November 9, 2011

slide-9
SLIDE 9

A Few Minor problems

  • compilation took ~40-60 minutes
  • resulting binary was 750MB

Wednesday, November 9, 2011

slide-10
SLIDE 10

A Few Minor problems

  • compilation took ~40-60 minutes
  • resulting binary was 750MB
  • Code issues

Wednesday, November 9, 2011

slide-11
SLIDE 11

A Few Minor problems

  • compilation took ~40-60 minutes
  • resulting binary was 750MB
  • Code issues
  • gcc 4.5.2 required + added deps

Wednesday, November 9, 2011

slide-12
SLIDE 12

Part I - Solving the build problem

Jenkins to the rescue

Wednesday, November 9, 2011

slide-13
SLIDE 13

Part I - Solving the build problem

Jenkins to the rescue Add distCC to speed up compilation times

Wednesday, November 9, 2011

slide-14
SLIDE 14

Part I - Solving the build problem

Add some serious hardware

Wednesday, November 9, 2011

slide-15
SLIDE 15

Part I - Solving the build problem

Add some serious hardware Compile / build in < 6 mins

Wednesday, November 9, 2011

slide-16
SLIDE 16

Part II: Deploying Sequential ?

Wednesday, November 9, 2011

slide-17
SLIDE 17

Part II: Deploying Sequential ?

500MB @ 1Gb/s = 4 seconds 500MB @ 500Mb/s = 8 seconds 500MB @ 200Mb/s = 20 seconds

Wednesday, November 9, 2011

slide-18
SLIDE 18

Part II: Deploying Sequential ?

500MB @ 1Gb/s = 4 seconds 500MB @ 500Mb/s = 8 seconds 500MB @ 200Mb/s = 20 seconds 450 servers * 8 seconds = 3600 seconds == 1 hour 450 servers * 20 seconds = 9000 seconds == 2.5 hour

Wednesday, November 9, 2011

slide-19
SLIDE 19

Part II: Deploying Sequential ?

500MB @ 1Gb/s = 4 seconds 500MB @ 500Mb/s = 8 seconds 500MB @ 200Mb/s = 20 seconds 450 servers * 8 seconds = 3600 seconds == 1 hour 450 servers * 20 seconds = 9000 seconds == 2.5 hour

Diffs ?

Wednesday, November 9, 2011

slide-20
SLIDE 20

Part II: Deploying Sequential ?

500MB @ 1Gb/s = 4 seconds 500MB @ 500Mb/s = 8 seconds 500MB @ 200Mb/s = 20 seconds 450 servers * 8 seconds = 3600 seconds == 1 hour 450 servers * 20 seconds = 9000 seconds == 2.5 hour

binary diff would be between 10KB - 400MB Even on consecutive runs without

Diffs ?

Wednesday, November 9, 2011

slide-21
SLIDE 21

Part II: Deploying - Bittorrent

Wednesday, November 9, 2011

slide-22
SLIDE 22

Bittorrent - Previous experiences

Naive run using bittorrent to transport 300MB throughout our serverpark

  • Near-complete network outage due to

bandwidth starvation

  • Several crucial subsystems delayed or

unreachable due to network bandwidth shortage

Wednesday, November 9, 2011

slide-23
SLIDE 23

Bittorrent - Previous experiences

Wednesday, November 9, 2011

slide-24
SLIDE 24

Bittorrent - The Problem

Every server has 1Gb/s link to every other server

Wednesday, November 9, 2011

slide-25
SLIDE 25

Bittorrent - The Problem

Every server has 1Gb/s link to every other server

they don’t

Wednesday, November 9, 2011

slide-26
SLIDE 26

Bittorrent - Actual bandwidth available

Core Network

1-4Gb/s

Wednesday, November 9, 2011

slide-27
SLIDE 27

Bittorrent - Actual bandwidth available

Core Network

Production traffic Administration traffic

Wednesday, November 9, 2011

slide-28
SLIDE 28

Murder - Why not ?

Murder uses two tricks:

  • Clients (including the seeder) capped

to 1 upload peer

  • Every client receives every peer from

the tracker

  • No download bandwidth cap

(easy to add though)

Wednesday, November 9, 2011

slide-29
SLIDE 29

Murder - Why not ?

Murder uses two tricks:

  • Clients (including the seeder) capped

to 1 upload peer

  • Every client receives every peer from

the tracker

  • No download bandwidth cap

(easy to add though)

Peers will still connect all over the place

Wednesday, November 9, 2011

slide-30
SLIDE 30

Murder - Why not ?

Murder uses two tricks:

  • Clients (including the seeder) capped

to 1 upload peer

  • Every client receives every peer from

the tracker

  • No download bandwidth cap

(easy to add though)

Peers will still connect all over the place It’s slow, timing run over 25 peers took 7 mins

Wednesday, November 9, 2011

slide-31
SLIDE 31

DIY: Location aware tracker

Wednesday, November 9, 2011

slide-32
SLIDE 32

SMDB - Location metadata

We have bandwidth information available in

  • ur server management database

Build two-tier bittorrent swarms:

  • 1 swarm with 2 peers / rack(uplink)
  • 1 additional swarm per rack (uplink)
  • cap every client @ 96mbit/s

Wednesday, November 9, 2011

slide-33
SLIDE 33

DIY: 2-tier swarms

Wednesday, November 9, 2011

slide-34
SLIDE 34

DIY - Tracker

Tracker in python + Flask:

  • 1100 lines of code (1900 with tests)
  • Stores transfer metadata in redis
  • Connects to our SMDB using REST
  • Exposes REST interface

Wednesday, November 9, 2011

slide-35
SLIDE 35

DIY - Client

We use a slightly modified rtorrent client Same things twitter modified:

  • Remove features related to operating
  • n the big bad internet.
  • Make various timeouts more aggresive
  • No DHT, UPNP etc.

Nice bonus: RPC remote API

Wednesday, November 9, 2011

slide-36
SLIDE 36

Results - single deploy ~300 hosts

500 1000 1500 2000 Without build With build

Classic deploy Bittorrent deploy

Wednesday, November 9, 2011

slide-37
SLIDE 37

Graphs - full webfarm cluster

Wednesday, November 9, 2011

slide-38
SLIDE 38

Graphs - single node

Wednesday, November 9, 2011

slide-39
SLIDE 39

Graphs - single rack (24 nodes)

Wednesday, November 9, 2011

slide-40
SLIDE 40

Bitorrent - Statistics

== release: 101482 expected: 287 actual: 107 seeders: 0 progress: 0.00% start: 12:04:20 last_completed: none failed: 0 == == release: 101482 expected: 287 actual: 267 seeders: 0 progress: 0.00% start: 12:04:20 last_completed: none failed: 0 == == release: 101482 expected: 287 actual: 286 seeders: 0 progress: 0.00% start: 12:04:20 last_completed: none failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 1 progress: 0.35% start: 12:04:20 last_completed: none failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 2 progress: 42.11% start: 12:04:20 last_completed: 12:06:01 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 5 progress: 44.48% start: 12:04:20 last_completed: 12:06:05 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 34 progress: 45.80% start: 12:04:20 last_completed: 12:06:09 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 95 progress: 48.17% start: 12:04:20 last_completed: 12:06:10 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 95 progress: 48.17% start: 12:04:20 last_completed: 12:06:10 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 97 progress: 48.46% start: 12:04:20 last_completed: 12:06:15 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 101 progress: 49.23% start: 12:04:20 last_completed: 12:06:21 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 108 progress: 50.72% start: 12:04:20 last_completed: 12:06:24 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 118 progress: 52.95% start: 12:04:20 last_completed: 12:06:27 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 131 progress: 55.88% start: 12:04:20 last_completed: 12:06:30 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 172 progress: 67.77% start: 12:04:20 last_completed: 12:06:34 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 213 progress: 80.53% start: 12:04:20 last_completed: 12:06:37 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 239 progress: 87.36% start: 12:04:20 last_completed: 12:06:40 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 246 progress: 91.24% start: 12:04:20 last_completed: 12:06:43 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 265 progress: 97.55% start: 12:04:20 last_completed: 12:06:46 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 274 progress: 98.79% start: 12:04:20 last_completed: 12:06:49 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 280 progress: 99.07% start: 12:04:20 last_completed: 12:06:52 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 283 progress: 99.26% start: 12:04:20 last_completed: 12:06:55 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 284 progress: 99.36% start: 12:04:20 last_completed: 12:06:56 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 284 progress: 99.36% start: 12:04:20 last_completed: 12:06:56 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 285 progress: 99.52% start: 12:04:20 last_completed: 12:07:04 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 286 progress: 99.85% start: 12:04:20 last_completed: 12:07:06 failed: 0 == == release: 101482 expected: 287 actual: 287 seeders: 287 progress: 100.00% start: 12:04:20 last_completed: 12:07:37 failed: 0 ==

Wednesday, November 9, 2011

slide-41
SLIDE 41

Next Steps

Enable more projects

  • currently only three main projects
  • ne codebase
  • Multi-transfer

Move to Continuous Delivery

  • currently doing continuous deployment
  • 6-12 deploys a day
  • want to allow feature teams to deploy

Wednesday, November 9, 2011

slide-42
SLIDE 42

Next Steps

Open Source the tracker:

  • Very closely tied to our infra
  • Not overly clean code

We will open source it, however some refactoring is needed contact me if you’re interested.

Watch our github repository: https://github.com/organizations/hyves-org

Wednesday, November 9, 2011

slide-43
SLIDE 43

Next Steps - Deployment Glue

TODO: [] prepare stufg [] transport stufg [] do some more stufg [] Activate payload [] post check

Wednesday, November 9, 2011

slide-44
SLIDE 44

Next Steps - Deployment Glue

Simple from a single host perspective Complex when executed in parallel, remotely, with failure handling and proper reporting

Fabric

Wednesday, November 9, 2011

slide-45
SLIDE 45

Thank you, Questions ?

Bram Cohen - http://www.bittorrent.com/ Flask: http://flask.pocoo.org/ Rtorrent: http://libtorrent.rakshasa.no/ Twitter - Murder: https://github.com/lg/murder Boris, Cor, Lorenzo & others at hyves.nl Michael Tekel: https://github.com/mtekel/ http://wiki.theory.org/BitTorrentSpecification You ? Weʼre hiring: http://werkenbijhyves.hyves.nl email: ramon@hyves.nl twitter: @ramonvanalteren

Wednesday, November 9, 2011