Behind the Scenes at MySpace.com Dan Farino Chief Systems - - PowerPoint PPT Presentation

behind the scenes at myspace com
SMART_READER_LITE
LIVE PREVIEW

Behind the Scenes at MySpace.com Dan Farino Chief Systems - - PowerPoint PPT Presentation

Behind the Scenes at MySpace.com Dan Farino Chief Systems Architect dan@myspace.com Friday, November 21, 2008 1 Topics Architecture overview and history The stuff I get to work on (in the Windows world) Monitoring


slide-1
SLIDE 1

Behind the Scenes at MySpace.com

Dan Farino Chief Systems Architect dan@myspace.com

1 Friday, November 21, 2008

slide-2
SLIDE 2

Topics

  • Architecture overview and

history

  • The stuff I get to work on (in

the Windows world)

  • Monitoring
  • AdministraGon

2

2 Friday, November 21, 2008

slide-3
SLIDE 3

Topics

  • Windows?!
  • It’s a good server (now leave me

alone.)

  • However, the selecGon of tools for

large‐scale management is a bit sparse...

3

3 Friday, November 21, 2008

slide-4
SLIDE 4

Where we started

4 Friday, November 21, 2008

slide-5
SLIDE 5

Where we started

  • The ideal growth scenario
  • Plan
  • Implement
  • Test
  • Go live
  • Monitor and collect ops data
  • Repeat

5

5 Friday, November 21, 2008

slide-6
SLIDE 6

Where we started

  • Our growth scenario:
  • Implement
  • Go live
  • And while those are happening
  • ver and over:
  • Reboot servers
  • Throw hardware at performance issues
  • “Shotgun debugging”

6

6 Friday, November 21, 2008

slide-7
SLIDE 7

Where we started

“Shotgun debugging”:

Shotgun debugging is a process of making relaGvely undirected changes to soVware in the hope that a bug will be perturbed out of existence.

7

7 Friday, November 21, 2008

slide-8
SLIDE 8

Where we started

  • Why would anyone “shotgun

debug”?

  • Don’t really know how to analyze

and debug a problem

  • Need to resolve the problem now

and collecGng data for analysis would take too long

8

8 Friday, November 21, 2008

slide-9
SLIDE 9

Where we started

  • Web servers
  • Windows 2000 Server
  • IIS 5.0
  • ColdFusion 5
  • Database servers
  • Windows 2000 Server
  • SQL Server 2000

9

9 Friday, November 21, 2008

slide-10
SLIDE 10

Where we were

  • OperaGonally
  • Batch files and robocopy for code

deployment

  • “psexec” for remote admin script

execuGon

  • Windows Performance Monitor for

monitoring

10

10 Friday, November 21, 2008

slide-11
SLIDE 11

Where we were

  • Any sort of formal, automated

QA process?

  • No.

11

11 Friday, November 21, 2008

slide-12
SLIDE 12

Current architecture

12 Friday, November 21, 2008

slide-13
SLIDE 13

Current architecture

  • 4,500+ web servers
  • Windows 2003/IIS 6.0/ASP.NET
  • 1,200+ “cache” servers
  • 64‐bit Windows 2003
  • 500+ database servers
  • 64‐bit Windows 2003
  • SQL Server 2005

13

13 Friday, November 21, 2008

slide-14
SLIDE 14

QA today

  • Unit tests/automated tesGng
  • We sGll don’t “fuzz” the site

nearly as thoroughly as our users do though

  • There are sGll problems that

happen only in producGon

14

14 Friday, November 21, 2008

slide-15
SLIDE 15

QA today

  • We need beher operaGonal

data collecGon so that we know what cases we’re not tesGng

15

15 Friday, November 21, 2008

slide-16
SLIDE 16

OperaGonal Data CollecGon

16 Friday, November 21, 2008

slide-17
SLIDE 17

Ops Data CollecGon

  • Two general types of systems:
  • StaGc
  • Collect, store and alert based on pre‐

configured rules

  • Dynamic
  • Write an ad‐hoc script or applicaGon to

collect data for an immediate or one‐off need

17

17 Friday, November 21, 2008

slide-18
SLIDE 18

Ops Data CollecGon

  • Our current “staGc” Windows

Performance counter monitor:

18 Friday, November 21, 2008

slide-19
SLIDE 19

Ops Data CollecGon

  • Cons of staGc system:
  • RelaGvely central configuraGon

managed by a small number of administrators

  • Bad for one‐off requests: change

the config, apply, wait for data

  • Developer’s quesGons usually go

unanswered

19

19 Friday, November 21, 2008

slide-20
SLIDE 20

Ops Data CollecGon

  • Developers looking at

producGon?!

  • Developers like to see their

creaGons come to life (I know I do)

  • The more a developer can see once

their code goes live, the more they’re going to know for V2

20

20 Friday, November 21, 2008

slide-21
SLIDE 21

Ops Data CollecGon

  • Cons of the dynamic system:
  • It’s not really a “system” at all...it’s

an administrator running a script

  • Is a privileged operaGon: scripts

are powerful and can potenGally make changes to the system

  • Even run as a limited user, bad

scripts can sGll DoS the system

21

21 Friday, November 21, 2008

slide-22
SLIDE 22

Ops Data CollecGon

  • Cons of the dynamic system:
  • One‐shot data collecGon is possible

but learning about deltas takes a lot more code (and polling, yuck)

  • Different custom‐data collecGon

tools that request the same data point cause duplicated network traffic

22

22 Friday, November 21, 2008

slide-23
SLIDE 23

Ops Data CollecGon

  • A recent example of an ad‐hoc

task using our current “dynamic” system:

  • get‐adservers | run‐agent ps /e

'"Version: $(gcm F:\file.dll | % {$_.FileVersionInfo.FileVersion} )"' | select Host, Message

23

23 Friday, November 21, 2008

slide-24
SLIDE 24

Ops Data CollecGon

  • Ideally, all operaGonal data

available in the enGre server farm should be able to queried:

  • Safely
  • Instantly
  • With change‐noGficaGon

24

24 Friday, November 21, 2008

slide-25
SLIDE 25

Ops Data CollecGon

  • I’d like to be able to do

something like this:

  • SELECT CpuTime.*,

ExceptionsPerSecond WHERE WebService.Status = ‘UP’ AND serving = ‘profile.myspace.com’ OR serving = ‘home.myspace.com’

25

25 Friday, November 21, 2008

slide-26
SLIDE 26

Ops Data CollecGon

I’d also to be able to leave that query “hanging” and be noGfied

  • f changes like:
  • A selected field has changed for a

known data point

  • A new server has come online and

meets the criteria (or vice‐versa)

26

26 Friday, November 21, 2008

slide-27
SLIDE 27

Our new operaGonal data collecGon plalorm

27 Friday, November 21, 2008

slide-28
SLIDE 28

Ops Data CollecGon

  • Our new operaGonal data‐

subscripGon plalorm:

  • On‐demand
  • Supports both “one‐shot” and

“persistent” modes

  • Can be used by non‐privileged

users

28

28 Friday, November 21, 2008

slide-29
SLIDE 29

Ops Data CollecGon

  • Our new operaGonal data‐

subscripGon plalorm:

  • Eliminates the need for the

consumer to poll for changes

  • If a data source requires polling,

that operaGon is pushed as close to the source as possible

29

29 Friday, November 21, 2008

slide-30
SLIDE 30

Ops Data CollecGon

  • A Client makes one TCP

connecGon to a “Collector” server

  • Can receive data related to

thousands of servers via this one connecGon

  • As long as the connecGon is up,

the client is kept up‐to‐date

30

30 Friday, November 21, 2008

slide-31
SLIDE 31

Ops Data CollecGon

  • A lihle bit like:
  • Having all of the servers in a chat

room and being able to talk to a selected subset of them at any Gme (over one connecGon)

  • IniGal idea came from looking at

using XMPP+ejabberd for command and control

31

31 Friday, November 21, 2008

slide-32
SLIDE 32

Ops Data CollecGon

Collector Server Agent Agent Agent

One lazily-established TCP connection per Agent

Client Client

Preferably one TCP connection per Client, although more than one is allowed (but frowned upon)

32

32 Friday, November 21, 2008

slide-33
SLIDE 33

Ops Data CollecGon

  • Provides:
  • Windows Performance Counters
  • WMI objects
  • Event logs
  • Hardware data
  • Custom WMI objects published from out‐
  • f‐process
  • Log file contents

33

33 Friday, November 21, 2008

slide-34
SLIDE 34

Ops Data CollecGon

  • Provides:
  • On Linux, plans are to hook into

something like D‐Bus so that processes can provide operaGonal data to the Agent in a loosely‐ connected manner

34

34 Friday, November 21, 2008

slide-35
SLIDE 35

Ops Data CollecGon

  • The Collector service:
  • A Windows Service in C#
  • Completely async I/O (never blocks

a thread)

  • Uses MicrosoV’s “Concurrency and

CoordinaGon RunGme”

  • An Agent running on each host

35

35 Friday, November 21, 2008

slide-36
SLIDE 36

Ops Data CollecGon

  • Wire protocol is Google’s

Protocol Buffers

  • Clients and Agents can be

easily wrihen in any of the languages for which there is a PB implementaGon

36

36 Friday, November 21, 2008

slide-37
SLIDE 37

Ops Data CollecGon

  • Why not use XMPP+ejabberd?
  • Wanted to use Protocol Buffers

instead of XML

  • Wanted lazily‐established TCP

connecGons to the Agents

  • Wanted to see if C#+CCR could

handle the load (yes it can)

37

37 Friday, November 21, 2008

slide-38
SLIDE 38

Why develop a whole new plalorm?

38 Friday, November 21, 2008

slide-39
SLIDE 39

Ops Data CollecGon

  • Why develop something new?
  • There doesn’t seem to be anything
  • ut there right now that fits the

need

  • And my requirements also include

free and open source...

39

39 Friday, November 21, 2008

slide-40
SLIDE 40

Ops Data CollecGon

  • To do it properly, you really

need to be using 100% async I/O.

  • Libraries that make this easy

are relaGvely new

  • CCR, Twisted, GTask, Erlang

40

40 Friday, November 21, 2008

slide-41
SLIDE 41

Ops Data CollecGon

  • Most established products

were wrihen before the mulG‐core/async craze

41

41 Friday, November 21, 2008

slide-42
SLIDE 42

Ops Data CollecGon

  • What does it enable?
  • The individual that is actually

interested in the data can gather it himself

  • No central config, no need to

involve an administrator

  • This includes developers

42

42 Friday, November 21, 2008

slide-43
SLIDE 43

Ops Data CollecGon

  • What does it enable?
  • There is a very low “barrier to

entry”

  • It’s almost like exploring a database

with some ad‐hoc SQL queries

  • “I wonder...” quesGons are easily

answered without a lot of work

43

43 Friday, November 21, 2008

slide-44
SLIDE 44

Ops Data CollecGon

  • What does it enable?
  • CharGng/alerGng/data‐archiving

systems no longer concern themselves with the data‐collecGon intricacies.

  • We can spend Gme wriGng the

valuable code instead of rewriGng the same plumbing every Gme

44

44 Friday, November 21, 2008

slide-45
SLIDE 45

Ops Data CollecGon

  • What does it provide?
  • Abstracts physical server‐farm from

the user

  • If you know machine names, great.

But you can also say “all servers serving ‘profile.myspace.com’” or “all cache servers in Los Angeles”

45

45 Friday, November 21, 2008

slide-46
SLIDE 46

Ops Data CollecGon

  • What does it provide?
  • Guaranteed to keep you

“up‐to‐date”

  • Get your iniGal set of data and then

just wait for the deltas

  • Pushes polling as close to the

source as possible

46

46 Friday, November 21, 2008

slide-47
SLIDE 47

Ops Data CollecGon

  • What does it provide?
  • Eliminates duplicate requests
  • Hundreds of clients can be

monitoring the “% Processor Time” for a server and it will only be sent from that server once when it changes

47

47 Friday, November 21, 2008

slide-48
SLIDE 48

Ops Data CollecGon

  • What does it provide?
  • Only collects data that someone is

currently asking for

  • This is how we avoid having explicit

configuraGon on the server

48

48 Friday, November 21, 2008

slide-49
SLIDE 49

Ops Data CollecGon

  • Is this really a good way to do

things?

  • Having too much data pushed at

you is a bad thing

  • Being able to pull from a large

selecGon of data points is a good thing

49

49 Friday, November 21, 2008

slide-50
SLIDE 50

Ops Data CollecGon

  • For developers, knowing that

they will have access to instrumentaGon data even in produc4on encourages more detailed instrumentaGon

50

50 Friday, November 21, 2008

slide-51
SLIDE 51

Ops Data CollecGon

  • Beher instrumentaGon = more

data available = more detailed feedback to QA and developers

51

51 Friday, November 21, 2008

slide-52
SLIDE 52

Ops Data CollecGon

  • Ease‐of‐use is a very high

priority

  • Easy and fun APIs encourage

adopGon

52

52 Friday, November 21, 2008

slide-53
SLIDE 53

Ops Data CollecGon

  • LINQ via C#:

var collector = new Collector(...); var counters = from server in collector where server.subdomain = “www.myspace.com” select server.WindowsPerfCounter into counters where counters.category = “Processor” select server.Name, counters.Instance, counters.Value

53

53 Friday, November 21, 2008

slide-54
SLIDE 54

Ops Data CollecGon

  • LINQ via C# and CLINQ

(“ConGnuous LINQ”) = instant monitoring app (in about 10 lines of code):

var counters = ... MainWpfWindow.MainGrid = counters; // Go grab a beer

54

54 Friday, November 21, 2008

slide-55
SLIDE 55

Ops Data CollecGon

  • Tail a file across thousands of

servers

  • With the filtering expression being

run on the remote machines

  • At the same Gme as someone else

is (with no duplicate lines being sent over the network)

55

55 Friday, November 21, 2008

slide-56
SLIDE 56

Ops Data CollecGon

  • Open source?
  • Hopefully!
  • Other implementaGons?
  • I may write a GTask or Erlang

version as a “weekend” project

56

56 Friday, November 21, 2008

slide-57
SLIDE 57

Thank you!

57 Friday, November 21, 2008