Stateful Services on Mesos Ankan Mukherjee (ankan@moz.com) Arunabha - - PowerPoint PPT Presentation

stateful services on mesos
SMART_READER_LITE
LIVE PREVIEW

Stateful Services on Mesos Ankan Mukherjee (ankan@moz.com) Arunabha - - PowerPoint PPT Presentation

Stateful Services on Mesos Ankan Mukherjee (ankan@moz.com) Arunabha Ghosh (agh@moz.com) A deployment diagram Source: wikipedia Presentation Business Data Presentation Business Data Why run on Mesos? Services are decoupled from the


slide-1
SLIDE 1

Stateful Services on Mesos

Ankan Mukherjee (ankan@moz.com) Arunabha Ghosh (agh@moz.com)

slide-2
SLIDE 2

A deployment diagram

Source: wikipedia

slide-3
SLIDE 3

Presentation Business Data

slide-4
SLIDE 4

Presentation Business Data

slide-5
SLIDE 5

Why run on Mesos?

  • Services are decoupled from the nodes
  • Automatic failover
  • Easier to manage/maintain
  • Simpler version management
  • Simpler environments, staging → deployment
  • Lesser complexity of the set of systems
slide-6
SLIDE 6

Transition

slide-7
SLIDE 7

Challenges

  • Packaging/deployment
  • Naming/finding services
  • Dependency on persistent state
slide-8
SLIDE 8

Challenges

  • Packaging/deployment
  • Naming/finding services
  • Dependency on persistent state
slide-9
SLIDE 9

The problem

Examples:

  • Legacy apps
  • Single node SQL

databases (mysql, postgres)

  • Apps that depend
  • n local storage

?

slide-10
SLIDE 10

Potential Solutions

  • Local storage
  • Shared storage
  • Network block device
  • Mesos persistent resource primitives
  • Application specific distributed solutions
slide-11
SLIDE 11

Local storage (option 1) ?

  • Pin to node
  • On failure

○ Manually bring the node up ○ Rely on existing process

slide-12
SLIDE 12

Local storage (option 1)

  • Pros

○ Easiest (~ no changes) ○ Share free resources from node

  • Cons

○ No auto failover ○ Service still coupled to node ○ Feels like cheating!

slide-13
SLIDE 13

Local storage (option 2)

backup

slide-14
SLIDE 14

Local storage (option 2)

backup restore

slide-15
SLIDE 15

Local storage (option 2)

  • Periodic backups to central location
  • On failure:

○ Restore last known good state to local storage ○ Proceed as usual backup restore

slide-16
SLIDE 16

Local storage (option 2)

  • When and where to backup?
  • When and where to restore?

○ Which node? ○ Which backup?

slide-17
SLIDE 17

Local storage (option 2)

  • When and where to backup?
  • When and where to restore?

○ Which node? ○ Which backup?

“Automated scripted restore at process start.”

slide-18
SLIDE 18

Local storage (option 2)

  • Pros:

○ Easy to set up ○ Auto failover ○ Share free resources

  • Cons:

○ Scripted restore complexity ○ Adversely affected by system & data volume/type ○ Time to restore ○ Data loss

slide-19
SLIDE 19

Shared file system - centralized

slide-20
SLIDE 20

Shared file system - centralized

  • POSIX compliant centralized shared FS
  • Example: NFS
  • Mounted to same path across all nodes
  • On failure:

○ Let Mesos start new instance on any available node

slide-21
SLIDE 21

Shared file system - centralized

What can go wrong?

  • What did we just do?

○ Added network between the process and the storage

slide-22
SLIDE 22

Master Master Master Master Master Master Master Master Master

Node disconnects from master

slide-23
SLIDE 23

Master Master Master Master Master Master Master Master Master Master Master Master

Node disconnects and reconnects

slide-24
SLIDE 24

Master Master Master Master Master Master

scaleTo = 2

Task is scaled to >1

slide-25
SLIDE 25

Master Master Master Master Master Master Master Master Master

Node disconnects from FS

slide-26
SLIDE 26

Shared file system - centralized

To summarize, we could end up with…

  • Possibly corrupted data if

○ Node disconnects from master but is connected to FS ○ Node disconnects from network & then connects back ○ Somehow the task is “scaled” to >1 instances

  • Possibly undesired state of process/service if

○ Node is connected to master but disconnects from FS

slide-27
SLIDE 27

Shared file system - centralized

How do we fix this?

Master Master Master

slide-28
SLIDE 28

zookeeper zookeeper

Shared file system - centralized

How do we fix this?

Master Master Master zookeeper lock node

  • Use zookeeper exclusive lock
  • The process should

○ start only if it has acquired the zk lock (exit otherwise) ○ exit at any point it loses the zk lock

  • Check for FS mount and exit if NA
slide-29
SLIDE 29

Shared file system - centralized

  • How without changing orig app?

○ New startup app/script (wrapper) ○ entrypoint/startup → wrapper → orig app

zookeeper lock node

slide-30
SLIDE 30

Shared file system - centralized

Check:

  • Possibly corrupted data if

○ Node disconnects from master but is connected to FS ○ Node disconnects from network & then connects back ○ Somehow the task is “scaled” to >1 instances

  • Possibly undesired state of process/service if

○ Node is connected to master but disconnects from FS

slide-31
SLIDE 31

Shared file system - centralized

  • Pros:

○ Easy to set up ○ Process benefits from most features (except scaling)

  • Cons:

○ Handle mutual exclusion (but this is fairly simple) ○ Depends on network speed/latency

slide-32
SLIDE 32

Shared file system - distributed

  • POSIX compliant distributed shared FS
  • Examples: glusterfs, MooseFS, Lustre
  • Mounted to same path across all nodes
  • On failure:

○ Let Mesos start new instance on any available node

slide-33
SLIDE 33

Shared file system - distributed

  • Similar to centralized shared FS
  • Pros:

○ Process benefits from most features (except scaling)

  • Cons:

○ Similar as centralized shared FS ○ Setup may be complex ○ Replication, data distribution, processing overhead, etc.

slide-34
SLIDE 34

Network Block Device

slide-35
SLIDE 35

Network Block Device

  • Somewhat between local and shared FS
  • Device mounted to only 1 node at a time
  • On node failure:

○ Repair & mount device to new node ○ Proceed as usual

slide-36
SLIDE 36

Network Block Device

  • Pros

○ Lesser overhead than a high level protocol like NFS.

  • Cons

○ Slightly more difficult to manage. ○ Failover is not automatic ■ Need to mount to new node (scripted). ○ May need to repair the FS on the NBD at startup (run fsck before mount)

slide-37
SLIDE 37

Persistent State Resource Primitives

  • New features

○ Storage as a resource ○ Keep data across process restarts ○ Process affinity to data with node (on node restarts)

  • Easier to work with storage
slide-38
SLIDE 38

Application Specific Solutions

  • For mysql:

○ Vitess ○ Mysos (Apache Cotton)

  • Pros

○ Replication and availability built in ○ Scalable

  • Cons

○ Relatively more involved setup ○ NA for most applications

slide-39
SLIDE 39

Stateful services we’re running

  • mysql
  • postgresql
  • mongodb (single, clustered soon)
  • redis
  • rethinkdb
  • elasticsearch (single, clustered)
slide-40
SLIDE 40

Best Practices / Lessons Learnt

  • Mount dir at the same point (path)
  • Multi-level backup as storage may be SPOF

○ Disk based ones like RAID ○ App specific ones like mysqldump

  • Leverage services like zookeeper for mutual exclusion
slide-41
SLIDE 41

Best Practices / Lessons Learnt

  • Isolate applications at this layer

○ Based on ■ disk space & usage ■ disk iops & usage ■ network bandwidth & usage ○ Use multiple mounts, specific allocation, etc.

  • Set up adequate monitoring & alerting
slide-42
SLIDE 42

Conclusion

  • Although not a natural fit, it is possible to gainfully run

stateful services in Mesos.

  • Should be approached as an engineering problem rather

than one with a generic or ideal solution.

slide-43
SLIDE 43

Performance Test

  • Disclaimer

○ Very much dependent on the setup, network, etc. ○ YMMV!

  • Setup

○ local* : ~ 2000r / 1000w IOPS ○ nfs500 : ~ 500 IOPS ○ nfs1000: ~ 1000 IOPS

*24 10k SAS disks in RAID 10

slide-44
SLIDE 44

Performance Test

  • System

○ Single node mysql server ○ Buffer pool size: 128 M

  • Tests

○ sysbench tests run for 300 seconds ■ default RO & RW tests ■ custom WO tests with no reads ■ single thread

slide-45
SLIDE 45
  • Read only queries
  • No Begin/Commit

Performance Test

slide-46
SLIDE 46
  • Read only queries
  • With Begin/Commit

Performance Test

slide-47
SLIDE 47
  • Read/Write queries
  • With Begin/Commit
  • 26% write queries

Performance Test

slide-48
SLIDE 48
  • Write only queries
  • With Begin/Commit

Performance Test

slide-49
SLIDE 49

Performance Test

  • For read heavy queries

○ increasing buffer pool size may compensate for performance decrease with network FS.

  • For write heavy queries

○ memory size is less relevant as these are disk bound.

slide-50
SLIDE 50

Thanks!