Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - - PowerPoint PPT Presentation
Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu - - PowerPoint PPT Presentation
Taskerman A Distributed Cluster Task Manager Raghavendra D Prabhu rprabhu@yelp.com @randomsurfer Distributed Systems Yelps Mission Connecting people with great local businesses. Datastore Ecosystem @ Cassandra Elasticsearch Zookeeper
Yelp’s Mission
Connecting people with great local businesses.
Datastore Ecosystem @
Cassandra Elasticsearch Zookeeper PostgreSQL
5
….
- Memcached
- Redis
- Spark
- Redshift
- DynamoDB
- PaaStorm
- S3
Any many more..
- Several TB in Cassandra clusters with tens of nodes each
- Close to a million messages/second in streaming pipeline
- Several TB in Elasticsearch with several hundred nodes in
each
- Many PB archived to S3 every month
- Multi-AZ Multi-Region
- And growing…
Distributed Systems
“Need to run logical backup on a fleet without disruption to ingress traffic” “Run anti-entropy repair on Cassandra cluster without spiking read latency” “Reboot 1000 instances without taking a millennia but not bringing down site either” “Upgrade an Elasticsearch cluster from m3.medium to m3.xlarge safely without downtime”
Pet vs Cattle
Maintenance Cost Engineering Efficiency Scalability
Taskerman
- Safe
- Security
- Generic and Extensible
- Distributed
- Loosely coupled
- Cluster awareness
Requirements
- Schedulable
- Reusable
- Auditability
○ Not Ad-hoc ○ More Declarative, Less Imperative ○ Config Management
- Maintainability
- Observability
- Resilience
Desirable
- Paramount*
- Serialized execution
○ ‘m’ out of ‘n’ ○ Disjoint jobs.
- Avoid cascade
- Privilege escalation
- Pull-based
* Unless oncall is automated too.
Safety
- Network is reliable
- Latency is zero
- Bandwidth is infinite
- Network is secure
- One administrator
- Transport cost is zero
- Network is homogenous
- Topology doesn't change
Fallacies of Distributed System
Quotes
There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. @secretGeek There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order
- f messages 2. Exactly-once delivery @mathiasverraes
- Scheduler
- Router
- Co-ordinator
- Transport
- Executor
- Error handler
- Configuration
- Monitoring
- Tooling
Building Blocks
Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
#Anatomy of a Taskerman Task # Restart action for 2 nodes of geo_counter # cassandra cluster owned by gsi { ‘action’: ‘cassandra_task:restart’, ‘version’: 1.2, ‘limit’: 2, ‘cluster_name’: ‘cassandra:geo_counter’, ‘discovery’ : ‘aws_tags’, ‘owner’: ‘gsi’, ‘task_id’: ‘abcd-ef123’,
#Anatomy of a Taskerman Task ‘taskerman_params’: { ‘action_args’: {‘force’: true}, ‘workqueue_args’: {‘retry_count’:3}, }, ‘nodes’: [], ‘destnode’: ‘’, } # force=true for restart, retry_count for queue # [a,b,c,d] to skip discovery
Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
- Runs on Chronos
- Emits a task
- Enqueues into global queue
- Ad-hoc invocation
- Deployment granularities
- Task tracking
- Yelpsoa-configs
Task Scheduler
PaaSTA
Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
- AWS SQS
- Best-effort FIFO
- Reliable and cheap
- Low latency
- Properties
○ Read without delete ○ Visibility timeout ○ Retry ○ Dead Letter Queue
WorkQueue
AWS SQS
Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
- Stateless Marathon worker
- Routes tasks to clusters
- Custom routing logic
- At-least once delivery
- ‘DNS’ of Taskerman
- Pluggable discovery
○ AWS ○ Smartstack
Task Router
PaaSTA
Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries Zookeeper EC2 API
- The executor of Taskerman
- Dequeue task and executes
○ Pre-defined reviewed code.
- Cron-ed on node
- Zookeeper for coordination
- Task deleted upon success
- Dead letter queue upon failed
retries
TaskRunner
class TestTaskRunner(TaskRunner): def __init__(self, task,..): # State mgmt and datastore specific def pre_check(self): # Is the task safe to execute on this cluster def execute_action(self): # Actual execution of task:action def post_check(self): # cluster good after execution or is it on fire
Router Queue Q2 Q1 Q3 Dead Letter Queue T1 T2 T3 Lease Failure Workqueue Flow of task Task Scheduler Cluster Node Queues Retries EC2 API Zookeeper
- Distributed Coordinator
- Non Blocking Lease
○ Time-based lease ○ Global lease
- Ephemeral locks
- Atomic Counters
○ Statistics ○ Circuit breaker
Zookeeper
- Staleness
○ Nodes can go down
- Garbage collection
○ Cleanup of ZK data structures
- Composition
- Starvation
- Uptime
Zookeeper: Challenges
- Puppet
- Terraform
- Yelpsoa-configs
- PaaSTA
- Jenkins
- AWS Lambda
Deployment
PaaSTA
- Multiple vectors of failure
- Idempotency
- Pessimistic approach
○ Job retry
- Separation of state
- Mutability
- Highly available components
- Circuit breakers
Failure handling
Debugging
- Heartbeat ping
○ End-to-end monitoring
- Dead Letter Queue
○ Recycle bin of failed tasks. ○ Hooks into human side of monitoring
- Status check
Failure detection
- End-to-end logging
○ Un/structured
- Metrics
○ Counters ○ Queue lengths
- Aggregation and dashboards
- Staleness checks
- Dead Letter Queue
- Multi-modal Alerting
Monitoring
- Restarts
- Reboots
- Instance Replacement
- Integration tests
- Kafka config reload
- Failure injection
- Backup and restore
- Search indexing
- .. and many more.
Use cases
- Safety
- Cassandra
- Elasticsearch
- Common issues
- Constraints
○ Limit ○ Healthcheck ○ Mutual exclusion
Scheduled Backups
Secure Infrastructure
$ uptime 06:52:54 up 99 days, 19:20, 1 user, load average: 0.02, 0.03, 0.07 ps -eo pid,cmd,lstart | grep .. 10058 zookeeper Tue Dec 5 05:23:43 2017
www.yelp.com/careers/
We're Hiring!
@YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp
Q & A
- Slides will also be uploaded to
slideshare.net/slidunder.
Q & A
❖ Q: What challenges remain with Taskerman. ➢ A: ❖ Q: … ➢ A: …
- https://www.elastic.co/products/elasticsearch
- https://zookeeper.apache.org/
- https://kafka.apache.org/
- https://www.flickr.com/photos/dapuglet/6291424431
- http://www.alamy.com/stock-photo/cattle-penning.html
- http://www.firstcallsigns.co.uk/content/images/thumbs/0000927_EE80127.jpeg
- https://sensuapp.org/img/logo-flat-white.png
- https://thumbs.gfycat.com/FocusedCompetentEyas-max-1mb.gif
- https://www.percona.com/sites/default/files/dashboard.png
- https://www.sales-initiative.com/downloads/2856/download/resilience.jpg?cb=29f43ac82cea225ab3ee370d7580760d
- http://izquotes.com/quotes-pictures/quote-a-distributed-system-is-one-in-which-the-failure-of-a-computer-you-didn-t-eve
n-know-existed-can-leslie-lamport-346227.jpg
- https://pbs.twimg.com/media/DRCfqaCWsAczqTz.jpg
- https://upload.wikimedia.org/wikipedia/en/thumb/e/e0/Iron_Man_bleeding_edge.jpg/220px-Iron_Man_bleeding_edge.jpg
- https://github.com/mesos/chronos
- https://github.com/mesosphere
Image Credits
- http://www.networknuts-web.biz/wp-content/uploads/2014/10/cron-logo.png
- http://www.pvhc.net/img195/ojfspebrvfblupftgajb.png
- https://fun-damentals.com/wp-content/uploads/2016/05/a-resilience.png
- http://www.azquotes.com/picture-quotes/quote-debugging-is-twice-as-hard-as-writing-the-code-in-the-first-place-therefor
e-if-you-write-brian-kernighan-66-91-06.jpg
- https://thenounproject.com/
- https://aws.amazon.com/
- https://www.splunk.com/
- https://www.terraform.io/
- http://yelp.com
- http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
Image Credits
- https://engineeringblog.yelp.com/2015/03/using-services-to-break-down-monoliths.html
- http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
- https://martinfowler.com/bliki/TwoHardThings.html
- https://zookeeper.apache.org/
- https://www.terraform.io/
- https://github.com/Yelp/service-principles
- https://en.wikipedia.org/wiki/Law_of_Demeter