Stephan Erb serb@apache.org 2016.11.15 @ErbStephan
Multi-tenant Machine Learning Apache Aurora & Apache Mesos
Multi-tenant Machine Learning Apache Aurora & Apache Mesos - - PowerPoint PPT Presentation
Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb serb@apache.org 2016.11.15 @ErbStephan Apache Aurora https://aurora.apache.org Mesos
Stephan Erb serb@apache.org 2016.11.15 @ErbStephan
Multi-tenant Machine Learning Apache Aurora & Apache Mesos
Apache Aurora
https://aurora.apache.org
Mesos framework for the deployment and scaling of stateless and fault tolerant services in a datacenter Apache Mesos
https://mesos.apache.org
Cluster manager providing fault-tolerant, fjne-grained multitenancy via containers
Apache Aurora
https://aurora.apache.org
“distributed supervisord" Apache Mesos
https://mesos.apache.org
“plumbing”
Cluster Manager
Cluster Manager
webservice = Process( name = 'webservice', cmdline = ‘./run_my_webservice.py’) task = Task( processes = [webservice], resources = Resources(cpu=4, ram=4*GB, disk=8*GB)) jobs = [ Job( task=task, instances=4, constraints = {'host': 'limit:1'}, service=True, cluster=‘rz1', role=‘www’, environment=‘prod’, name=‘webserver’), ]
$ aurora update start rz1/www/prod/webserver \ webserver.aurora
Worker Node Task (Container) User Code Mesos Agent Mesos Master Aurora Executor Aurora Scheduler Zookeeper State
Photo by liz west https://fmic.kr/p/7qYh21
Tenant / ML model
Data Delivery
Historic Tenant Data Compute Platform Tenant / ML model Customer System
Key Achievement
Key Achievement
Key Achievement
VM/Host bigger VM/Host
Implementation Choices:
implementation
infjnite data is easy”)
Implementation Choices:
implementation
infjnite data is easy”)
# Compute on whole data set # compute_prediction(data) # Compute on partitioned data # # (this is rather restrictive but tends to # work great for many usecases) # for chunk in partition(data): compute_prediction(chunk)
Master
Workers
http://www.celeryproject.org/ http://distributed.readthedocs.io/en/latest/
Compute Cluster Project/ Tenant
Compute Cluster Project/ Tenant
Compute Cluster Project/ Tenant
Key Idea
— Jay Kreps
https://www.confmuent.io/blog/sharing-is-caring-multi-tenancy-in-distributed-data-systems
Aurora
Mesos
Docker/Appc containers
cgroups
network, …)
Merits and Pitfalls?
User
automatic rollback
in ZooKeeper
Operator
preemption
monitoring and debugging
https://github.com/blue-yonder/mesos-threshold-oversubscription
heterogenous workloads.
machine learning models into production.
team. In this talk, we have seen:
Stephan Erb serb@apache.org 2016.11.15 @ErbStephan