Introducing Krylov
eBay AI Platform - Machine Learning Made Easy
GPU Technology Conference, 2018
Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform
Introducing Krylov eBay AI Platform - Machine Learning Made Easy - - PowerPoint PPT Presentation
Introducing Krylov eBay AI Platform - Machine Learning Made Easy Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform GPU Technology Conference, 2018 Agenda 1. Data Science and Machine Learning at eBay 2. Introducing Krylov 3.
GPU Technology Conference, 2018
Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform
Tools
Weka, XGBoost, Moses
Patterns for ML Training
Deep Learning Distributed Training
Key takeaway = CHOICE 1. Flexibility of software 2. Flexibility of hardware configuration
1. 50%-70% is plumbing work
a. Accessing and moving secured data b. Environment and tools setup c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances d. Long wait time from platform and infrastructure
2. Lost of productivity and opportunities
a. ML lifecycle management of models and features b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross validation
3. Collaborations almost impossible 4. Research vs Applied ML
powerful cloud-based data science and machine learning platform.
secured-data and eBay cloud computing resources.
○ Easy and secure access to training datasets ○ Access to compute in high performance machines, such as GPUs, or cluster of machines. ○ Familiar tools and flexible software to run machine learning model training jobs ○ Interactive data analysis and visualization, with multi-tenancy support to allow quick prototyping of algorithms and data access ○ Sharing and collaboration of ML work between teams in eBay
MODEL INFERENCING Deployable, Scalable MODEL BUILDING
Interactive, iterative
MODEL RE-FITTING Interactive, iterative MODEL RE-TRAINING Interactive, iterative Data + Lifecycle Management MODEL TRAINING Automatable, repeatable, scalable
Infrastructure - Krylov AI Engine - Krylov
Learning Pipelines Model Experimentation Data Scientist Workspaces Model Lifecycle Management GPU Tall instances Fast Storage
Preparation Movement Discovery Access AI Hub
(Shared Repository)
AI Modules
Speech Recognition Machine Translation Computer Vision Information Retrieval Natural Language Understanding …
Inferencing
developers’ programs
○ mlapplication.json and artifact.sjon configuration files ○ Source code of the programs ○ Dependencies management via Dockerfile
used by the Run Specification file to deploy it into one of the nodes in the cluster
{ "tasks": { "prepare_data": { "program": "com.ebay.oss.krylov.workflow.JvmMainProgram", "parameters": { "className": "com.ebay.krylov.helloai.HelloWorld" } }, "train_model": { "program": "com.ebay.oss.krylov.workflow.PythonProgram", "parameters": { "file": "helloai-python/helloai/helloworld.py", "args": [] } }, ...
parameter passing for each Task in the ML Application job submissions
the compute cluster
GPUs to use, CPU, memory, and which Docker image for dependencies used in ML Application programs
{ "jobName": "job-sample", "artifact": "myartifact", "artifactTag": "latest", "mlApplication": "com.ebay.oss.krylov.workflow.app.GenericMLApplication", "applicationParameters": { }, "tasks": { "prepare_data": { "taskParameters": { "prepare_data_parameter_key": "prepare_data_parameter_value" } } }
○ Declarative ○ Default Generic Workflow
○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application ■ Each Workflow contains one or more Tasks ■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure ○ Task - smallest unit of execution that run developers’ Program and executed in a single machine ○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs ○ Flow - The chosen key that will be run from possible selection in the Flows definition
{ "tasks": { ... }, "flows": { "sample_flow": { "prepare_data": ["train_model"], "train_model": ["output"] } }, "flow": "sample_flow" }
browser to do ML model prototyping, data preparation and exploration
memory or NVIDIA GPU instances
Jupyter Notebook servers in Krylov compute cluster using Kubernetes
customize default parameters
1. Download krylovctl program from Krylov release repository 2. Run `krylovctl project create` to create new project in the local machine 3. Update or add code to the Krylov project for the machine learning programs 4. Register them as Program within a Task in the mlapplication.json 5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG) 6. Run `krylovctl project build` to build the project. 7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file 8. Run `krylovctl artifact upload` to upload the artifact file for remote execution 9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing cluster
1. Inferencing Platform 2. Exploration and documentation of RESTful APIs for job management 3. Data Source and Dataset abstraction via Krylov SDKs 4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation 5. Distributed Deep Learning 6. AutoML - Hyper Parameters Tuning 7. AI Hub to share ML Applications and Datasets