[PPT] - Introducing Krylov eBay AI Platform - Machine Learning Made Easy PowerPoint Presentation

SLIDE 1

Introducing Krylov

eBay AI Platform - Machine Learning Made Easy

GPU Technology Conference, 2018

Henry Saputra Technical Lead for Krylov - eBay Unified AI Platform

SLIDE 2

1. Data Science and Machine Learning at eBay
2. Introducing Krylov
3. Compute Cluster and Accelerator Support with Nvidia GPU
4. Quickstart Example
5. Future Roadmap
6. Q & A

Agenda

SLIDE 3

Data Science and Machine Learning at eBay

SLIDE 4

eBay Patterns - Tools and Frameworks

Tools

Languages: R, Python, Scala, C++
IDE-like: RStudio, Notebooks (Juptyer), Python IDE
Frameworks: NumPy, SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O

Weka, XGBoost, Moses

Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie

Patterns for ML Training

Single node
Distributed training
Deep learning (GPUs)

Deep Learning Distributed Training

Key takeaway = CHOICE 1. Flexibility of software 2. Flexibility of hardware configuration

SLIDE 5

1. 50%-70% is plumbing work

a. Accessing and moving secured data b. Environment and tools setup c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances d. Long wait time from platform and infrastructure

2. Lost of productivity and opportunities

a. ML lifecycle management of models and features b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross validation

3. Collaborations almost impossible 4. Research vs Applied ML

Problems and Challenges

SLIDE 6

Introducing Krylov: Unified eBay AI Platform

SLIDE 7

Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and

powerful cloud-based data science and machine learning platform.

The objective of the project is to enable machine learning jobs with easy access to

secured-data and eBay cloud computing resources.

The main goals for the Krylov initiative are:

○ Easy and secure access to training datasets ○ Access to compute in high performance machines, such as GPUs, or cluster of machines. ○ Familiar tools and flexible software to run machine learning model training jobs ○ Interactive data analysis and visualization, with multi-tenancy support to allow quick prototyping of algorithms and data access ○ Sharing and collaboration of ML work between teams in eBay

Overview

SLIDE 8

ML Lifecycle Management

Lifecycle

MODEL INFERENCING Deployable, Scalable MODEL BUILDING

Interactive, iterative

MODEL RE-FITTING Interactive, iterative MODEL RE-TRAINING Interactive, iterative Data + Lifecycle Management MODEL TRAINING Automatable, repeatable, scalable

SLIDE 9

Krylov Staircase Design for AI Platform

SLIDE 10

eBay AI Platform Components

Infrastructure - Krylov AI Engine - Krylov

Learning Pipelines Model Experimentation Data Scientist Workspaces Model Lifecycle Management GPU Tall instances Fast Storage

Data

Preparation Movement Discovery Access AI Hub

(Shared Repository)

AI Modules

Speech Recognition Machine Translation Computer Vision Information Retrieval Natural Language Understanding …

Inferencing

SLIDE 11

Krylov High Level Architecture

SLIDE 12

1. Client Command Line Interface (CLI) via krylovctl program
2. ML Application and Run Specification
3. ML Pipelines: Workflow and Workspace
4. Namespaces - For quota and data isolation
5. Jobs and Runs - Managed by Krylov Tools and Minions
6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom

Krylov Main Features and Concepts

SLIDE 13

Krylov CLI - krylovctl

SLIDE 14

Krylov ML Application is a versioned unit of deployment that contains declaration of the

developers’ programs

Implemented as client project used as source to build deployment artifact
Three main parts:

○ mlapplication.json and artifact.sjon configuration files ○ Source code of the programs ○ Dependencies management via Dockerfile

Supported types of programs: JVM languages (Java, Scala), Python, Shell script
Using the ML Application as source, developers can build deployment artifact that can be

used by the Run Specification file to deploy it into one of the nodes in the cluster

Krylov ML Application

SLIDE 15

{ "tasks": { "prepare_data": { "program": "com.ebay.oss.krylov.workflow.JvmMainProgram", "parameters": { "className": "com.ebay.krylov.helloai.HelloWorld" } }, "train_model": { "program": "com.ebay.oss.krylov.workflow.PythonProgram", "parameters": { "file": "helloai-python/helloai/helloworld.py", "args": [] } }, ...

Krylov ML Application Example

SLIDE 16

The Krylov Run Specification is a runtime configuration to add override configuration and

parameter passing for each Task in the ML Application job submissions

It tells Krylov master API server of which the artifact created by ML Application will be used in

the compute cluster

Defined as runspec.json file or can be passed as argument to krylovctl client program.
The runspec.json file also has definition for the compute resources, such as which NVIDIA

GPUs to use, CPU, memory, and which Docker image for dependencies used in ML Application programs

Krylov Run Specification

SLIDE 17

{ "jobName": "job-sample", "artifact": "myartifact", "artifactTag": "latest", "mlApplication": "com.ebay.oss.krylov.workflow.app.GenericMLApplication", "applicationParameters": { }, "tasks": { "prepare_data": { "taskParameters": { "prepare_data_parameter_key": "prepare_data_parameter_value" } } }

Krylov Run Specification Example

SLIDE 18

Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition

○ Declarative ○ Default Generic Workflow

Important concepts for Krylov Workflow:

○ Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application ■ Each Workflow contains one or more Tasks ■ The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure ○ Task - smallest unit of execution that run developers’ Program and executed in a single machine ○ Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs ○ Flow - The chosen key that will be run from possible selection in the Flows definition

Krylov ML Pipelines: Workflow

SLIDE 19

{ "tasks": { ... }, "flows": { "sample_flow": { "prepare_data": ["train_model"], "train_model": ["output"] } }, "flow": "sample_flow" }

Workflow Example in mlapplication.json

SLIDE 20

Workflow Runs Flow

SLIDE 21

A Workspace is an interactive web application to allow developers to use web

browser to do ML model prototyping, data preparation and exploration

The Workspace is run as Jupyter Notebook servers and launched on high CPU/

memory or NVIDIA GPU instances

Enhance the JupyterHub project to allow distributed launching of multi-tenants

Jupyter Notebook servers in Krylov compute cluster using Kubernetes

Krylov Workspace uses configuration file on creation time to override and

customize default parameters

Krylov ML Pipelines: Workspace

SLIDE 22

Workspace Deployment Flow

SLIDE 23

Krylov Compute Cluster

SLIDE 24

Krylov Cluster Infrastructure

SLIDE 25

Krylov Compute Cluster Deployment

SLIDE 26

Metrics - Grafana, InfluxDb, and Telegraf for GPU monitoring

Krylov Cluster Monitoring

SLIDE 27

Krylov Metrics Management Flow

SLIDE 28

Krylov Compute Resources Management

SLIDE 29

Quickstart Example

SLIDE 30

1. Download krylovctl program from Krylov release repository 2. Run `krylovctl project create` to create new project in the local machine 3. Update or add code to the Krylov project for the machine learning programs 4. Register them as Program within a Task in the mlapplication.json 5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG) 6. Run `krylovctl project build` to build the project. 7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file 8. Run `krylovctl artifact upload` to upload the artifact file for remote execution 9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing cluster

Steps to Submit Krylov Workflow Job with CLI

SLIDE 31

Here we go ...

Demo Time

SLIDE 32

Future Roadmap

SLIDE 33

1. Inferencing Platform 2. Exploration and documentation of RESTful APIs for job management 3. Data Source and Dataset abstraction via Krylov SDKs 4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation 5. Distributed Deep Learning 6. AutoML - Hyper Parameters Tuning 7. AI Hub to share ML Applications and Datasets

Future Roadmap

SLIDE 34

Introducing Krylov eBay AI Platform - Machine Learning Made Easy - - PowerPoint PPT Presentation

Introducing Krylov

eBay AI Platform - Machine Learning Made Easy

Agenda

Data Science and Machine Learning at eBay

eBay Patterns - Tools and Frameworks

Problems and Challenges

Introducing Krylov: Unified eBay AI Platform

Overview

ML Lifecycle Management

Lifecycle

Krylov Staircase Design for AI Platform

eBay AI Platform Components

Data

Krylov High Level Architecture

Krylov Main Features and Concepts

Krylov CLI - krylovctl

Krylov ML Application

Krylov ML Application Example

Krylov Run Specification

Krylov Run Specification Example

Krylov ML Pipelines: Workflow

Workflow Example in mlapplication.json

Workflow Runs Flow

Krylov ML Pipelines: Workspace

Workspace Deployment Flow

Krylov Compute Cluster

Krylov Cluster Infrastructure

Krylov Compute Cluster Deployment

Krylov Cluster Monitoring

Krylov Metrics Management Flow

Krylov Compute Resources Management

Quickstart Example

Steps to Submit Krylov Workflow Job with CLI

Demo Time

Future Roadmap

Future Roadmap

Question?