NSML : A Machine Learning Platform That Enables You Focus on Your - - PowerPoint PPT Presentation

nsml a machine learning platform that enables you focus
SMART_READER_LITE
LIVE PREVIEW

NSML : A Machine Learning Platform That Enables You Focus on Your - - PowerPoint PPT Presentation

NSML : A Machine Learning Platform That Enables You Focus on Your Models. ML-Sys WS 2017 @ NIPS Nako Sung , Minkyu Kim, Hyunwoo Jo, Youngil Yang, Jinwoong Kim, Leonard Lausen, Youngkwan Kim, Gayoung Lee, Donghyun Kwak, Jung-Woo Ha, and Sunghun


slide-1
SLIDE 1

NSML: A Machine Learning Platform That Enables You Focus

  • n Your Models.

ML-Sys WS 2017 @ NIPS Nako Sung, Minkyu Kim, Hyunwoo Jo, Youngil Yang, Jinwoong Kim, Leonard Lausen, Youngkwan Kim, 
 Gayoung Lee, Donghyun Kwak, Jung-Woo Ha, and Sunghun Kim CLOVA AI Research (CLAIR), NAVER | LINE, Search Solution, NAVER Webtoon, HKUST

slide-2
SLIDE 2

What is NSML?

  • A machine learning platform that enables you focus on your models
  • Two options: on-premise / PaaS
slide-3
SLIDE 3

https://xkcd.com/303/

slide-4
SLIDE 4

https://www.youtube.com/watch?v=lxZyxxHOw3Y

slide-5
SLIDE 5

https://www.youtube.com/watch?v=lxZyxxHOw3Y

Wasted Time

slide-6
SLIDE 6

https://www.formula1.com/en/latest/features/2017/2/F1-cars-of-2017.html

slide-7
SLIDE 7

https://www.formula1.com/en/latest/features/2017/2/F1-cars-of-2017.html

Importance of Fast Machines (Multiple Servers and GPUs)

slide-8
SLIDE 8

https://www.sportskeeda.com/f1/what-happens-during-f1-pit-stop

slide-9
SLIDE 9

https://www.sportskeeda.com/f1/what-happens-during-f1-pit-stop

ML Research Challenges: Incidental Tasks

slide-10
SLIDE 10

GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (busy) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (busy) GPU (idle) GPU (idle) GPU (idle) Heavy Heavy Heavy Heavy Model Model Model Model

slide-11
SLIDE 11

ML Research Challenges: Resource Scheduling and Utilization

GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (busy) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (idle) GPU (busy) GPU (idle) GPU (idle) GPU (idle)

14 GPUs available but only 7 GPUs can be used in a single machine.

Heavy Heavy Heavy Heavy Model Model Model Model

slide-12
SLIDE 12

https://livingthing.danmackinlay.name/automl.html

slide-13
SLIDE 13

ML Research Challenges: Hyperparameter Tuning

https://livingthing.danmackinlay.name/automl.html

slide-14
SLIDE 14

Tensor board

DONE DONE TRAINING TRAINING γ=0.1 γ=0.2 γ=0.3, K=1 γ=1e-2

Visdom

slide-15
SLIDE 15

ML Research Challenges: Multiple Experiments

Tensor board

DONE DONE TRAINING TRAINING γ=0.1 γ=0.2 γ=0.3, K=1 γ=1e-2

Visdom

slide-16
SLIDE 16

https://www.linkedin.com/pulse/protecting-workers-who-work-alone-sandie-baillargeon

slide-17
SLIDE 17

ML Research Challenges: Isolated Researchers

https://www.linkedin.com/pulse/protecting-workers-who-work-alone-sandie-baillargeon

slide-18
SLIDE 18

Challenges

  • Slack
  • Incidental Tasks
  • Inefficient resource utilization
  • Naive hyperparameter tuning
  • Painful keeping track of multiple sessions
  • Isolated researchers
slide-19
SLIDE 19

Requirements of ML Platforms

  • Resource Management
  • Better computational resource management
  • Data Management
  • Post datasets once and reuse them for multiple models
  • Share datasets with others
  • Serverless Configuration
  • No framework / library lock-in
  • Easy and lightweight task submission
slide-20
SLIDE 20

Requirements of ML Platforms

  • Experiment Management and Visualization
  • Parallel runs with different jobs priorities
  • Automatic visualization and summarization of learning progress
  • Leaderboard
  • Leaderboard for each dataset to compare models and hyper parameters
  • AutoML
  • Experiment performance prediction based on previously run experiments.
  • Automatic hyper parameter optimization based on the performance predictions.
slide-21
SLIDE 21

Limitations of Previous Solutions

  • Vendor lock-in (Cloud service)
  • Inefficient model experiments
  • Inconsistent research environments
  • Still hard to keep track of experiments
slide-22
SLIDE 22

This work was done for NCSoft and was presented at Nvidia GTC Korea 2015.

MINI

slide-23
SLIDE 23

My Previous Work in Early 2015

This work was done for NCSoft and was presented at Nvidia GTC Korea 2015.

MINI

slide-24
SLIDE 24
slide-25
SLIDE 25

URI

  • Every dataset, session and model have uniform resource identifier.


{Dataset} / {User id} / {Session id} / {Model id}

CIFAR_10 CIFAR 10 dataset CIFAR_10/researcher_A/24 research_A’s 24th session for CIFAR_10 CIFAR_10/researcher_A/24/322 Snapshot from epoch 322

slide-26
SLIDE 26

Easy One-Liner CLI

slide-27
SLIDE 27

Dataset registration

Easy One-Liner CLI

slide-28
SLIDE 28

Dataset registration Train

Easy One-Liner CLI

slide-29
SLIDE 29

Dataset registration Train Serve

Easy One-Liner CLI

slide-30
SLIDE 30

Parallel Experiments to Kill Slack

Distributed responses

  • Exp. #1

Exp #2. vari. 1 Exp #2. vari. 2 Exp #3

Time

slide-31
SLIDE 31

https://www.interaction-design.org/literature/book/the-encyclopedia-of-human-computer-interaction-2nd-ed/data-visualization-for-human-perception

Need to Visualize

  • Balance your brain to understand without effort
slide-32
SLIDE 32

Your code @2 Visualization tool NSML Your code @3 Your code @1

Flexible Analysis

DONE TRAINING TRAINING

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

NSML

Typical training loop

Forward pass Backward pass Communicate to NSML

Dynamic Control Flow

Command queue

1

model

Watch a variable

2

change_lr(0.2)

Change a hyper parameter on the fly

3

nsml.save(‘quick’)

Save current snapshot

4

nsml.load(424)

Load saved snapshot

5

vis.image(model.generate(2))

Generate an image to visdom

….

slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

CLI

  • Base of advanced features like save, load, infer, …
slide-41
SLIDE 41
  • (Almost) Nothing to learn
  • Cached (Fast)

Bring Your Own Workspace

slide-42
SLIDE 42
  • (Almost) Nothing to learn
  • Cached (Fast)

Bring Your Own Workspace

slide-43
SLIDE 43

No Framework Lock-in

slide-44
SLIDE 44

GPU server 10.0.0.1 python your_model.py

stdout

Interactive Mode

slide-45
SLIDE 45

GPU server 10.0.0.1 python your_model.py

stdout

Interactive Mode

slide-46
SLIDE 46
slide-47
SLIDE 47

Pragmatic Research

slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54

Collaboration and Competition Leaderboard, CI-ML

slide-55
SLIDE 55

Collaboration and Competition Leaderboard, CI-ML

New Workflow for ML Research

slide-56
SLIDE 56

Collaborative Research

  • Easy to reproduce and extend other’s research.
slide-57
SLIDE 57

Collaborative Research

  • Easy to reproduce and extend other’s research.
slide-58
SLIDE 58

Models are ranked automatically

Dataset-centric environment Standardized and Quantified

Easy to compete Towards AutoML

Cohesive and Competitive

slide-59
SLIDE 59

Models are ranked automatically

Dataset-centric environment Standardized and Quantified

Easy to compete Towards AutoML

Cohesive and Competitive

slide-60
SLIDE 60

AutoML

  • Quantitive model analysis makes ML workflow as a gym of AutoML
slide-61
SLIDE 61

Dataset ASR

SOTA server

Bob’s model 12 98.2% Bob’s model 13 94.2% Alice’s model 4 92.1% REST API


https://service.nsml.navercorp.com/ASR

Seamless Connection to Services

slide-62
SLIDE 62

Dataset ASR

SOTA server

Bob’s model 12 98.2% Bob’s model 13 94.2% Alice’s model 4 92.1% Alice’s model 5 98.3% REST API


https://service.nsml.navercorp.com/ASR

Seamless Connection to Services

slide-63
SLIDE 63
  • Q1. 2018
slide-64
SLIDE 64

Thank you

https://research.clova.ai/nsml-alpha

Several Hundreds of GPUs for this alpha (free)