Continuum A Platform for Cost-Aware, Low-Latency Continual Learning - - PowerPoint PPT Presentation

continuum
SMART_READER_LITE
LIVE PREVIEW

Continuum A Platform for Cost-Aware, Low-Latency Continual Learning - - PowerPoint PPT Presentation

Continuum A Platform for Cost-Aware, Low-Latency Continual Learning Huangshi Tian, Minchen Yu, Wei Wang @ HKUST Oct 11, 2018 1 Continual/Online vs. Batch/Offline Learning When fresh data arrive, offline / batch offline learning trains model


slide-1
SLIDE 1

Continuum

A Platform for Cost-Aware, Low-Latency Continual Learning

Huangshi Tian, Minchen Yu, Wei Wang @ HKUST Oct 11, 2018 1

slide-2
SLIDE 2
  • ffline learning trains model

from scratch with all historical data;

  • nline learning updates

model with fresh data.

  • ffline / batch

learning historical data fresh data updated model

+

  • nline / continual

learning stale model fresh data updated model

+

Continual/Online vs. Batch/Offline Learning

When fresh data arrive, 2

slide-3
SLIDE 3

Scenario Users continuously generate tweets; We deploy topic models to detect new topics; Topic models are continually updated with new data.

users

tweets

data servers prediction servers Continual Learning System

+

Case Study: Topic Monitoring

Setting AWS EC2 (c5.4xlarge instance) Latent Dirichlet Allocation (LDA) and a dataset of real-world tweets 3

slide-4
SLIDE 4

Results Perplexity measures the model quality (lower means better). Incorporating fresh data improves model quality. Online updating takes much less time than

  • ffline retraining.

Case Study: Topic Monitoring

4

slide-5
SLIDE 5

Advantage of Online Learning

better performance

quickly exploit data recency to improve model quality consume less hardware resources

wide application in industry

: recommendation, contextual decision makin, click-through rate prediction , , : online advertising 5

slide-6
SLIDE 6

Why do we need a platform?

no support from mainstream learning systems ad-hoc scripts bacome status quo

This becomes particularly challenging when data changes over time and fresh models need to be produced continuously. Unfortunately, such orchestration is often done ad hoc using glue code and custom scripts developed by individual teams for specific use cases, leading to duplicated effort and fragile systems with high technical debt. —Google 6

slide-7
SLIDE 7

Why do we need a platform?

wasted effort in (re)implementing training loop

Application Training Loop Model Updating Topic Monitoring 377 56 Friend Suggestion 211 41 Click Prediction 558 44 Lines of Code in Case Studies 7

slide-8
SLIDE 8

In need of a general-purpose, automated solution for continual learning, we present

Continuum

8

slide-9
SLIDE 9

System Overview

automated: streamlines the process of online learning general-purpose: applicable to heterogeneous ML frameworks and systems lightweight: a thin layer on existing systems 9

slide-10
SLIDE 10

Overall Workflow

10

slide-11
SLIDE 11

Overall Workflow

11

slide-12
SLIDE 12

When to Retrain Models?

Setting: As data keep arriving, Continuum determines when to retrain models. 12

slide-13
SLIDE 13

When to Retrain Models?

Setting: As data keep arriving, Continuum determines when to retrain models. 13

slide-14
SLIDE 14

When to Retrain Models?

Setting: As data keep arriving, Continuum determines when to retrain models. 14

slide-15
SLIDE 15

When to Retrain Models?

Setting: As data keep arriving, Continuum determines when to retrain models. Objectives better model quality → minimize data incorporation latency less hardware cost → minimize training cost (i.e., machine time) 15

slide-16
SLIDE 16

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update 16

slide-17
SLIDE 17

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update 17

slide-18
SLIDE 18

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update 18

slide-19
SLIDE 19

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update 19

slide-20
SLIDE 20

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update Proposed Approach: Best-Effort Policy 20

slide-21
SLIDE 21

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update Proposed Approach: Best-Effort Policy 21

slide-22
SLIDE 22

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update Proposed Approach: Best-Effort Policy 22

slide-23
SLIDE 23

Scenario I: Seeking Fast Data Incoporation

Naive Approach: Continuous Update Proposed Approach: Best-Effort Policy Potential Problem: high training cost because the machine is always occupied 23

slide-24
SLIDE 24

Scenario II: Saving Cost of Training

Naive Approach: Periodic Update Proposed Approach: Cost-Aware Policy a regret-based online algorithm jointly optimize the weighted sum of latency and training cost proven to be 2-competitive (never worse than twice the offline optimum) 24

slide-25
SLIDE 25

Experimental Setting

Testbed AWS EC2 (c5.4xlarge instance) Applications Latent Dirichlet Allocation (LDA) from Mallet + twitter dataset Gradient-Boost Decision Tree (GBDT) from XGBoost + Criteo click dataset Personalized PageRank (PPR) + twitter user dataset Methodology Replay data generation and update models under different policies. Metrics incorporation latency of all data samples training cost measured by machine time 25

slide-26
SLIDE 26

Compared with Continuous Update, Best-Effort Policy can reduce the latency by up to 15.2%. Compared with Periodic Update, Cost-Aware Policy can reduce the latency by up to 28%, saves hardware cost by up to 32%.

Evaluation of Proposed Policies

26

slide-27
SLIDE 27

Continuum achieves high effi ficiency in responding to requests and deciding to update models, linear scalability to a 20- node cluster, low overhead imposed on backend.

Evaluation of Implemented System

27

slide-28
SLIDE 28

Conclusion

motivate the need of an online learning platform design and implement Continuum propose two policies for fast data incorporation and low cost 28

slide-29
SLIDE 29

Source code available at Thanks for your attention!

29

slide-30
SLIDE 30

Customized Policy

For users who want to decide when to retrain on their own, we provide two mechanisms. REST API to trigger retraining Users can leverage external information (cluster usage, model monitor). Example: When model quality drops below a threshold, retrain the model. abstract policy class for extension Users can access internal information (data amount, estimated training time). Users can implement their own decision logic. 30

slide-31
SLIDE 31

Backend Abstraction

Continuum communicates with backends through an RPC layer. The following interface abstracts away the heterogeneity of learning frameworks and systems. 31