Coresets for Data-efficient Training of Machine Learning Models - PowerPoint PPT Presentation

Coresets for Data-efficient Training of Machine Learning Models Baharan Mirzasoleiman*, Jeff Bilmes**, Jure Leskovec* * * **

Machine Learning Becomes Mainstream Personalized medicine Robotics Finance Autonomous cars 2

Data is the Fuel for Machine Learning Example: object detection Object detection performance in mAP@[.5,.95] on COCO minival [ ]

Problem 1: Training on Large Data is Expensive Example: training a single deep model for NLP (with NAS) [SGM’19] 11.4 days 3.2M 5.3X the yearly energy consumption of the average American 5x a lifetime of a car CO2 4

Problem 2: What We Care About is Underrepresented Example: self driving data 14% 80% 1% 5% 5

How can we find the “right” data for e ffi cient machine learning? 6

Setting : Training Machine Learning Models Often reduces to minimizing a regularized empirical risk function Feature Label Training data volume: {( x i , y i ), i ∈ V } Regularizer f ( w ) = ∑ w * ∈ arg min w ∈𝒳 f ( w ), f i ( w ) + r ( w ), f i ( w ) = l ( w , ( x i , y i )) i ∈ V Loss function associated with training example i ∈ V Examples: Convex : Linear regression, logistic regression, ridge regression, f ( w ) • regularized support vector machines (SVM) Non-convex : Neural networks f ( w ) • 7

Setting : Training Machine Learning Models Incremental gradient methods are used to train on large data • Sequentially step along the gradient of functions f i w k i = w k i − 1 − α k ∇ f i ( w i − 1 ) • Consider every ∇ f ( . )= ∑ as an unbiased estimate of ∇ f i ( . ) ∇ f i ( . ) • Therefore, they are slow to converge i ∈ V 8

Problem: How to Find the "Right" Data for Machine Learning? The most informative subset , s.t. S * = arg max S ⊆ V F ( S ) | S | ≤ k • What is a good choice for F ( S ) ? • S* = { } V = { } 10 20 30 10 10 20 20 30 30 10 20 30 • If we can find , we get a speedup by only training on S * | V | / | S * | S * 9

Finding is Challenging S * 1. How to chose an informative subset for training? Points close to decision boundary vs. a diverse subset? • 2. Finding must be fast S * Otherwise we don’t get any speedup • 3. We also need to decide on the step-sizes 4. We need theoretical guarantees For the quality of the trained model • For convergence of incremental gradient methods on the subset • 10

Our Approach: Learning from Coresets Idea : select the smallest subset and weights that closely S * γ estimates the full gradient w ∈𝒳 ∥ ∑ ∇ f i ( w ) − ∑ s.t. S * = arg min S ⊆ V , γ j ≥ 0 ∀ j | S | , max γ j ∇ f j ( w ) ∥ ≤ ϵ . i ∈ V j ∈ S Full gradient Gradient of S Solution : for every is the set of exemplars of all the , w ∈ 𝒳 S * data points in the gradient space Training Data: {( x i , y i ), i ∈ V } V ′ = { ∇ f i ( w ), i ∈ V } Gradients at w = { } S* = { } V = { } V ′ 11

Our Approach: Learning from Coresets How can we find exemplars in big datasets? • Exemplar clustering is submodular! F ( S *) = ∑ min j ∈ S * ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ϵ i ∈ V Submodularity is a natural diminishing returns property ∀ A ⊆ B and B ∌ x : F ( A ∪ { x }) - F ( A ) ≥ F ( B ∪ { x }) - F ( B ) A simple greedy algorithm can find exemplars in large datasets S * However, depends on ! S * w • We have to update after every SGD update S * Slow! :( 12

Our approach: Learning from Coresets Can we find a subset that bounds the estimation error for S * all ? w ∈𝒳 F ( S *) = ∑ min j ∈ S * ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ϵ i ∈ V Idea : consider worst-case approximation of the estimation error over the entire parameter space 𝒳 F ( S *) = ∑ j ∈ S * ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ∑ min min j ∈ S * max w ∈𝒳 ∥∇ f i ( w ) − ∇ f j ( w ) ∥ ≤ ϵ i ∈ V i ∈ V d ij : upper-bound on the gradient di ff erence d ij over the entire parameter space 𝒳 13

Our approach: Learning from Coresets How can we e ffi ciently find upper-bounds ? d ij • Convex : Linear/logistic/ridge regression, regularized SVM f ( w ) Feature vector d ij ≤ const. ∥ x i − x j ∥ can be found as a preprocessing step S * • Non-convex : Neural networks f ( w ) Input to the last layer d ij ≤ const. ( ∥∇ z ( L ) [KF’19] i f i ( w ) − ∇ z ( L ) j f j ( w ) ∥ ) is cheap to compute, but we have to update d ij S * 14

Our Approach: CRAIG Idea : select a weighted subset that closely estimates the full gradient Algorithm: • (1) use greedy to find the set of exemplars from dataset S * V • (2) weight every elements of by the size of the corresponding cluster S * • (3) apply weighted incremental gradient descent on S * ➤ ➤ w =0.2 1 epoch w =0.3 w =0.05 w =0.1 Gradients of data points i ∈ V Loss function 15

Our approach: CRAIG Weighted incremental gradient descent on the subset of S ⊆ V exemplars in the gradient space w =0.2 w =0.3 w =0.05 w =0.1 Theorem: For a -strongly convex loss function, CRAIG with decaying μ step-size converges to a neighborhood of the Θ (1/ k τ ), τ < 1 2 ϵ / μ optimal solution, with a rate of 𝒫 (1/ k τ ) We get up to | V |/| S | speedup! 16

Existing Techniques Speeding up stochastic gradient methods Variance reduction techniques [JZ’13, DB’14, A’18] • Choosing better step sizes [KB’14, DHS’11, Z’12] • Importance sampling [NSW’13, ZZ'14, KF’18] • CRAIG is complementary to all the above methods 17

Application of CRAIG to Logistic Regression Training on subsets of size 10% of Covtype with 581K points Up to 6x faster than training on the full data, with the same accuracy 18

Application of CRAIG to Logistic Regression Training on subsets of various size of Ijcnn1 with 50K points (Imbalanced) 10% 30% 90% 50% 70% 10% SGD+ 20% All data 90% 30% Up to 7x faster than training on the full data, with the same accuracy 19

Application of CRAIG to Neural Networks Training on MNIST with a 2-layer neural network with 50K points 2x-3x faster than training on the full data, with better generalization 20

Application of CRAIG to Deep Networks Training ResNet20 on subsets of various size from CIFAR10 with 50K points CRAIG is data-e ffi cient 21

Summary • We developed the first rigorous method for data- e ffi cient training of general machine learning models Converges to the near optimal solution • Similar convergence rate as Incremental gradient methods • Speeds up training by up to 7x for logistic regression and • 3x for deep neural networks Come to our poster for more details! 22

Coresets for Data-efficient Training of Machine Learning Models - PowerPoint PPT Presentation

Coresets for Data-efficient Training of Machine Learning Models Baharan Mirzasoleiman*, Jeff Bilmes**, Jure Leskovec* * * ** Machine Learning Becomes Mainstream Personalized medicine Robotics Finance Autonomous cars 2 Data is the Fuel

On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , Anirban Dasgupta and Supratim

Secure Data Retrieval on the Cloud: Homomorphic Encryption meets Coresets Adi Akavia (University

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs Sepehr Assadi

Geometric Approximation Using Coresets Pankaj K. Agarwal Department of Computer Science Duke

speaking, an extent measure of P either computes certain statistics of P itself or of a (possibly

Coresets for k-Means and k-Median Clustering and their Applications Sariel Har-Peled and Soham

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

The LHCb Upgrade and prospects on spectroscopy studies Yiming Li Institute of High

Ti e 2015 LONG RANGE PLAN for NUCLEAR SCIENCE THE SCIENCE QUESTIONS Nuclear

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

Topic 12: Spatial Light Modulators and Modern Optical Systems Aim: This lecture look the need and

What next for medical leaders? Dr. Simon Moralee, University of Manchester Prof. Mark Exworthy,

From Asympto+c PRAM Speedups To Easy-To- Obtain Concrete XMT

Direct Messaging in North Carolina: Current Capabilities & Future Directions November 4, 2015

Disability and society Masters Level course Anne Revillard Think and write (personal notes) ?

Coresets for Data-efficient Training of Machine Learning Models - PowerPoint PPT Presentation

Coresets for Data-efficient Training of Machine Learning Models Baharan Mirzasoleiman*, Jeff Bilmes**, Jure Leskovec* * * ** Machine Learning Becomes Mainstream Personalized medicine Robotics Finance Autonomous cars 2 Data is the Fuel

On Coresets For Regularized Regression ICML 2020 Rachit Chhaya , Anirban Dasgupta and Supratim

Secure Data Retrieval on the Cloud: Homomorphic Encryption meets Coresets Adi Akavia (University

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Coresets Meet EDCS: Algorithms for Matching and Vertex Cover on Massive Graphs Sepehr Assadi

Geometric Approximation Using Coresets Pankaj K. Agarwal Department of Computer Science Duke

speaking, an extent measure of P either computes certain statistics of P itself or of a (possibly

Coresets for k-Means and k-Median Clustering and their Applications Sariel Har-Peled and Soham

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

The LHCb Upgrade and prospects on spectroscopy studies Yiming Li Institute of High

Ti e 2015 LONG RANGE PLAN for NUCLEAR SCIENCE THE SCIENCE QUESTIONS Nuclear

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

Topic 12: Spatial Light Modulators and Modern Optical Systems Aim: This lecture look the need and

What next for medical leaders? Dr. Simon Moralee, University of Manchester Prof. Mark Exworthy,

From Asympto+c PRAM Speedups To Easy-To- Obtain Concrete XMT

Direct Messaging in North Carolina: Current Capabilities &amp; Future Directions November 4, 2015

Disability and society Masters Level course Anne Revillard Think and write (personal notes) ?

Direct Messaging in North Carolina: Current Capabilities & Future Directions November 4, 2015