Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City - PowerPoint PPT Presentation

Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start November 5, 2016

Outline of the Talk I. Distributed learning with big data II. Least squares regression and and regularization III. Distributed learning with regularization schemes IV. Optimal rates for regularization V. Other distributed learning algorithms VI. Further topics 1 First Previous Next Last Back Close Quit

I. Distributed learning with big data Big data leads to scientific challenges: storage bottleneck, algorithmic scalability, ... Distributed learning : based on a divide-and-conquer approach A distributed learning algorithm consisting of three steps: (1) partitioning the data into disjoint subsets (2) applying a learning algorithm implemented in an individual machine or processor to each data subset to produce an individual output (3) synthesizing a global output by utilizing some average of the individual outputs Advantages: reducing the memory and computing costs to handle big data 2 First Previous Next Last Back Close Quit

If we divide a sample D = { ( x i , y i ) } N i =1 of input-output pairs into disjoint subsets { D j } m j =1 , applying a learning algorithm to the much smaller data subset D j gives an output f D j , and the global output might be f D = 1 � m j =1 f D j . m The distributed learning method has been observed to be very successful in many practical applications. There a challenging theoretical question is raised: If we had a ”big machine” which could implement the same learning algorithm to the whole data set D to produce an output f D , could f D be as efficient as f D ? Recent work: Zhou-Chawla-Jin-Williams, Zhang-Duchi-Wainwright, Shamir-Srebro, ... 3 First Previous Next Last Back Close Quit

II. Least squares regression and and regularization Learn f : II.1. Model for the least squares regression. X → Y from a random sample D = { ( x i , y i ) } N i =1 Take X to be a compact metric space and Y = R . y ≈ f ( x ) Due to noises or other uncertainty, we assume a (unknown) probability measure ρ on Z = X × Y governs the sampling. marginal distribution ρ X on X : x = { x i } N i =1 drawn according to ρ X conditional distribution ρ ( ·| x ) at x ∈ X � Y ydρ ( y | x ) Learning the regression function : f ρ ( x ) = y i ≈ f ρ ( x i ) 4 First Previous Next Last Back Close Quit

II.2. Error decomposition and ERM Z ( f ( x ) − y ) 2 dρ minimized by f ρ : E ls ( f ) = � E ls ( f ) − E ls ( f ρ ) = � f − f ρ � 2 =: � f − f ρ � 2 ρ ≥ 0 . L 2 ρX Classical Approach of Empirical Risk Minimization (ERM) Let H be a compact subset of C ( X ) called hypothesis space (model selection). The ERM algorithm is given by N D ( f ) = 1 f ∈H E ls E ls ( f ( x i ) − y i ) 2 . � f D = arg min D ( f ) , N i =1 Target function f H : best approximation of f ρ in H � f ∈H E ls ( f ) = arg inf Z ( f ( x ) − y ) 2 dρ f H = arg min f ∈H 5 First Previous Next Last Back Close Quit

II.3. Approximation error � f D − f ρ � 2 X ( f D ( x ) − f ρ ( x )) 2 dρ X is bounded Analysis . = � L 2 ρX � � � � � E ls D ( f ) − E ls ( f ) E ls ( f H ) − E ls ( f ρ ) by 2 sup f ∈H � + . � � Approximation Error . Smale-Zhou (Anal. Appl. 2003) � E ls ( f H ) − E ls ( f ρ ) = � f H − f ρ � 2 ( f ( x ) − f ρ ( x )) 2 dρ X = inf L 2 f ∈H ρX f H ≈ f ρ when H is rich Theorem 1 Let B be a Hilbert space (such as a Sobolev space or a reproducing kernel Hilbert space). If B ⊂ L 2 ρ X is dense and θ > 0 , then ρX = O ( R − θ ) � f � B ≤ R � f − f ρ � L 2 inf if and only if f ρ lies in the interpolation space ( B, L 2 ρ X ) 1+ θ , ∞ . θ 6 First Previous Next Last Back Close Quit

II.4. Examples of hypothesis spaces if X ⊂ R n , ρ X is the normalized Lebesgue Sobolv spaces : measure, and B is the Sobolev space H s with s > n/ 2, then θ θ 1+ θ s θ 1+ θ s 1+ θ s ⊂ B ( H s , L 2 ρ X ) 1+ θ , ∞ is the Besov space B and H 2 , ∞ ⊂ θ 2 , ∞ θ 1+ θ s − ǫ for any ǫ > 0. H Range of power of integral operator : if K : X × X → R is a Mercer kernel (continuous, symmetric and positive semidef- inite), then the integral operator L K on L 2 ρ X is defined by � L K ( f )( x ) = X K ( x, y ) f ( y ) dρ X ( y ) , x ∈ X . The r-th power L r K is well defined for any r ≥ 0. Its range ρ X ) gives the RKHS H K = L 1 / 2 L r K ( L 2 K ( L 2 ρ X ) and for 0 < r ≤ 1 / 2, ρ X ) 2 r, ∞ ⊂ L r − ǫ L r K ( L 2 ρ X ) ⊂ ( H K , L 2 ρ X ) 2 r, ∞ and ( H K , L 2 K ( L 2 ρ X ) for any ǫ > 0 when the support of ρ X is X . So we may assume f ρ = L r for some r > 0 , g ρ ∈ L 2 K ( g ρ ) ρ X . 7 First Previous Next Last Back Close Quit

II.5. Least squares regularization   N 1 ( f ( x i ) − y i ) 2 + λ � f � 2   � f D,λ := arg min  , λ > 0 . K N f ∈H K  i =1 A large literature in learning theory: books by Vapnik, Sch¨ olkopf- Smola, Wahba, Anthony-Bartlett, Shawe-Taylor-Cristianini, Steinwart- Christmann, Cucker-Zhou, ... many papers: Cucker-Smale, Zhang, De Vito-Caponnetto- Rosasco, Smale-Zhou, Lin-Zeng-Fang-Xu, Yao, Chen-Xu, Shi- Feng-Zhou, Wu-Ying-Zhou, ... regularity of f ρ complexity of H K : covering numbers, decay of eigenvalues { λ i } of L K , effective dimension, ... | y | ≤ M , exponential decay, moment decay- decay of y : ρ ∈ L p ing condition, E [ | y | q ] < ∞ for some q > 2, σ 2 ρ X for the Y ( y − f ρ ( x )) 2 dρ ( y | x ), ... conditional variance σ 2 ρ ( x ) = � 8 First Previous Next Last Back Close Quit

III. Distributed learning with regularization schemes Join work with S. B. Lin and X. Guo (under major revision for JMLR) Distributed learning with the data disjoint union D = ∪ m j =1 D j : m | D j | � f D,λ = | D | f D j ,λ j =1 Define the effective dimension to measure the complexity of H K with respect to ρ X as λ i ( L K + λI ) − 1 L K � � � N ( λ ) = Tr = λ > 0 . λ i + λ, i Note that λ i = O ( i − 2 α ) implies N ( λ ) = O ( λ − 1 2 α ) 9 First Previous Next Last Back Close Quit

III.1. Error analysis for distributed learning Theorem 2 Assume | y | ≤ M and f ρ = L r K ( g ρ ) for some 0 ≤ If N ( λ ) = O ( λ − 1 r ≤ 1 2 and g ρ ∈ H K . 2 α ) for some α > 0 , � � 12 αr +1 4 αr min 5(4 αr +2 α +1) , | D j | = N 4 αr +2 α +1 m for j = 1 , . . . , m , and m ≤ N , 2 α then by taking λ = N − 4 αr +1 , we have α +2 αr �� N − � E � f D,λ − f ρ = O 2 α +4 αr +1 . � � � ρ 2 α 1 2 α +1 yields � m � If f ρ ∈ H K and m ≤ N 4+6 α , the choice λ = N 1 α �� N − 2 α +1 m − � E � f D,λ − f D,λ = O 4 α +2 � � � ρ and � � 1 �� f D,λ − f D,λ = O E √ m . � � � K 10 First Previous Next Last Back Close Quit

III.2. Previous work : Zhang-Duchi-Wainwright (2015): If the normalized eigenfunctions { ϕ i } i of L K on L 2 ρ X satisfy � ϕ i � 2 k � | ϕ i ( x ) | 2 k � ≤ A 2 k , = E i = 1 , 2 , . . . , L 2 k ρX for some constants k > 2 and A < ∞ , f ρ ∈ H K and λ i = 2 α �� 2 � � � N − � O ( i − 2 α ) for some α > 1 / 2, then E � f D,λ − f ρ = O 2 α +1 � � � ρ 2( k − 4) α − k 2 α 1 / ( A 4 k log k N )) 2 α +1 and m = O (( N when λ = N 2 α +1 k − 2 ). An example of a C ∞ Mercer kernel without uniform bounded- ness of the eigenfunctions: Zhou (2002) Advantages of our analysis: (1) General results without any eigenfunction assumption (2) Error estimates in the H K metric (Smale-Zhou 2007) (3) A novel second order decomposition applicable to other algorithms 11 First Previous Next Last Back Close Quit

IV. Optimal rates for regularization: by-product Caponnetto-DeVito (2007): If λ i ≈ i − 2 α with some α > 1 / 2, 2 α � log N � 2 α +1 , then with λ = N 2 α   � log N 2 � 2 α +1 � �  = 1 . � f D,λ N − f ρ ρ ≤ τ τ →∞ lim sup lim sup prob � �  � ρ N N →∞ � i − 2 α � Steinwart-Hush-Scovel (2009): If λ i = O with some α > 1 / 2, and for some constant C > 0, the pair ( K, ρ X ) satisfies 1 1 − 1 2 α 2 α � f � ∞ ≤ C � f � K � f � , ∀ f ∈ H K , ρ 2 α then with λ = N − 2 α +1 , 2 α �� 2 � � � N − � � � E � π M f D,λ − f ρ = O . 2 α +1 � � ρ � Here π M is the projection onto the interval [ − M, M ]. 12 First Previous Next Last Back Close Quit

Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City - PowerPoint PPT Presentation

Analysis of Distributed Learning Algorithms Ding-Xuan Zhou City University of Hong Kong E-mail: mazhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start November 5, 2016 Outline of the Talk I. Distributed learning

Distributed Algorithms Distributed Algorithms Distributed Mutual Exclusion Olivier Dalle (*)

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Distributed Algorithms for Message-Passing Systems Contents Part I Distributed Graph

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Research Interests Distributed algorithms Distributed shared memory systems Distributed

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Distributed Algorithms Tutorial Roger Wattenhofer ETH Zurich Distributed Computing

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

The Arvy Distributed Directory Protocol Pankaj Khanchandani, Roger Wattenhofer ETH Zurich -

` James R. Wilcox Zach Tatlock Ilya Sergey Distributed Systems Distributed Infrastructure

Statistical online learning of large-scale imaging-genetics data Data Science Meetup Nice -

Robotics Part II: From Learning Model-based Control to Model-free Reinforcement Learning Stefan

Predicting octane content of gasoline using Near Infrared Spectra Data from: Kalivas, John H.,

Eta Meson Production in Proton-Deuteron Collisions MESON 2014: 13 th International Workshop on

tt sst PP

Least squares optimal identification of LTI dynamical systems Bart De Moor KU Leuven Dept.EE:

Predicting Speech Quality Ivan Halim Parmonangan 1 , Hiroki Tanaka 1,2 , Sakriani Sakti 1,2 ,

Modeling of Survival Data Now we will explore the relationship between survival and explanatory