A new perspective on machine learning H. N. Mhaskar Claremont - PowerPoint PPT Presentation

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM Scientific Machine Learning January 28, 2019 H. N. Mhaskar A new perspective on machine learning

Outline ◮ My understanding of machine learning problem and its traditional solution. ◮ What bothers me about this. ◮ My own efforts to remedy the problems ◮ Diffusion geometry based approach ◮ Application to diabetic sugar level prediction ◮ Problems ◮ Hermite polynomial based approach ◮ Applications H. N. Mhaskar A new perspective on machine learning

Problem of machine learning Given data (training data) of the form { ( x j , y j ) } M j =1 , where y j ∈ R , and x j ’s are in some Euclidean space R q , find a function P on a suitable domain ◮ that models the data well; ◮ in particular, P ( x j ) ≈ y j . H. N. Mhaskar A new perspective on machine learning

1. Traditional paradigm H. N. Mhaskar A new perspective on machine learning

Basic set up { ( x j , y j ) } are i.i.d. samples from an unknown probability distribution µ f ( x ) = E µ ( y | x ), target function µ ∗ = marginal distribution of x . X =support of µ ∗ . V n ⊂ V n +1 ⊂ · · · = classes of models, V n with complexity n (typically, number of parameters). H. N. Mhaskar A new perspective on machine learning

Traditional methodology ◮ Assume f ∈ W γ (smoothness class, prior, RKHS). ◮ Estimate E n ( f ) = inf P ∈ V n � f − P � L 2 ( µ ∗ ) = � f − P ∗ � L 2 ( µ ∗ ) . Decide upon the right value of n . ◮ Find � � P # = arg min Loss( { y ℓ − P ( x ℓ ) } ) + λ � P � W γ . P ∈ V n H. N. Mhaskar A new perspective on machine learning

Generalization error � � | y − P # ( x ) | 2 d µ ( y , x ) | y − f ( x ) | 2 d µ ( y , x ) = X × R X × R � �� generalization error variance + � f − P ∗ � 2 L 2 ( µ ∗ ) � �� Approximation error + � f − P # � 2 L 2 ( µ ∗ ) − � f − P ∗ � L 2 ( µ ∗ ) � �� Sampling error Only the approximation error and sampling error can be controlled. H. N. Mhaskar A new perspective on machine learning

Observations on the paradigm ◮ Too complicated. ◮ Bounds of approximation error are often obtained by explicit constructions. The approach makes no use of these constructions. ◮ Measuring errors in L 2 with function values makes sense only if f is in some RKHS. So, the method is not universal. H. N. Mhaskar A new perspective on machine learning

Observations on the paradigm Good is better than best On the left, the log–plot of the absolute error between the function x �→ | cos x | 1 / 4 , and its Fourier projection. On the right the corresponding plot with the trigonometric polynomial obtained by our summability operator. This is based on 128 equidistant samples. The order of the trigonometric polynomials is 31 in each case. The numbers on the x axis are in multiples of π , the actual absolute errors are 10 y . − 0.5 − 0.5 − 1 − 1 − 1.5 − 2 − 1.5 − 2.5 − 2 − 3 − 3.5 − 2.5 − 4 − 4.5 − 3 − 5 − 3.5 − 5.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 H. N. Mhaskar A new perspective on machine learning

Observations on the paradigm ◮ The choice of penalty functional/loss functional, kernels, etc. are often ad hoc, and assume a prior on the target function. ◮ Performance guarantees on new and unseen data are often not easy to obtain, sometimes impossible. ◮ The optimization algorithms might not converge or converge too slowly. ◮ The paradigm does not work in the context of deep learning. H. N. Mhaskar A new perspective on machine learning

Curse of dimensionality The number of parameters required to get a generalization error of ǫ is at least constant times ǫ − γ/ q , γ =smoothness of f , q =number of input variables. 1 1 Donoho 2000, DeVore Howard, Micchelli, 1989 H. N. Mhaskar A new perspective on machine learning

Blessing of compositionality Approximate F ( x 1 , · · · , x 4 ) = f ( f 1 ( x 1 , x 2 ) , f 2 ( x 3 , x 4 )) by 2 Q ( x 1 , · · · , x 4 ) = P ( P 1 ( x 1 , x 2 ) , P 2 ( x 3 , x 4 )) . Only functions of 2 variables are involved at each stage. 2 Mh., Poggio, 2016 H. N. Mhaskar A new perspective on machine learning

How to measure generalization error � | f ( f 1 ( x 1 , x 2 ) , f 2 ( x 3 , x 4 )) − P ( P 1 ( x 1 , x 2 ) , P 2 ( x 3 , x 4 )) | 2 d µ ( x 1 , x 2 , x 3 , x 4 ) µ ignores compositionality � | f ( f 1 , f 2 ) − P ( P 1 , P 2 ) | 2 d ν (?) The distributions of ( f 1 , f 2 ) and ( P 1 , P 2 ) are different. Must have a different notion of generalization error. 3 3 Mh., Poggio, 2016 H. N. Mhaskar A new perspective on machine learning

A new look Given data (training data) of the form { ( x j , y j ) } M j =1 , where y j ∈ R , and x j ’s are in some Euclidean space R q +1 . ◮ Assume that there is an underlying target function f : X → R , such that y j = f ( x j ) + ǫ j . ◮ No priors, just continuity. ◮ Use approximation theory to construct the approximation P . H. N. Mhaskar A new perspective on machine learning

Objectives ◮ Universal approximation with no assumptions on prior. ◮ Generalization error defined pointwise, and adjusts itself per local smoothness. ◮ Optimization is substantially easier. ◮ Can be adapted to deep learning easily. H. N. Mhaskar A new perspective on machine learning

Problem Classical approximation theory results are not adequate. ◮ Data distributed densely on a known domain, cube, sphere, etc. ◮ The points x j need to be chosen judiciously; e.g., Driscoll-Healy points on the sphere or quadrature nodes on the cube, etc. H. N. Mhaskar A new perspective on machine learning

2. Diffusion geometry based construction H. N. Mhaskar A new perspective on machine learning

Set up Data { x j } i.i.d. sample from a distribution µ ∗ from a smooth compact manifold X (unknown). { φ k } a system of eigenfunctions of a suitable PDE with eigenvalues { λ k } . φ k , λ k ’s are computed approximately from a “graph Laplacian” 4 4 Lafon, 2004, Singer, 2006, Belkin, Niyogi, 2008 H. N. Mhaskar A new perspective on machine learning

Set up Data { x j } i.i.d. sample from a distribution µ ∗ from a smooth compact manifold X (unknown). { φ k } a system of eigenfunctions of a suitable PDE with eigenvalues { λ k } . Π n = span { φ k : λ k < n } . � f � = sup x ∈ X | f ( x ) | , n γ dist( f , Π n ) . � f � γ = � f � + sup n ≥ 1 γ is the smoothness of f . H. N. Mhaskar A new perspective on machine learning

Construction h : a smooth low pass filter (even, = 1 on [0 , 1 / 2], = 0 on [1 , ∞ )). � λ k � � Φ n ( x , y ) = h φ k ( x ) φ k ( y ) . n 0 ≤ k < n Fact: 4 If x j ’s are sufficiently dense, then there exist w j such that for all P ∈ Π n , � � � � P ( x ) d µ ∗ ( x ) , | P ( x ) | d µ ∗ ( x ) . w j P ( x j ) = | w j P ( x j ) | ≤ c X X j j (Marcinkiewicz-Zygmund (MZ) quadrature) 4 Filbir-Mhaskar, 2010, 2011 H. N. Mhaskar A new perspective on machine learning

Algorithm ◮ Find w j ’s depending only on x j ’s and construct 5 M � P ( x ) = w j y j Φ n ( x , x j ) j =1 � M � M = w j f ( x j )Φ n ( x , x j ) + w j ǫ j Φ n ( x , x j ) , j =1 j =1 � �� σ n ( f )( x ) noise part 5 Ehler, Filbir, Mhaskar, 2012 H. N. Mhaskar A new perspective on machine learning

Theorem f ∈ W γ if and only if � f − σ n ( f ) � = O ( n − γ ). 6 If f ∈ W γ , P the noisy version of σ n ( f ), then with high probability, and n ∼ ( M / log M ) 1 / (2 q +2 γ ) , � f − P � ≤ cn − γ . 6 Maggioni,Mhaskar, 2008 H. N. Mhaskar A new perspective on machine learning

3. An application 7 7 Mhaskar, Pereverzyev, van der Walt, 2017 H. N. Mhaskar A new perspective on machine learning

Continuous blood glucose monitoring Source : http://www.dexcom.com/seven-plus Problem: Estimate the future blood glucose level based on the past few readings, and the direction in which it is going – up or down. H. N. Mhaskar A new perspective on machine learning

PRED-EGA grid ◮ Numerical accuracy is not as critical as classification errors. ◮ Depending upon low, normal, high blood sugar, the results are classified as accurate, or wrong but with no serious consequences (benign) or outright errors. H. N. Mhaskar A new perspective on machine learning

Deep diffusion network Given sugar levels s ( t 0 ) , s ( t 1 ) , · · · at times t 0 , t 1 , · · · or different patients, we form a data set P = { ( x j , y j ) } , x j = ( s ( t 0 ) , · · · , s ( t 6 )), y j = s ( t 12 ) (30 minute prediction), and a training data C ⊂ P . H. N. Mhaskar A new perspective on machine learning

A new perspective on machine learning H. N. Mhaskar Claremont - PowerPoint PPT Presentation

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM Scientific Machine Learning January 28, 2019 H. N. Mhaskar A new perspective on machine learning Outline My understanding of machine learning problem

New Defence Perspective New Defence Perspective New Defence Perspective New Defence Perspective

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Perspective LanguaL Structured Vocabulary: USDA Perspective Joanne Holden Perspective: Earth

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Adopt Open J9 for Spring Boot performance! Charlie Gracie Michael Thompson

Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work

Distributed motion coordination of robotic networks Lecture 2 models and complexity notions

Quality and Access Issues 28 th August 2012 UICC World Cancer Congress Montral, Qubec David

Coping with Inconsistent Databases Semantics, Algorithms, and Complexity Phokion G. Kolaitis

1-bROPUWicTef had4g

Thermally induced non-equilibrium fluctuations: gravity and finite-size effects Jan V. Sengers

RAMSEY CLASSES SPARSITY AND MODELS FOR FINITE - NESIETPIIL JAROSLAV UNIVERSITY CHARLES