A new perspective on machine learning H. N. Mhaskar Claremont - - PowerPoint PPT Presentation

a new perspective on machine learning
SMART_READER_LITE
LIVE PREVIEW

A new perspective on machine learning H. N. Mhaskar Claremont - - PowerPoint PPT Presentation

A new perspective on machine learning H. N. Mhaskar Claremont Graduate University ICERM Scientific Machine Learning January 28, 2019 H. N. Mhaskar A new perspective on machine learning Outline My understanding of machine learning problem


slide-1
SLIDE 1

A new perspective on machine learning

  • H. N. Mhaskar

Claremont Graduate University

ICERM Scientific Machine Learning January 28, 2019

  • H. N. Mhaskar

A new perspective on machine learning

slide-2
SLIDE 2

Outline

◮ My understanding of machine learning problem and its

traditional solution.

◮ What bothers me about this. ◮ My own efforts to remedy the problems

◮ Diffusion geometry based approach ◮ Application to diabetic sugar level prediction ◮ Problems ◮ Hermite polynomial based approach

◮ Applications

  • H. N. Mhaskar

A new perspective on machine learning

slide-3
SLIDE 3

Problem of machine learning

Given data (training data) of the form {(xj, yj)}M

j=1, where yj ∈ R,

and xj’s are in some Euclidean space Rq, find a function P on a suitable domain

◮ that models the data well; ◮ in particular, P(xj) ≈ yj.

  • H. N. Mhaskar

A new perspective on machine learning

slide-4
SLIDE 4
  • 1. Traditional paradigm
  • H. N. Mhaskar

A new perspective on machine learning

slide-5
SLIDE 5

Basic set up

{(xj, yj)} are i.i.d. samples from an unknown probability distribution µ f (x) = Eµ(y|x), target function µ∗= marginal distribution of x. X=support of µ∗. Vn ⊂ Vn+1 ⊂ · · · = classes of models, Vn with complexity n (typically, number of parameters).

  • H. N. Mhaskar

A new perspective on machine learning

slide-6
SLIDE 6

Traditional methodology

◮ Assume f ∈ Wγ (smoothness class, prior, RKHS). ◮ Estimate En(f ) = infP∈Vn f − PL2(µ∗) = f − P∗L2(µ∗).

Decide upon the right value of n.

◮ Find

P# = arg min

P∈Vn

  • Loss({yℓ − P(xℓ)}) + λPWγ
  • .
  • H. N. Mhaskar

A new perspective on machine learning

slide-7
SLIDE 7

Generalization error

  • X×R

|y − P#(x)|2dµ(y, x)

  • generalization error

=

  • X×R

|y − f (x)|2dµ(y, x)

  • variance

+ f − P∗2

L2(µ∗)

  • Approximation error

+ f − P#2

L2(µ∗) − f − P∗L2(µ∗)

  • Sampling error

Only the approximation error and sampling error can be controlled.

  • H. N. Mhaskar

A new perspective on machine learning

slide-8
SLIDE 8

Observations on the paradigm

◮ Too complicated.

◮ Bounds of approximation error are often obtained by explicit

  • constructions. The approach makes no use of these

constructions.

◮ Measuring errors in L2 with function values makes sense only if

f is in some RKHS. So, the method is not universal.

  • H. N. Mhaskar

A new perspective on machine learning

slide-9
SLIDE 9

Observations on the paradigm

Good is better than best On the left, the log–plot of the absolute error between the function x → | cos x|1/4, and its Fourier projection. On the right the corresponding plot with the trigonometric polynomial obtained by

  • ur summability operator. This is based on 128 equidistant
  • samples. The order of the trigonometric polynomials is 31 in each
  • case. The numbers on the x axis are in multiples of π, the actual

absolute errors are 10y.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5

  • H. N. Mhaskar

A new perspective on machine learning

slide-10
SLIDE 10

Observations on the paradigm

◮ The choice of penalty functional/loss functional, kernels, etc.

are often ad hoc, and assume a prior on the target function.

◮ Performance guarantees on new and unseen data are often not

easy to obtain, sometimes impossible.

◮ The optimization algorithms might not converge or converge

too slowly.

◮ The paradigm does not work in the context of deep learning.

  • H. N. Mhaskar

A new perspective on machine learning

slide-11
SLIDE 11

Curse of dimensionality

The number of parameters required to get a generalization error of ǫ is at least constant times ǫ−γ/q, γ=smoothness of f , q=number of input variables.1

1Donoho 2000, DeVore Howard, Micchelli, 1989

  • H. N. Mhaskar

A new perspective on machine learning

slide-12
SLIDE 12

Blessing of compositionality

Approximate F(x1, · · · , x4) = f (f1(x1, x2), f2(x3, x4)) by Q(x1, · · · , x4) = P(P1(x1, x2), P2(x3, x4)). Only functions of 2 vari- ables are involved at each stage.

2

2Mh., Poggio, 2016

  • H. N. Mhaskar

A new perspective on machine learning

slide-13
SLIDE 13

How to measure generalization error

  • |f (f1(x1, x2), f2(x3, x4))−P(P1(x1, x2), P2(x3, x4))|2dµ(x1, x2, x3, x4)

µ ignores compositionality

  • |f (f1, f2) − P(P1, P2)|2dν(?)

The distributions of (f1, f2) and (P1, P2) are different. Must have a different notion of generalization error.3

3Mh., Poggio, 2016

  • H. N. Mhaskar

A new perspective on machine learning

slide-14
SLIDE 14

A new look

Given data (training data) of the form {(xj, yj)}M

j=1, where yj ∈ R,

and xj’s are in some Euclidean space Rq+1.

◮ Assume that there is an underlying target function f : X → R,

such that yj = f (xj) + ǫj.

◮ No priors, just continuity. ◮ Use approximation theory to construct the approximation P.

  • H. N. Mhaskar

A new perspective on machine learning

slide-15
SLIDE 15

Objectives

◮ Universal approximation with no assumptions on prior. ◮ Generalization error defined pointwise, and adjusts itself per

local smoothness.

◮ Optimization is substantially easier. ◮ Can be adapted to deep learning easily.

  • H. N. Mhaskar

A new perspective on machine learning

slide-16
SLIDE 16

Problem Classical approximation theory results are not adequate.

◮ Data distributed densely on a known domain, cube, sphere,

etc.

◮ The points xj need to be chosen judiciously; e.g.,

Driscoll-Healy points on the sphere or quadrature nodes on the cube, etc.

  • H. N. Mhaskar

A new perspective on machine learning

slide-17
SLIDE 17
  • 2. Diffusion geometry based construction
  • H. N. Mhaskar

A new perspective on machine learning

slide-18
SLIDE 18

Set up

Data {xj} i.i.d. sample from a distribution µ∗ from a smooth compact manifold X (unknown). {φk} a system of eigenfunctions of a suitable PDE with eigenvalues {λk}. φk, λk’s are computed approximately from a “graph Laplacian” 4

4Lafon, 2004, Singer, 2006, Belkin, Niyogi, 2008

  • H. N. Mhaskar

A new perspective on machine learning

slide-19
SLIDE 19

Set up

Data {xj} i.i.d. sample from a distribution µ∗ from a smooth compact manifold X (unknown). {φk} a system of eigenfunctions of a suitable PDE with eigenvalues {λk}. Πn = span{φk : λk < n}. f = supx∈X |f (x)|, f γ = f + sup

n≥1

nγdist(f , Πn). γ is the smoothness of f .

  • H. N. Mhaskar

A new perspective on machine learning

slide-20
SLIDE 20

Construction

h: a smooth low pass filter (even, = 1 on [0, 1/2], = 0 on [1, ∞)). Φn(x, y) =

  • 0≤k<n

h λk n

  • φk(x)φk(y).

Fact:4 If xj’s are sufficiently dense, then there exist wj such that for all P ∈ Πn,

  • j

wjP(xj) =

  • X

P(x)dµ∗(x),

  • j

|wjP(xj)| ≤ c

  • X

|P(x)|dµ∗(x). (Marcinkiewicz-Zygmund (MZ) quadrature)

4Filbir-Mhaskar, 2010, 2011

  • H. N. Mhaskar

A new perspective on machine learning

slide-21
SLIDE 21

Algorithm

◮ Find wj’s depending only on xj’s and construct5

P(x) =

M

  • j=1

wjyjΦn(x, xj) =

M

  • j=1

wjf (xj)Φn(x, xj)

  • σn(f )(x)

+

M

  • j=1

wjǫjΦn(x, xj)

  • noise part

,

5Ehler, Filbir, Mhaskar, 2012

  • H. N. Mhaskar

A new perspective on machine learning

slide-22
SLIDE 22

Theorem

f ∈ Wγ if and only if f − σn(f ) = O(n−γ).6 If f ∈ Wγ, P the noisy version of σn(f ), then with high probability, and n ∼ (M/ log M)1/(2q+2γ), f − P ≤ cn−γ.

6Maggioni,Mhaskar, 2008

  • H. N. Mhaskar

A new perspective on machine learning

slide-23
SLIDE 23
  • 3. An application7

7Mhaskar, Pereverzyev, van der Walt, 2017

  • H. N. Mhaskar

A new perspective on machine learning

slide-24
SLIDE 24

Continuous blood glucose monitoring

Source : http://www.dexcom.com/seven-plus Problem: Estimate the future blood glucose level based on the past few readings, and the direction in which it is going – up or down.

  • H. N. Mhaskar

A new perspective on machine learning

slide-25
SLIDE 25

PRED-EGA grid

◮ Numerical accuracy is not as critical as classification errors. ◮ Depending upon low, normal, high blood sugar, the results are

classified as accurate, or wrong but with no serious consequences (benign) or outright errors.

  • H. N. Mhaskar

A new perspective on machine learning

slide-26
SLIDE 26

Deep diffusion network

Given sugar levels s(t0), s(t1), · · · at times t0, t1, · · · or different patients, we form a data set P = {(xj, yj)}, xj = (s(t0), · · · , s(t6)), yj = s(t12) (30 minute prediction), and a training data C ⊂ P.

  • H. N. Mhaskar

A new perspective on machine learning

slide-27
SLIDE 27

Deep diffusion network

◮ Divide C into 3 clusters Co, Ce, Cr depending whether 5

minute prediction indicates hypo-glycemic, eu-glycemic, or hyper-glycemic condition. 8

◮ (First layer training) Use approximation theory based on

{λk, φk} with 30% training data in each cluster to get predictions fo, fe, fr respectively9

fs(x) =

  • z∈Cs

wz,sf (z)Φn(x, z), s = o, e, r, x ∈ P

◮ (Second layer training) Using the same ideas, train a judge to

decide based on the training data which prediction gives the best PRED-EGA grid result.

8Mhaskar, Naumova, Pereverzyev, 2013 9Ehler, Filbir, Mhaskar, 2012

  • H. N. Mhaskar

A new perspective on machine learning

slide-28
SLIDE 28

Results

Used clinical readings for 26 patients from DirecNet data. Average percentages in each PRED-EGD category: Deep diffusion network (30% training data) Ho-A Ho-B Ho-E Eu-A Eu-B Eu-E Hr-A Hr-B Hr-E 95.81 2.80 1.40 82.96 15.26 1.79 65.65 21.56 12.79 Deep neural network (65% training data) 48.33 4.51 47.16 80.26 14.94 4.80 65.38 17.41 17.21 Shallow network (65% training data) 54.09 5.43 40.48 77.39 17.33 5.28 57.13 23.97 18.91 Ho=Hypo-glycemic, Eu=Eu-glycemic, Hr=Hyper-glycemic A=Accurate, B=Benign, E=Erroneous.

  • H. N. Mhaskar

A new perspective on machine learning

slide-29
SLIDE 29

Problems

◮ Out-of-sample extension

◮ Since φk’s are computed in an entirely data dependent

manner, a new computation is needed if a new datum appears.

◮ Nystr¨

  • m extension does not always have good approximation

bounds

◮ The measure µ∗ is not known.

  • H. N. Mhaskar

A new perspective on machine learning

slide-30
SLIDE 30
  • 4. A more direct construction
  • H. N. Mhaskar

A new perspective on machine learning

slide-31
SLIDE 31

Hermite functions

ψk(x) = (−1)k √ π1/2k!2k exp(x2/2) d dx k exp(−x2). If k = (k1, · · · , kq), x = (x1, · · · , xq), ψk(x) =

q

  • j=1

ψkj(xj).

  • Rq ψk(x)ψm(x)dx = δk,m.

Projm,q(x, y) =

  • k:|k|1=m

ψk(x)ψk(y)

  • H. N. Mhaskar

A new perspective on machine learning

slide-32
SLIDE 32

Hermite functions

Mehler formula

  • m=0

wmProjm,Q(x, y) = 1 (π(1 − w2))Q/2 exp 4wx · y − (1 + w2)(|x|2 + |y|2) 2(1 − w2)

  • .

With (1 − w2)(Q−q)/2 = ∞

k=0 dkwk,

˜ Pm,Q,q(x, y) =

m

  • k=0

dm−kProjk,Q(x, y).

  • Φn,Q,q(x, y) =
  • m<n2

h √m n

  • ˜

Pm,Q,q(x, y)

  • H. N. Mhaskar

A new perspective on machine learning

slide-33
SLIDE 33

Recovery of functions

Let X be a smooth, compact, q-dimensional sub-manifold of RQ, µ∗ be its Riemannian volume measure, 0 < γ < 1. For sufficiently smooth f ∈ C(X) and x ∈ X,

  • X
  • Φn,Q,q(x, y)f (y)dµ∗(y) − f (x)
  • ≤ cn−γ.
  • H. N. Mhaskar

A new perspective on machine learning

slide-34
SLIDE 34
  • 5. An application10

10Mhaskar, Cloninger, Cheng, 2019

  • H. N. Mhaskar

A new perspective on machine learning

slide-35
SLIDE 35

Discriminative model

Based on a data {(xj, yj)}M

j=1, xj ∈ Rq, with yj taking one of

finitely many values, estimate the probability p(y = k|x) for any x in the domain space. One-hot classification For any label k, do a binary classification:

  • utput 1 if x has the label k, −1 otherwise.

Problems

◮ Not every x has a label. ◮ There may be more than one label with any x with different

probabilities.

  • H. N. Mhaskar

A new perspective on machine learning

slide-36
SLIDE 36

Witness function

Marginal distribution of the xj’s is µ∗. Wanted: A function G such that G(x) = 1 if x has label 1, −1 if x has label −1, and 0 if x is not in the support of µ∗ or has an uncertain label. A generative network: A network G with this property.

  • H. N. Mhaskar

A new perspective on machine learning

slide-37
SLIDE 37

Simplification

Assume Q = q, write Φn for ˜ Φn,q,q.

◮ µ∗ is absolutely continuous: dµ∗(x) = f (x)dx (dx=Lebesgue

measure on Rq)

◮ There are smooth functions F1(x) = 1 if x is reliably in class

1, F2(x) = 1 if x is reliably in class −1. Class boundary: F(x) = F1(x) − F2(x) is small. This may mean that the label for x is uncertain or that x is not in the support of µ∗, and hence, does not have a label. Witness function for F1, F2 Gn(x) = 1 M

M

  • j=1

(F1(xj) − F2(xj))Φn(x, xj).

  • H. N. Mhaskar

A new perspective on machine learning

slide-38
SLIDE 38

Algorithm

◮ Input: Data sets X and Y , points Z = {z1, ..., zK} at which

to inspect significance, level of confidence A.

◮ Tunable parameters: p, N, A ◮ y ← X ∪ Y M ← |X| + |Y | cj ←

  • 1,

if yj ∈ X −1, if yj ∈ Y

  • F(zj) ← 1

M

M

j=1 cjΦn(zj, yj) Main estimator

  • H. N. Mhaskar

A new perspective on machine learning

slide-39
SLIDE 39

Algorithm

Permutation test for significance

◮ For k = 1 to N

π ← Permutation(M) Fk(zj) ← 1

M

M

j=1 cπ(j)Φn(zj, yj)

end for

◮ T(zj) ← Percentile({Fk(zj)}N k=1, p) ◮ D(zj) ← ✶

  • |

F(zj)| > A · T(zj)

  • D(zj) = 1 means | ˆ

F(zj)| is significant.

◮ return

F(zj), D(zj)

  • H. N. Mhaskar

A new perspective on machine learning

slide-40
SLIDE 40
  • 6. Examples
  • H. N. Mhaskar

A new perspective on machine learning

slide-41
SLIDE 41

MNIST

(Left) Embedding of training data into 2D VAE latent space. (Right) Reconstructed images from grid sampling in 2D VAE latent space.

  • H. N. Mhaskar

A new perspective on machine learning

slide-42
SLIDE 42

MNIST

Reconstructed images only of grid points that are deemed “significant” by the kernel. (Left) Witness function with the Hermite kernel. (Right) Witness function with the Gaussian kernel.

  • H. N. Mhaskar

A new perspective on machine learning

slide-43
SLIDE 43

MNIST

Gaussian mixture model based generation of class centroids Prototypical points from each class of MNIST digits, computed from 2D VAE embedding. (Left) All point GMM centroids reconstructions (Right) Witness function region GMM centroids reconstructions

  • H. N. Mhaskar

A new perspective on machine learning

slide-44
SLIDE 44

Science news data set

◮ 1046 articles in 8 categories, using 1153 popular words (binary

vector rather than count of words)

◮ Used hierarchical topic modeling ◮ Generated “fake” documents from a grid deemed significant

using our algorithm

  • H. N. Mhaskar

A new perspective on machine learning

slide-45
SLIDE 45

Science news data set

(Left) Hierarchical topic embedding of documents. (Right) Embedding highlighted by grid points deemed significantly within one class.

  • H. N. Mhaskar

A new perspective on machine learning

slide-46
SLIDE 46

Science news data set

Improving nearest neighbor search Neighbor in given Centroid computed from set of docs given set of docs All documents 51.43% 53.44%

  • Sig. documents

71.56% 76.35% Accuracy with nearest neighbor classification of Science News documents across all documents and across only significant documents.

  • H. N. Mhaskar

A new perspective on machine learning

slide-47
SLIDE 47

CIFAR10

◮ Color images in 10 classes, 50K training, 10K test ◮ Features selected from last hidden layer of VGG-16 (512

dimensions)

◮ Error rate: 0.6%. ◮ Goal: Based on the features, detect and remove the “bad

apples” in the test data without a priori knowledge of the ground truth in test data.

  • H. N. Mhaskar

A new perspective on machine learning

slide-48
SLIDE 48

CIFAR10

(Left) Classification error on points deemed “significant” as a function of level of confidence A . (Middle) Classification error as a function of the number of points removed for being “uncertain” (Right) The relationship between the number of points dropped and parameter A.

  • H. N. Mhaskar

A new perspective on machine learning

slide-49
SLIDE 49

Conclusions

◮ Generalization error is better measured pointwise (in a

probabilistic sense).

◮ Our new approximation theory techniques

◮ Are simple to implement ◮ Obtain universal approximation with no priors required ◮ Training requires only minimal optimization

Further work:

◮ Approximation of measures on manifolds, ◮ Precise estimates on out-of-sample extensions ◮ Feature selection

  • H. N. Mhaskar

A new perspective on machine learning

slide-50
SLIDE 50

Thank you.

  • H. N. Mhaskar

A new perspective on machine learning