Learning sums of ridge functions in high dimension: a nonlinear - - PowerPoint PPT Presentation

learning sums of ridge functions in high dimension a
SMART_READER_LITE
LIVE PREVIEW

Learning sums of ridge functions in high dimension: a nonlinear - - PowerPoint PPT Presentation

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo Fornasier Fakult at f ur Mathematik Technische Universit at M unchen massimo.fornasier@ma.tum.de http://www-m15.ma.tum.de/ Winter


slide-1
SLIDE 1

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model

Massimo Fornasier

Fakult¨ at f¨ ur Mathematik Technische Universit¨ at M¨ unchen massimo.fornasier@ma.tum.de http://www-m15.ma.tum.de/

Winter School on Compressed Sensing Technical University of Berlin December 3-5, 2015 Collection of joint results with Ingrid Daubechies, Karin Schnass, and Jan Vyb´ ıral

slide-2
SLIDE 2

Introduction on ridge functions

◮ A ridge function - in its simplest form - is a function f : Rd → R of

the type f (x) = g(aTx) = g(a · x), where g : R → R is a scalar univariate function and a ∈ Rd is the direction of the ridge function;

slide-3
SLIDE 3

Introduction on ridge functions

◮ A ridge function - in its simplest form - is a function f : Rd → R of

the type f (x) = g(aTx) = g(a · x), where g : R → R is a scalar univariate function and a ∈ Rd is the direction of the ridge function;

◮ Ridge functions are constant along the hyperplanes a · x = λ for any

given level λ ∈ R and are among the most simple form of multivariate functions;

slide-4
SLIDE 4

Introduction on ridge functions

◮ A ridge function - in its simplest form - is a function f : Rd → R of

the type f (x) = g(aTx) = g(a · x), where g : R → R is a scalar univariate function and a ∈ Rd is the direction of the ridge function;

◮ Ridge functions are constant along the hyperplanes a · x = λ for any

given level λ ∈ R and are among the most simple form of multivariate functions;

◮ They have been extensively studied in the past couple of decades as

approximation building blocks for more complicated high dimensional functions.

slide-5
SLIDE 5

Some origins of ridge functions

◮ In multivariate Fourier series, the basis functions are of the form

ein·x for n ∈ Zd and eia·x for arbitrary directions a ∈ Rd in the Radon transform;

slide-6
SLIDE 6

Some origins of ridge functions

◮ In multivariate Fourier series, the basis functions are of the form

ein·x for n ∈ Zd and eia·x for arbitrary directions a ∈ Rd in the Radon transform;

◮ The term “ridge function” has been actually coined by Logan and

Shepp in 1975 in their work on computer tomography where they show how ridge functions solve the corresponding L2-minimum norm approximation problem.

slide-7
SLIDE 7

Projection pursuit of the ’80s

◮ Ridge function approximation has been as well extensively studies

during the 80’s in mathematical statistics under the name of projection pursuit (Huber, 1985; Donoho-Johnston, 1989);

slide-8
SLIDE 8

Projection pursuit of the ’80s

◮ Ridge function approximation has been as well extensively studies

during the 80’s in mathematical statistics under the name of projection pursuit (Huber, 1985; Donoho-Johnston, 1989);

◮ Projection pursuit algorithms approximate a function of d variables

by functions of the form

m

  • i=1

gi(ai · x), x ∈ Rd, for some functions gi : R → R and some non-zero vectors ai ∈ Rd.

slide-9
SLIDE 9

Some relevant applications of the ’90s

◮ In the early 90’s there has been an explosion of interest in the field

  • f neural networks. One very popular model is the multilayer

feed-forward neural network with input, hidden (internal), and

  • utput layers;
slide-10
SLIDE 10

Some relevant applications of the ’90s

◮ In the early 90’s there has been an explosion of interest in the field

  • f neural networks. One very popular model is the multilayer

feed-forward neural network with input, hidden (internal), and

  • utput layers;

◮ the simplest case of such a network is described mathematically by a

function of the form

m

  • i=1

αiσ  

m

  • j=1

wijxj + θi   , where σ : R → R is somehow given and called the activation function and wij are suitable weights;

slide-11
SLIDE 11

Ridge functions and approximation theory

◮ In the early 90’s the question of whether one can use sums of ridge

functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);

slide-12
SLIDE 12

Ridge functions and approximation theory

◮ In the early 90’s the question of whether one can use sums of ridge

functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);

◮ the efficiency of such an approximation compared to, e.g., spline

type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);

slide-13
SLIDE 13

Ridge functions and approximation theory

◮ In the early 90’s the question of whether one can use sums of ridge

functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);

◮ the efficiency of such an approximation compared to, e.g., spline

type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);

◮ The identification of a ridge function has also been thoroughly

considered, in particular we mention the work of Pinkus, and, for what concerns multilayer neural networks, we refer to the work by Fefferman, 1994.

slide-14
SLIDE 14

Ridge functions and approximation theory

◮ In the early 90’s the question of whether one can use sums of ridge

functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);

◮ the efficiency of such an approximation compared to, e.g., spline

type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);

◮ The identification of a ridge function has also been thoroughly

considered, in particular we mention the work of Pinkus, and, for what concerns multilayer neural networks, we refer to the work by Fefferman, 1994.

◮ Except for the work of Cand`

es on ridglets, there has been less attention after 2000 on the problem of approximating functions by means of ridge functions;

slide-15
SLIDE 15

Ridge functions and approximation theory

◮ In the early 90’s the question of whether one can use sums of ridge

functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);

◮ the efficiency of such an approximation compared to, e.g., spline

type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);

◮ The identification of a ridge function has also been thoroughly

considered, in particular we mention the work of Pinkus, and, for what concerns multilayer neural networks, we refer to the work by Fefferman, 1994.

◮ Except for the work of Cand`

es on ridglets, there has been less attention after 2000 on the problem of approximating functions by means of ridge functions;

slide-16
SLIDE 16

Capturing ridge functions from point queries

◮ The above results on the identification of such functions based on

disposing of any possible output or even derivatives;

slide-17
SLIDE 17

Capturing ridge functions from point queries

◮ The above results on the identification of such functions based on

disposing of any possible output or even derivatives;

◮ this might be in certain practical situations very expensive,

hazardous or impossible;

slide-18
SLIDE 18

Capturing ridge functions from point queries

◮ The above results on the identification of such functions based on

disposing of any possible output or even derivatives;

◮ this might be in certain practical situations very expensive,

hazardous or impossible;

◮ In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, and

Picard address the approximation of ridge functions by the minimal amount of sampling queries:

slide-19
SLIDE 19

Capturing ridge functions from point queries

◮ The above results on the identification of such functions based on

disposing of any possible output or even derivatives;

◮ this might be in certain practical situations very expensive,

hazardous or impossible;

◮ In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, and

Picard address the approximation of ridge functions by the minimal amount of sampling queries: For g ∈ C s([0, 1]), 1 < s, gC s M0, aℓd

q M1, 0 < q 1

f − ˆ f C(Ω) CM0

  • L−s + M1

1 + log(d/L) L 1/q−1 using 3L + 2 sampling points, deterministically and adaptively chosen.

slide-20
SLIDE 20

Capturing ridge functions from point queries: a nonlinear compressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X.

slide-21
SLIDE 21

Capturing ridge functions from point queries: a nonlinear compressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT

i a,

i = 1, . . . , m are linear measurements of a.

slide-22
SLIDE 22

Capturing ridge functions from point queries: a nonlinear compressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT

i a,

i = 1, . . . , m are linear measurements of a. If now we assume yi to be the values of a ridge function at the points xi

slide-23
SLIDE 23

Capturing ridge functions from point queries: a nonlinear compressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT

i a,

i = 1, . . . , m are linear measurements of a. If now we assume yi to be the values of a ridge function at the points xi yi ≈ g(a · xi), i = 1, . . . , m, for some unknown or roughly given nonlinear function g,

slide-24
SLIDE 24

Capturing ridge functions from point queries: a nonlinear compressed sensing model

Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT

i a,

i = 1, . . . , m are linear measurements of a. If now we assume yi to be the values of a ridge function at the points xi yi ≈ g(a · xi), i = 1, . . . , m, for some unknown or roughly given nonlinear function g, the problem of identifying the ridge direction can be understood as a nonlinear compressed sensing model ...

slide-25
SLIDE 25

Ridge functions and functions of data clustered around manifolds

Figure : Functions on data clustered around a manifold can be locally approximated by k-ridge functions

slide-26
SLIDE 26

Universal random sampling for a more general ridge model

  • M. Fornasier, K. Schnass, J. Vyb´

ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f (x) = g(Ax), A is a k × d matrix

slide-27
SLIDE 27

Universal random sampling for a more general ridge model

  • M. Fornasier, K. Schnass, J. Vyb´

ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1, 0 < q 1 AAT is the identity operator on Rk The regularity condition: sup

|α|2

Dαg∞ C2

slide-28
SLIDE 28

Universal random sampling for a more general ridge model

  • M. Fornasier, K. Schnass, J. Vyb´

ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1, 0 < q 1 AAT is the identity operator on Rk The regularity condition: sup

|α|2

Dαg∞ C2 The matrix Hf :=

  • Sd−1 ∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix We assume, that the singular values of the matrix Hf satisfy σ1(Hf ) · · · σk(Hf ) α > 0.

slide-29
SLIDE 29

How can we learn k-ridge functions from point queries?

slide-30
SLIDE 30
  • MD. House’s differential diagnosis (or simply called

”sensitivity analysis”)

We rely on numerical approximation of

∂f ∂ϕ

∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ

slide-31
SLIDE 31
  • MD. House’s differential diagnosis (or simply called

”sensitivity analysis”)

We rely on numerical approximation of

∂f ∂ϕ

∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ X = {xj ∈ Ω : j = 1, . . . , mX} drawn uniformly at random in Ω ⊂ Rd Φ = {ϕj ∈ Rd, j = 1, . . . , mΦ}, where ϕj

ℓ =

  • 1/√mΦ

with prob. 1/2, −1/√mΦ with prob. 1/2 for every j ∈ {1, . . . , mΦ} and every ℓ ∈ {1, . . . , d}

slide-32
SLIDE 32

Sensitivity analysis

x x + εϕ Sd−1

Figure : We perform at random, randomized sensitivity analysis

slide-33
SLIDE 33

Collecting together the differential analysis

Φ . . . mΦ × d matrix whose rows are ϕi, X . . . d × mX matrix X =

  • AT∇g(Ax1)| . . . |AT∇g(AxmX)
  • .

The mX × mΦ instances of (∗) in matrix notation as ΦX = Y + E (∗∗) Y and E are mΦ × mX matrices defined by yij = f (xj + ǫϕi) − f (xj) ǫ , εij = −ǫ 2[(ϕi)T∇2f (ζij)ϕi],

slide-34
SLIDE 34

Example of active coordinates: which factor does play a role?

We assume, that A =    eT

i1

. . . eT

ik

   , i.e. f (x) = f (x1, . . . , xd) = g(xi1, . . . , xik), where f : Ω = [0, 1]d → R and g : [0, 1]k → R

slide-35
SLIDE 35

Example of active coordinates: which factor does play a role?

We assume, that A =    eT

i1

. . . eT

ik

   , i.e. f (x) = f (x1, . . . , xd) = g(xi1, . . . , xik), where f : Ω = [0, 1]d → R and g : [0, 1]k → R We want to identify first the active coordinates i1, . . . , ik. Then one can apply any usual k-dimensional approximation method...

slide-36
SLIDE 36

Example of active coordinates: which factor does play a role?

We assume, that A =    eT

i1

. . . eT

ik

   , i.e. f (x) = f (x1, . . . , xd) = g(xi1, . . . , xik), where f : Ω = [0, 1]d → R and g : [0, 1]k → R We want to identify first the active coordinates i1, . . . , ik. Then one can apply any usual k-dimensional approximation method... A possible algorithm chooses the sampling points at random, due to the concentration of measure effects, we get the right result with

  • verwhelming probability.
slide-37
SLIDE 37

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity ΦTΦX = ΦTY + ΦTE where now X has ith-row Xi = ∂g ∂zi (Ax1), . . . , ∂g ∂zi (AxmX

  • ,

for i ∈ I, and all other row equal to zero.

slide-38
SLIDE 38

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity ΦTΦX = ΦTY + ΦTE where now X has ith-row Xi = ∂g ∂zi (Ax1), . . . , ∂g ∂zi (AxmX

  • ,

for i ∈ I, and all other row equal to zero. In expectation: ΦTΦ ≈ Id : Rd → Rd ΦTΦX ≈ X and ΦTE is small = ⇒ ΦTY ≈ X,

slide-39
SLIDE 39

A simple algorithm based on concentration of measure

The algorithm to identify the active coordinates I is based on the identity ΦTΦX = ΦTY + ΦTE where now X has ith-row Xi = ∂g ∂zi (Ax1), . . . , ∂g ∂zi (AxmX

  • ,

for i ∈ I, and all other row equal to zero. In expectation: ΦTΦ ≈ Id : Rd → Rd ΦTΦX ≈ X and ΦTE is small = ⇒ ΦTY ≈ X, We select the k largest rows of ΦTY and estimate the probability, that their indices coincide with the indices of the non-zero rows of X.

slide-40
SLIDE 40

A first recovery result

Theorem (Schnass and Vyb´ ıral 2011)

Let f : Rd → R be a function of k active coordinates that is defined and twice continuously differentiable on a small neighbourhood of [0, 1]d. For L d, a positive real number, the randomized algorithm described above recovers the k unknown active coordinates of f with probability at least 1 − 6 exp(−L) using only O(k(L + log k)(L + log d)) samples of f . The constants involved in the O notation depend on smoothness properties of g, namely on maxj=1,...,k ∂ijg∞ minj=1,...,k ∂ijg1

slide-41
SLIDE 41

Examples of active coordinate detection in dimension d = 1000

6 12 18 24 30 36 42 48 54 60 20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 5 10 15 20 25 30 35 40 45 50 80 100 120 140 160 180 200 220 240 260 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure : max(1 − 5

  • (x3 − 1/2)2 + (x4 − 1/2)2, 0)3 and

sin

  • 6π 40

i=21 xi

  • + 40

i=21 sin(6πxi) + 5(xi − 1/2)2

slide-42
SLIDE 42

Learning ridge functions k = 1

Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd a2 = 1 and aq C1, 0 < q 1, max0α2 Dαg∞ C2 α =

  • Sd−1 ∇f (x)2

ℓd

2 dµSd−1(x) =

  • Sd−1 |g ′(a · x)|2dµSd−1(x) > 0,
slide-43
SLIDE 43

Learning ridge functions k = 1

Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd a2 = 1 and aq C1, 0 < q 1, max0α2 Dαg∞ C2 α =

  • Sd−1 ∇f (x)2

ℓd

2 dµSd−1(x) =

  • Sd−1 |g ′(a · x)|2dµSd−1(x) > 0,

We consider again the Taylor expansion (*) with Ω = Sd−1 We choose the points X = {xj ∈ Sd−1 : j = 1, . . . , mX} generated at random on Sd−1 with respect to µSd−1 The matrix Φ is generated as before and we obtain (**) again in the form Φ[g ′(a · xj)a] = yj + εj, j = 1, . . . mX.

slide-44
SLIDE 44

Algorithm 1:

◮ Given mΦ, mX, draw at random the sets Φ and X, and

construct Y according (*).

◮ Set ˆ

xj = ∆(yj) := arg minyj=Φz zℓd

1 .

◮ Find

j0 = arg max

j=1,...,mX ˆ

xjℓd

2 .

◮ Set ˆ

a = ˆ xj0/ˆ xj0ℓd

2 .

◮ Define ˆ

g(y) := f (ˆ aTy) and ˆ f (x) := ˆ g(ˆ a · x).

slide-45
SLIDE 45

Recovery result

Theorem (F., Schnass, and Vyb´ ıral 2012)

Let 0 < s < 1 and log d mΦ [log 6]2d. Then there is a constant c ′

1

such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1 defines a function ˆ f : BRd(1 + ¯ ǫ) → R that, with probability 1 −

  • e−c ′

1 mΦ + e−√mΦd + 2e

2mXs2α2 C4 2

  • ,

will satisfy f − ˆ f ∞ 2C2(1 + ¯ ǫ) ν1

  • α(1 − s) − ν1

, where ν1 = C ′

log(d/mΦ) 1/2−1/q + ǫ √mΦ

  • and C ′ depends only on C1 and C2.
slide-46
SLIDE 46

Ingredients of the proof

◮ compressed sensing;

slide-47
SLIDE 47

Ingredients of the proof

◮ compressed sensing; ◮ stability of one dimensional subspaces;

slide-48
SLIDE 48

Ingredients of the proof

◮ compressed sensing; ◮ stability of one dimensional subspaces; ◮ concentration inequalities (Hoeffding’s inequality).

slide-49
SLIDE 49

Compressed sensing

Theorem (Wojtaszczyk, 2011)

Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m.

slide-50
SLIDE 50

Compressed sensing

Theorem (Wojtaszczyk, 2011)

Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m. Let us suppose that d > [log 6]2m.

slide-51
SLIDE 51

Compressed sensing

Theorem (Wojtaszczyk, 2011)

Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m. Let us suppose that d > [log 6]2m. Then there are positive constants C, c ′

1, c ′ 2 > 0, such that,

with probability at least 1 − e−c ′

1 m − e−

√ md,

the matrix Φ has the following property.

slide-52
SLIDE 52

Compressed sensing

Theorem (Wojtaszczyk, 2011)

Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m. Let us suppose that d > [log 6]2m. Then there are positive constants C, c ′

1, c ′ 2 > 0, such that,

with probability at least 1 − e−c ′

1 m − e−

√ md,

the matrix Φ has the following property. For every x ∈ Rd, ε ∈ Rm and every natural number K c ′

2m/ log(d/m) we have

∆(Φx + ε) − xℓd

2 C

  • K −1/2σK(x)ℓd

1 + max{εℓm 2 ,

  • log dεℓm

∞}

  • ,

where σK(x)ℓd

1 := inf{x − zℓd 1 : # supp z K}

is the best K-term approximation of x.

slide-53
SLIDE 53

How does compressed sensing play a role?

For the d × mX matrix X, i.e., X = (g ′(a · x1)aT| . . . |g ′(a · xmX)aT), Φ[g ′(a · xj)a

  • :=xj

] = yj + εj, j = 1, . . . mX,

slide-54
SLIDE 54

How does compressed sensing play a role?

For the d × mX matrix X, i.e., X = (g ′(a · x1)aT| . . . |g ′(a · xmX)aT), Φ[g ′(a · xj)a

  • :=xj

] = yj + εj, j = 1, . . . mX, and ˆ xj = ∆(yj) := arg min

y j=Φz zℓd

1

slide-55
SLIDE 55

How does compressed sensing play a role?

For the d × mX matrix X, i.e., X = (g ′(a · x1)aT| . . . |g ′(a · xmX)aT), Φ[g ′(a · xj)a

  • :=xj

] = yj + εj, j = 1, . . . mX, and ˆ xj = ∆(yj) := arg min

y j=Φz zℓd

1

the previous result gives - with the probability provided there - ˆ xj = g ′(a · xj)aT + nj, with nj properly estimated by njℓd

2 C

  • K −1/2σK(g ′(a · xj)aT)ℓd

1 + max{εjℓm 2 ,

  • log dεjℓm

  • .
slide-56
SLIDE 56

Some computations

Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd

1 xℓd qK 1−1/q,

for all x ∈ Rd,

slide-57
SLIDE 57

Some computations

Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd

1 xℓd qK 1−1/q,

for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd

1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2

log(d/mΦ) 1/2−1/q .

slide-58
SLIDE 58

Some computations

Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd

1 xℓd qK 1−1/q,

for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd

1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2

log(d/mΦ) 1/2−1/q . Moreover εjℓ

mΦ ∞

= ǫ 2 · max

i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|

= ǫ 2mΦ · max

i=1,...,mΦ

  • d
  • k,l=1

akalg ′′(a · ζij)

slide-59
SLIDE 59

Some computations

Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd

1 xℓd qK 1−1/q,

for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd

1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2

log(d/mΦ) 1/2−1/q . Moreover εjℓ

mΦ ∞

= ǫ 2 · max

i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|

= ǫ 2mΦ · max

i=1,...,mΦ

  • d
  • k,l=1

akalg ′′(a · ζij)

  • ǫg ′′∞

2mΦ d

  • k=1

|ak| 2 ǫg ′′∞ 2mΦ d

  • k=1

|ak|q 2/q C 2

1 C2

2mΦ ǫ,

slide-60
SLIDE 60

Some computations

Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd

1 xℓd qK 1−1/q,

for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd

1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2

log(d/mΦ) 1/2−1/q . Moreover εjℓ

mΦ ∞

= ǫ 2 · max

i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|

= ǫ 2mΦ · max

i=1,...,mΦ

  • d
  • k,l=1

akalg ′′(a · ζij)

  • ǫg ′′∞

2mΦ d

  • k=1

|ak| 2 ǫg ′′∞ 2mΦ d

  • k=1

|ak|q 2/q C 2

1 C2

2mΦ ǫ, εjℓ

mΦ 2

  • √mΦεjℓ

mΦ ∞

C 2

1 C2

2√mΦ ǫ,

slide-61
SLIDE 61

Some computations

Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd

1 xℓd qK 1−1/q,

for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd

1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2

log(d/mΦ) 1/2−1/q . Moreover εjℓ

mΦ ∞

= ǫ 2 · max

i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|

= ǫ 2mΦ · max

i=1,...,mΦ

  • d
  • k,l=1

akalg ′′(a · ζij)

  • ǫg ′′∞

2mΦ d

  • k=1

|ak| 2 ǫg ′′∞ 2mΦ d

  • k=1

|ak|q 2/q C 2

1 C2

2mΦ ǫ, εjℓ

mΦ 2

  • √mΦεjℓ

mΦ ∞

C 2

1 C2

2√mΦ ǫ, leading to max{εjℓ

mΦ 2

, √log dεjℓ

mΦ ∞ }

C 2

1 C2

2√mΦ ǫ · max

  • 1,
  • log d

  • =

C 2

1 C2

2√mΦ ǫ.

slide-62
SLIDE 62

Summarizing ...

With high probability ˆ xj = g ′(a · xj)aT + nj,

slide-63
SLIDE 63

Summarizing ...

With high probability ˆ xj = g ′(a · xj)aT + nj, where njℓd

2

  • C
  • K −1/2σK(g ′(a · xj)aT)ℓd

1 + max{εjℓm 2 ,

  • log dεjℓm

  • C ′

log(d/mΦ) 1/2−1/q + ǫ √mΦ

  • := ν1
slide-64
SLIDE 64

Stability of one dimensional subspaces

Lemma

Let us fix ˆ x ∈ Rd, a ∈ Sd−1, 0 = γ ∈ R, and n ∈ Rd with norm nℓd

2 ν1 < |γ|. If we assume ˆ

x = γa + n then

  • sign γ

ˆ x ˆ xℓd

2

− a

  • ℓd

2

  • 2ν1

ˆ xℓd

2

.

slide-65
SLIDE 65

Stability of one dimensional subspaces

Lemma

Let us fix ˆ x ∈ Rd, a ∈ Sd−1, 0 = γ ∈ R, and n ∈ Rd with norm nℓd

2 ν1 < |γ|. If we assume ˆ

x = γa + n then

  • sign γ

ˆ x ˆ xℓd

2

− a

  • ℓd

2

  • 2ν1

ˆ xℓd

2

. We recall, that ˆ xj = g ′(a · xj)aT + nj. and max

j

ˆ xjℓd

2 max

j

|g ′(a·xj)|−max

j

ˆ xj−xjℓd

2

max

j

|g ′(a · xj)|

  • we need to estimate it

−ν1

slide-66
SLIDE 66

Concentration inequalities I

Lemma (Hoeffding’s inequality)

Let X1, . . . , Xm be independent random variables. Assume that the Xj are almost surely bounded, i.e., there exist finite scalars aj, bj such that P{Xj − EXj ∈ [aj, bj]} = 1, for j = 1, . . . , m. Then we have P   

  • m
  • j=1

Xj − E  

m

  • j=1

Xj  

  • t

   2e

2t2 m j=1(bj −aj )2 .

slide-67
SLIDE 67

Concentration inequalities I

Lemma (Hoeffding’s inequality)

Let X1, . . . , Xm be independent random variables. Assume that the Xj are almost surely bounded, i.e., there exist finite scalars aj, bj such that P{Xj − EXj ∈ [aj, bj]} = 1, for j = 1, . . . , m. Then we have P   

  • m
  • j=1

Xj − E  

m

  • j=1

Xj  

  • t

   2e

2t2 m j=1(bj −aj )2 .

Let us now apply Hoeffding’s inequality to the random variables Xj = |g ′(a · xj)|2.

slide-68
SLIDE 68

Probabilistic estimates from below

By applying Hoeffding’s inequality to the random variables Xj = |g ′(a · xj)|2, we have

Lemma

Let us fix 0 < s < 1. Then with probability 1 − 2e

2mXs2α2 C4 2

we have max

j=1,...,mX |g ′(a · xj)|

  • α(1 − s),

where α := Ex(|g ′(a · xj)|2) =

  • Sd−1 |g ′(a · x)|2dµSd−1(x) =
  • Sd−1 ∇f (x)2

ℓd

2 dµSd−1(x) > 0.

slide-69
SLIDE 69

Algorithm 1:

◮ Given mΦ, mX, draw at random the sets Φ and X, and

construct Y according (*).

◮ Set ˆ

xj = ∆(yj) := arg minyj=Φz zℓd

1 .

◮ Find

j0 = arg max

j=1,...,mX ˆ

xjℓd

2 .

◮ Set ˆ

a = ˆ xj0/ˆ xj0ℓd

2 .

◮ Define ˆ

g(y) := f (ˆ aTy) and ˆ f (x) := ˆ g(ˆ a · x).

slide-70
SLIDE 70

Recovery result

Theorem (F., Schnass, and Vyb´ ıral 2012)

Let 0 < s < 1 and log d mΦ [log 6]2d. Then there is a constant c ′

1

such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1 defines a function ˆ f : BRd(1 + ¯ ǫ) → R that, with probability 1 −

  • e−c ′

1 mΦ + e−√mΦd + 2e

2mXs2α2 C4 2

  • ,

will satisfy f − ˆ f ∞ 2C2(1 + ¯ ǫ) ν1

  • α(1 − s) − ν1

, where ν1 = C ′

log(d/mΦ) 1/2−1/q + ǫ √mΦ

  • and C ′ depends only on C1 and C2.
slide-71
SLIDE 71

Concentration of measure phenomenon and risk of intractability

Key role is played by α =

  • Sd−1 |g ′(a · x)|2dµSd−1(x)
slide-72
SLIDE 72

Concentration of measure phenomenon and risk of intractability

Key role is played by α =

  • Sd−1 |g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a

slide-73
SLIDE 73

Concentration of measure phenomenon and risk of intractability

Key role is played by α =

  • Sd−1 |g ′(a · x)|2dµSd−1(x)

Due to symmetry . . . independent on a Push-forward measure µ1 on [−1, 1] α = 1

−1

|g ′(y)|2dµ1(y) = Γ(d/2) π1/2Γ((d − 1)/2) 1

−1

|g ′(y)|2(1 − y 2)

d−3 2 dy

µ1 concentrates around zero exponentially fast as d → ∞

slide-74
SLIDE 74

Dependence on the dimension d

Proposition

Let us fix M ∈ N and assume that g : [−1, 1] → R is C M+2-differentiable in an open neighbourhood U of 0 and

dℓ dxℓ g(0) = 0 for ℓ = 1, . . . , M.

Then α(d) = O(d−M), for d → ∞.

slide-75
SLIDE 75

Tractability classes

(1) For 0 < q 1, C1 > 1 and C2 α0 > 0, we define F1

d

:= F1

d(α0, q, C1, C2) := {f : BRd → R :

∃a ∈ Rd, aℓd

2 = 1, aℓd q C1

and ∃g ∈ C 2(BR), |g ′(0)| α0 > 0 : f (x) = g(a · x) }.

slide-76
SLIDE 76

Tractability classes

(1) For 0 < q 1, C1 > 1 and C2 α0 > 0, we define F1

d

:= F1

d(α0, q, C1, C2) := {f : BRd → R :

∃a ∈ Rd, aℓd

2 = 1, aℓd q C1

and ∃g ∈ C 2(BR), |g ′(0)| α0 > 0 : f (x) = g(a · x) }. (2) For a neighborhood U of 0, 0 < q 1, C1 > 1, C2 α0 > 0 and N 2, we define F2

d

:= F2

d(U, α0, q, C1, C2, N) := {f : BRd → R :

∃a ∈ Rd, aℓd

2 = 1, aℓd q C1

and ∃g ∈ C 2(BR) ∩ C N(U) ∃0 M N − 1, |g (M)(0)| α0 > 0 : f (x) = g(a · x) }.

slide-77
SLIDE 77

Tractability classes

(1) For 0 < q 1, C1 > 1 and C2 α0 > 0, we define F1

d

:= F1

d(α0, q, C1, C2) := {f : BRd → R :

∃a ∈ Rd, aℓd

2 = 1, aℓd q C1

and ∃g ∈ C 2(BR), |g ′(0)| α0 > 0 : f (x) = g(a · x) }. (2) For a neighborhood U of 0, 0 < q 1, C1 > 1, C2 α0 > 0 and N 2, we define F2

d

:= F2

d(U, α0, q, C1, C2, N) := {f : BRd → R :

∃a ∈ Rd, aℓd

2 = 1, aℓd q C1

and ∃g ∈ C 2(BR) ∩ C N(U) ∃0 M N − 1, |g (M)(0)| α0 > 0 : f (x) = g(a · x) }. (3) For a neighborhood U of 0, 0 < q 1, C1 > 1 and C2 α0 > 0, we define F3

d

:= F3

d(U, α0, q, C1, C2) := {f : BRd → R :

∃a ∈ Rd, aℓd

2 = 1, aℓd q C1

and ∃g ∈ C 2(BR) ∩ C ∞(U) |g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) }.

slide-78
SLIDE 78

Tractability result

Corollary

The problem of learning functions f in the classes F1

d and F2 d from point

evaluations is strongly polynomially tractable (no poly dep. on d) and polynomially tractable (with poly dep. on d) respectively.

slide-79
SLIDE 79

Intractability

On the one hand, let us notice that if in the class F3

d we remove the

condition aℓd

q C1, then the problem actually becomes intractable.

slide-80
SLIDE 80

Intractability

On the one hand, let us notice that if in the class F3

d we remove the

condition aℓd

q C1, then the problem actually becomes intractable. Let

g ∈ C 2([−1 − ¯ ǫ, 1 + ¯ ǫ]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ¯ ǫ] and zero otherwise.

slide-81
SLIDE 81

Intractability

On the one hand, let us notice that if in the class F3

d we remove the

condition aℓd

q C1, then the problem actually becomes intractable. Let

g ∈ C 2([−1 − ¯ ǫ, 1 + ¯ ǫ]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ¯ ǫ] and zero otherwise. Notice that, for every a ∈ Rd with aℓd

2 = 1, the

function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the cap U(a, 1/2) := {x ∈ Sd−1 : a · x 1/2},

Figure : The function g and the spherical cap U(a, 1/2).

slide-82
SLIDE 82

Intractability

The µSd−1 measure of U(a, 1/2) obviously does not depend on a and is known to be exponentially small in d. Furthermore, it is known, that there is a constant c > 0 and unit vectors a1, . . . , aK, such that the sets U(a1, 1/2), . . . , U(aK, 1/2) are mutually disjoint and K ecd. Finally, we

  • bserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.
slide-83
SLIDE 83

Intractability

The µSd−1 measure of U(a, 1/2) obviously does not depend on a and is known to be exponentially small in d. Furthermore, it is known, that there is a constant c > 0 and unit vectors a1, . . . , aK, such that the sets U(a1, 1/2), . . . , U(aK, 1/2) are mutually disjoint and K ecd. Finally, we

  • bserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.

We conclude that any algorithm making only use of the structure of f (x) = g(a · x) and the condition needs to use exponentially many sampling points in order to distinguish between f (x) ≡ 0 and f (x) = g(ai · x) for some of the ai’s as constructed above.

slide-84
SLIDE 84

Truly k-ridge functions for k ≫ 1

f (x) = g(Ax), A is a k × d matrix

slide-85
SLIDE 85

Truly k-ridge functions for k ≫ 1

f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1 AAT is the identity operator on Rk The regularity condition: sup

|α|2

Dαg∞ C2

slide-86
SLIDE 86

Truly k-ridge functions for k ≫ 1

f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1 AAT is the identity operator on Rk The regularity condition: sup

|α|2

Dαg∞ C2 The matrix Hf :=

  • Sd−1 ∇f (x)∇f (x)TdµSd−1(x) is a positive

semi-definite k-rank matrix We assume, that the singular values of the matrix Hf satisfy σ1(Hf ) · · · σk(Hf ) α > 0.

slide-87
SLIDE 87
  • MD. House’s differential diagnosis (or simply called

”sensitivity analysis”)

We rely on numerical approximation of

∂f ∂ϕ

∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ

slide-88
SLIDE 88
  • MD. House’s differential diagnosis (or simply called

”sensitivity analysis”)

We rely on numerical approximation of

∂f ∂ϕ

∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ X = {xj ∈ Ω : j = 1, . . . , mX} drawn uniformly at random in Ω ⊂ Rd Φ = {ϕj ∈ Rd, j = 1, . . . , mΦ}, where ϕj

ℓ =

  • 1/√mΦ

with prob. 1/2, −1/√mΦ with prob. 1/2 for every j ∈ {1, . . . , mΦ} and every ℓ ∈ {1, . . . , d}

slide-89
SLIDE 89

Sensitivity analysis

x x + εϕ Sd−1

Figure : We perform at random, randomized sensitivity analysis

slide-90
SLIDE 90

Collecting together the differential analysis

Φ . . . mΦ × d matrix whose rows are ϕi, X . . . d × mX matrix X =

  • AT∇g(Ax1)| . . . |AT∇g(AxmX)
  • .

The mX × mΦ instances of (∗) in matrix notation as ΦX = Y + E (∗∗) Y and E are mΦ × mX matrices defined by yij = f (xj + ǫϕi) − f (xj) ǫ , εij = −ǫ 2[(ϕi)T∇2f (ζij)ϕi],

slide-91
SLIDE 91

Algorithm 2:

◮ Given mΦ, mX, draw at random the sets Φ and X, and

construct Y according to (*).

◮ Set ˆ

xj = ∆(yj) := arg minyj=Φz zℓd

1 , for j = 1, . . . , mX, and

ˆ X = (ˆ x1| . . . |ˆ xmX) is again a d × mX matrix.

◮ Compute the singular value decomposition of

ˆ X T = ˆ U1 ˆ U2 ˆ Σ1 ˆ Σ2 ˆ V T

1

ˆ V T

2

  • ,

where ˆ Σ1 contains the k largest singular values.

◮ Set ˆ

A = ˆ V T

1 . ◮ Define ˆ

g(y) := f (ˆ ATy) and ˆ f (x) := ˆ g(ˆ Ax).

slide-92
SLIDE 92

The control of the error

The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: 1. The error between ˆ X and X, which can be controlled through the number of compressed sensing measurements mΦ;

slide-93
SLIDE 93

The control of the error

The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: 1. The error between ˆ X and X, which can be controlled through the number of compressed sensing measurements mΦ; 2. The stability of the span of V T, simply characterized by how well the singular values of X or equivalently G are separated from 0, which is related to the number of random samples mX. To be precise, we have

slide-94
SLIDE 94

Recovery result

Theorem (F., Schnass, and Vyb´ ıral)

Let log d mΦ [log 6]2d. Then there is a constant c ′

1 such that using

mX · (mΦ + 1) function evaluations of f , Algorithm 2 defines a function ˆ f : BRd(1 + ¯ ǫ) → R that, with probability 1 −

  • e−c ′

1 mΦ + e−√mΦd + ke −mXαs2 2kC2 2

  • ,

will satisfy f − ˆ f ∞ 2C2 √ k(1 + ¯ ǫ) ν2

  • α(1 − s) − ν2

, where ν2 = C

  • k1/q

log(d/mΦ) 1/2−1/q + ǫk2 √mΦ

  • ,

and C depends only on C1 and C2.

slide-95
SLIDE 95

Ingredients of the proof

◮ compressed sensing;

slide-96
SLIDE 96

Ingredients of the proof

◮ compressed sensing; ◮ stability of the SVD;

slide-97
SLIDE 97

Ingredients of the proof

◮ compressed sensing; ◮ stability of the SVD; ◮ concentration inequalities (Chernoff bounds for sums of

positive-semidefinite matrices).

slide-98
SLIDE 98

Compressed sensing

Corollary (after Wojtaszczyk, 2011)

Let log d mΦ < [log 6]2d. Then with probability 1 − (e−c ′

1 mΦ + e−√mΦd)

the matrix ˆ X as calculated in Algorithm 2 satisfies X − ˆ XF C√mX

  • k1/q

log(d/mΦ) 1/2−1/q + ǫk2 √mΦ

  • ,

where C depends only on C1 and C2.

slide-99
SLIDE 99

Stability of SVD

Given two matrices B and ˆ B with corresponding singular value decompositions B =

  • U1

U2 Σ1 Σ2 V T

1

V T

2

  • and

ˆ B = ˆ U1 ˆ U2 ˆ Σ1 ˆ Σ2 ˆ V T

1

ˆ V T

2

  • ,

we have:

slide-100
SLIDE 100

Wedin’s bound

Theorem (Stability of subspaces)

If there is an ¯ α > 0 such that min

ℓ,ˆ ℓ

|σˆ

ℓ(ˆ

Σ1) − σℓ(Σ2)| ¯ α, and min

ˆ ℓ

|σˆ

ℓ(ˆ

Σ1)| ¯ α, then V1V T

1 − ˆ

V1 ˆ V T

1 F 2

¯ αB − ˆ BF.

slide-101
SLIDE 101

Wedin’s bound

Applied to our situation, where X has rank k and thus Σ2 = 0, we get V1V T

1 − ˆ

V1 ˆ V T

1 F 2√mXν2

σk( ˆ X T) , and further since σk( ˆ X T) σk(X T) − X − ˆ XF, that V1V T

1 − ˆ

V1 ˆ V T

1 F

2√mXν2 σk(X T) − √mXν2 .

slide-102
SLIDE 102

Wedin’s bound

Applied to our situation, where X has rank k and thus Σ2 = 0, we get V1V T

1 − ˆ

V1 ˆ V T

1 F 2√mXν2

σk( ˆ X T) , and further since σk( ˆ X T) σk(X T) − X − ˆ XF, that V1V T

1 − ˆ

V1 ˆ V T

1 F

2√mXν2 σk(X T) − √mXν2 . Note that X T = GA = UGΣG[V T

G A],

for G =

  • ∇g(Ax1)| . . . |∇g(AxmX)

T, hence ΣX T = ΣG. Moreover σi(G) =

  • σi(GTG),

for all i = 1, . . . , k.

slide-103
SLIDE 103

Concentration inequalities II

Theorem (Matrix Chernoff bounds)

Consider X1, . . . , Xm independent random, positive-semidefinite matrices

  • f dimension k × k. Moreover suppose σ1(Xj) C, almost surely.

Compute the singular values of the sum of the expectations µmax = σ1 m

j=1 EXj

  • and µmin = σk

m

j=1 EXj

  • , then

P   σ1  

m

  • j=1

Xj   − µmax sµmax    k (1 + s) e − µmax(1+s)

C

, for all s > (e − 1), and P   σk  

m

  • j=1

Xj   − µmin −sµmin    ke− µmins2

2C

, for all s ∈ (0, 1).

slide-104
SLIDE 104

Note that GTG =

mX

  • j=1

∇g(Axj)∇g(Axj)T. and by applying the previous result to Xj = ∇g(Axj)∇g(Axj)T, we have:

Lemma

For any s ∈ (0, 1) we have that σk(X T)

  • mXα(1 − s)

with probability 1 − ke

−mXαs2 2kC2 2

.

slide-105
SLIDE 105

Proof of Theorem

with probability at least 1 −

  • e−c ′

1 mΦ + e−√mΦd + ke −mXαs2 2kC2 2

  • ,

we have V1V T

1 − ˆ

V1 ˆ V T

1 F

2ν2

  • α(1 − s) − ν2

.

slide-106
SLIDE 106

Proof of Theorem

with probability at least 1 −

  • e−c ′

1 mΦ + e−√mΦd + ke −mXαs2 2kC2 2

  • ,

we have V1V T

1 − ˆ

V1 ˆ V T

1 F

2ν2

  • α(1 − s) − ν2

. and for ˆ A = ˆ V T

1 and V T G A = V T 1

ATA − ˆ AT ˆ AF = ATVGV T

G A − ˆ

V1 ˆ V T

1 F

2ν2

  • α(1 − s) − ν2

.

slide-107
SLIDE 107

Proof of Theorem ... continue

Since A is row-orthogonal we have A = AATA and |f (x) − ˆ f (x)| = |g(Ax) − ˆ g(ˆ Ax)| = |g(Ax) − g(Aˆ AT ˆ Ax)| C2 √ kAx − Aˆ AT ˆ Axℓk

2

= C2 √ kA(ATA − ˆ AT ˆ A)xℓk

2

C2 √ k(ATA − ˆ AT ˆ A)Fxℓd

2

2C2 √ k(1 + ¯ ǫ) ν2

  • α(1 − s) − ν2

. where we used ATA − ˆ AT ˆ AF = ATVGV T

G A − ˆ

V1 ˆ V T

1 F

2ν2

  • α(1 − s) − ν2

.

slide-108
SLIDE 108

k-ridge functions may be too simple!

Figure : Functions on data clustered around a manifold with multiple directions can be locally approximated by sums of k-ridge functions

slide-109
SLIDE 109

Sums of ridge functions

Can we still be able to learn functions of the type f (x) =

m

  • i=1

gi(ai · x), x ∈ [−1, 1]d?

slide-110
SLIDE 110

Sums of ridge functions

Can we still be able to learn functions of the type f (x) =

m

  • i=1

gi(ai · x), x ∈ [−1, 1]d? Our approach (Daubechies, F., Vyb´ ıral) is essentially based on the formula Dα1

c1 . . . Dαk ck f (x) = m

  • i=1

g (α1+···+αk)

i

(ai · x)(ai · c1)α1 . . . (ai · ck)αk, where k ∈ N, ci ∈ Rd, αi ∈ N for all i = 1, . . . , k and Dαi

ci is the αi-th

derivative in the direction ci.

slide-111
SLIDE 111

The recovery strategy: nearly orthonormal systems

We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m

  • i=1

ai − wi2

2

1/2 : w1, . . . , wm

  • rthonormal basis in Rm

is small!

slide-112
SLIDE 112

The recovery strategy: nearly orthonormal systems

We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m

  • i=1

ai − wi2

2

1/2 : w1, . . . , wm

  • rthonormal basis in Rm

is small! Furthermore, we denote by L = span{ai ⊗ ai, i = 1, . . . , m} ⊂ Rm×m the subspace of symmetric matrices generated by tensor products ai ⊗ ai = aiaT

i .

slide-113
SLIDE 113

The recovery strategy: nearly orthonormal systems

We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m

  • i=1

ai − wi2

2

1/2 : w1, . . . , wm

  • rthonormal basis in Rm

is small! Furthermore, we denote by L = span{ai ⊗ ai, i = 1, . . . , m} ⊂ Rm×m the subspace of symmetric matrices generated by tensor products ai ⊗ ai = aiaT

i .

We first recover an approximation of L, i.e. instead of L we have then a subspace ˜ L of symmetric matrices at our disposal, which is (in some sense) close to L.

slide-114
SLIDE 114

The recovery strategy: nearly orthonormal systems

We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m

  • i=1

ai − wi2

2

1/2 : w1, . . . , wm

  • rthonormal basis in Rm

is small! Furthermore, we denote by L = span{ai ⊗ ai, i = 1, . . . , m} ⊂ Rm×m the subspace of symmetric matrices generated by tensor products ai ⊗ ai = aiaT

i .

We first recover an approximation of L, i.e. instead of L we have then a subspace ˜ L of symmetric matrices at our disposal, which is (in some sense) close to L. Finally, we propose the following algorithm arg max M∞, s.t. M ∈ ˜ L, MF 1 to recover ai’s - or their good approximation ˆ ai (which is of course possible only up to the sign).

slide-115
SLIDE 115

Nonlinear programming to recover the ai ⊗ ai’s

Figure : The ai ⊗ ai are the extremal points of the matrix operator norm!

slide-116
SLIDE 116

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.

slide-117
SLIDE 117

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT

1 , a2aT 2 }

slide-118
SLIDE 118

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT

1 , a2aT 2 } and that

˜ L = span 1 ǫ ǫ −ǫ

  • ,
  • 0.5

0.5 + ǫ 0.5 + ǫ 0.5 − ǫ

slide-119
SLIDE 119

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT

1 , a2aT 2 } and that

˜ L = span 1 ǫ ǫ −ǫ

  • ,
  • 0.5

0.5 + ǫ 0.5 + ǫ 0.5 − ǫ

  • When choosing ǫ = 0.05, we find out that

{dist(a1aT

1 , ˜

L), dist(a2aT

2 , ˜

L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08].

slide-120
SLIDE 120

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT

1 , a2aT 2 } and that

˜ L = span 1 ǫ ǫ −ǫ

  • ,
  • 0.5

0.5 + ǫ 0.5 + ǫ 0.5 − ǫ

  • When choosing ǫ = 0.05, we find out that

{dist(a1aT

1 , ˜

L), dist(a2aT

2 , ˜

L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08]. Hence, looking at ˜ L alone, every algorithm will have difficulties to decide, which two of the three rank-1 matrices above are the generators of the true L.

slide-121
SLIDE 121

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT

1 , a2aT 2 } and that

˜ L = span 1 ǫ ǫ −ǫ

  • ,
  • 0.5

0.5 + ǫ 0.5 + ǫ 0.5 − ǫ

  • When choosing ǫ = 0.05, we find out that

{dist(a1aT

1 , ˜

L), dist(a2aT

2 , ˜

L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08]. Hence, looking at ˜ L alone, every algorithm will have difficulties to decide, which two of the three rank-1 matrices above are the generators of the true L. Nevertheless, b − a12 = b − a22 0.39.

slide-122
SLIDE 122

On the ambiguity of learning for nonorthogonal profiles

Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT

1 , a2aT 2 } and that

˜ L = span 1 ǫ ǫ −ǫ

  • ,
  • 0.5

0.5 + ǫ 0.5 + ǫ 0.5 − ǫ

  • When choosing ǫ = 0.05, we find out that

{dist(a1aT

1 , ˜

L), dist(a2aT

2 , ˜

L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08]. Hence, looking at ˜ L alone, every algorithm will have difficulties to decide, which two of the three rank-1 matrices above are the generators of the true L. Nevertheless, b − a12 = b − a22 0.39. We see that although the level of noise was rather mild, we have difficulties to distinguish between well separated vectors.

slide-123
SLIDE 123

The approximation to L

Define ˜ L = span{∆f (xj), j = 1, . . . , mX}, where (∆f (x))j,k = f (x + ǫ(ej + ek)) − f (x + ǫej) − f (x + ǫek) + f (x) ǫ2 , for j, k = 1, . . . , m, is an approximation to the Hessian of f a x. For x drawn at random and by applying in a suitable way the Chernoff matrix bounds, one derives a probabilistic error estimate, in the sense that PL − P˜

LF→F Cm3/2ǫ,

with high probability.

slide-124
SLIDE 124

A nonlinear operator towards a gradient ascent

Let us introduce first for a given parameter γ > 1 an operator acting on the singular values of a matrix X = UΣV T as follows: Πγ(X) = U diag(γ, 1, . . . , 1) × Σ (diag(γ, 1, . . . , 1) × Σ)F V T, where diag(γ, 1, . . . , 1) × Σ =     γσ1 . . . σ2 . . . . . . . . . . . . . . . . . . σm     Notice that Πγ maps any matrix X onto a matrix of unit Frobenius norm, simply exalting the first singular value and damping the others. It is not a linear operator.

slide-125
SLIDE 125

The nonlinear programming

We propose a projected gradient method for solving arg max M∞, s.t. M ∈ ˜ L, MF 1. Algorithm 3:

◮ Fix a suitable parameter γ > 1 ◮ Assume to have identified a basis for ˜

L of semi-positive definite matrices, for instance, one can use the second order finite differences ∆f (xj), j = 1, . . . , mX to form such a basis;

◮ Generate an initial guess X 0 = mX j=1 ζj∆f (xj) by choosing at

random ζj 0, so that X 0 ∈ ˜ L and X 0F = 1;

◮ For ℓ 0:

X ℓ+1 := P˜

LΠγ(X ℓ);

slide-126
SLIDE 126

Analysis of the algorithm for ˜ L = L

Proposition (Daubechies, F., Vyb´ ral)

Assume that ˜ L = L and that a1, . . . , am are orthonormal. Let γ > √ 2 and let X 0∞ > 1/

  • γ2 − 1. Then there exists µ0 < 1 such that
  • 1 − X ℓ+1∞
  • µ0
  • 1 − X ℓ∞
  • ,

for all ℓ 0. Being the sequence (X ℓ)ℓ made of matrices with Frobenius norm bounded by 1, we conclude that any accumulation point of it has both unit Frobenius and spectral norm and therefore it has to coincide with

  • ne maximizer.
slide-127
SLIDE 127

Analysis of the algorithm for ˜ L = L

Proposition (Daubechies, F., Vyb´ ral)

Assume that ˜ L = L and that a1, . . . , am are orthonormal. Let γ > √ 2 and let X 0∞ > 1/

  • γ2 − 1. Then there exists µ0 < 1 such that
  • 1 − X ℓ+1∞
  • µ0
  • 1 − X ℓ∞
  • ,

for all ℓ 0. Being the sequence (X ℓ)ℓ made of matrices with Frobenius norm bounded by 1, we conclude that any accumulation point of it has both unit Frobenius and spectral norm and therefore it has to coincide with

  • ne maximizer.

The proof is based on the following observation X ℓ+1∞ = σ1(X ℓ+1) = γσ1(X ℓ)

  • γ2σ1(X ℓ)2 + σ2(X ℓ)2 + · · · + σm(X ℓ)2
  • γX ℓ∞
  • (γ2 − 1)X ℓ2

∞ + 1

.

slide-128
SLIDE 128

Analysis of the algorithm for ˜ L ≈ L

Theorem (Daubechies, F., Vyb´ ral)

Assume for that P˜

L − PLF→F < ǫ < 1 and that a1, . . . , am are

  • rthonormal. Let X 0∞ > max{

1

γ2−1, 1 √ 2 + ǫ + ξ} and

√ 2 < γ. Then for the iterations (X ℓ)ℓ produced by Algorithm 3, there exists µ0 < 1 such that lim sup

|1 − X ℓ∞| µ1(γ, t0, ǫ) + 2ǫ 1 − µ0 + ǫ, where µ1(γ, ξ, ǫ) ≈ ǫ. The sequence (X ℓ)ℓ is bounded and its accumulation points ¯ X satisfy simultaneously the following properties ¯ XF 1 and ¯ X∞ 1 − µ1(γ, t0, ǫ) + 2ǫ 1 − µ0 + ǫ, and PL ¯ XF 1 and PL ¯ X∞ 1 − µ1(γ, t0, ǫ) + 2ǫ 1 − µ0 .

slide-129
SLIDE 129

A graphical explanation of the algorithm

Figure : Objective function · ∞ to be maximized and iterations of Algorithm 3 converging to one of the extremal points ai ⊗ ai

slide-130
SLIDE 130

Nonlinear programming

Theorem (Daubechies, F., Vyb´ ral)

Let M be any local maximizer of arg max M∞, s.t. M ∈ ˜ L, MF 1. Then uT

j Xuj = 0

for all X ∈ S˜

L

with X ⊥ M and all j ∈ {1, . . . , m} with |λj(0)| = M∞. If furthermore the ai’s are nearly orthonoramal S(a1, . . . , am) ε and 3 · m · PL − P˜

L < (1 − ε)2,

then λ1 = M∞ > max{|λ2|, . . . , |λm|} and 2

m

  • k=2

(uT

1 Xuk)2

λ1 − λk λ1.

slide-131
SLIDE 131

Nonlinear programming

Algorithm 4:

◮ Let M be a local maximizer of the nonlinear programming ◮ Take its singular value decomposition M = m j=1 λjuj ⊗ uj ◮ Put ˆ

a := u1

Theorem (Daubechies, F., Vyb´ ral)

Let L = ˜ L and S(a1, . . . , am) ε. Then there is j0 ∈ {1, . . . , m}, such that ˆ a found by Algorithm 4 satisfies ˆ a − aj02 C√ε.

slide-132
SLIDE 132

Nonlinear programming

Algorithm 4:

◮ Let M be a local maximizer of the nonlinear programming ◮ Take its singular value decomposition M = m j=1 λjuj ⊗ uj ◮ Put ˆ

a := u1

Theorem (Daubechies, F., Vyb´ ral)

Let L = ˜ L and S(a1, . . . , am) ε. Then there is j0 ∈ {1, . . . , m}, such that ˆ a found by Algorithm 4 satisfies ˆ a − aj02 C√ε. The proof is based on testing the optimality condtions for X = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.

slide-133
SLIDE 133

Learning sums of ridge functions

Algorithm 5:

◮ Let ˆ

aj are normalized approximations of aj, j = 1, . . . , m

◮ Let (ˆ

bj)m

j=1 be the dual basis to (ˆ

aj)m

j=1 ◮ Assume, that f (0) = g1(0) = · · · = gm(0) ◮ Put ˆ

gj(t) := f (tˆ bj), t ∈ (−1/ˆ bj2, 1/ˆ bj2)

◮ Put ˆ

f (x) := m

j=1 ˆ

gj(ˆ aj · x), x2 1

Theorem (Daubechies, F., Vyb´ ral)

Let

◮ S(a1, . . . , am) ε and S(ˆ

a1, . . . , ˆ am) ε′;

◮ aj − ˆ

aj2 η, j = 1, . . . , m. Then f − ˆ f ∞ c(ε, ε′)mη.

slide-134
SLIDE 134

Our literature

◮ I. Daubechies, M. Fornasier, and J. Vyb´

ıral, Approximation of sums

  • f ridge functions, in preparation

◮ M. Fornasier, K. Schnass, and J. Vyb´

ıral, Learning functions of few arbitrary linear parameters in high dimensions, Foundations on Computational Mathematics, Vol. 2, No. 2, 2012, pp. 229-262

◮ K. Schnass and J. Vyb´

ıral, Compressed learning of high-dimensional sparse functions, ICASSP11, 2011.

◮ A. Kolleck and J. Vyb´

ıral, On some aspects of approximation of ridge functions, J. Appr. Theory 194 (2015), 35-61

◮ S. Mayer, T. Ullrich, and J. Vyb´

ıral, Entropy and sampling numbers

  • f classes of ridge functions, to appear in Constructive

Approximation,