SLIDE 1 Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model
Massimo Fornasier
Fakult¨ at f¨ ur Mathematik Technische Universit¨ at M¨ unchen massimo.fornasier@ma.tum.de http://www-m15.ma.tum.de/
Winter School on Compressed Sensing Technical University of Berlin December 3-5, 2015 Collection of joint results with Ingrid Daubechies, Karin Schnass, and Jan Vyb´ ıral
SLIDE 2 Introduction on ridge functions
◮ A ridge function - in its simplest form - is a function f : Rd → R of
the type f (x) = g(aTx) = g(a · x), where g : R → R is a scalar univariate function and a ∈ Rd is the direction of the ridge function;
SLIDE 3 Introduction on ridge functions
◮ A ridge function - in its simplest form - is a function f : Rd → R of
the type f (x) = g(aTx) = g(a · x), where g : R → R is a scalar univariate function and a ∈ Rd is the direction of the ridge function;
◮ Ridge functions are constant along the hyperplanes a · x = λ for any
given level λ ∈ R and are among the most simple form of multivariate functions;
SLIDE 4 Introduction on ridge functions
◮ A ridge function - in its simplest form - is a function f : Rd → R of
the type f (x) = g(aTx) = g(a · x), where g : R → R is a scalar univariate function and a ∈ Rd is the direction of the ridge function;
◮ Ridge functions are constant along the hyperplanes a · x = λ for any
given level λ ∈ R and are among the most simple form of multivariate functions;
◮ They have been extensively studied in the past couple of decades as
approximation building blocks for more complicated high dimensional functions.
SLIDE 5 Some origins of ridge functions
◮ In multivariate Fourier series, the basis functions are of the form
ein·x for n ∈ Zd and eia·x for arbitrary directions a ∈ Rd in the Radon transform;
SLIDE 6 Some origins of ridge functions
◮ In multivariate Fourier series, the basis functions are of the form
ein·x for n ∈ Zd and eia·x for arbitrary directions a ∈ Rd in the Radon transform;
◮ The term “ridge function” has been actually coined by Logan and
Shepp in 1975 in their work on computer tomography where they show how ridge functions solve the corresponding L2-minimum norm approximation problem.
SLIDE 7 Projection pursuit of the ’80s
◮ Ridge function approximation has been as well extensively studies
during the 80’s in mathematical statistics under the name of projection pursuit (Huber, 1985; Donoho-Johnston, 1989);
SLIDE 8 Projection pursuit of the ’80s
◮ Ridge function approximation has been as well extensively studies
during the 80’s in mathematical statistics under the name of projection pursuit (Huber, 1985; Donoho-Johnston, 1989);
◮ Projection pursuit algorithms approximate a function of d variables
by functions of the form
m
gi(ai · x), x ∈ Rd, for some functions gi : R → R and some non-zero vectors ai ∈ Rd.
SLIDE 9 Some relevant applications of the ’90s
◮ In the early 90’s there has been an explosion of interest in the field
- f neural networks. One very popular model is the multilayer
feed-forward neural network with input, hidden (internal), and
SLIDE 10 Some relevant applications of the ’90s
◮ In the early 90’s there has been an explosion of interest in the field
- f neural networks. One very popular model is the multilayer
feed-forward neural network with input, hidden (internal), and
◮ the simplest case of such a network is described mathematically by a
function of the form
m
αiσ
m
wijxj + θi , where σ : R → R is somehow given and called the activation function and wij are suitable weights;
SLIDE 11 Ridge functions and approximation theory
◮ In the early 90’s the question of whether one can use sums of ridge
functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);
SLIDE 12 Ridge functions and approximation theory
◮ In the early 90’s the question of whether one can use sums of ridge
functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);
◮ the efficiency of such an approximation compared to, e.g., spline
type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);
SLIDE 13 Ridge functions and approximation theory
◮ In the early 90’s the question of whether one can use sums of ridge
functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);
◮ the efficiency of such an approximation compared to, e.g., spline
type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);
◮ The identification of a ridge function has also been thoroughly
considered, in particular we mention the work of Pinkus, and, for what concerns multilayer neural networks, we refer to the work by Fefferman, 1994.
SLIDE 14 Ridge functions and approximation theory
◮ In the early 90’s the question of whether one can use sums of ridge
functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);
◮ the efficiency of such an approximation compared to, e.g., spline
type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);
◮ The identification of a ridge function has also been thoroughly
considered, in particular we mention the work of Pinkus, and, for what concerns multilayer neural networks, we refer to the work by Fefferman, 1994.
◮ Except for the work of Cand`
es on ridglets, there has been less attention after 2000 on the problem of approximating functions by means of ridge functions;
SLIDE 15 Ridge functions and approximation theory
◮ In the early 90’s the question of whether one can use sums of ridge
functions to approximate well arbitrary functions has been at the center of the attention of the approximation theory community (overviews by Li 2002 and Pinkus 1997);
◮ the efficiency of such an approximation compared to, e.g., spline
type approximation for smoothness classes of functions, has been extensively considered (DeVore et al. 1997; Petrushev, 1999);
◮ The identification of a ridge function has also been thoroughly
considered, in particular we mention the work of Pinkus, and, for what concerns multilayer neural networks, we refer to the work by Fefferman, 1994.
◮ Except for the work of Cand`
es on ridglets, there has been less attention after 2000 on the problem of approximating functions by means of ridge functions;
SLIDE 16 Capturing ridge functions from point queries
◮ The above results on the identification of such functions based on
disposing of any possible output or even derivatives;
SLIDE 17 Capturing ridge functions from point queries
◮ The above results on the identification of such functions based on
disposing of any possible output or even derivatives;
◮ this might be in certain practical situations very expensive,
hazardous or impossible;
SLIDE 18 Capturing ridge functions from point queries
◮ The above results on the identification of such functions based on
disposing of any possible output or even derivatives;
◮ this might be in certain practical situations very expensive,
hazardous or impossible;
◮ In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, and
Picard address the approximation of ridge functions by the minimal amount of sampling queries:
SLIDE 19 Capturing ridge functions from point queries
◮ The above results on the identification of such functions based on
disposing of any possible output or even derivatives;
◮ this might be in certain practical situations very expensive,
hazardous or impossible;
◮ In a paper of 2012, Cohen, Daubechies, DeVore, Kerkyacharian, and
Picard address the approximation of ridge functions by the minimal amount of sampling queries: For g ∈ C s([0, 1]), 1 < s, gC s M0, aℓd
q M1, 0 < q 1
f − ˆ f C(Ω) CM0
1 + log(d/L) L 1/q−1 using 3L + 2 sampling points, deterministically and adaptively chosen.
SLIDE 20
Capturing ridge functions from point queries: a nonlinear compressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X.
SLIDE 21 Capturing ridge functions from point queries: a nonlinear compressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT
i a,
i = 1, . . . , m are linear measurements of a.
SLIDE 22 Capturing ridge functions from point queries: a nonlinear compressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT
i a,
i = 1, . . . , m are linear measurements of a. If now we assume yi to be the values of a ridge function at the points xi
SLIDE 23 Capturing ridge functions from point queries: a nonlinear compressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT
i a,
i = 1, . . . , m are linear measurements of a. If now we assume yi to be the values of a ridge function at the points xi yi ≈ g(a · xi), i = 1, . . . , m, for some unknown or roughly given nonlinear function g,
SLIDE 24 Capturing ridge functions from point queries: a nonlinear compressed sensing model
Compressed sensing: given X ∈ Rm×d sensing matrix, for m ≪ d, suitable matrix, we wish to identify a nearly sparse vector a ∈ Rd from its measurements y ≈ Xa, by means of suitable algorithms (ℓ1-minimization, greedy algs) aware of y and X. The data yi ≈ xi · a = xT
i a,
i = 1, . . . , m are linear measurements of a. If now we assume yi to be the values of a ridge function at the points xi yi ≈ g(a · xi), i = 1, . . . , m, for some unknown or roughly given nonlinear function g, the problem of identifying the ridge direction can be understood as a nonlinear compressed sensing model ...
SLIDE 25
Ridge functions and functions of data clustered around manifolds
Figure : Functions on data clustered around a manifold can be locally approximated by k-ridge functions
SLIDE 26 Universal random sampling for a more general ridge model
- M. Fornasier, K. Schnass, J. Vyb´
ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f (x) = g(Ax), A is a k × d matrix
SLIDE 27 Universal random sampling for a more general ridge model
- M. Fornasier, K. Schnass, J. Vyb´
ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1, 0 < q 1 AAT is the identity operator on Rk The regularity condition: sup
|α|2
Dαg∞ C2
SLIDE 28 Universal random sampling for a more general ridge model
- M. Fornasier, K. Schnass, J. Vyb´
ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1, 0 < q 1 AAT is the identity operator on Rk The regularity condition: sup
|α|2
Dαg∞ C2 The matrix Hf :=
- Sd−1 ∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix We assume, that the singular values of the matrix Hf satisfy σ1(Hf ) · · · σk(Hf ) α > 0.
SLIDE 29
How can we learn k-ridge functions from point queries?
SLIDE 30
- MD. House’s differential diagnosis (or simply called
”sensitivity analysis”)
We rely on numerical approximation of
∂f ∂ϕ
∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ
SLIDE 31
- MD. House’s differential diagnosis (or simply called
”sensitivity analysis”)
We rely on numerical approximation of
∂f ∂ϕ
∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ X = {xj ∈ Ω : j = 1, . . . , mX} drawn uniformly at random in Ω ⊂ Rd Φ = {ϕj ∈ Rd, j = 1, . . . , mΦ}, where ϕj
ℓ =
with prob. 1/2, −1/√mΦ with prob. 1/2 for every j ∈ {1, . . . , mΦ} and every ℓ ∈ {1, . . . , d}
SLIDE 32 Sensitivity analysis
x x + εϕ Sd−1
Figure : We perform at random, randomized sensitivity analysis
SLIDE 33 Collecting together the differential analysis
Φ . . . mΦ × d matrix whose rows are ϕi, X . . . d × mX matrix X =
- AT∇g(Ax1)| . . . |AT∇g(AxmX)
- .
The mX × mΦ instances of (∗) in matrix notation as ΦX = Y + E (∗∗) Y and E are mΦ × mX matrices defined by yij = f (xj + ǫϕi) − f (xj) ǫ , εij = −ǫ 2[(ϕi)T∇2f (ζij)ϕi],
SLIDE 34 Example of active coordinates: which factor does play a role?
We assume, that A = eT
i1
. . . eT
ik
, i.e. f (x) = f (x1, . . . , xd) = g(xi1, . . . , xik), where f : Ω = [0, 1]d → R and g : [0, 1]k → R
SLIDE 35 Example of active coordinates: which factor does play a role?
We assume, that A = eT
i1
. . . eT
ik
, i.e. f (x) = f (x1, . . . , xd) = g(xi1, . . . , xik), where f : Ω = [0, 1]d → R and g : [0, 1]k → R We want to identify first the active coordinates i1, . . . , ik. Then one can apply any usual k-dimensional approximation method...
SLIDE 36 Example of active coordinates: which factor does play a role?
We assume, that A = eT
i1
. . . eT
ik
, i.e. f (x) = f (x1, . . . , xd) = g(xi1, . . . , xik), where f : Ω = [0, 1]d → R and g : [0, 1]k → R We want to identify first the active coordinates i1, . . . , ik. Then one can apply any usual k-dimensional approximation method... A possible algorithm chooses the sampling points at random, due to the concentration of measure effects, we get the right result with
SLIDE 37 A simple algorithm based on concentration of measure
The algorithm to identify the active coordinates I is based on the identity ΦTΦX = ΦTY + ΦTE where now X has ith-row Xi = ∂g ∂zi (Ax1), . . . , ∂g ∂zi (AxmX
for i ∈ I, and all other row equal to zero.
SLIDE 38 A simple algorithm based on concentration of measure
The algorithm to identify the active coordinates I is based on the identity ΦTΦX = ΦTY + ΦTE where now X has ith-row Xi = ∂g ∂zi (Ax1), . . . , ∂g ∂zi (AxmX
for i ∈ I, and all other row equal to zero. In expectation: ΦTΦ ≈ Id : Rd → Rd ΦTΦX ≈ X and ΦTE is small = ⇒ ΦTY ≈ X,
SLIDE 39 A simple algorithm based on concentration of measure
The algorithm to identify the active coordinates I is based on the identity ΦTΦX = ΦTY + ΦTE where now X has ith-row Xi = ∂g ∂zi (Ax1), . . . , ∂g ∂zi (AxmX
for i ∈ I, and all other row equal to zero. In expectation: ΦTΦ ≈ Id : Rd → Rd ΦTΦX ≈ X and ΦTE is small = ⇒ ΦTY ≈ X, We select the k largest rows of ΦTY and estimate the probability, that their indices coincide with the indices of the non-zero rows of X.
SLIDE 40
A first recovery result
Theorem (Schnass and Vyb´ ıral 2011)
Let f : Rd → R be a function of k active coordinates that is defined and twice continuously differentiable on a small neighbourhood of [0, 1]d. For L d, a positive real number, the randomized algorithm described above recovers the k unknown active coordinates of f with probability at least 1 − 6 exp(−L) using only O(k(L + log k)(L + log d)) samples of f . The constants involved in the O notation depend on smoothness properties of g, namely on maxj=1,...,k ∂ijg∞ minj=1,...,k ∂ijg1
SLIDE 41 Examples of active coordinate detection in dimension d = 1000
6 12 18 24 30 36 42 48 54 60 20 40 60 80 100 120 140 160 180 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 5 10 15 20 25 30 35 40 45 50 80 100 120 140 160 180 200 220 240 260 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure : max(1 − 5
- (x3 − 1/2)2 + (x4 − 1/2)2, 0)3 and
sin
i=21 xi
i=21 sin(6πxi) + 5(xi − 1/2)2
SLIDE 42 Learning ridge functions k = 1
Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd a2 = 1 and aq C1, 0 < q 1, max0α2 Dαg∞ C2 α =
ℓd
2 dµSd−1(x) =
- Sd−1 |g ′(a · x)|2dµSd−1(x) > 0,
SLIDE 43 Learning ridge functions k = 1
Let f (x) = g(a · x), f : BRd → R, where a ∈ Rd a2 = 1 and aq C1, 0 < q 1, max0α2 Dαg∞ C2 α =
ℓd
2 dµSd−1(x) =
- Sd−1 |g ′(a · x)|2dµSd−1(x) > 0,
We consider again the Taylor expansion (*) with Ω = Sd−1 We choose the points X = {xj ∈ Sd−1 : j = 1, . . . , mX} generated at random on Sd−1 with respect to µSd−1 The matrix Φ is generated as before and we obtain (**) again in the form Φ[g ′(a · xj)a] = yj + εj, j = 1, . . . mX.
SLIDE 44 Algorithm 1:
◮ Given mΦ, mX, draw at random the sets Φ and X, and
construct Y according (*).
◮ Set ˆ
xj = ∆(yj) := arg minyj=Φz zℓd
1 .
◮ Find
j0 = arg max
j=1,...,mX ˆ
xjℓd
2 .
◮ Set ˆ
a = ˆ xj0/ˆ xj0ℓd
2 .
◮ Define ˆ
g(y) := f (ˆ aTy) and ˆ f (x) := ˆ g(ˆ a · x).
SLIDE 45 Recovery result
Theorem (F., Schnass, and Vyb´ ıral 2012)
Let 0 < s < 1 and log d mΦ [log 6]2d. Then there is a constant c ′
1
such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1 defines a function ˆ f : BRd(1 + ¯ ǫ) → R that, with probability 1 −
1 mΦ + e−√mΦd + 2e
−
2mXs2α2 C4 2
will satisfy f − ˆ f ∞ 2C2(1 + ¯ ǫ) ν1
, where ν1 = C ′
log(d/mΦ) 1/2−1/q + ǫ √mΦ
- and C ′ depends only on C1 and C2.
SLIDE 46 Ingredients of the proof
◮ compressed sensing;
SLIDE 47 Ingredients of the proof
◮ compressed sensing; ◮ stability of one dimensional subspaces;
SLIDE 48 Ingredients of the proof
◮ compressed sensing; ◮ stability of one dimensional subspaces; ◮ concentration inequalities (Hoeffding’s inequality).
SLIDE 49
Compressed sensing
Theorem (Wojtaszczyk, 2011)
Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m.
SLIDE 50
Compressed sensing
Theorem (Wojtaszczyk, 2011)
Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m. Let us suppose that d > [log 6]2m.
SLIDE 51 Compressed sensing
Theorem (Wojtaszczyk, 2011)
Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m. Let us suppose that d > [log 6]2m. Then there are positive constants C, c ′
1, c ′ 2 > 0, such that,
with probability at least 1 − e−c ′
1 m − e−
√ md,
the matrix Φ has the following property.
SLIDE 52 Compressed sensing
Theorem (Wojtaszczyk, 2011)
Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1/√m. Let us suppose that d > [log 6]2m. Then there are positive constants C, c ′
1, c ′ 2 > 0, such that,
with probability at least 1 − e−c ′
1 m − e−
√ md,
the matrix Φ has the following property. For every x ∈ Rd, ε ∈ Rm and every natural number K c ′
2m/ log(d/m) we have
∆(Φx + ε) − xℓd
2 C
1 + max{εℓm 2 ,
∞}
where σK(x)ℓd
1 := inf{x − zℓd 1 : # supp z K}
is the best K-term approximation of x.
SLIDE 53 How does compressed sensing play a role?
For the d × mX matrix X, i.e., X = (g ′(a · x1)aT| . . . |g ′(a · xmX)aT), Φ[g ′(a · xj)a
] = yj + εj, j = 1, . . . mX,
SLIDE 54 How does compressed sensing play a role?
For the d × mX matrix X, i.e., X = (g ′(a · x1)aT| . . . |g ′(a · xmX)aT), Φ[g ′(a · xj)a
] = yj + εj, j = 1, . . . mX, and ˆ xj = ∆(yj) := arg min
y j=Φz zℓd
1
SLIDE 55 How does compressed sensing play a role?
For the d × mX matrix X, i.e., X = (g ′(a · x1)aT| . . . |g ′(a · xmX)aT), Φ[g ′(a · xj)a
] = yj + εj, j = 1, . . . mX, and ˆ xj = ∆(yj) := arg min
y j=Φz zℓd
1
the previous result gives - with the probability provided there - ˆ xj = g ′(a · xj)aT + nj, with nj properly estimated by njℓd
2 C
- K −1/2σK(g ′(a · xj)aT)ℓd
1 + max{εjℓm 2 ,
∞
SLIDE 56 Some computations
Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd
1 xℓd qK 1−1/q,
for all x ∈ Rd,
SLIDE 57 Some computations
Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd
1 xℓd qK 1−1/q,
for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd
1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2
log(d/mΦ) 1/2−1/q .
SLIDE 58 Some computations
Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd
1 xℓd qK 1−1/q,
for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd
1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2
log(d/mΦ) 1/2−1/q . Moreover εjℓ
mΦ ∞
= ǫ 2 · max
i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|
= ǫ 2mΦ · max
i=1,...,mΦ
akalg ′′(a · ζij)
SLIDE 59 Some computations
Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd
1 xℓd qK 1−1/q,
for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd
1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2
log(d/mΦ) 1/2−1/q . Moreover εjℓ
mΦ ∞
= ǫ 2 · max
i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|
= ǫ 2mΦ · max
i=1,...,mΦ
akalg ′′(a · ζij)
2mΦ d
|ak| 2 ǫg ′′∞ 2mΦ d
|ak|q 2/q C 2
1 C2
2mΦ ǫ,
SLIDE 60 Some computations
Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd
1 xℓd qK 1−1/q,
for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd
1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2
log(d/mΦ) 1/2−1/q . Moreover εjℓ
mΦ ∞
= ǫ 2 · max
i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|
= ǫ 2mΦ · max
i=1,...,mΦ
akalg ′′(a · ζij)
2mΦ d
|ak| 2 ǫg ′′∞ 2mΦ d
|ak|q 2/q C 2
1 C2
2mΦ ǫ, εjℓ
mΦ 2
mΦ ∞
C 2
1 C2
2√mΦ ǫ,
SLIDE 61 Some computations
Let us estimate the quontities. By Stechkin’s inequality for which σK(x)ℓd
1 xℓd qK 1−1/q,
for all x ∈ Rd, one obtains - for xj = g ′(a · xj)a K −1/2σK(xj)ℓd
1 |g ′(a·xj)|·aℓd q·K 1/2−1/q C1 C2
log(d/mΦ) 1/2−1/q . Moreover εjℓ
mΦ ∞
= ǫ 2 · max
i=1,...,mΦ |ϕi T∇2f (ζij)ϕi|
= ǫ 2mΦ · max
i=1,...,mΦ
akalg ′′(a · ζij)
2mΦ d
|ak| 2 ǫg ′′∞ 2mΦ d
|ak|q 2/q C 2
1 C2
2mΦ ǫ, εjℓ
mΦ 2
mΦ ∞
C 2
1 C2
2√mΦ ǫ, leading to max{εjℓ
mΦ 2
, √log dεjℓ
mΦ ∞ }
C 2
1 C2
2√mΦ ǫ · max
mΦ
C 2
1 C2
2√mΦ ǫ.
SLIDE 62
Summarizing ...
With high probability ˆ xj = g ′(a · xj)aT + nj,
SLIDE 63 Summarizing ...
With high probability ˆ xj = g ′(a · xj)aT + nj, where njℓd
2
- C
- K −1/2σK(g ′(a · xj)aT)ℓd
1 + max{εjℓm 2 ,
∞
log(d/mΦ) 1/2−1/q + ǫ √mΦ
SLIDE 64 Stability of one dimensional subspaces
Lemma
Let us fix ˆ x ∈ Rd, a ∈ Sd−1, 0 = γ ∈ R, and n ∈ Rd with norm nℓd
2 ν1 < |γ|. If we assume ˆ
x = γa + n then
ˆ x ˆ xℓd
2
− a
2
ˆ xℓd
2
.
SLIDE 65 Stability of one dimensional subspaces
Lemma
Let us fix ˆ x ∈ Rd, a ∈ Sd−1, 0 = γ ∈ R, and n ∈ Rd with norm nℓd
2 ν1 < |γ|. If we assume ˆ
x = γa + n then
ˆ x ˆ xℓd
2
− a
2
ˆ xℓd
2
. We recall, that ˆ xj = g ′(a · xj)aT + nj. and max
j
ˆ xjℓd
2 max
j
|g ′(a·xj)|−max
j
ˆ xj−xjℓd
2
max
j
|g ′(a · xj)|
−ν1
SLIDE 66 Concentration inequalities I
Lemma (Hoeffding’s inequality)
Let X1, . . . , Xm be independent random variables. Assume that the Xj are almost surely bounded, i.e., there exist finite scalars aj, bj such that P{Xj − EXj ∈ [aj, bj]} = 1, for j = 1, . . . , m. Then we have P
Xj − E
m
Xj
2e
−
2t2 m j=1(bj −aj )2 .
SLIDE 67 Concentration inequalities I
Lemma (Hoeffding’s inequality)
Let X1, . . . , Xm be independent random variables. Assume that the Xj are almost surely bounded, i.e., there exist finite scalars aj, bj such that P{Xj − EXj ∈ [aj, bj]} = 1, for j = 1, . . . , m. Then we have P
Xj − E
m
Xj
2e
−
2t2 m j=1(bj −aj )2 .
Let us now apply Hoeffding’s inequality to the random variables Xj = |g ′(a · xj)|2.
SLIDE 68 Probabilistic estimates from below
By applying Hoeffding’s inequality to the random variables Xj = |g ′(a · xj)|2, we have
Lemma
Let us fix 0 < s < 1. Then with probability 1 − 2e
−
2mXs2α2 C4 2
we have max
j=1,...,mX |g ′(a · xj)|
where α := Ex(|g ′(a · xj)|2) =
- Sd−1 |g ′(a · x)|2dµSd−1(x) =
- Sd−1 ∇f (x)2
ℓd
2 dµSd−1(x) > 0.
SLIDE 69 Algorithm 1:
◮ Given mΦ, mX, draw at random the sets Φ and X, and
construct Y according (*).
◮ Set ˆ
xj = ∆(yj) := arg minyj=Φz zℓd
1 .
◮ Find
j0 = arg max
j=1,...,mX ˆ
xjℓd
2 .
◮ Set ˆ
a = ˆ xj0/ˆ xj0ℓd
2 .
◮ Define ˆ
g(y) := f (ˆ aTy) and ˆ f (x) := ˆ g(ˆ a · x).
SLIDE 70 Recovery result
Theorem (F., Schnass, and Vyb´ ıral 2012)
Let 0 < s < 1 and log d mΦ [log 6]2d. Then there is a constant c ′
1
such that using mX · (mΦ + 1) function evaluations of f , Algorithm 1 defines a function ˆ f : BRd(1 + ¯ ǫ) → R that, with probability 1 −
1 mΦ + e−√mΦd + 2e
−
2mXs2α2 C4 2
will satisfy f − ˆ f ∞ 2C2(1 + ¯ ǫ) ν1
, where ν1 = C ′
log(d/mΦ) 1/2−1/q + ǫ √mΦ
- and C ′ depends only on C1 and C2.
SLIDE 71 Concentration of measure phenomenon and risk of intractability
Key role is played by α =
- Sd−1 |g ′(a · x)|2dµSd−1(x)
SLIDE 72 Concentration of measure phenomenon and risk of intractability
Key role is played by α =
- Sd−1 |g ′(a · x)|2dµSd−1(x)
Due to symmetry . . . independent on a
SLIDE 73 Concentration of measure phenomenon and risk of intractability
Key role is played by α =
- Sd−1 |g ′(a · x)|2dµSd−1(x)
Due to symmetry . . . independent on a Push-forward measure µ1 on [−1, 1] α = 1
−1
|g ′(y)|2dµ1(y) = Γ(d/2) π1/2Γ((d − 1)/2) 1
−1
|g ′(y)|2(1 − y 2)
d−3 2 dy
µ1 concentrates around zero exponentially fast as d → ∞
SLIDE 74 Dependence on the dimension d
Proposition
Let us fix M ∈ N and assume that g : [−1, 1] → R is C M+2-differentiable in an open neighbourhood U of 0 and
dℓ dxℓ g(0) = 0 for ℓ = 1, . . . , M.
Then α(d) = O(d−M), for d → ∞.
SLIDE 75 Tractability classes
(1) For 0 < q 1, C1 > 1 and C2 α0 > 0, we define F1
d
:= F1
d(α0, q, C1, C2) := {f : BRd → R :
∃a ∈ Rd, aℓd
2 = 1, aℓd q C1
and ∃g ∈ C 2(BR), |g ′(0)| α0 > 0 : f (x) = g(a · x) }.
SLIDE 76 Tractability classes
(1) For 0 < q 1, C1 > 1 and C2 α0 > 0, we define F1
d
:= F1
d(α0, q, C1, C2) := {f : BRd → R :
∃a ∈ Rd, aℓd
2 = 1, aℓd q C1
and ∃g ∈ C 2(BR), |g ′(0)| α0 > 0 : f (x) = g(a · x) }. (2) For a neighborhood U of 0, 0 < q 1, C1 > 1, C2 α0 > 0 and N 2, we define F2
d
:= F2
d(U, α0, q, C1, C2, N) := {f : BRd → R :
∃a ∈ Rd, aℓd
2 = 1, aℓd q C1
and ∃g ∈ C 2(BR) ∩ C N(U) ∃0 M N − 1, |g (M)(0)| α0 > 0 : f (x) = g(a · x) }.
SLIDE 77 Tractability classes
(1) For 0 < q 1, C1 > 1 and C2 α0 > 0, we define F1
d
:= F1
d(α0, q, C1, C2) := {f : BRd → R :
∃a ∈ Rd, aℓd
2 = 1, aℓd q C1
and ∃g ∈ C 2(BR), |g ′(0)| α0 > 0 : f (x) = g(a · x) }. (2) For a neighborhood U of 0, 0 < q 1, C1 > 1, C2 α0 > 0 and N 2, we define F2
d
:= F2
d(U, α0, q, C1, C2, N) := {f : BRd → R :
∃a ∈ Rd, aℓd
2 = 1, aℓd q C1
and ∃g ∈ C 2(BR) ∩ C N(U) ∃0 M N − 1, |g (M)(0)| α0 > 0 : f (x) = g(a · x) }. (3) For a neighborhood U of 0, 0 < q 1, C1 > 1 and C2 α0 > 0, we define F3
d
:= F3
d(U, α0, q, C1, C2) := {f : BRd → R :
∃a ∈ Rd, aℓd
2 = 1, aℓd q C1
and ∃g ∈ C 2(BR) ∩ C ∞(U) |g (M)(0)| = 0 for all M ∈ N : f (x) = g(a · x) }.
SLIDE 78 Tractability result
Corollary
The problem of learning functions f in the classes F1
d and F2 d from point
evaluations is strongly polynomially tractable (no poly dep. on d) and polynomially tractable (with poly dep. on d) respectively.
SLIDE 79 Intractability
On the one hand, let us notice that if in the class F3
d we remove the
condition aℓd
q C1, then the problem actually becomes intractable.
SLIDE 80 Intractability
On the one hand, let us notice that if in the class F3
d we remove the
condition aℓd
q C1, then the problem actually becomes intractable. Let
g ∈ C 2([−1 − ¯ ǫ, 1 + ¯ ǫ]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ¯ ǫ] and zero otherwise.
SLIDE 81 Intractability
On the one hand, let us notice that if in the class F3
d we remove the
condition aℓd
q C1, then the problem actually becomes intractable. Let
g ∈ C 2([−1 − ¯ ǫ, 1 + ¯ ǫ]) given by g(y) = 8(y − 1/2)3 for y ∈ [1/2, 1 + ¯ ǫ] and zero otherwise. Notice that, for every a ∈ Rd with aℓd
2 = 1, the
function f (x) = g(a · x) vanishes everywhere on Sd−1 outside of the cap U(a, 1/2) := {x ∈ Sd−1 : a · x 1/2},
Figure : The function g and the spherical cap U(a, 1/2).
SLIDE 82 Intractability
The µSd−1 measure of U(a, 1/2) obviously does not depend on a and is known to be exponentially small in d. Furthermore, it is known, that there is a constant c > 0 and unit vectors a1, . . . , aK, such that the sets U(a1, 1/2), . . . , U(aK, 1/2) are mutually disjoint and K ecd. Finally, we
- bserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.
SLIDE 83 Intractability
The µSd−1 measure of U(a, 1/2) obviously does not depend on a and is known to be exponentially small in d. Furthermore, it is known, that there is a constant c > 0 and unit vectors a1, . . . , aK, such that the sets U(a1, 1/2), . . . , U(aK, 1/2) are mutually disjoint and K ecd. Finally, we
- bserve that maxx∈Sd−1 |f (x)| = f (a) = g(1) = 1.
We conclude that any algorithm making only use of the structure of f (x) = g(a · x) and the condition needs to use exponentially many sampling points in order to distinguish between f (x) ≡ 0 and f (x) = g(ai · x) for some of the ai’s as constructed above.
SLIDE 84
Truly k-ridge functions for k ≫ 1
f (x) = g(Ax), A is a k × d matrix
SLIDE 85 Truly k-ridge functions for k ≫ 1
f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1 AAT is the identity operator on Rk The regularity condition: sup
|α|2
Dαg∞ C2
SLIDE 86 Truly k-ridge functions for k ≫ 1
f (x) = g(Ax), A is a k × d matrix Rows of A are compressible: maxi aiq C1 AAT is the identity operator on Rk The regularity condition: sup
|α|2
Dαg∞ C2 The matrix Hf :=
- Sd−1 ∇f (x)∇f (x)TdµSd−1(x) is a positive
semi-definite k-rank matrix We assume, that the singular values of the matrix Hf satisfy σ1(Hf ) · · · σk(Hf ) α > 0.
SLIDE 87
- MD. House’s differential diagnosis (or simply called
”sensitivity analysis”)
We rely on numerical approximation of
∂f ∂ϕ
∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ
SLIDE 88
- MD. House’s differential diagnosis (or simply called
”sensitivity analysis”)
We rely on numerical approximation of
∂f ∂ϕ
∇g(Ax)TAϕ = ∂f ∂ϕ(x) (∗) = f (x + ǫϕ) − f (x) ǫ − ǫ 2[ϕT∇2f (ζ)ϕ], ǫ ¯ ǫ X = {xj ∈ Ω : j = 1, . . . , mX} drawn uniformly at random in Ω ⊂ Rd Φ = {ϕj ∈ Rd, j = 1, . . . , mΦ}, where ϕj
ℓ =
with prob. 1/2, −1/√mΦ with prob. 1/2 for every j ∈ {1, . . . , mΦ} and every ℓ ∈ {1, . . . , d}
SLIDE 89 Sensitivity analysis
x x + εϕ Sd−1
Figure : We perform at random, randomized sensitivity analysis
SLIDE 90 Collecting together the differential analysis
Φ . . . mΦ × d matrix whose rows are ϕi, X . . . d × mX matrix X =
- AT∇g(Ax1)| . . . |AT∇g(AxmX)
- .
The mX × mΦ instances of (∗) in matrix notation as ΦX = Y + E (∗∗) Y and E are mΦ × mX matrices defined by yij = f (xj + ǫϕi) − f (xj) ǫ , εij = −ǫ 2[(ϕi)T∇2f (ζij)ϕi],
SLIDE 91 Algorithm 2:
◮ Given mΦ, mX, draw at random the sets Φ and X, and
construct Y according to (*).
◮ Set ˆ
xj = ∆(yj) := arg minyj=Φz zℓd
1 , for j = 1, . . . , mX, and
ˆ X = (ˆ x1| . . . |ˆ xmX) is again a d × mX matrix.
◮ Compute the singular value decomposition of
ˆ X T = ˆ U1 ˆ U2 ˆ Σ1 ˆ Σ2 ˆ V T
1
ˆ V T
2
where ˆ Σ1 contains the k largest singular values.
◮ Set ˆ
A = ˆ V T
1 . ◮ Define ˆ
g(y) := f (ˆ ATy) and ˆ f (x) := ˆ g(ˆ Ax).
SLIDE 92
The control of the error
The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: 1. The error between ˆ X and X, which can be controlled through the number of compressed sensing measurements mΦ;
SLIDE 93
The control of the error
The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: 1. The error between ˆ X and X, which can be controlled through the number of compressed sensing measurements mΦ; 2. The stability of the span of V T, simply characterized by how well the singular values of X or equivalently G are separated from 0, which is related to the number of random samples mX. To be precise, we have
SLIDE 94 Recovery result
Theorem (F., Schnass, and Vyb´ ıral)
Let log d mΦ [log 6]2d. Then there is a constant c ′
1 such that using
mX · (mΦ + 1) function evaluations of f , Algorithm 2 defines a function ˆ f : BRd(1 + ¯ ǫ) → R that, with probability 1 −
1 mΦ + e−√mΦd + ke −mXαs2 2kC2 2
will satisfy f − ˆ f ∞ 2C2 √ k(1 + ¯ ǫ) ν2
, where ν2 = C
log(d/mΦ) 1/2−1/q + ǫk2 √mΦ
and C depends only on C1 and C2.
SLIDE 95 Ingredients of the proof
◮ compressed sensing;
SLIDE 96 Ingredients of the proof
◮ compressed sensing; ◮ stability of the SVD;
SLIDE 97 Ingredients of the proof
◮ compressed sensing; ◮ stability of the SVD; ◮ concentration inequalities (Chernoff bounds for sums of
positive-semidefinite matrices).
SLIDE 98 Compressed sensing
Corollary (after Wojtaszczyk, 2011)
Let log d mΦ < [log 6]2d. Then with probability 1 − (e−c ′
1 mΦ + e−√mΦd)
the matrix ˆ X as calculated in Algorithm 2 satisfies X − ˆ XF C√mX
log(d/mΦ) 1/2−1/q + ǫk2 √mΦ
where C depends only on C1 and C2.
SLIDE 99 Stability of SVD
Given two matrices B and ˆ B with corresponding singular value decompositions B =
U2 Σ1 Σ2 V T
1
V T
2
ˆ B = ˆ U1 ˆ U2 ˆ Σ1 ˆ Σ2 ˆ V T
1
ˆ V T
2
we have:
SLIDE 100 Wedin’s bound
Theorem (Stability of subspaces)
If there is an ¯ α > 0 such that min
ℓ,ˆ ℓ
|σˆ
ℓ(ˆ
Σ1) − σℓ(Σ2)| ¯ α, and min
ˆ ℓ
|σˆ
ℓ(ˆ
Σ1)| ¯ α, then V1V T
1 − ˆ
V1 ˆ V T
1 F 2
¯ αB − ˆ BF.
SLIDE 101 Wedin’s bound
Applied to our situation, where X has rank k and thus Σ2 = 0, we get V1V T
1 − ˆ
V1 ˆ V T
1 F 2√mXν2
σk( ˆ X T) , and further since σk( ˆ X T) σk(X T) − X − ˆ XF, that V1V T
1 − ˆ
V1 ˆ V T
1 F
2√mXν2 σk(X T) − √mXν2 .
SLIDE 102 Wedin’s bound
Applied to our situation, where X has rank k and thus Σ2 = 0, we get V1V T
1 − ˆ
V1 ˆ V T
1 F 2√mXν2
σk( ˆ X T) , and further since σk( ˆ X T) σk(X T) − X − ˆ XF, that V1V T
1 − ˆ
V1 ˆ V T
1 F
2√mXν2 σk(X T) − √mXν2 . Note that X T = GA = UGΣG[V T
G A],
for G =
T, hence ΣX T = ΣG. Moreover σi(G) =
for all i = 1, . . . , k.
SLIDE 103 Concentration inequalities II
Theorem (Matrix Chernoff bounds)
Consider X1, . . . , Xm independent random, positive-semidefinite matrices
- f dimension k × k. Moreover suppose σ1(Xj) C, almost surely.
Compute the singular values of the sum of the expectations µmax = σ1 m
j=1 EXj
m
j=1 EXj
P σ1
m
Xj − µmax sµmax k (1 + s) e − µmax(1+s)
C
, for all s > (e − 1), and P σk
m
Xj − µmin −sµmin ke− µmins2
2C
, for all s ∈ (0, 1).
SLIDE 104 Note that GTG =
mX
∇g(Axj)∇g(Axj)T. and by applying the previous result to Xj = ∇g(Axj)∇g(Axj)T, we have:
Lemma
For any s ∈ (0, 1) we have that σk(X T)
with probability 1 − ke
−mXαs2 2kC2 2
.
SLIDE 105 Proof of Theorem
with probability at least 1 −
1 mΦ + e−√mΦd + ke −mXαs2 2kC2 2
we have V1V T
1 − ˆ
V1 ˆ V T
1 F
2ν2
.
SLIDE 106 Proof of Theorem
with probability at least 1 −
1 mΦ + e−√mΦd + ke −mXαs2 2kC2 2
we have V1V T
1 − ˆ
V1 ˆ V T
1 F
2ν2
. and for ˆ A = ˆ V T
1 and V T G A = V T 1
ATA − ˆ AT ˆ AF = ATVGV T
G A − ˆ
V1 ˆ V T
1 F
2ν2
.
SLIDE 107 Proof of Theorem ... continue
Since A is row-orthogonal we have A = AATA and |f (x) − ˆ f (x)| = |g(Ax) − ˆ g(ˆ Ax)| = |g(Ax) − g(Aˆ AT ˆ Ax)| C2 √ kAx − Aˆ AT ˆ Axℓk
2
= C2 √ kA(ATA − ˆ AT ˆ A)xℓk
2
C2 √ k(ATA − ˆ AT ˆ A)Fxℓd
2
2C2 √ k(1 + ¯ ǫ) ν2
. where we used ATA − ˆ AT ˆ AF = ATVGV T
G A − ˆ
V1 ˆ V T
1 F
2ν2
.
SLIDE 108
k-ridge functions may be too simple!
Figure : Functions on data clustered around a manifold with multiple directions can be locally approximated by sums of k-ridge functions
SLIDE 109 Sums of ridge functions
Can we still be able to learn functions of the type f (x) =
m
gi(ai · x), x ∈ [−1, 1]d?
SLIDE 110 Sums of ridge functions
Can we still be able to learn functions of the type f (x) =
m
gi(ai · x), x ∈ [−1, 1]d? Our approach (Daubechies, F., Vyb´ ıral) is essentially based on the formula Dα1
c1 . . . Dαk ck f (x) = m
g (α1+···+αk)
i
(ai · x)(ai · c1)α1 . . . (ai · ck)αk, where k ∈ N, ci ∈ Rd, αi ∈ N for all i = 1, . . . , k and Dαi
ci is the αi-th
derivative in the direction ci.
SLIDE 111 The recovery strategy: nearly orthonormal systems
We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m
ai − wi2
2
1/2 : w1, . . . , wm
is small!
SLIDE 112 The recovery strategy: nearly orthonormal systems
We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m
ai − wi2
2
1/2 : w1, . . . , wm
is small! Furthermore, we denote by L = span{ai ⊗ ai, i = 1, . . . , m} ⊂ Rm×m the subspace of symmetric matrices generated by tensor products ai ⊗ ai = aiaT
i .
SLIDE 113 The recovery strategy: nearly orthonormal systems
We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m
ai − wi2
2
1/2 : w1, . . . , wm
is small! Furthermore, we denote by L = span{ai ⊗ ai, i = 1, . . . , m} ⊂ Rm×m the subspace of symmetric matrices generated by tensor products ai ⊗ ai = aiaT
i .
We first recover an approximation of L, i.e. instead of L we have then a subspace ˜ L of symmetric matrices at our disposal, which is (in some sense) close to L.
SLIDE 114 The recovery strategy: nearly orthonormal systems
We assume that vectors a1, . . . , am ∈ Rm are nearly orthonormal, to mean that S(a1, . . . , am) = inf m
ai − wi2
2
1/2 : w1, . . . , wm
is small! Furthermore, we denote by L = span{ai ⊗ ai, i = 1, . . . , m} ⊂ Rm×m the subspace of symmetric matrices generated by tensor products ai ⊗ ai = aiaT
i .
We first recover an approximation of L, i.e. instead of L we have then a subspace ˜ L of symmetric matrices at our disposal, which is (in some sense) close to L. Finally, we propose the following algorithm arg max M∞, s.t. M ∈ ˜ L, MF 1 to recover ai’s - or their good approximation ˆ ai (which is of course possible only up to the sign).
SLIDE 115
Nonlinear programming to recover the ai ⊗ ai’s
Figure : The ai ⊗ ai are the extremal points of the matrix operator norm!
SLIDE 116
On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.
SLIDE 117 On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT
1 , a2aT 2 }
SLIDE 118 On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT
1 , a2aT 2 } and that
˜ L = span 1 ǫ ǫ −ǫ
0.5 + ǫ 0.5 + ǫ 0.5 − ǫ
SLIDE 119 On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT
1 , a2aT 2 } and that
˜ L = span 1 ǫ ǫ −ǫ
0.5 + ǫ 0.5 + ǫ 0.5 − ǫ
- When choosing ǫ = 0.05, we find out that
{dist(a1aT
1 , ˜
L), dist(a2aT
2 , ˜
L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08].
SLIDE 120 On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT
1 , a2aT 2 } and that
˜ L = span 1 ǫ ǫ −ǫ
0.5 + ǫ 0.5 + ǫ 0.5 − ǫ
- When choosing ǫ = 0.05, we find out that
{dist(a1aT
1 , ˜
L), dist(a2aT
2 , ˜
L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08]. Hence, looking at ˜ L alone, every algorithm will have difficulties to decide, which two of the three rank-1 matrices above are the generators of the true L.
SLIDE 121 On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT
1 , a2aT 2 } and that
˜ L = span 1 ǫ ǫ −ǫ
0.5 + ǫ 0.5 + ǫ 0.5 − ǫ
- When choosing ǫ = 0.05, we find out that
{dist(a1aT
1 , ˜
L), dist(a2aT
2 , ˜
L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08]. Hence, looking at ˜ L alone, every algorithm will have difficulties to decide, which two of the three rank-1 matrices above are the generators of the true L. Nevertheless, b − a12 = b − a22 0.39.
SLIDE 122 On the ambiguity of learning for nonorthogonal profiles
Let a1 = (1, 0)T, a2 = ( √ 2/2, √ 2/2)T and b = (a1 + a2)/a1 + a22.We assume that L = span{a1aT
1 , a2aT 2 } and that
˜ L = span 1 ǫ ǫ −ǫ
0.5 + ǫ 0.5 + ǫ 0.5 − ǫ
- When choosing ǫ = 0.05, we find out that
{dist(a1aT
1 , ˜
L), dist(a2aT
2 , ˜
L), dist(bbT, ˜ L)} ⊂ [0.07, 0.08]. Hence, looking at ˜ L alone, every algorithm will have difficulties to decide, which two of the three rank-1 matrices above are the generators of the true L. Nevertheless, b − a12 = b − a22 0.39. We see that although the level of noise was rather mild, we have difficulties to distinguish between well separated vectors.
SLIDE 123 The approximation to L
Define ˜ L = span{∆f (xj), j = 1, . . . , mX}, where (∆f (x))j,k = f (x + ǫ(ej + ek)) − f (x + ǫej) − f (x + ǫek) + f (x) ǫ2 , for j, k = 1, . . . , m, is an approximation to the Hessian of f a x. For x drawn at random and by applying in a suitable way the Chernoff matrix bounds, one derives a probabilistic error estimate, in the sense that PL − P˜
LF→F Cm3/2ǫ,
with high probability.
SLIDE 124
A nonlinear operator towards a gradient ascent
Let us introduce first for a given parameter γ > 1 an operator acting on the singular values of a matrix X = UΣV T as follows: Πγ(X) = U diag(γ, 1, . . . , 1) × Σ (diag(γ, 1, . . . , 1) × Σ)F V T, where diag(γ, 1, . . . , 1) × Σ = γσ1 . . . σ2 . . . . . . . . . . . . . . . . . . σm Notice that Πγ maps any matrix X onto a matrix of unit Frobenius norm, simply exalting the first singular value and damping the others. It is not a linear operator.
SLIDE 125 The nonlinear programming
We propose a projected gradient method for solving arg max M∞, s.t. M ∈ ˜ L, MF 1. Algorithm 3:
◮ Fix a suitable parameter γ > 1 ◮ Assume to have identified a basis for ˜
L of semi-positive definite matrices, for instance, one can use the second order finite differences ∆f (xj), j = 1, . . . , mX to form such a basis;
◮ Generate an initial guess X 0 = mX j=1 ζj∆f (xj) by choosing at
random ζj 0, so that X 0 ∈ ˜ L and X 0F = 1;
◮ For ℓ 0:
X ℓ+1 := P˜
LΠγ(X ℓ);
SLIDE 126 Analysis of the algorithm for ˜ L = L
Proposition (Daubechies, F., Vyb´ ral)
Assume that ˜ L = L and that a1, . . . , am are orthonormal. Let γ > √ 2 and let X 0∞ > 1/
- γ2 − 1. Then there exists µ0 < 1 such that
- 1 − X ℓ+1∞
- µ0
- 1 − X ℓ∞
- ,
for all ℓ 0. Being the sequence (X ℓ)ℓ made of matrices with Frobenius norm bounded by 1, we conclude that any accumulation point of it has both unit Frobenius and spectral norm and therefore it has to coincide with
SLIDE 127 Analysis of the algorithm for ˜ L = L
Proposition (Daubechies, F., Vyb´ ral)
Assume that ˜ L = L and that a1, . . . , am are orthonormal. Let γ > √ 2 and let X 0∞ > 1/
- γ2 − 1. Then there exists µ0 < 1 such that
- 1 − X ℓ+1∞
- µ0
- 1 − X ℓ∞
- ,
for all ℓ 0. Being the sequence (X ℓ)ℓ made of matrices with Frobenius norm bounded by 1, we conclude that any accumulation point of it has both unit Frobenius and spectral norm and therefore it has to coincide with
The proof is based on the following observation X ℓ+1∞ = σ1(X ℓ+1) = γσ1(X ℓ)
- γ2σ1(X ℓ)2 + σ2(X ℓ)2 + · · · + σm(X ℓ)2
- γX ℓ∞
- (γ2 − 1)X ℓ2
∞ + 1
.
SLIDE 128 Analysis of the algorithm for ˜ L ≈ L
Theorem (Daubechies, F., Vyb´ ral)
Assume for that P˜
L − PLF→F < ǫ < 1 and that a1, . . . , am are
- rthonormal. Let X 0∞ > max{
1
√
γ2−1, 1 √ 2 + ǫ + ξ} and
√ 2 < γ. Then for the iterations (X ℓ)ℓ produced by Algorithm 3, there exists µ0 < 1 such that lim sup
ℓ
|1 − X ℓ∞| µ1(γ, t0, ǫ) + 2ǫ 1 − µ0 + ǫ, where µ1(γ, ξ, ǫ) ≈ ǫ. The sequence (X ℓ)ℓ is bounded and its accumulation points ¯ X satisfy simultaneously the following properties ¯ XF 1 and ¯ X∞ 1 − µ1(γ, t0, ǫ) + 2ǫ 1 − µ0 + ǫ, and PL ¯ XF 1 and PL ¯ X∞ 1 − µ1(γ, t0, ǫ) + 2ǫ 1 − µ0 .
SLIDE 129
A graphical explanation of the algorithm
Figure : Objective function · ∞ to be maximized and iterations of Algorithm 3 converging to one of the extremal points ai ⊗ ai
SLIDE 130 Nonlinear programming
Theorem (Daubechies, F., Vyb´ ral)
Let M be any local maximizer of arg max M∞, s.t. M ∈ ˜ L, MF 1. Then uT
j Xuj = 0
for all X ∈ S˜
L
with X ⊥ M and all j ∈ {1, . . . , m} with |λj(0)| = M∞. If furthermore the ai’s are nearly orthonoramal S(a1, . . . , am) ε and 3 · m · PL − P˜
L < (1 − ε)2,
then λ1 = M∞ > max{|λ2|, . . . , |λm|} and 2
m
(uT
1 Xuk)2
λ1 − λk λ1.
SLIDE 131 Nonlinear programming
Algorithm 4:
◮ Let M be a local maximizer of the nonlinear programming ◮ Take its singular value decomposition M = m j=1 λjuj ⊗ uj ◮ Put ˆ
a := u1
Theorem (Daubechies, F., Vyb´ ral)
Let L = ˜ L and S(a1, . . . , am) ε. Then there is j0 ∈ {1, . . . , m}, such that ˆ a found by Algorithm 4 satisfies ˆ a − aj02 C√ε.
SLIDE 132 Nonlinear programming
Algorithm 4:
◮ Let M be a local maximizer of the nonlinear programming ◮ Take its singular value decomposition M = m j=1 λjuj ⊗ uj ◮ Put ˆ
a := u1
Theorem (Daubechies, F., Vyb´ ral)
Let L = ˜ L and S(a1, . . . , am) ε. Then there is j0 ∈ {1, . . . , m}, such that ˆ a found by Algorithm 4 satisfies ˆ a − aj02 C√ε. The proof is based on testing the optimality condtions for X = Xj = aj ⊗ aj and showing that λ1(M) ≈ 1.
SLIDE 133 Learning sums of ridge functions
Algorithm 5:
◮ Let ˆ
aj are normalized approximations of aj, j = 1, . . . , m
◮ Let (ˆ
bj)m
j=1 be the dual basis to (ˆ
aj)m
j=1 ◮ Assume, that f (0) = g1(0) = · · · = gm(0) ◮ Put ˆ
gj(t) := f (tˆ bj), t ∈ (−1/ˆ bj2, 1/ˆ bj2)
◮ Put ˆ
f (x) := m
j=1 ˆ
gj(ˆ aj · x), x2 1
Theorem (Daubechies, F., Vyb´ ral)
Let
◮ S(a1, . . . , am) ε and S(ˆ
a1, . . . , ˆ am) ε′;
◮ aj − ˆ
aj2 η, j = 1, . . . , m. Then f − ˆ f ∞ c(ε, ε′)mη.
SLIDE 134 Our literature
◮ I. Daubechies, M. Fornasier, and J. Vyb´
ıral, Approximation of sums
- f ridge functions, in preparation
◮ M. Fornasier, K. Schnass, and J. Vyb´
ıral, Learning functions of few arbitrary linear parameters in high dimensions, Foundations on Computational Mathematics, Vol. 2, No. 2, 2012, pp. 229-262
◮ K. Schnass and J. Vyb´
ıral, Compressed learning of high-dimensional sparse functions, ICASSP11, 2011.
◮ A. Kolleck and J. Vyb´
ıral, On some aspects of approximation of ridge functions, J. Appr. Theory 194 (2015), 35-61
◮ S. Mayer, T. Ullrich, and J. Vyb´
ıral, Entropy and sampling numbers
- f classes of ridge functions, to appear in Constructive
Approximation,