Learning sums of ridge functions in high dimension: a nonlinear - PowerPoint PPT Presentation

Capturing ridge functions from point queries: a nonlinear compressed sensing model Compressed sensing: given X ∈ R m × d sensing matrix, for m ≪ d , suitable matrix, we wish to identify a nearly sparse vector a ∈ R d from its measurements y ≈ Xa , by means of suitable algorithms ( ℓ 1 -minimization, greedy algs) aware of y and X . The data y i ≈ x i · a = x T i a , i = 1, . . . , m are linear measurements of a . If now we assume y i to be the values of a ridge function at the points x i

Capturing ridge functions from point queries: a nonlinear compressed sensing model Compressed sensing: given X ∈ R m × d sensing matrix, for m ≪ d , suitable matrix, we wish to identify a nearly sparse vector a ∈ R d from its measurements y ≈ Xa , by means of suitable algorithms ( ℓ 1 -minimization, greedy algs) aware of y and X . The data y i ≈ x i · a = x T i a , i = 1, . . . , m are linear measurements of a . If now we assume y i to be the values of a ridge function at the points x i y i ≈ g ( a · x i ) , i = 1, . . . , m , for some unknown or roughly given nonlinear function g ,

Capturing ridge functions from point queries: a nonlinear compressed sensing model Compressed sensing: given X ∈ R m × d sensing matrix, for m ≪ d , suitable matrix, we wish to identify a nearly sparse vector a ∈ R d from its measurements y ≈ Xa , by means of suitable algorithms ( ℓ 1 -minimization, greedy algs) aware of y and X . The data y i ≈ x i · a = x T i a , i = 1, . . . , m are linear measurements of a . If now we assume y i to be the values of a ridge function at the points x i y i ≈ g ( a · x i ) , i = 1, . . . , m , for some unknown or roughly given nonlinear function g , the problem of identifying the ridge direction can be understood as a nonlinear compressed sensing model ...

Ridge functions and functions of data clustered around manifolds Figure : Functions on data clustered around a manifold can be locally approximated by k -ridge functions

Universal random sampling for a more general ridge model M. Fornasier, K. Schnass, J. Vyb´ ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f ( x ) = g ( Ax ) , A is a k × d matrix

Universal random sampling for a more general ridge model M. Fornasier, K. Schnass, J. Vyb´ ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 , 0 < q � 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2

Universal random sampling for a more general ridge model M. Fornasier, K. Schnass, J. Vyb´ ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 , 0 < q � 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2 � The matrix H f := S d − 1 ∇ f ( x ) ∇ f ( x ) T d µ S d − 1 ( x ) is a positive semi-definite k -rank matrix We assume, that the singular values of the matrix H f satisfy σ 1 ( H f ) � · · · � σ k ( H f ) � α > 0.

How can we learn k -ridge functions from point queries?

MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ

MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ X = { x j ∈ Ω : j = 1, . . . , m X } drawn uniformly at random in Ω ⊂ R d Φ = { ϕ j ∈ R d , j = 1, . . . , m Φ } , where � 1 / √ m Φ with prob. 1 / 2, ϕ j ℓ = − 1 / √ m Φ with prob. 1 / 2 for every j ∈ { 1, . . . , m Φ } and every ℓ ∈ { 1, . . . , d }

Sensitivity analysis x + εϕ S d − 1 x Figure : We perform at random, randomized sensitivity analysis

Collecting together the differential analysis Φ . . . m Φ × d matrix whose rows are ϕ i , X . . . d × m X matrix A T ∇ g ( Ax 1 ) | . . . | A T ∇ g ( Ax m X ) � � X = . The m X × m Φ instances of ( ∗ ) in matrix notation as Φ X = Y + E ( ∗∗ ) Y and E are m Φ × m X matrices defined by y ij = f ( x j + ǫϕ i ) − f ( x j ) , ǫ ε ij = − ǫ 2 [( ϕ i ) T ∇ 2 f ( ζ ij ) ϕ i ] ,

Example of active coordinates: which factor does play a role? We assume, that  e T  i 1 . A = .  ,   .  e T i k i.e. f ( x ) = f ( x 1 , . . . , x d ) = g ( x i 1 , . . . , x i k ) , where f : Ω = [ 0, 1 ] d → R and g : [ 0, 1 ] k → R

Example of active coordinates: which factor does play a role? We assume, that  e T  i 1 . A = .  ,   .  e T i k i.e. f ( x ) = f ( x 1 , . . . , x d ) = g ( x i 1 , . . . , x i k ) , where f : Ω = [ 0, 1 ] d → R and g : [ 0, 1 ] k → R We want to identify first the active coordinates i 1 , . . . , i k . Then one can apply any usual k -dimensional approximation method...

Example of active coordinates: which factor does play a role? We assume, that  e T  i 1 . A = .  ,   .  e T i k i.e. f ( x ) = f ( x 1 , . . . , x d ) = g ( x i 1 , . . . , x i k ) , where f : Ω = [ 0, 1 ] d → R and g : [ 0, 1 ] k → R We want to identify first the active coordinates i 1 , . . . , i k . Then one can apply any usual k -dimensional approximation method... A possible algorithm chooses the sampling points at random, due to the concentration of measure effects, we get the right result with overwhelming probability.

A simple algorithm based on concentration of measure The algorithm to identify the active coordinates I is based on the identity Φ T Φ X = Φ T Y + Φ T E where now X has i th -row � ∂ g ( Ax 1 ) , . . . , ∂ g � ( Ax m X X i = , ∂ z i ∂ z i for i ∈ I , and all other row equal to zero.

A simple algorithm based on concentration of measure The algorithm to identify the active coordinates I is based on the identity Φ T Φ X = Φ T Y + Φ T E where now X has i th -row � ∂ g ( Ax 1 ) , . . . , ∂ g � ( Ax m X X i = , ∂ z i ∂ z i for i ∈ I , and all other row equal to zero. In expectation: Φ T Φ ≈ I d : R d → R d Φ T Φ X ≈ X and Φ T E is small = ⇒ Φ T Y ≈ X ,

A simple algorithm based on concentration of measure The algorithm to identify the active coordinates I is based on the identity Φ T Φ X = Φ T Y + Φ T E where now X has i th -row � ∂ g ( Ax 1 ) , . . . , ∂ g � ( Ax m X X i = , ∂ z i ∂ z i for i ∈ I , and all other row equal to zero. In expectation: Φ T Φ ≈ I d : R d → R d Φ T Φ X ≈ X and Φ T E is small = ⇒ Φ T Y ≈ X , We select the k largest rows of Φ T Y and estimate the probability, that their indices coincide with the indices of the non-zero rows of X .

A first recovery result Theorem (Schnass and Vyb´ ıral 2011) Let f : R d → R be a function of k active coordinates that is defined and twice continuously differentiable on a small neighbourhood of [ 0, 1 ] d . For L � d, a positive real number, the randomized algorithm described above recovers the k unknown active coordinates of f with probability at least 1 − 6 exp (− L ) using only O ( k ( L + log k )( L + log d )) samples of f . The constants involved in the O notation depend on smoothness properties of g , namely on max j = 1,..., k � ∂ i j g � ∞ min j = 1,..., k � ∂ i j g � 1

Examples of active coordinate detection in dimension d = 1000 20 0.9 40 0.8 60 0.7 80 0.6 100 0.9 80 0.5 120 0.8 100 140 0.4 0.7 120 160 0.3 140 0.6 180 0.2 160 0.5 200 0.1 6 12 18 24 30 36 42 48 54 60 180 0.4 200 0.3 220 0.2 240 0.1 260 0 5 10 15 20 25 30 35 40 45 50 ( x 3 − 1 / 2 ) 2 + ( x 4 − 1 / 2 ) 2 , 0 ) 3 and � Figure : max ( 1 − 5 � 6 π � 40 � + � 40 i = 21 sin ( 6 π x i ) + 5 ( x i − 1 / 2 ) 2 sin i = 21 x i

Learning ridge functions k = 1 Let f ( x ) = g ( a · x ) , f : B R d → R , where a ∈ R d � a � 2 = 1 and � a � q � C 1 , 0 < q � 1, max 0 � α � 2 � D α g � ∞ � C 2 � � S d − 1 �∇ f ( x ) � 2 S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) > 0, α = 2 d µ S d − 1 ( x ) = ℓ d

Learning ridge functions k = 1 Let f ( x ) = g ( a · x ) , f : B R d → R , where a ∈ R d � a � 2 = 1 and � a � q � C 1 , 0 < q � 1, max 0 � α � 2 � D α g � ∞ � C 2 � � S d − 1 �∇ f ( x ) � 2 S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) > 0, α = 2 d µ S d − 1 ( x ) = ℓ d We consider again the Taylor expansion (*) with Ω = S d − 1 We choose the points X = { x j ∈ S d − 1 : j = 1, . . . , m X } generated at random on S d − 1 with respect to µ S d − 1 The matrix Φ is generated as before and we obtain (**) again in the form Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X .

Algorithm 1 : ◮ Given m Φ , m X , draw at random the sets Φ and X , and construct Y according (*). ◮ Set ˆ x j = ∆ ( y j ) := arg min y j = Φ z � z � ℓ d 1 . ◮ Find j 0 = arg j = 1,..., m X � ˆ max x j � ℓ d 2 . ◮ Set ˆ a = ˆ x j 0 / � ˆ x j 0 � ℓ d 2 . a T y ) and ˆ ◮ Define ˆ g ( y ) := f ( ˆ f ( x ) := ˆ g ( ˆ a · x ) .

Recovery result Theorem (F., Schnass, and Vyb´ ıral 2012) Let 0 < s < 1 and log d � m Φ � [ log 6 ] 2 d. Then there is a constant c ′ 1 such that using m X · ( m Φ + 1 ) function evaluations of f , Algorithm 1 defines a function ˆ f : B R d ( 1 + ¯ ǫ ) → R that, with probability � � 2 m X s 2 α 2 1 m Φ + e − √ m Φ d + 2 e − e − c ′ C 4 1 − , 2 will satisfy ν 1 � f − ˆ f � ∞ � 2 C 2 ( 1 + ¯ ǫ ) , � α ( 1 − s ) − ν 1 where �� 1 / 2 − 1 / q m Φ ǫ ν 1 = C ′ + √ m Φ log ( d / m Φ ) and C ′ depends only on C 1 and C 2 .

Ingredients of the proof ◮ compressed sensing;

Ingredients of the proof ◮ compressed sensing; ◮ stability of one dimensional subspaces;

Ingredients of the proof ◮ compressed sensing; ◮ stability of one dimensional subspaces; ◮ concentration inequalities (Hoeffding’s inequality).

Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m.

Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m. Let us suppose that d > [ log 6 ] 2 m.

Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m. Let us suppose that d > [ log 6 ] 2 m. Then there are positive constants C , c ′ 1 , c ′ 2 > 0 , such that, with probability at least √ 1 m − e − 1 − e − c ′ md , the matrix Φ has the following property.

Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m. Let us suppose that d > [ log 6 ] 2 m. Then there are positive constants C , c ′ 1 , c ′ 2 > 0 , such that, with probability at least √ 1 m − e − 1 − e − c ′ md , the matrix Φ has the following property. For every x ∈ R d , ε ∈ R m and every natural number K � c ′ 2 m / log ( d / m ) we have � � K − 1 / 2 σ K ( x ) ℓ d � � ∆ ( Φ x + ε ) − x � ℓ d 2 � C 1 + max { � ε � ℓ m 2 , log d � ε � ℓ m ∞ } , where σ K ( x ) ℓ d 1 := inf { � x − z � ℓ d 1 : # supp z � K } is the best K-term approximation of x.

How does compressed sensing play a role? For the d × m X matrix X , i.e., X = ( g ′ ( a · x 1 ) a T | . . . | g ′ ( a · x m X ) a T ) , Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X , � �� := x j

How does compressed sensing play a role? For the d × m X matrix X , i.e., X = ( g ′ ( a · x 1 ) a T | . . . | g ′ ( a · x m X ) a T ) , Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X , � �� := x j and x j = ∆ ( y j ) := arg min ˆ y j = Φ z � z � ℓ d 1

How does compressed sensing play a role? For the d × m X matrix X , i.e., X = ( g ′ ( a · x 1 ) a T | . . . | g ′ ( a · x m X ) a T ) , Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X , � �� := x j and x j = ∆ ( y j ) := arg min ˆ y j = Φ z � z � ℓ d 1 the previous result gives - with the probability provided there - x j = g ′ ( a · x j ) a T + n j , ˆ with n j properly estimated by � � K − 1 / 2 σ K ( g ′ ( a · x j ) a T ) ℓ d � � n j � ℓ d 2 � C 1 + max { � ε j � ℓ m 2 , log d � ε j � ℓ m . ∞

Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d ,

Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ )

Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1

Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1 � d � d � 2 � 2 / q � C 2 ǫ � g ′′ � ∞ � ǫ � g ′′ � ∞ � � 1 C 2 | a k | q | a k | ǫ , � 2 m Φ 2 m Φ 2 m Φ k = 1 k = 1

Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1 � d � d � 2 � 2 / q � C 2 ǫ � g ′′ � ∞ � ǫ � g ′′ � ∞ � � 1 C 2 | a k | q | a k | ǫ , � 2 m Φ 2 m Φ 2 m Φ k = 1 k = 1 � C 2 √ m Φ � ε j � ℓ 1 C 2 � ε j � ℓ 2 √ m Φ ǫ , � m Φ m Φ 2 ∞

Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1 � d � d � 2 � 2 / q � C 2 ǫ � g ′′ � ∞ � ǫ � g ′′ � ∞ � � 1 C 2 | a k | q | a k | ǫ , � 2 m Φ 2 m Φ 2 m Φ k = 1 k = 1 � C 2 √ m Φ � ε j � ℓ 1 C 2 � ε j � ℓ 2 √ m Φ ǫ , leading to � m Φ m Φ 2 ∞ � � , √ log d � ε j � ℓ C 2 � C 2 1 C 2 log d 1 C 2 max { � ε j � ℓ ∞ } � 2 √ m Φ ǫ · max 1, = 2 √ m Φ ǫ . m Φ m Φ m Φ 2

Summarizing ... With high probability x j = g ′ ( a · x j ) a T + n j , ˆ

Summarizing ... With high probability x j = g ′ ( a · x j ) a T + n j , ˆ where � � K − 1 / 2 σ K ( g ′ ( a · x j ) a T ) ℓ d � � n j � ℓ d C 1 + max { � ε j � ℓ m 2 , log d � ε j � ℓ m � 2 ∞ �� 1 / 2 − 1 / q m Φ ǫ C ′ + := ν 1 � √ m Φ log ( d / m Φ )

Stability of one dimensional subspaces Lemma x ∈ R d , a ∈ S d − 1 , 0 � = γ ∈ R , and n ∈ R d with norm Let us fix ˆ � n � ℓ d 2 � ν 1 < | γ | . If we assume ˆ x = γ a + n then � � ˆ x 2 ν 1 � � � sign γ − a . � � � � � ˆ x � ℓ d � � ˆ x � ℓ d � ℓ d 2 2 2

Stability of one dimensional subspaces Lemma x ∈ R d , a ∈ S d − 1 , 0 � = γ ∈ R , and n ∈ R d with norm Let us fix ˆ � n � ℓ d 2 � ν 1 < | γ | . If we assume ˆ x = γ a + n then � � x ˆ 2 ν 1 � � � sign γ − a . � � � � � ˆ x � ℓ d � � ˆ x � ℓ d � ℓ d 2 2 2 We recall, that x j = g ′ ( a · x j ) a T + n j . ˆ and | g ′ ( a · x j ) | − max | g ′ ( a · x j ) | max � ˆ x j � ℓ d 2 � max � ˆ x j − x j � ℓ d max − ν 1 2 � j j j j � �� we need to estimate it

Concentration inequalities I Lemma (Hoeffding’s inequality) Let X 1 , . . . , X m be independent random variables. Assume that the X j are almost surely bounded, i.e., there exist finite scalars a j , b j such that P { X j − E X j ∈ [ a j , b j ] } = 1, for j = 1, . . . , m. Then we have   � �   m m   2 t 2 � � � � j = 1 ( bj − aj ) 2 . − � m � � P X j − E X j � t  � 2 e � �    � � j = 1 j = 1 � �

Concentration inequalities I Lemma (Hoeffding’s inequality) Let X 1 , . . . , X m be independent random variables. Assume that the X j are almost surely bounded, i.e., there exist finite scalars a j , b j such that P { X j − E X j ∈ [ a j , b j ] } = 1, for j = 1, . . . , m. Then we have   � �   m m   2 t 2 � � � � j = 1 ( bj − aj ) 2 . − � m � � P X j − E X j � t  � 2 e � �    � � j = 1 j = 1 � � Let us now apply Hoeffding’s inequality to the random variables X j = | g ′ ( a · x j ) | 2 .

Probabilistic estimates from below By applying Hoeffding’s inequality to the random variables X j = | g ′ ( a · x j ) | 2 , we have Lemma 2 m X s 2 α 2 − C 4 Let us fix 0 < s < 1 . Then with probability 1 − 2 e we have 2 j = 1,..., m X | g ′ ( a · x j ) | � � max α ( 1 − s ) , � where α := E x ( | g ′ ( a · x j ) | 2 ) = S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) = � S d − 1 �∇ f ( x ) � 2 2 d µ S d − 1 ( x ) > 0 . ℓ d

Algorithm 1 : ◮ Given m Φ , m X , draw at random the sets Φ and X , and construct Y according (*). ◮ Set ˆ x j = ∆ ( y j ) := arg min y j = Φ z � z � ℓ d 1 . ◮ Find j 0 = arg j = 1,..., m X � ˆ max x j � ℓ d 2 . ◮ Set ˆ a = ˆ x j 0 / � ˆ x j 0 � ℓ d 2 . a T y ) and ˆ ◮ Define ˆ g ( y ) := f ( ˆ f ( x ) := ˆ g ( ˆ a · x ) .

Recovery result Theorem (F., Schnass, and Vyb´ ıral 2012) Let 0 < s < 1 and log d � m Φ � [ log 6 ] 2 d. Then there is a constant c ′ 1 such that using m X · ( m Φ + 1 ) function evaluations of f , Algorithm 1 defines a function ˆ f : B R d ( 1 + ¯ ǫ ) → R that, with probability � � 2 m X s 2 α 2 1 m Φ + e − √ m Φ d + 2 e − e − c ′ C 4 1 − , 2 will satisfy ν 1 � f − ˆ f � ∞ � 2 C 2 ( 1 + ¯ ǫ ) , � α ( 1 − s ) − ν 1 where �� 1 / 2 − 1 / q m Φ ǫ ν 1 = C ′ + √ m Φ log ( d / m Φ ) and C ′ depends only on C 1 and C 2 .

Concentration of measure phenomenon and risk of intractability Key role is played by � S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) α =

Concentration of measure phenomenon and risk of intractability Key role is played by � S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) α = Due to symmetry . . . independent on a

Concentration of measure phenomenon and risk of intractability Key role is played by � S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) α = Due to symmetry . . . independent on a Push-forward measure µ 1 on [− 1, 1 ] � 1 | g ′ ( y ) | 2 d µ 1 ( y ) α = − 1 � 1 Γ ( d / 2 ) d − 3 | g ′ ( y ) | 2 ( 1 − y 2 ) 2 dy = π 1 / 2 Γ (( d − 1 ) / 2 ) − 1 µ 1 concentrates around zero exponentially fast as d → ∞

Dependence on the dimension d Proposition Let us fix M ∈ N and assume that g : [− 1, 1 ] → R is C M + 2 -differentiable d ℓ in an open neighbourhood U of 0 and dx ℓ g ( 0 ) = 0 for ℓ = 1, . . . , M. Then α ( d ) = O ( d − M ) , for d → ∞ .

Tractability classes (1) For 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 1 F 1 := d ( α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d 2 = 1, � a � ℓ d q � C 1 and ∃ g ∈ C 2 ( B R ) , | g ′ ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } .

Tractability classes (1) For 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 1 F 1 := d ( α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d 2 = 1, � a � ℓ d q � C 1 and ∃ g ∈ C 2 ( B R ) , | g ′ ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } . (2) For a neighborhood U of 0, 0 < q � 1, C 1 > 1, C 2 � α 0 > 0 and N � 2, we define F 2 F 2 := d ( U , α 0 , q , C 1 , C 2 , N ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d ∃ g ∈ C 2 ( B R ) ∩ C N ( U ) 2 = 1, � a � ℓ d q � C 1 and ∃ 0 � M � N − 1, | g ( M ) ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } .

Tractability classes (1) For 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 1 F 1 := d ( α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d 2 = 1, � a � ℓ d q � C 1 and ∃ g ∈ C 2 ( B R ) , | g ′ ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } . (2) For a neighborhood U of 0, 0 < q � 1, C 1 > 1, C 2 � α 0 > 0 and N � 2, we define F 2 F 2 := d ( U , α 0 , q , C 1 , C 2 , N ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d ∃ g ∈ C 2 ( B R ) ∩ C N ( U ) 2 = 1, � a � ℓ d q � C 1 and ∃ 0 � M � N − 1, | g ( M ) ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } . (3) For a neighborhood U of 0, 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 3 F 3 := d ( U , α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d ∃ g ∈ C 2 ( B R ) ∩ C ∞ ( U ) 2 = 1, � a � ℓ d q � C 1 and | g ( M ) ( 0 ) | = 0 for all M ∈ N : f ( x ) = g ( a · x ) } .

Tractability result Corollary The problem of learning functions f in the classes F 1 d and F 2 d from point evaluations is strongly polynomially tractable (no poly dep. on d) and polynomially tractable (with poly dep. on d) respectively.

Intractability On the one hand, let us notice that if in the class F 3 d we remove the condition � a � ℓ d q � C 1 , then the problem actually becomes intractable .

Intractability On the one hand, let us notice that if in the class F 3 d we remove the condition � a � ℓ d q � C 1 , then the problem actually becomes intractable . Let ǫ ]) given by g ( y ) = 8 ( y − 1 / 2 ) 3 for y ∈ [ 1 / 2, 1 + ¯ g ∈ C 2 ([− 1 − ¯ ǫ , 1 + ¯ ǫ ] and zero otherwise.

Intractability On the one hand, let us notice that if in the class F 3 d we remove the condition � a � ℓ d q � C 1 , then the problem actually becomes intractable . Let ǫ ]) given by g ( y ) = 8 ( y − 1 / 2 ) 3 for y ∈ [ 1 / 2, 1 + ¯ g ∈ C 2 ([− 1 − ¯ ǫ , 1 + ¯ ǫ ] and zero otherwise. Notice that, for every a ∈ R d with � a � ℓ d 2 = 1, the function f ( x ) = g ( a · x ) vanishes everywhere on S d − 1 outside of the cap U ( a , 1 / 2 ) := { x ∈ S d − 1 : a · x � 1 / 2 } , Figure : The function g and the spherical cap U ( a , 1 / 2 ) .

Intractability The µ S d − 1 measure of U ( a , 1 / 2 ) obviously does not depend on a and is known to be exponentially small in d . Furthermore, it is known, that there is a constant c > 0 and unit vectors a 1 , . . . , a K , such that the sets U ( a 1 , 1 / 2 ) , . . . , U ( a K , 1 / 2 ) are mutually disjoint and K � e cd . Finally, we observe that max x ∈ S d − 1 | f ( x ) | = f ( a ) = g ( 1 ) = 1.

Intractability The µ S d − 1 measure of U ( a , 1 / 2 ) obviously does not depend on a and is known to be exponentially small in d . Furthermore, it is known, that there is a constant c > 0 and unit vectors a 1 , . . . , a K , such that the sets U ( a 1 , 1 / 2 ) , . . . , U ( a K , 1 / 2 ) are mutually disjoint and K � e cd . Finally, we observe that max x ∈ S d − 1 | f ( x ) | = f ( a ) = g ( 1 ) = 1. We conclude that any algorithm making only use of the structure of f ( x ) = g ( a · x ) and the condition needs to use exponentially many sampling points in order to distinguish between f ( x ) ≡ 0 and f ( x ) = g ( a i · x ) for some of the a i ’s as constructed above.

Truly k -ridge functions for k ≫ 1 f ( x ) = g ( Ax ) , A is a k × d matrix

Truly k -ridge functions for k ≫ 1 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2

Truly k -ridge functions for k ≫ 1 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2 � The matrix H f := S d − 1 ∇ f ( x ) ∇ f ( x ) T d µ S d − 1 ( x ) is a positive semi-definite k -rank matrix We assume, that the singular values of the matrix H f satisfy σ 1 ( H f ) � · · · � σ k ( H f ) � α > 0.

MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ

MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ X = { x j ∈ Ω : j = 1, . . . , m X } drawn uniformly at random in Ω ⊂ R d Φ = { ϕ j ∈ R d , j = 1, . . . , m Φ } , where � 1 / √ m Φ with prob. 1 / 2, ϕ j ℓ = − 1 / √ m Φ with prob. 1 / 2 for every j ∈ { 1, . . . , m Φ } and every ℓ ∈ { 1, . . . , d }

Sensitivity analysis x + εϕ S d − 1 x Figure : We perform at random, randomized sensitivity analysis

Collecting together the differential analysis Φ . . . m Φ × d matrix whose rows are ϕ i , X . . . d × m X matrix A T ∇ g ( Ax 1 ) | . . . | A T ∇ g ( Ax m X ) � � X = . The m X × m Φ instances of ( ∗ ) in matrix notation as Φ X = Y + E ( ∗∗ ) Y and E are m Φ × m X matrices defined by y ij = f ( x j + ǫϕ i ) − f ( x j ) , ǫ ε ij = − ǫ 2 [( ϕ i ) T ∇ 2 f ( ζ ij ) ϕ i ] ,

Algorithm 2 : ◮ Given m Φ , m X , draw at random the sets Φ and X , and construct Y according to (*). ◮ Set ˆ x j = ∆ ( y j ) := arg min y j = Φ z � z � ℓ d 1 , for j = 1, . . . , m X , and ˆ X = ( ˆ x 1 | . . . | ˆ x m X ) is again a d × m X matrix. ◮ Compute the singular value decomposition of � � ˆ � � ˆ � ˆ V T � Σ 1 0 X T = ˆ ˆ 1 U 1 U 2 , ˆ ˆ V T 0 Σ 2 2 where ˆ Σ 1 contains the k largest singular values. ◮ Set ˆ A = ˆ V T 1 . g ( y ) := f ( ˆ A T y ) and ˆ g ( ˆ ◮ Define ˆ f ( x ) := ˆ Ax ) .

The control of the error The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: The error between ˆ 1. X and X , which can be controlled through the number of compressed sensing measurements m Φ ;

The control of the error The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: The error between ˆ 1. X and X , which can be controlled through the number of compressed sensing measurements m Φ ; The stability of the span of V T , simply characterized by how well 2. the singular values of X or equivalently G are separated from 0, which is related to the number of random samples m X . To be precise, we have

Recovery result Theorem (F., Schnass, and Vyb´ ıral) Let log d � m Φ � [ log 6 ] 2 d. Then there is a constant c ′ 1 such that using m X · ( m Φ + 1 ) function evaluations of f , Algorithm 2 defines a function ˆ f : B R d ( 1 + ¯ ǫ ) → R that, with probability � � − m X α s 2 1 m Φ + e − √ m Φ d + ke e − c ′ 2 kC 2 1 − , 2 will satisfy √ ν 2 � f − ˆ k ( 1 + ¯ f � ∞ � 2 C 2 ǫ ) , � α ( 1 − s ) − ν 2 where � � � 1 / 2 − 1 / q + ǫ k 2 � m Φ k 1 / q ν 2 = C √ m Φ , log ( d / m Φ ) and C depends only on C 1 and C 2 .

Ingredients of the proof ◮ compressed sensing;

Ingredients of the proof ◮ compressed sensing; ◮ stability of the SVD;

Ingredients of the proof ◮ compressed sensing; ◮ stability of the SVD; ◮ concentration inequalities (Chernoff bounds for sums of positive-semidefinite matrices).

Compressed sensing Corollary (after Wojtaszczyk, 2011) Let log d � m Φ < [ log 6 ] 2 d. Then with probability 1 m Φ + e − √ m Φ d ) 1 − ( e − c ′ the matrix ˆ X as calculated in Algorithm 2 satisfies � � 1 / 2 − 1 / q � + ǫ k 2 X � F � C √ m X � m Φ � X − ˆ k 1 / q √ m Φ , log ( d / m Φ ) where C depends only on C 1 and C 2 .

Stability of SVD Given two matrices B and ˆ B with corresponding singular value decompositions � � V T � � Σ 1 � 0 � 1 B = U 1 U 2 V T 0 Σ 2 2 and � � ˆ � � ˆ � ˆ V T � Σ 1 0 ˆ ˆ 1 B = , U 1 U 2 ˆ ˆ V T 0 Σ 2 2 we have:

Wedin’s bound Theorem (Stability of subspaces) If there is an ¯ α > 0 such that ℓ ( ˆ min | σ ˆ Σ 1 ) − σ ℓ ( Σ 2 ) | � ¯ α , ℓ ,ˆ ℓ and ℓ ( ˆ min | σ ˆ Σ 1 ) | � ¯ α , ˆ ℓ then 1 � F � 2 1 − ˆ V 1 ˆ α � B − ˆ � V 1 V T V T B � F . ¯

Learning sums of ridge functions in high dimension: a nonlinear - PowerPoint PPT Presentation

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo Fornasier Fakult at f ur Mathematik Technische Universit at M unchen massimo.fornasier@ma.tum.de http://www-m15.ma.tum.de/ Winter

Dedekind sums ingredients Dedekind sums Fourier- Dedekind sums Restricted partition Mirco

Recap: Prefix Sums Given A : set of n integers Find B : prefix sums A: 3 1 1 7 2 5

Binary and Ternary Kloosterman sums Kseniya Garaschuk University of Victoria July 22, 2010

Data Structures II Partial Sums Dynamic Arrays Philip Bille Data Structures II

On Ridge Functions Allan Pinkus Technion September 23, 2013 Allan Pinkus (Technion) Ridge

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Trace functions over finite fields: a study in sums of products E. Kowalski ETH Z urich May

Triple Shifted Sums of Automorphic L -functions Thomas Hulse Brown University ICERM Semester

Binomial = + + + Theorem + + + + + Image by MIT OpenCourseWare. Products of Sums = Sums

Treasury (spending / accessing funds, A-Board (requesting funds) SUMS technical support) Form

Low-rank sums-of-squares representations Cynthia Vinzant, North Carolina State University joint

Sums of Money MDM4U: Mathematics of Data Management How many different sums of money can be made

Sums of two squares A tale of two sums Melanie Abel Department of Mathematics University of

18.175: Lecture 7 Sums of random variables Scott Sheffield MIT 1 18.175 Lecture 7 Outline

Algorithms at Scale (Week 2) Puzzle of the Day: A bag contains a collection of blue and red

Computational Learning Theory 1 / 22 Decidability Computation Decidability which

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

Foundation of Cryptography (0368-4162-01), Lecture 3 Hardcore Predicates for Any One-way Function

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree Search Algorithms (Part II)

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse

Learning sums of ridge functions in high dimension: a nonlinear - PowerPoint PPT Presentation

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo Fornasier Fakult at f ur Mathematik Technische Universit at M unchen massimo.fornasier@ma.tum.de http://www-m15.ma.tum.de/ Winter

Dedekind sums ingredients Dedekind sums Fourier- Dedekind sums Restricted partition Mirco

Recap: Prefix Sums Given A : set of n integers Find B : prefix sums A: 3 1 1 7 2 5

Binary and Ternary Kloosterman sums Kseniya Garaschuk University of Victoria July 22, 2010

Data Structures II Partial Sums Dynamic Arrays Philip Bille Data Structures II

On Ridge Functions Allan Pinkus Technion September 23, 2013 Allan Pinkus (Technion) Ridge

Mount Sutro Mount Sutro South Ridge &amp; Edgewood Avenue South Ridge &amp; Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Trace functions over finite fields: a study in sums of products E. Kowalski ETH Z urich May

Triple Shifted Sums of Automorphic L -functions Thomas Hulse Brown University ICERM Semester

Binomial = + + + Theorem + + + + + Image by MIT OpenCourseWare. Products of Sums = Sums

Treasury (spending / accessing funds, A-Board (requesting funds) SUMS technical support) Form

Low-rank sums-of-squares representations Cynthia Vinzant, North Carolina State University joint

Sums of Money MDM4U: Mathematics of Data Management How many different sums of money can be made

Sums of two squares A tale of two sums Melanie Abel Department of Mathematics University of

18.175: Lecture 7 Sums of random variables Scott Sheffield MIT 1 18.175 Lecture 7 Outline

Algorithms at Scale (Week 2) Puzzle of the Day: A bag contains a collection of blue and red

Computational Learning Theory 1 / 22 Decidability Computation Decidability which

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

Foundation of Cryptography (0368-4162-01), Lecture 3 Hardcore Predicates for Any One-way Function

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .

Planning and Optimization December 16, 2019 G8. Monte-Carlo Tree Search Algorithms (Part II)

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Data Stream Classification using Random Feature Functions and Novel Method Combinations Jesse

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue