learning sums of ridge functions in high dimension a
play

Learning sums of ridge functions in high dimension: a nonlinear - PowerPoint PPT Presentation

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo Fornasier Fakult at f ur Mathematik Technische Universit at M unchen massimo.fornasier@ma.tum.de http://www-m15.ma.tum.de/ Winter


  1. Capturing ridge functions from point queries: a nonlinear compressed sensing model Compressed sensing: given X ∈ R m × d sensing matrix, for m ≪ d , suitable matrix, we wish to identify a nearly sparse vector a ∈ R d from its measurements y ≈ Xa , by means of suitable algorithms ( ℓ 1 -minimization, greedy algs) aware of y and X . The data y i ≈ x i · a = x T i a , i = 1, . . . , m are linear measurements of a . If now we assume y i to be the values of a ridge function at the points x i

  2. Capturing ridge functions from point queries: a nonlinear compressed sensing model Compressed sensing: given X ∈ R m × d sensing matrix, for m ≪ d , suitable matrix, we wish to identify a nearly sparse vector a ∈ R d from its measurements y ≈ Xa , by means of suitable algorithms ( ℓ 1 -minimization, greedy algs) aware of y and X . The data y i ≈ x i · a = x T i a , i = 1, . . . , m are linear measurements of a . If now we assume y i to be the values of a ridge function at the points x i y i ≈ g ( a · x i ) , i = 1, . . . , m , for some unknown or roughly given nonlinear function g ,

  3. Capturing ridge functions from point queries: a nonlinear compressed sensing model Compressed sensing: given X ∈ R m × d sensing matrix, for m ≪ d , suitable matrix, we wish to identify a nearly sparse vector a ∈ R d from its measurements y ≈ Xa , by means of suitable algorithms ( ℓ 1 -minimization, greedy algs) aware of y and X . The data y i ≈ x i · a = x T i a , i = 1, . . . , m are linear measurements of a . If now we assume y i to be the values of a ridge function at the points x i y i ≈ g ( a · x i ) , i = 1, . . . , m , for some unknown or roughly given nonlinear function g , the problem of identifying the ridge direction can be understood as a nonlinear compressed sensing model ...

  4. Ridge functions and functions of data clustered around manifolds Figure : Functions on data clustered around a manifold can be locally approximated by k -ridge functions

  5. Universal random sampling for a more general ridge model M. Fornasier, K. Schnass, J. Vyb´ ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f ( x ) = g ( Ax ) , A is a k × d matrix

  6. Universal random sampling for a more general ridge model M. Fornasier, K. Schnass, J. Vyb´ ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 , 0 < q � 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2

  7. Universal random sampling for a more general ridge model M. Fornasier, K. Schnass, J. Vyb´ ıral, Learning functions of few arbitrary linear parameters in high dimensions, FoCM, 2012 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 , 0 < q � 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2 � The matrix H f := S d − 1 ∇ f ( x ) ∇ f ( x ) T d µ S d − 1 ( x ) is a positive semi-definite k -rank matrix We assume, that the singular values of the matrix H f satisfy σ 1 ( H f ) � · · · � σ k ( H f ) � α > 0.

  8. How can we learn k -ridge functions from point queries?

  9. MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ

  10. MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ X = { x j ∈ Ω : j = 1, . . . , m X } drawn uniformly at random in Ω ⊂ R d Φ = { ϕ j ∈ R d , j = 1, . . . , m Φ } , where � 1 / √ m Φ with prob. 1 / 2, ϕ j ℓ = − 1 / √ m Φ with prob. 1 / 2 for every j ∈ { 1, . . . , m Φ } and every ℓ ∈ { 1, . . . , d }

  11. Sensitivity analysis x + εϕ S d − 1 x Figure : We perform at random, randomized sensitivity analysis

  12. Collecting together the differential analysis Φ . . . m Φ × d matrix whose rows are ϕ i , X . . . d × m X matrix A T ∇ g ( Ax 1 ) | . . . | A T ∇ g ( Ax m X ) � � X = . The m X × m Φ instances of ( ∗ ) in matrix notation as Φ X = Y + E ( ∗∗ ) Y and E are m Φ × m X matrices defined by y ij = f ( x j + ǫϕ i ) − f ( x j ) , ǫ ε ij = − ǫ 2 [( ϕ i ) T ∇ 2 f ( ζ ij ) ϕ i ] ,

  13. Example of active coordinates: which factor does play a role? We assume, that  e T  i 1 . A = .  ,   .  e T i k i.e. f ( x ) = f ( x 1 , . . . , x d ) = g ( x i 1 , . . . , x i k ) , where f : Ω = [ 0, 1 ] d → R and g : [ 0, 1 ] k → R

  14. Example of active coordinates: which factor does play a role? We assume, that  e T  i 1 . A = .  ,   .  e T i k i.e. f ( x ) = f ( x 1 , . . . , x d ) = g ( x i 1 , . . . , x i k ) , where f : Ω = [ 0, 1 ] d → R and g : [ 0, 1 ] k → R We want to identify first the active coordinates i 1 , . . . , i k . Then one can apply any usual k -dimensional approximation method...

  15. Example of active coordinates: which factor does play a role? We assume, that  e T  i 1 . A = .  ,   .  e T i k i.e. f ( x ) = f ( x 1 , . . . , x d ) = g ( x i 1 , . . . , x i k ) , where f : Ω = [ 0, 1 ] d → R and g : [ 0, 1 ] k → R We want to identify first the active coordinates i 1 , . . . , i k . Then one can apply any usual k -dimensional approximation method... A possible algorithm chooses the sampling points at random, due to the concentration of measure effects, we get the right result with overwhelming probability.

  16. A simple algorithm based on concentration of measure The algorithm to identify the active coordinates I is based on the identity Φ T Φ X = Φ T Y + Φ T E where now X has i th -row � ∂ g ( Ax 1 ) , . . . , ∂ g � ( Ax m X X i = , ∂ z i ∂ z i for i ∈ I , and all other row equal to zero.

  17. A simple algorithm based on concentration of measure The algorithm to identify the active coordinates I is based on the identity Φ T Φ X = Φ T Y + Φ T E where now X has i th -row � ∂ g ( Ax 1 ) , . . . , ∂ g � ( Ax m X X i = , ∂ z i ∂ z i for i ∈ I , and all other row equal to zero. In expectation: Φ T Φ ≈ I d : R d → R d Φ T Φ X ≈ X and Φ T E is small = ⇒ Φ T Y ≈ X ,

  18. A simple algorithm based on concentration of measure The algorithm to identify the active coordinates I is based on the identity Φ T Φ X = Φ T Y + Φ T E where now X has i th -row � ∂ g ( Ax 1 ) , . . . , ∂ g � ( Ax m X X i = , ∂ z i ∂ z i for i ∈ I , and all other row equal to zero. In expectation: Φ T Φ ≈ I d : R d → R d Φ T Φ X ≈ X and Φ T E is small = ⇒ Φ T Y ≈ X , We select the k largest rows of Φ T Y and estimate the probability, that their indices coincide with the indices of the non-zero rows of X .

  19. A first recovery result Theorem (Schnass and Vyb´ ıral 2011) Let f : R d → R be a function of k active coordinates that is defined and twice continuously differentiable on a small neighbourhood of [ 0, 1 ] d . For L � d, a positive real number, the randomized algorithm described above recovers the k unknown active coordinates of f with probability at least 1 − 6 exp (− L ) using only O ( k ( L + log k )( L + log d )) samples of f . The constants involved in the O notation depend on smoothness properties of g , namely on max j = 1,..., k � ∂ i j g � ∞ min j = 1,..., k � ∂ i j g � 1

  20. Examples of active coordinate detection in dimension d = 1000 20 0.9 40 0.8 60 0.7 80 0.6 100 0.9 80 0.5 120 0.8 100 140 0.4 0.7 120 160 0.3 140 0.6 180 0.2 160 0.5 200 0.1 6 12 18 24 30 36 42 48 54 60 180 0.4 200 0.3 220 0.2 240 0.1 260 0 5 10 15 20 25 30 35 40 45 50 ( x 3 − 1 / 2 ) 2 + ( x 4 − 1 / 2 ) 2 , 0 ) 3 and � Figure : max ( 1 − 5 � 6 π � 40 � + � 40 i = 21 sin ( 6 π x i ) + 5 ( x i − 1 / 2 ) 2 sin i = 21 x i

  21. Learning ridge functions k = 1 Let f ( x ) = g ( a · x ) , f : B R d → R , where a ∈ R d � a � 2 = 1 and � a � q � C 1 , 0 < q � 1, max 0 � α � 2 � D α g � ∞ � C 2 � � S d − 1 �∇ f ( x ) � 2 S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) > 0, α = 2 d µ S d − 1 ( x ) = ℓ d

  22. Learning ridge functions k = 1 Let f ( x ) = g ( a · x ) , f : B R d → R , where a ∈ R d � a � 2 = 1 and � a � q � C 1 , 0 < q � 1, max 0 � α � 2 � D α g � ∞ � C 2 � � S d − 1 �∇ f ( x ) � 2 S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) > 0, α = 2 d µ S d − 1 ( x ) = ℓ d We consider again the Taylor expansion (*) with Ω = S d − 1 We choose the points X = { x j ∈ S d − 1 : j = 1, . . . , m X } generated at random on S d − 1 with respect to µ S d − 1 The matrix Φ is generated as before and we obtain (**) again in the form Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X .

  23. Algorithm 1 : ◮ Given m Φ , m X , draw at random the sets Φ and X , and construct Y according (*). ◮ Set ˆ x j = ∆ ( y j ) := arg min y j = Φ z � z � ℓ d 1 . ◮ Find j 0 = arg j = 1,..., m X � ˆ max x j � ℓ d 2 . ◮ Set ˆ a = ˆ x j 0 / � ˆ x j 0 � ℓ d 2 . a T y ) and ˆ ◮ Define ˆ g ( y ) := f ( ˆ f ( x ) := ˆ g ( ˆ a · x ) .

  24. Recovery result Theorem (F., Schnass, and Vyb´ ıral 2012) Let 0 < s < 1 and log d � m Φ � [ log 6 ] 2 d. Then there is a constant c ′ 1 such that using m X · ( m Φ + 1 ) function evaluations of f , Algorithm 1 defines a function ˆ f : B R d ( 1 + ¯ ǫ ) → R that, with probability � � 2 m X s 2 α 2 1 m Φ + e − √ m Φ d + 2 e − e − c ′ C 4 1 − , 2 will satisfy ν 1 � f − ˆ f � ∞ � 2 C 2 ( 1 + ¯ ǫ ) , � α ( 1 − s ) − ν 1 where �� � � 1 / 2 − 1 / q m Φ ǫ ν 1 = C ′ + √ m Φ log ( d / m Φ ) and C ′ depends only on C 1 and C 2 .

  25. Ingredients of the proof ◮ compressed sensing;

  26. Ingredients of the proof ◮ compressed sensing; ◮ stability of one dimensional subspaces;

  27. Ingredients of the proof ◮ compressed sensing; ◮ stability of one dimensional subspaces; ◮ concentration inequalities (Hoeffding’s inequality).

  28. Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m.

  29. Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m. Let us suppose that d > [ log 6 ] 2 m.

  30. Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m. Let us suppose that d > [ log 6 ] 2 m. Then there are positive constants C , c ′ 1 , c ′ 2 > 0 , such that, with probability at least √ 1 m − e − 1 − e − c ′ md , the matrix Φ has the following property.

  31. Compressed sensing Theorem (Wojtaszczyk, 2011) Assume that Φ is an m × d random matrix with all entries being independent Bernoulli variables scaled by 1 / √ m. Let us suppose that d > [ log 6 ] 2 m. Then there are positive constants C , c ′ 1 , c ′ 2 > 0 , such that, with probability at least √ 1 m − e − 1 − e − c ′ md , the matrix Φ has the following property. For every x ∈ R d , ε ∈ R m and every natural number K � c ′ 2 m / log ( d / m ) we have � � K − 1 / 2 σ K ( x ) ℓ d � � ∆ ( Φ x + ε ) − x � ℓ d 2 � C 1 + max { � ε � ℓ m 2 , log d � ε � ℓ m ∞ } , where σ K ( x ) ℓ d 1 := inf { � x − z � ℓ d 1 : # supp z � K } is the best K-term approximation of x.

  32. How does compressed sensing play a role? For the d × m X matrix X , i.e., X = ( g ′ ( a · x 1 ) a T | . . . | g ′ ( a · x m X ) a T ) , Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X , � �� � := x j

  33. How does compressed sensing play a role? For the d × m X matrix X , i.e., X = ( g ′ ( a · x 1 ) a T | . . . | g ′ ( a · x m X ) a T ) , Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X , � �� � := x j and x j = ∆ ( y j ) := arg min ˆ y j = Φ z � z � ℓ d 1

  34. How does compressed sensing play a role? For the d × m X matrix X , i.e., X = ( g ′ ( a · x 1 ) a T | . . . | g ′ ( a · x m X ) a T ) , Φ [ g ′ ( a · x j ) a ] = y j + ε j , j = 1, . . . m X , � �� � := x j and x j = ∆ ( y j ) := arg min ˆ y j = Φ z � z � ℓ d 1 the previous result gives - with the probability provided there - x j = g ′ ( a · x j ) a T + n j , ˆ with n j properly estimated by � � K − 1 / 2 σ K ( g ′ ( a · x j ) a T ) ℓ d � � n j � ℓ d 2 � C 1 + max { � ε j � ℓ m 2 , log d � ε j � ℓ m . ∞

  35. Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d ,

  36. Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ )

  37. Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1

  38. Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1 � d � d � 2 � 2 / q � C 2 ǫ � g ′′ � ∞ � ǫ � g ′′ � ∞ � � 1 C 2 | a k | q | a k | ǫ , � 2 m Φ 2 m Φ 2 m Φ k = 1 k = 1

  39. Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1 � d � d � 2 � 2 / q � C 2 ǫ � g ′′ � ∞ � ǫ � g ′′ � ∞ � � 1 C 2 | a k | q | a k | ǫ , � 2 m Φ 2 m Φ 2 m Φ k = 1 k = 1 � C 2 √ m Φ � ε j � ℓ 1 C 2 � ε j � ℓ 2 √ m Φ ǫ , � m Φ m Φ 2 ∞

  40. Some computations Let us estimate the quontities. By Stechkin’s inequality for which q K 1 − 1 / q , σ K ( x ) ℓ d 1 � � x � ℓ d for all x ∈ R d , one obtains - for x j = g ′ ( a · x j ) a � 1 / 2 − 1 / q � m Φ q · K 1 / 2 − 1 / q � C 1 C 2 K − 1 / 2 σ K ( x j ) ℓ d 1 � | g ′ ( a · x j ) | ·� a � ℓ d . log ( d / m Φ ) Moreover ǫ i = 1,..., m Φ | ϕ i T ∇ 2 f ( ζ ij ) ϕ i | � ε j � ℓ = 2 · max m Φ ∞ � � d ǫ � � � a k a l g ′′ ( a · ζ ij ) = · max � � 2 m Φ � � i = 1,..., m Φ � � k , l = 1 � d � d � 2 � 2 / q � C 2 ǫ � g ′′ � ∞ � ǫ � g ′′ � ∞ � � 1 C 2 | a k | q | a k | ǫ , � 2 m Φ 2 m Φ 2 m Φ k = 1 k = 1 � C 2 √ m Φ � ε j � ℓ 1 C 2 � ε j � ℓ 2 √ m Φ ǫ , leading to � m Φ m Φ 2 ∞ � � , √ log d � ε j � ℓ C 2 � C 2 1 C 2 log d 1 C 2 max { � ε j � ℓ ∞ } � 2 √ m Φ ǫ · max 1, = 2 √ m Φ ǫ . m Φ m Φ m Φ 2

  41. Summarizing ... With high probability x j = g ′ ( a · x j ) a T + n j , ˆ

  42. Summarizing ... With high probability x j = g ′ ( a · x j ) a T + n j , ˆ where � � K − 1 / 2 σ K ( g ′ ( a · x j ) a T ) ℓ d � � n j � ℓ d C 1 + max { � ε j � ℓ m 2 , log d � ε j � ℓ m � 2 ∞ �� � � 1 / 2 − 1 / q m Φ ǫ C ′ + := ν 1 � √ m Φ log ( d / m Φ )

  43. Stability of one dimensional subspaces Lemma x ∈ R d , a ∈ S d − 1 , 0 � = γ ∈ R , and n ∈ R d with norm Let us fix ˆ � n � ℓ d 2 � ν 1 < | γ | . If we assume ˆ x = γ a + n then � � ˆ x 2 ν 1 � � � sign γ − a . � � � � � ˆ x � ℓ d � � ˆ x � ℓ d � ℓ d 2 2 2

  44. Stability of one dimensional subspaces Lemma x ∈ R d , a ∈ S d − 1 , 0 � = γ ∈ R , and n ∈ R d with norm Let us fix ˆ � n � ℓ d 2 � ν 1 < | γ | . If we assume ˆ x = γ a + n then � � x ˆ 2 ν 1 � � � sign γ − a . � � � � � ˆ x � ℓ d � � ˆ x � ℓ d � ℓ d 2 2 2 We recall, that x j = g ′ ( a · x j ) a T + n j . ˆ and | g ′ ( a · x j ) | − max | g ′ ( a · x j ) | max � ˆ x j � ℓ d 2 � max � ˆ x j − x j � ℓ d max − ν 1 2 � j j j j � �� � we need to estimate it

  45. Concentration inequalities I Lemma (Hoeffding’s inequality) Let X 1 , . . . , X m be independent random variables. Assume that the X j are almost surely bounded, i.e., there exist finite scalars a j , b j such that P { X j − E X j ∈ [ a j , b j ] } = 1, for j = 1, . . . , m. Then we have   � �   m m   2 t 2 � � � � j = 1 ( bj − aj ) 2 . − � m � � P X j − E X j � t  � 2 e � �    � � j = 1 j = 1 � �

  46. Concentration inequalities I Lemma (Hoeffding’s inequality) Let X 1 , . . . , X m be independent random variables. Assume that the X j are almost surely bounded, i.e., there exist finite scalars a j , b j such that P { X j − E X j ∈ [ a j , b j ] } = 1, for j = 1, . . . , m. Then we have   � �   m m   2 t 2 � � � � j = 1 ( bj − aj ) 2 . − � m � � P X j − E X j � t  � 2 e � �    � � j = 1 j = 1 � � Let us now apply Hoeffding’s inequality to the random variables X j = | g ′ ( a · x j ) | 2 .

  47. Probabilistic estimates from below By applying Hoeffding’s inequality to the random variables X j = | g ′ ( a · x j ) | 2 , we have Lemma 2 m X s 2 α 2 − C 4 Let us fix 0 < s < 1 . Then with probability 1 − 2 e we have 2 j = 1,..., m X | g ′ ( a · x j ) | � � max α ( 1 − s ) , � where α := E x ( | g ′ ( a · x j ) | 2 ) = S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) = � S d − 1 �∇ f ( x ) � 2 2 d µ S d − 1 ( x ) > 0 . ℓ d

  48. Algorithm 1 : ◮ Given m Φ , m X , draw at random the sets Φ and X , and construct Y according (*). ◮ Set ˆ x j = ∆ ( y j ) := arg min y j = Φ z � z � ℓ d 1 . ◮ Find j 0 = arg j = 1,..., m X � ˆ max x j � ℓ d 2 . ◮ Set ˆ a = ˆ x j 0 / � ˆ x j 0 � ℓ d 2 . a T y ) and ˆ ◮ Define ˆ g ( y ) := f ( ˆ f ( x ) := ˆ g ( ˆ a · x ) .

  49. Recovery result Theorem (F., Schnass, and Vyb´ ıral 2012) Let 0 < s < 1 and log d � m Φ � [ log 6 ] 2 d. Then there is a constant c ′ 1 such that using m X · ( m Φ + 1 ) function evaluations of f , Algorithm 1 defines a function ˆ f : B R d ( 1 + ¯ ǫ ) → R that, with probability � � 2 m X s 2 α 2 1 m Φ + e − √ m Φ d + 2 e − e − c ′ C 4 1 − , 2 will satisfy ν 1 � f − ˆ f � ∞ � 2 C 2 ( 1 + ¯ ǫ ) , � α ( 1 − s ) − ν 1 where �� � � 1 / 2 − 1 / q m Φ ǫ ν 1 = C ′ + √ m Φ log ( d / m Φ ) and C ′ depends only on C 1 and C 2 .

  50. Concentration of measure phenomenon and risk of intractability Key role is played by � S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) α =

  51. Concentration of measure phenomenon and risk of intractability Key role is played by � S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) α = Due to symmetry . . . independent on a

  52. Concentration of measure phenomenon and risk of intractability Key role is played by � S d − 1 | g ′ ( a · x ) | 2 d µ S d − 1 ( x ) α = Due to symmetry . . . independent on a Push-forward measure µ 1 on [− 1, 1 ] � 1 | g ′ ( y ) | 2 d µ 1 ( y ) α = − 1 � 1 Γ ( d / 2 ) d − 3 | g ′ ( y ) | 2 ( 1 − y 2 ) 2 dy = π 1 / 2 Γ (( d − 1 ) / 2 ) − 1 µ 1 concentrates around zero exponentially fast as d → ∞

  53. Dependence on the dimension d Proposition Let us fix M ∈ N and assume that g : [− 1, 1 ] → R is C M + 2 -differentiable d ℓ in an open neighbourhood U of 0 and dx ℓ g ( 0 ) = 0 for ℓ = 1, . . . , M. Then α ( d ) = O ( d − M ) , for d → ∞ .

  54. Tractability classes (1) For 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 1 F 1 := d ( α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d 2 = 1, � a � ℓ d q � C 1 and ∃ g ∈ C 2 ( B R ) , | g ′ ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } .

  55. Tractability classes (1) For 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 1 F 1 := d ( α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d 2 = 1, � a � ℓ d q � C 1 and ∃ g ∈ C 2 ( B R ) , | g ′ ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } . (2) For a neighborhood U of 0, 0 < q � 1, C 1 > 1, C 2 � α 0 > 0 and N � 2, we define F 2 F 2 := d ( U , α 0 , q , C 1 , C 2 , N ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d ∃ g ∈ C 2 ( B R ) ∩ C N ( U ) 2 = 1, � a � ℓ d q � C 1 and ∃ 0 � M � N − 1, | g ( M ) ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } .

  56. Tractability classes (1) For 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 1 F 1 := d ( α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d 2 = 1, � a � ℓ d q � C 1 and ∃ g ∈ C 2 ( B R ) , | g ′ ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } . (2) For a neighborhood U of 0, 0 < q � 1, C 1 > 1, C 2 � α 0 > 0 and N � 2, we define F 2 F 2 := d ( U , α 0 , q , C 1 , C 2 , N ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d ∃ g ∈ C 2 ( B R ) ∩ C N ( U ) 2 = 1, � a � ℓ d q � C 1 and ∃ 0 � M � N − 1, | g ( M ) ( 0 ) | � α 0 > 0 : f ( x ) = g ( a · x ) } . (3) For a neighborhood U of 0, 0 < q � 1, C 1 > 1 and C 2 � α 0 > 0, we define F 3 F 3 := d ( U , α 0 , q , C 1 , C 2 ) := { f : B R d → R : d ∃ a ∈ R d , � a � ℓ d ∃ g ∈ C 2 ( B R ) ∩ C ∞ ( U ) 2 = 1, � a � ℓ d q � C 1 and | g ( M ) ( 0 ) | = 0 for all M ∈ N : f ( x ) = g ( a · x ) } .

  57. Tractability result Corollary The problem of learning functions f in the classes F 1 d and F 2 d from point evaluations is strongly polynomially tractable (no poly dep. on d) and polynomially tractable (with poly dep. on d) respectively.

  58. Intractability On the one hand, let us notice that if in the class F 3 d we remove the condition � a � ℓ d q � C 1 , then the problem actually becomes intractable .

  59. Intractability On the one hand, let us notice that if in the class F 3 d we remove the condition � a � ℓ d q � C 1 , then the problem actually becomes intractable . Let ǫ ]) given by g ( y ) = 8 ( y − 1 / 2 ) 3 for y ∈ [ 1 / 2, 1 + ¯ g ∈ C 2 ([− 1 − ¯ ǫ , 1 + ¯ ǫ ] and zero otherwise.

  60. Intractability On the one hand, let us notice that if in the class F 3 d we remove the condition � a � ℓ d q � C 1 , then the problem actually becomes intractable . Let ǫ ]) given by g ( y ) = 8 ( y − 1 / 2 ) 3 for y ∈ [ 1 / 2, 1 + ¯ g ∈ C 2 ([− 1 − ¯ ǫ , 1 + ¯ ǫ ] and zero otherwise. Notice that, for every a ∈ R d with � a � ℓ d 2 = 1, the function f ( x ) = g ( a · x ) vanishes everywhere on S d − 1 outside of the cap U ( a , 1 / 2 ) := { x ∈ S d − 1 : a · x � 1 / 2 } , Figure : The function g and the spherical cap U ( a , 1 / 2 ) .

  61. Intractability The µ S d − 1 measure of U ( a , 1 / 2 ) obviously does not depend on a and is known to be exponentially small in d . Furthermore, it is known, that there is a constant c > 0 and unit vectors a 1 , . . . , a K , such that the sets U ( a 1 , 1 / 2 ) , . . . , U ( a K , 1 / 2 ) are mutually disjoint and K � e cd . Finally, we observe that max x ∈ S d − 1 | f ( x ) | = f ( a ) = g ( 1 ) = 1.

  62. Intractability The µ S d − 1 measure of U ( a , 1 / 2 ) obviously does not depend on a and is known to be exponentially small in d . Furthermore, it is known, that there is a constant c > 0 and unit vectors a 1 , . . . , a K , such that the sets U ( a 1 , 1 / 2 ) , . . . , U ( a K , 1 / 2 ) are mutually disjoint and K � e cd . Finally, we observe that max x ∈ S d − 1 | f ( x ) | = f ( a ) = g ( 1 ) = 1. We conclude that any algorithm making only use of the structure of f ( x ) = g ( a · x ) and the condition needs to use exponentially many sampling points in order to distinguish between f ( x ) ≡ 0 and f ( x ) = g ( a i · x ) for some of the a i ’s as constructed above.

  63. Truly k -ridge functions for k ≫ 1 f ( x ) = g ( Ax ) , A is a k × d matrix

  64. Truly k -ridge functions for k ≫ 1 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2

  65. Truly k -ridge functions for k ≫ 1 f ( x ) = g ( Ax ) , A is a k × d matrix Rows of A are compressible: max i � a i � q � C 1 AA T is the identity operator on R k � D α g � ∞ � C 2 The regularity condition: sup | α | � 2 � The matrix H f := S d − 1 ∇ f ( x ) ∇ f ( x ) T d µ S d − 1 ( x ) is a positive semi-definite k -rank matrix We assume, that the singular values of the matrix H f satisfy σ 1 ( H f ) � · · · � σ k ( H f ) � α > 0.

  66. MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ

  67. MD. House’s differential diagnosis (or simply called ”sensitivity analysis”) ∂ f We rely on numerical approximation of ∂ϕ ∇ g ( Ax ) T A ϕ = ∂ f ∂ϕ ( x ) ( ∗ ) = f ( x + ǫϕ ) − f ( x ) − ǫ 2 [ ϕ T ∇ 2 f ( ζ ) ϕ ] , ǫ � ¯ ǫ ǫ X = { x j ∈ Ω : j = 1, . . . , m X } drawn uniformly at random in Ω ⊂ R d Φ = { ϕ j ∈ R d , j = 1, . . . , m Φ } , where � 1 / √ m Φ with prob. 1 / 2, ϕ j ℓ = − 1 / √ m Φ with prob. 1 / 2 for every j ∈ { 1, . . . , m Φ } and every ℓ ∈ { 1, . . . , d }

  68. Sensitivity analysis x + εϕ S d − 1 x Figure : We perform at random, randomized sensitivity analysis

  69. Collecting together the differential analysis Φ . . . m Φ × d matrix whose rows are ϕ i , X . . . d × m X matrix A T ∇ g ( Ax 1 ) | . . . | A T ∇ g ( Ax m X ) � � X = . The m X × m Φ instances of ( ∗ ) in matrix notation as Φ X = Y + E ( ∗∗ ) Y and E are m Φ × m X matrices defined by y ij = f ( x j + ǫϕ i ) − f ( x j ) , ǫ ε ij = − ǫ 2 [( ϕ i ) T ∇ 2 f ( ζ ij ) ϕ i ] ,

  70. Algorithm 2 : ◮ Given m Φ , m X , draw at random the sets Φ and X , and construct Y according to (*). ◮ Set ˆ x j = ∆ ( y j ) := arg min y j = Φ z � z � ℓ d 1 , for j = 1, . . . , m X , and ˆ X = ( ˆ x 1 | . . . | ˆ x m X ) is again a d × m X matrix. ◮ Compute the singular value decomposition of � � ˆ � � ˆ � ˆ V T � Σ 1 0 X T = ˆ ˆ 1 U 1 U 2 , ˆ ˆ V T 0 Σ 2 2 where ˆ Σ 1 contains the k largest singular values. ◮ Set ˆ A = ˆ V T 1 . g ( y ) := f ( ˆ A T y ) and ˆ g ( ˆ ◮ Define ˆ f ( x ) := ˆ Ax ) .

  71. The control of the error The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: The error between ˆ 1. X and X , which can be controlled through the number of compressed sensing measurements m Φ ;

  72. The control of the error The quality of the final approximation of f by means of ˆ f depends on two kinds of accuracies: The error between ˆ 1. X and X , which can be controlled through the number of compressed sensing measurements m Φ ; The stability of the span of V T , simply characterized by how well 2. the singular values of X or equivalently G are separated from 0, which is related to the number of random samples m X . To be precise, we have

  73. Recovery result Theorem (F., Schnass, and Vyb´ ıral) Let log d � m Φ � [ log 6 ] 2 d. Then there is a constant c ′ 1 such that using m X · ( m Φ + 1 ) function evaluations of f , Algorithm 2 defines a function ˆ f : B R d ( 1 + ¯ ǫ ) → R that, with probability � � − m X α s 2 1 m Φ + e − √ m Φ d + ke e − c ′ 2 kC 2 1 − , 2 will satisfy √ ν 2 � f − ˆ k ( 1 + ¯ f � ∞ � 2 C 2 ǫ ) , � α ( 1 − s ) − ν 2 where � � � 1 / 2 − 1 / q + ǫ k 2 � m Φ k 1 / q ν 2 = C √ m Φ , log ( d / m Φ ) and C depends only on C 1 and C 2 .

  74. Ingredients of the proof ◮ compressed sensing;

  75. Ingredients of the proof ◮ compressed sensing; ◮ stability of the SVD;

  76. Ingredients of the proof ◮ compressed sensing; ◮ stability of the SVD; ◮ concentration inequalities (Chernoff bounds for sums of positive-semidefinite matrices).

  77. Compressed sensing Corollary (after Wojtaszczyk, 2011) Let log d � m Φ < [ log 6 ] 2 d. Then with probability 1 m Φ + e − √ m Φ d ) 1 − ( e − c ′ the matrix ˆ X as calculated in Algorithm 2 satisfies � � 1 / 2 − 1 / q � + ǫ k 2 X � F � C √ m X � m Φ � X − ˆ k 1 / q √ m Φ , log ( d / m Φ ) where C depends only on C 1 and C 2 .

  78. Stability of SVD Given two matrices B and ˆ B with corresponding singular value decompositions � � V T � � Σ 1 � 0 � 1 B = U 1 U 2 V T 0 Σ 2 2 and � � ˆ � � ˆ � ˆ V T � Σ 1 0 ˆ ˆ 1 B = , U 1 U 2 ˆ ˆ V T 0 Σ 2 2 we have:

  79. Wedin’s bound Theorem (Stability of subspaces) If there is an ¯ α > 0 such that ℓ ( ˆ min | σ ˆ Σ 1 ) − σ ℓ ( Σ 2 ) | � ¯ α , ℓ ,ˆ ℓ and ℓ ( ˆ min | σ ˆ Σ 1 ) | � ¯ α , ˆ ℓ then 1 � F � 2 1 − ˆ V 1 ˆ α � B − ˆ � V 1 V T V T B � F . ¯

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend