lectures on learning theory
play

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra - PowerPoint PPT Presentation

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra University Barcelona what is learning theory? A mathematical theory to understand the behavior of learning algorithms and assist their design. what is learning theory? A


  1. johnson-lindenstrauss Suppose A = { a 1 , . . . , a n } ⊂ R D is a finite set, D is large. We would like to embed A in R d where d ≪ D . Is this possible? In what sense? Given ε > 0 , a function f : R D → R d is an ε -isometry if for all a , a ′ ∈ A , � 2 ≤ � 2 ≤ (1 + ε ) � 2 . � � a − a ′ � � � f(a) − f(a ′ ) � � � a − a ′ � (1 − ε ) Johnson-Lindenstrauss lemma: If d ≥ (c /ε 2 ) log n , then there exists an ε -isometry f : R D → R d . Independent of D !

  2. random projections We take f to be linear. How? At random!

  3. random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal.

  4. random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal. For any a = ( α 1 , . . . , α D ) ∈ R D , d D E � f(a) � 2 = 1 i , j = � a � 2 . � � α 2 j E X 2 d i=1 j=1 The expected squared distances are preserved!

  5. random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal. For any a = ( α 1 , . . . , α D ) ∈ R D , d D E � f(a) � 2 = 1 i , j = � a � 2 . � � α 2 j E X 2 d i=1 j=1 The expected squared distances are preserved! � f(a) � 2 / � a � 2 is a weighted sum of squared normals.

  6. random projections Let b = a i − a j for some a i , a j ∈ A . Then √ √  �  � � � f(b) � 2 8 log(n / δ ) + 8 log(n / δ )  � �  P  ∃ b : − 1 � > � � � b � 2 � � d d �  √ √  �  � � � f(b) � 2 � n 8 log(n / δ ) + 8 log(n / δ ) �  � �  ≤ P − 1 � > � � � b � 2 2 � � d d  �  ≤ δ (by a Bernstein-type inequality) . √ If d ≥ (c /ε 2 ) log(n / δ ) , then √ √ � 8 log(n / δ ) + 8 log(n / δ ) ≤ ε d d and f is an ε -isometry with probability ≥ 1 − δ .

  7. martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z .

  8. martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale representation of Z .

  9. martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale Joseph Leo Doob (1910–2004) representation of Z .

  10. martingale representation: the variance � n  � 2  n � � � � �  = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E  i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n  � 2  n � � � �  = ∆ 2 Var (Z) = E ∆ i . E  i i=1 i=1

  11. martingale representation: the variance � n  � 2  n � � � � �  = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E  i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n  � 2  n � � � �  = ∆ 2 Var (Z) = E ∆ i . E  i i=1 i=1 From this, using independence, it is easy derive the Efron-Stein inequality.

  12. efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only.

  13. efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only. We obtain more useful forms by using that Var (X) = 1 2 E (X − X ′ ) 2 Var (X) ≤ E (X − a) 2 and for any constant a .

  14. efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables.

  15. efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables. If Z = � n i=1 X i then we have an equality. Sums are the “least concentrated” of all functions!

  16. efron-stein inequality (1981) If for some arbitrary functions f i Z i = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then � n � � (Z − Z i ) 2 Var (Z) ≤ E i=1

  17. efron, stein, and steele Mike Steele Charles Stein Bradley Efron

  18. example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n

  19. example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n regardless of the distribution and the richness of A .

  20. example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is n φ n (x) = 1 � x − X i � � K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx .

  21. example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is n φ n (x) = 1 � x − X i � � K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx . It is easy to see that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | � x − x ′ 1 � � � x − x i � �� � dx ≤ 2 i � � ≤ � K − K n , � � nh h h Var (Z) ≤ 2 so we get n .

  22. ✶ ✶ ✶ bounding the expectation i ∈ A and let E ′ denote expectation only n (A) = 1 � n Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � n � n (A) | = 1 � � | P n (A) − P ′ � ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A � � i=1

  23. bounding the expectation i ∈ A and let E ′ denote expectation only n (A) = 1 � n Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � n � n (A) | = 1 � � | P n (A) − P ′ � ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A � � i=1 Second symmetrization: if ε 1 , . . . , ε n are independent Rademacher variables, then � n � � n � = 1 � ≤ 2 � � � � � � n E sup ε i ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) n E sup ε i ✶ X i ∈ A � � � � � � � � A ∈A A ∈A � � � i=1 i=1

  24. conditional rademacher average If � n � � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A � � i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A

  25. conditional rademacher average If � n � � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A � � i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A R n is a data-dependent quantity!

  26. concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1

  27. concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n !

  28. concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n ! Such functions are called self-bounding.

  29. bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n

  30. bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n In particular, � log S(X n 1 , A ) E sup | P n (A) − P(A) | ≤ 2 E . 2n A ∈A

  31. random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) .

  32. random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) . V is also self-bounding: n (V − V (i) ) 2 ≤ V � i=1 so by Efron-Stein, Var (V) ≤ E V

  33. vapnik and chervonenkis Alexey Chervonenkis Vladimir Vapnik

  34. beyond the variance X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Recall the Doob martingale representation: n � Z − E Z = ∆ i where ∆ i = E i Z − E i − 1 Z , i=1 with E i [ · ] = E [ ·| X 1 , . . . , X i ] . To get exponential inequalities, we bound the moment generating function E e λ (Z − E Z) .

  35. azuma’s inequality Suppose that the martingale differences are bounded: | ∆ i | ≤ c i . Then �� n − 1 � E e λ (Z − E Z) = E e λ ( � n i=1 ∆ i ) = EE n e λ i=1 ∆ i + λ ∆ n �� n − 1 � λ i=1 ∆ i E n e λ ∆ n = E e �� n − 1 � λ i=1 ∆ i n / 2 (by Hoeffding) e λ 2 c 2 ≤ E e · · · ≤ e λ 2 ( � n i=1 c 2 i ) / 2 . This is the Azuma-Hoeffding inequality for sums of bounded martingale differences.

  36. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded.

  37. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i .

  38. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i . McDiarmid’s inequality. Colin McDiarmid

  39. hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� n � � ≤ e − (t −√ v) 2 / (2v) . � � � P X i � > t � � � � � i=1

  40. hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� n � � ≤ e − (t −√ v) 2 / (2v) . � � � P X i � > t � � � � � i=1 �� n � has the bounded � � Proof: By the triangle inequality, i=1 X i differences property with constants c , so �� n � � �� n � � n � � n � � � � � � � � � � � � � � X i � > t = P X i � − E X i � > t − E X i P � � � � � � � � � � � � � � � � � � � � � i=1 i=1 i=1 i=1 �� 2 � �� n � � � � t − E i=1 X i ≤ exp − . 2v Also, � � 2 � n � � n � n � � E � X i � 2 ≤ √ v . � � � � � � � � � X i � ≤ X i = E � E � � � � � � � � � � � � i=1 i=1 i=1

  41. bounded differences inequality Easy to use. Distribution free. Often close to optimal (e.g., L 1 error of kernel density estimate). Does not exploit “variance information.” Often too rigid. Other methods are necessary.

  42. shannon entropy If X , Y are random variables taking values in a set of size N , � H(X) = − p(x) log p(x) x H(X | Y)= H(X , Y) − H(Y) � = − p(x , y) log p(x | y) Claude Shannon x , y (1916–2001) H(X) ≤ log N and H(X | Y) ≤ H(X)

  43. han’s inequality If X = (X 1 , . . . , X n ) and X (i) = (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then n � � � H(X) − H(X (i) ) ≤ H(X) i=1 Proof: H(X)= H(X (i) ) + H(X i | X (i) ) ≤ H(X (i) ) + H(X i | X 1 , . . . , X i − 1 ) Since � n i=1 H(X i | X 1 , . . . , X i − 1 ) = H(X) , summing Te Sun Han the inequality, we get n � H(X (i) ) . (n − 1)H(X) ≤ i=1

  44. subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 .

  45. subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 . Han’s inequality implies the following sub-additivity property. Let X 1 , . . . , X n be independent and let Z = f(X 1 , . . . , X n ) , where f ≥ 0 . Denote Ent (i) (Z) = E (i) Φ(Z) − Φ( E (i) Z) Then n � Ent (i) (Z) . Ent (Z) ≤ E i=1

  46. a logarithmic sobolev inequality on the hypercube Let X = (X 1 , . . . , X n ) be uniformly distributed over {− 1 , 1 } n . If f : {− 1 , 1 } n → R and Z = f(X) , n Ent (Z 2 ) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 The proof uses subadditivity of the entropy and calculus for the case n = 1 . Implies Efron-Stein and the edge-isoperimetric inequality.

  47. herbst’s argument: exponential concentration If f : {− 1 , 1 } n → R , the log-Sobolev inequality may be used with g(x) = e λ f(x) / 2 where λ ∈ R . If F( λ ) = E e λ Z is the moment generating function of Z = f(X) , � Ze λ Z � � e λ Z � � Ze λ Z � Ent (g(X) 2 )= λ E − E log E = λ F ′ ( λ ) − F( λ ) log F( λ ) . Differential inequalities are obtained for F( λ ) .

  48. herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � ′ � G( λ ) ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v

  49. herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � ′ � G( λ ) ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v Stronger than the bounded differences inequality!

  50. gaussian log-sobolev and concentration inequalities Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E This can be proved using the central limit theorem and the Bernoulli log-Sobolev inequality.

  51. gaussian log-sobolev and concentration inequalities Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E This can be proved using the central limit theorem and the Bernoulli log-Sobolev inequality. It implies the Gaussian concentration inequality: Suppose f is Lipschitz: for all x , y ∈ R n , | f(x) − f(y) | ≤ L � x − y � . Then, for all t > 0 , P { f(X) − E f(X) ≥ t } ≤ e − t 2 / (2L 2 ) .

  52. an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If σ 2 = sup � � �� X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 )

  53. an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If σ 2 = sup � � �� X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 ) Proof: We may assume T = { 1 , ..., n } . Let Γ be the covariance matrix of X = (X 1 , . . . , X n ) . Let A = Γ 1 / 2 . If Y is a standard normal vector, then distr . f(Y) = i=1 ,..., n (AY) i max = i=1 ,..., n X i max By Cauchy-Schwarz, 1 / 2 � �   � � � � A 2 � � | (Au) i − (Av) i | = A i , j (u j − v j ) ≤ � u − v � � � i , j  � � j j � � ≤ σ � u − v �

  54. beyond bernoulli and gaussian: the entropy method For general distributions, logarithmic Sobolev inequalities are not available. Solution: modified logarithmic Sobolev inequalities. Suppose X 1 , . . . , X n are independent. Let Z = f(X 1 , . . . , X n ) and Z i = f i (X (i) ) = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) . Let φ (x) = e x − x − 1 . Then for all λ ∈ R , � Ze λ Z � � e λ Z � � e λ Z � λ E − E log E n � � � e λ Z φ ( − λ (Z − Z i )) ≤ . E i=1 Michel Ledoux

  55. the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n (Z − Z i ) 2 ≤ v . � i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) .

  56. the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n (Z − Z i ) 2 ≤ v . � i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) . This implies the bounded differences inequality and much more.

  57. example: the largest eigenvalue of a symmetric matrix Let A = (X i , j ) n × n be symmetric, the X i , j independent ( i ≤ j ) with | X i , j | ≤ 1 . Let u T Au . Z = λ 1 = sup u: � u � =1 and suppose v is such that Z = v T Av . A ′ i , j is obtained by replacing X i , j by x ′ i , j . Then � � v T Av − v T A ′ (Z − Z i , j ) + ≤ i , j v ✶ Z > Z i , j � � � � v T (A − A ′ v i v j (X i , j − X ′ = i , j )v ✶ Z > Z i , j ≤ 2 i , j ) + ≤ 4 | v i v j | . Therefore, � n � 2 16 | v i v j | 2 ≤ 16 � � � (Z − Z ′ i , j ) 2 v 2 + ≤ = 16 . i 1 ≤ i ≤ j ≤ n 1 ≤ i ≤ j ≤ n i=1

  58. example: convex lipschitz functions Let f : [0 , 1] n → R be a convex function. Let i f(X 1 , . . . , x ′ i , . . . , X n ) and let X ′ i be the value of x ′ Z i = inf x ′ i for which the minimum is achieved. Then, writing (i) = (X 1 , . . . , X i − 1 , X ′ X i , X i+1 , . . . , X n ) , n n (i) ) 2 � (Z − Z i ) 2 = � (f(X) − f(X i=1 i=1 � ∂ f n � 2 � (X i − X ′ i ) 2 ≤ (X) ∂ x i i=1 (by convexity) � ∂ f n � 2 � ≤ (X) ∂ x i i=1 = �∇ f(X) � 2 ≤ L 2 .

  59. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z)

  60. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages and the random VC dimension are self bounding.

  61. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages and the random VC dimension are self bounding. Configuration functions.

  62. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1

  63. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2)

  64. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2) If, in addition, f(x) − f i (x (i) ) ≤ 1 , then for 0 < t ≤ E Z , � � t 2 P { Z ≤ E Z − t } ≤ exp − . 2 (a E Z + b + c − t) where c = (3a − 1) / 6 .

  65. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand

  66. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A]

  67. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A] Concentration of measure!

  68. the isoperimetric view Proof: By the bounded differences inequality, P { E d(X , A) − d(X , A) ≥ t } ≤ e − 2t 2 / n . Taking t = E d(X , A) , we get � n 1 E d(X , A) ≤ 2 log P { A } . By the bounded differences inequality again, � � � n 1 ≤ e − 2t 2 / n P d(X , A) ≥ t + 2 log P { A }

  69. talagrand’s convex distance The weighted Hamming distance is � d α (x , A) = inf y ∈ A d α (x , y) = inf | α i | y ∈ A i:x i � =y i where α = ( α 1 , . . . , α n ) . The same argument as before gives � � � � α � 2 1 ≤ e − 2t 2 / � α � 2 , P d α (X , A) ≥ t + log 2 P { A } This implies min ( P { A } , P { d α (X , A) ≥ t } ) ≤ e − t 2 / 2 . sup α : � α � =1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend