 
              Variance � d  � 2  �� � 2 � x || 2 � x [ i ] 2 || � = E � E 2   i = 1   d d � � x [ i ] 2 � x [ j ] 2 � = E   i = 1 j = 1 d d � � x [ i ] 2 � x [ j ] 2 � � � = E i = 1 j = 1
Variance � d  � 2  �� � 2 � x || 2 � x [ i ] 2 || � = E � E 2   i = 1   d d � � x [ i ] 2 � x [ j ] 2 � = E   i = 1 j = 1 d d � � x [ i ] 2 � x [ j ] 2 � � � = E i = 1 j = 1 d − 1 d d � � � x [ i ] 4 � x [ i ] 2 � x [ j ] 2 � � � � � � � = + 2 E E E i = 1 i = 1 j = i + 1
Variance � d  � 2  �� � 2 � x || 2 � x [ i ] 2 || � = E � E 2   i = 1   d d � � x [ i ] 2 � x [ j ] 2 � = E   i = 1 j = 1 d d � � x [ i ] 2 � x [ j ] 2 � � � = E i = 1 j = 1 d − 1 d d � � � x [ i ] 4 � x [ i ] 2 � x [ j ] 2 � � � � � � � = + 2 E E E i = 1 i = 1 j = i + 1 = 3 d + d ( d − 1 ) 4th moment of standard Gaussian equals 3
Variance � d  � 2  �� � 2 � x || 2 � x [ i ] 2 || � = E � E 2   i = 1   d d � � x [ i ] 2 � x [ j ] 2 � = E   i = 1 j = 1 d d � � x [ i ] 2 � x [ j ] 2 � � � = E i = 1 j = 1 d − 1 d d � � � x [ i ] 4 � x [ i ] 2 � x [ j ] 2 � � � � � � � = + 2 E E E i = 1 i = 1 j = i + 1 = 3 d + d ( d − 1 ) 4th moment of standard Gaussian equals 3 = d ( d + 2 )
Variance �� � 2 � � 2 � � � x || 2 x || 2 x || 2 || � || � − E || � Var = E 2 2 2 = d ( d + 2 ) − d 2 = 2 d � Relative standard deviation around mean scales as 2 / d Geometrically, probability density concentrates close to surface of a √ sphere with radius d
Non-asymptotic tail bound Let � x be an iid standard Gaussian random vector of dimension d For any ǫ > 0 2 � � x || 2 P d ( 1 − ǫ ) < || � 2 < d ( 1 + ǫ ) ≥ 1 − d ǫ 2
Markov’s inequality Let x be a nonnegative random variable For any positive constant a > 0, P ( x ≥ a ) ≤ E ( x ) a
Proof Define the indicator variable 1 x ≥ a x − a 1 x ≥ a ≥ 0
Proof Define the indicator variable 1 x ≥ a x − a 1 x ≥ a ≥ 0 E ( x ) ≥ a E ( 1 x ≥ a ) = a P ( x ≥ a )
Chebyshev bound x || 2 Let y := || � 2 , P ( | y − d | ≥ d ǫ )
Chebyshev bound x || 2 Let y := || � 2 , ( y − E ( y )) 2 ≥ d 2 ǫ 2 � � P ( | y − d | ≥ d ǫ ) = P
Chebyshev bound x || 2 Let y := || � 2 , ( y − E ( y )) 2 ≥ d 2 ǫ 2 � � P ( | y − d | ≥ d ǫ ) = P � ( y − E ( y )) 2 � E ≤ by Markov’s inequality d 2 ǫ 2
Chebyshev bound x || 2 Let y := || � 2 , ( y − E ( y )) 2 ≥ d 2 ǫ 2 � � P ( | y − d | ≥ d ǫ ) = P � ( y − E ( y )) 2 � E ≤ by Markov’s inequality d 2 ǫ 2 = Var ( y ) d 2 ǫ 2
Chebyshev bound x || 2 Let y := || � 2 , ( y − E ( y )) 2 ≥ d 2 ǫ 2 � � P ( | y − d | ≥ d ǫ ) = P � ( y − E ( y )) 2 � E ≤ by Markov’s inequality d 2 ǫ 2 = Var ( y ) d 2 ǫ 2 2 = d ǫ 2
Non-asymptotic Chernoff tail bound Let � x be an iid standard Gaussian random vector of dimension d For any ǫ > 0 − d ǫ 2 � � � � x || 2 d ( 1 − ǫ ) < || � 2 < d ( 1 + ǫ ) ≥ 1 − 2 exp P 8
Proof x || 2 Let y := || � 2 . The result is implied by − d ǫ 2 � � P ( y > d ( 1 + ǫ )) ≤ exp 8 − d ǫ 2 � � P ( y < d ( 1 − ǫ )) ≤ exp 8
Proof Fix t > 0 P ( y > a )
Proof Fix t > 0 P ( y > a ) = P ( exp ( t y ) > exp ( at ))
Proof Fix t > 0 P ( y > a ) = P ( exp ( t y ) > exp ( at )) ≤ exp ( − at ) E ( exp ( t y )) by Markov’s inequality
Proof Fix t > 0 P ( y > a ) = P ( exp ( t y ) > exp ( at )) ≤ exp ( − at ) E ( exp ( t y )) by Markov’s inequality � d � �� � 2 ≤ exp ( − at ) E exp t x i i = 1
Proof Fix t > 0 P ( y > a ) = P ( exp ( t y ) > exp ( at )) ≤ exp ( − at ) E ( exp ( t y )) by Markov’s inequality � d � �� � 2 ≤ exp ( − at ) E exp t x i i = 1 d � 2 �� ≤ exp ( − at ) � � E exp t x i by independence of x 1 , . . . , x d i = 1
Proof Lemma (by direct integration) 1 t x 2 �� � � exp = √ 1 − 2 t E Equivalent to controlling higher-order moments since � ∞ t x 2 � i � � � t x 2 �� � � exp = E E i ! i = 0 ∞ � t i � x 2 i �� E � = . i ! i = 0
Proof Fix t > 0 d � 2 �� � � P ( y > a ) ≤ exp ( − at ) exp t x i E i = 1 = exp ( − at ) d ( 1 − 2 t ) 2
Proof Setting a := d ( 1 + ǫ ) and t := 1 1 2 − 2 ( 1 + ǫ ) , we conclude � − d ǫ � P ( y > d ( 1 + ǫ )) ≤ ( 1 + ǫ ) d 2 exp 2 − d ǫ 2 � � ≤ exp 8
Projection onto a fixed subspace Probability density is isotropic and has variance d Projection onto fixed k -dimensional subspace should capture fraction of variance equal to k / d Variance of projection should be k
Projection onto a fixed subspace Let S be a k -dimensional subspace of R d and � x a d -dimensional standard Gaussian vector x ) = UU T � P S ( � x is not a Gaussian vector Covariance: x ) = UU T Σ � x UU T Σ P S ( �
Projection onto a fixed subspace Let S be a k -dimensional subspace of R d and � x a d -dimensional standard Gaussian vector x ) = UU T � P S ( � x is not a Gaussian vector Covariance: x ) = UU T Σ � x UU T Σ P S ( � = UU T
Projection onto a fixed subspace Let S be a k -dimensional subspace of R d and � x a d -dimensional standard Gaussian vector x ) = UU T � P S ( � x is not a Gaussian vector Covariance: x ) = UU T Σ � x UU T Σ P S ( � = UU T Not full rank
Projection onto a fixed subspace Coefficients U T � x are a Gaussian vector with covariance x = U T Σ � x U = U T U = I Σ U T �
Projection onto a fixed subspace Coefficients U T � x are a Gaussian vector with covariance x = U T Σ � x U = U T U = I Σ U T � We have x ) || 2 2 = ( UU T � x ) T UU T � ||P S ( � x 2 � � � � � U T � = x � � � � � � � 2
Projection onto a fixed subspace Coefficients U T � x are a Gaussian vector with covariance x = U T Σ � x U = U T U = I Σ U T � We have x ) || 2 2 = ( UU T � x ) T UU T � ||P S ( � x 2 � � � � � U T � = x � � � � � � � 2 For any ǫ > 0 − k ǫ 2 � � � � x ) || 2 k ( 1 − ǫ ) < ||P S ( � ≥ 1 − 2 exp P 2 < k ( 1 + ǫ ) 8
Linear regression To analyze the performance of the least-squares estimator we assume a linear model with additive iid Gaussian noise y train := X train � � β true + � z train The LS estimator equals � y train − X train � β LS := arg min � � β � 2 � β
Training error The training error is the projection of the noise onto the orthogonal complement of the column space of X train � y train − � y LS = P col ( X train ) ⊥ � z train Dimension of orthogonal complement of col ( X train ) equals n − p � y LS || 2 || � y train − � 2 Training RMSE := n � 1 − p ≈ σ n
Temperature prediction via linear regression 7 1 p / n 6 1 + p / n Training error Average error (deg Celsius) 5 Test error 4 3 2 1 0 200 500 1000 2000 5000 Number of training data
Motivating applications Gaussian random variables Randomized dimensionality reduction Compressed sensing
Randomized linear maps We use Gaussian matrices as randomized linear maps from R d to R k , k < d Each entry is sampled independently from standard Gaussian Question: Do we preserve distances between points in set? Equivalently, are any fixed vectors in the null space?
Fixed vector Let A be a k × d matrix with iid standard Gaussian entries v ∈ R d is a deterministic vector with unit ℓ 2 norm, then A � If � v is a k -dimensional standard Gaussian vector
Fixed vector Let A be a k × d matrix with iid standard Gaussian entries v ∈ R d is a deterministic vector with unit ℓ 2 norm, then A � If � v is a k -dimensional standard Gaussian vector Proof: ( A � v ) [ i ] , 1 ≤ i ≤ k is Gaussian with mean zero and variance � � A T v T Σ A i , : � i , : � = � Var v v
Fixed vector Let A be a k × d matrix with iid standard Gaussian entries v ∈ R d is a deterministic vector with unit ℓ 2 norm, then A � If � v is a k -dimensional standard Gaussian vector Proof: ( A � v ) [ i ] , 1 ≤ i ≤ k is Gaussian with mean zero and variance � � A T v T Σ A i , : � i , : � = � Var v v v T I � = � v v || 2 = || � 2 = 1
Fixed vector Let A be a k × d matrix with iid standard Gaussian entries v ∈ R d is a deterministic vector with unit ℓ 2 norm, then A � If � v is a k -dimensional standard Gaussian vector Proof: ( A � v ) [ i ] , 1 ≤ i ≤ k is Gaussian with mean zero and variance � � A T v T Σ A i , : � i , : � = � Var v v v T I � = � v v || 2 = || � 2 = 1 A i , : , 1 ≤ i ≤ k are all independent
Non-asymptotic Chernoff tail bound Let � x be an iid standard Gaussian random vector of dimension k For any ǫ > 0 − k ǫ 2 � � � � x || 2 k ( 1 − ǫ ) < || � 2 < k ( 1 + ǫ ) ≥ 1 − 2 exp P 8
Fixed vector Let A be a k × d matrix with iid standard Gaussian entries v ∈ R d with unit norm and any ǫ ∈ ( 0 , 1 ) For any � √ √ � � � � 1 � � � � 1 − ǫ ≤ √ A � ≤ 1 + ǫ v � � � � k � � � � 2 � − k ǫ 2 / 8 � with probability at least 1 − 2 exp
Distance between two vectors The result implies that if we fix two vectors � x 1 and � x 2 and define � y := � x 2 − � x 1 then √ √ � � 1 � � � � � � 1 − ǫ || y || 2 ≤ √ ≤ 1 + ǫ || y || 2 A y � � � � k � � � � 2 y / || y || 2 ) with high probability (just set � v := � What about distances between a set of vectors?
Johnson-Lindenstrauss lemma Let A be a k × d matrix with iid standard Gaussian entries x p ∈ R d be any fixed set of p deterministic vectors Let � x 1 , . . . , � For any pair � x i , � x j and any ǫ ∈ ( 0 , 1 ) 2 � � � � 1 x i − 1 x j || 2 x j || 2 � � � � ( 1 − ǫ ) || � x i − � 2 ≤ √ A � √ A � ≤ ( 1 + ǫ ) || � x i − � x j � � � � 2 k k � � � � 2 with probability at least 1 p as long as k ≥ 16 log ( p ) ǫ 2
Johnson-Lindenstrauss lemma Let A be a k × d matrix with iid standard Gaussian entries x p ∈ R d be any fixed set of p deterministic vectors Let � x 1 , . . . , � For any pair � x i , � x j and any ǫ ∈ ( 0 , 1 ) 2 � � � � 1 x i − 1 x j || 2 x j || 2 � � � � ( 1 − ǫ ) || � x i − � 2 ≤ √ A � √ A � ≤ ( 1 + ǫ ) || � x i − � x j � � � � 2 k k � � � � 2 with probability at least 1 p as long as k ≥ 16 log ( p ) ǫ 2 No dependence on d !
Proof Aim: Control action of A the normalized differences � x i − � x j � v ij := || � x i − � x j || 2 Our event of interest is the intersection of the events � � v ij || 2 E ij = k ( 1 − ǫ ) < || A � 2 < k ( 1 + ǫ ) 1 ≤ i < p , i < j ≤ p
Proof Aim: Control action of A the normalized differences � x i − � x j � v ij := || � x i − � x j || 2 Our event of interest is the intersection of the events � � v ij || 2 E ij = k ( 1 − ǫ ) < || A � 2 < k ( 1 + ǫ ) 1 ≤ i < p , i < j ≤ p Is it equal to � i , j E ij ?
Recommend
More recommend