linear regression without correspondence
play

Linear regression without correspondence Daniel Hsu Columbia - PowerPoint PPT Presentation

Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research). 1 Linear regression without correspondence Covariate vectors : x


  1. Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. 12

  2. Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial w ∈ R d . ˆ 12

  3. Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial w ∈ R d . ˆ ( Questions : How many restarts? How many iterations?) 12

  4. Approximation result Theorem There is an algorithm that given any inputs ( x i ) n i = 1 , ( y i ) n i = 1 , and ϵ ∈ ( 0 , 1 ) , returns a ( 1 + ϵ ) -approximate solution to the least squares problem in time ( n /ϵ ) O ( k ) + poly( n , d ) , where k = dim(span( x i ) n i = 1 ) . 13

  5. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . 14

  6. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . 14

  7. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . Algorithm : ▶ Find subset of d linearly independent points x i 1 , x i 2 , . . . , x i d . ▶ “Guess” values of π − 1 ⋆ ( i j ) ∈ [ d ] , j ∈ [ d ] . ▶ Solve linear system y π − 1 ( i j ) = w ⊤ x i j , j ∈ [ d ] , for w ∈ R d . ⋆ ▶ To check correctness of ˆ y i := ˆ w ⊤ x i , i ∈ [ n ] , and w : compute ˆ y π ( i ) ) 2 = 0 . ∑ n check if min π ∈ S n i = 1 ( y i − ˆ 14

  8. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . Algorithm : ▶ Find subset of d linearly independent points x i 1 , x i 2 , . . . , x i d . ▶ “Guess” values of π − 1 ⋆ ( i j ) ∈ [ d ] , j ∈ [ d ] . ▶ Solve linear system y π − 1 ( i j ) = w ⊤ x i j , j ∈ [ d ] , for w ∈ R d . ⋆ ▶ To check correctness of ˆ y i := ˆ w ⊤ x i , i ∈ [ n ] , and w : compute ˆ y π ( i ) ) 2 = 0 . ∑ n check if min π ∈ S n i = 1 ( y i − ˆ “Guess” means “enumerate over ≤ n d choices”; rest is poly( n , d ) . 14

  9. Beating brute-force search: general case General case : solution may not be determined by only d points. 15

  10. Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) 15

  11. Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) ⇒ n O ( d ) -time algorithm with approximation ratio d + 1 , = or n O ( d /ϵ ) -time algorithm with approximation ratio 1 + ϵ . 15

  12. Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) ⇒ n O ( d ) -time algorithm with approximation ratio d + 1 , = or n O ( d /ϵ ) -time algorithm with approximation ratio 1 + ϵ . Better way to get 1 + ϵ : exploit first-order optimality conditions (i.e., “normal equations”) and ϵ -nets. Overall time : ( n /ϵ ) O ( k ) + poly( n , d ) for k = dim(span( x i ) n i = 1 ) . 15

  13. Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . 16

  14. Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . Open problems : 1. Poly-time approximation algorithm when k = ω ( 1 ) . (Perhaps in average-case or smoothed setting.) 2. (Smoothed) analysis of alternating minimization. Similar to Lloyd’s algorithm for Euclidean k -means. 16

  15. Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . Open problems : 1. Poly-time approximation algorithm when k = ω ( 1 ) . (Perhaps in average-case or smoothed setting.) 2. (Smoothed) analysis of alternating minimization. Similar to Lloyd’s algorithm for Euclidean k -means. Next : Algorithm for noise-free average-case setting. 16

  16. 2. Exact recovery in the noise-free Gaussian setting 17

  17. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } 18

  18. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . 18

  19. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . 18

  20. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). 18

  21. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d ). 18

  22. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d ). Claim : n ≥ d suffices to recover ¯ π with high probability. 18

  23. Exact recovery result Theorem w ∈ R d and ¯ Fix any ¯ π ∈ S n , and assume n ≥ d . Suppose ( x i ) n i = 0 are drawn iid from N( 0 , I d ) , and ( y i ) n i = 0 satisfy w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . There is an algorithm that, given inputs ( x i ) n i = 0 and ( y i ) n i = 0 , returns ¯ π and ¯ w with high probability. 19

  24. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . 20

  25. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. 20

  26. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 20

  27. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 d ∑ π − 1 ( j ) ( x ⊤ = y ¯ j x 0 ) j = 1 20

  28. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 d ∑ π − 1 ( j ) ( x ⊤ = y ¯ j x 0 ) j = 1 d d ∑ ∑ π ( i ) = j } · y i ( x ⊤ = 1 { ¯ j x 0 ) . i = 1 j = 1 20

  29. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j 21

  30. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . 21

  31. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . ▶ Correct subset corresponds to ( i , j ) ∈ [ d ] 2 s.t. ¯ π ( i ) = j . 21

  32. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . ▶ Correct subset corresponds to ( i , j ) ∈ [ d ] 2 s.t. ¯ π ( i ) = j . Next : How to solve Subset Sum efficiently? 21

  33. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . 22

  34. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. 22

  35. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. 22

  36. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. [ ] I N [ ] 0 b 0 b 1 · · · b N = β T − β c 1 · · · − β c N for sufficiently large β > 0 . 22

  37. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. [ ] I N [ ] 0 b 0 b 1 · · · b N = β T − β c 1 · · · − β c N for sufficiently large β > 0 . Using Lenstra, Lenstra, & Lovász (1982) algorithm to find approximately -shortest vector reveals correct subset. 22

  38. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . 23

  39. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . 23

  40. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . ▶ To show that Lagarias & Odlyzko reduction still works, need Gaussian anti-concentration for quadratic and quartic forms. 23

  41. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . ▶ To show that Lagarias & Odlyzko reduction still works, need Gaussian anti-concentration for quadratic and quartic forms. Key lemma : (w.h.p.) for every Z ∈ Z d × d that is not an integer multiple of permutation matrix corresponding to ¯ π , � � � � 1 ∑ � � 2 poly( d ) · ∥ ¯ T − Z i , j · c i , j ≥ w ∥ 2 . � � � � � i , j � 23

  42. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. 24

  43. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . 24

  44. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . ▶ Algorithm strongly exploits assumption of noise-free measurements; likely fails in presence of noise . 24

  45. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . ▶ Algorithm strongly exploits assumption of noise-free measurements; likely fails in presence of noise . ▶ Similar algorithm used by Andoni, H., Shi, & Sun (2017) for different problems (phase retrieval / correspondence retrieval). 24

  46. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ 25

  47. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) 25

  48. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . 25

  49. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . Noisy setting : recovering ¯ w may be easier than recovering ¯ π . 25

  50. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . Noisy setting : recovering ¯ w may be easier than recovering ¯ π . Next : Limits for recovering ¯ w . 25

  51. 3. Lower bounds on SNR for approximate recovery 26

  52. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n 27

  53. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . 27

  54. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . We consider P = N( 0 , I d ) and P = Uniform([ − 1 , 1 ] d ) . 27

  55. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . We consider P = N( 0 , I d ) and P = Uniform([ − 1 , 1 ] d ) . Note : If correspondence between ( x i ) n i = 1 and � y i � n i = 1 is known , then just need SNR ≳ d / n to approximately recover ¯ w . 27

  56. Uniform case Theorem If ( x i ) n i = 1 are iid draws from Uniform([ − 1 , 1 ] d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and SNR ≤ ( 1 − 2 c ) 2 for some c ∈ ( 0 , 1 / 2 ) , then for any estimator ˆ w , there exists w ∈ R d such that ¯ [ ] ∥ ˆ w − ¯ w ∥ 2 ≥ c ∥ ¯ w ∥ 2 . E 28

  57. Uniform case Theorem If ( x i ) n i = 1 are iid draws from Uniform([ − 1 , 1 ] d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and SNR ≤ ( 1 − 2 c ) 2 for some c ∈ ( 0 , 1 / 2 ) , then for any estimator ˆ w , there exists w ∈ R d such that ¯ [ ] ∥ ˆ w − ¯ w ∥ 2 ≥ c ∥ ¯ w ∥ 2 . E Increasing sample size n does not help , unlike in the “known correspondence” setting (where SNR ≳ d / n suffices). 28

  58. Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . 29

  59. Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . Let P ¯ w be the data distribution with parameter ¯ w ∈ { e 1 , − e 1 } . Task : show P e 1 and P − e 1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max w ∥ ˆ w − ¯ w ∥ 2 ≥ 1 − ∥ P e 1 − P − e 1 ∥ tv . w ∈{ e 1 , − e 1 } E P ¯ ¯ 29

  60. Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . Let P ¯ w be the data distribution with parameter ¯ w ∈ { e 1 , − e 1 } . Task : show P e 1 and P − e 1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max w ∥ ˆ w − ¯ w ∥ 2 ≥ 1 − ∥ P e 1 − P − e 1 ∥ tv . w ∈{ e 1 , − e 1 } E P ¯ ¯ Key idea : conditional means of � y i � n i = 1 given ( x i ) n i = 1 , under P e 1 and P − e 1 , are close as unordered multi-sets . 29

  61. Proof sketch (continued) Generative process for P ¯ w : 30

  62. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 30

  63. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 30

  64. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . 30

  65. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . Conditional distribution of y = ( y 1 , y 2 , . . . , y n ) given ( x i ) n i = 1 : y | ( x i ) n i = 1 ∼ N( u ↑ , σ 2 I n ) Under P e 1 : y | ( x i ) n i = 1 ∼ N( − u ↓ , σ 2 I n ) Under P − e 1 : where u ↑ = ( u ( 1 ) , u ( 2 ) , . . . , u ( n ) ) and u ↓ = ( u ( n ) , u ( n − 1 ) , . . . , u ( 1 ) ) . 30

  66. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . Conditional distribution of y = ( y 1 , y 2 , . . . , y n ) given ( x i ) n i = 1 : y | ( x i ) n i = 1 ∼ N( u ↑ , σ 2 I n ) Under P e 1 : y | ( x i ) n i = 1 ∼ N( − u ↓ , σ 2 I n ) Under P − e 1 : where u ↑ = ( u ( 1 ) , u ( 2 ) , . . . , u ( n ) ) and u ↓ = ( u ( n ) , u ( n − 1 ) , . . . , u ( 1 ) ) . Data processing : Lose information by going from y to � y i � n i = 1 . 30

  67. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL 31

  68. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 2 = 2 σ 2 31

  69. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 31

  70. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . 31

  71. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . By conditioning + Pinsker’s inequality, √ SNR 1 2 + 1 · ∥ u ↑ + u ↓ ∥ 2 ∥ P e 1 − P − e 1 ∥ tv ≤ 2 med 2 4 31

  72. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . By conditioning + Pinsker’s inequality, √ SNR 1 2 + 1 · ∥ u ↑ + u ↓ ∥ 2 ∥ P e 1 − P − e 1 ∥ tv ≤ 2 med 2 4 √ 1 2 + 1 ≤ SNR . 2 31

  73. Gaussian case Theorem If ( x i ) n i = 1 are iid draws from N( 0 , I d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and d SNR ≤ C · log log( n ) for some absolute constant C > 0 , then for any estimator w ∈ R d such that w , there exists ¯ ˆ [ ] ≥ C ′ ∥ ¯ ∥ ˆ w − ¯ w ∥ 2 w ∥ 2 E for some other absolute constant C ′ > 0 . 32

  74. Gaussian case Theorem If ( x i ) n i = 1 are iid draws from N( 0 , I d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and d SNR ≤ C · log log( n ) for some absolute constant C > 0 , then for any estimator w ∈ R d such that w , there exists ¯ ˆ [ ] ≥ C ′ ∥ ¯ ∥ ˆ w − ¯ w ∥ 2 w ∥ 2 E for some other absolute constant C ′ > 0 . C.f. “known correspondence” setting, where SNR ≳ d / n suffices. 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend