Linear regression without correspondence
Daniel Hsu
Columbia University
October 3, 2017
Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research).
1
Linear regression without correspondence Daniel Hsu Columbia - - PowerPoint PPT Presentation
Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research). 1 Linear regression without correspondence Covariate vectors : x
Linear regression without correspondence
Daniel Hsu
Columbia University
October 3, 2017
Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research).
1Linear regression without correspondence
▶ Covariate vectors: x1, x2, . . . , xn ∈ Rd ▶ Responses: y1, y2, . . . , yn ∈ R ▶ Model:
yi = ¯ w⊤x ¯
π(i) + εi ,
i ∈ [n]
▶ Unknown linear function: ¯w ∈ Rd
▶ Unknown permutation: ¯π ∈ Sn
▶ Measurement errors: ε1, ε2, . . . , εn ∈ R(e.g., (εi)n
i=1 iid from N(0, σ2)) 2Linear regression without correspondence
▶ Covariate vectors: x1, x2, . . . , xn ∈ Rd ▶ Responses: y1, y2, . . . , yn ∈ R ▶ Model:
yi = ¯ w⊤x ¯
π(i) + εi ,
i ∈ [n]
▶ Unknown linear function: ¯w ∈ Rd
▶ Unknown permutation: ¯π ∈ Sn
▶ Measurement errors: ε1, ε2, . . . , εn ∈ R(e.g., (εi)n
i=1 iid from N(0, σ2))Correspondence between (xi)n
i=1 and (yi)n i=1 is unknown.
2Example #1: pose and correspondence estimation
▶ 3D object is captured as a 2D image. ▶ Some known 3D points on object map to 2D points in image.
3Example #1: pose and correspondence estimation
▶ 3D object is captured as a 2D image. ▶ Some known 3D points on object map to 2D points in image.
Goal: Find mapping between points on object and points in image. Perspective projection unknown.
3Example #2: flow cytometry
focusing), and measure emitted light using photomultipliers.
l%%t
g¥o*ny
÷
f"¥%¥F
Few
HEE
,FINE
' 4Example #2: flow cytometry
focusing), and measure emitted light using photomultipliers.
l%%t
g¥o*ny
÷
f"¥%¥F
Few
HEE
,FINE
'Goal: Learn relationship between measurements and cell properties. Order in which cells pass through laser is unknown.
4Prior works — statistical / information-theoretic issues
Unnikrishnan, Haghighatshoar, & Vetterli (2015) Question: If (xi)n
i=1 are iid from continuous distribution on
Rd, then how large must n be so that noiseless measurements uniquely determine every ¯ w ∈ Rd?
Prior works — statistical / information-theoretic issues
Unnikrishnan, Haghighatshoar, & Vetterli (2015) Question: If (xi)n
i=1 are iid from continuous distribution on
Rd, then how large must n be so that noiseless measurements uniquely determine every ¯ w ∈ Rd? Answer: n ≥ 2d is necessary and sufficient.
5Prior works — statistical / information-theoretic issues
Unnikrishnan, Haghighatshoar, & Vetterli (2015) Question: If (xi)n
i=1 are iid from continuous distribution on
Rd, then how large must n be so that noiseless measurements uniquely determine every ¯ w ∈ Rd? Answer: n ≥ 2d is necessary and sufficient. Elhami, Scholefield, Haro, & Vetterli (2017) Explicit construction in R2: for n ≥ 4, xi := [ cos(φi) sin(φi) ] where φi := 2π· 2i−1 − 1 2n − 1 , i ∈ [n] .
5Prior works — statistical / information-theoretic issues
Pananjady, Wainwright, & Courtade (2016) Question: If (xi)n
i=1 are iid from N(0, I d) and (εi)n i=1 are
iid from N(0, σ2), then how large must signal-to-noise ratio SNR = ∥ ¯ w∥2
2/σ2 be so that ¯
π can be recovered?
Prior works — statistical / information-theoretic issues
Pananjady, Wainwright, & Courtade (2016) Question: If (xi)n
i=1 are iid from N(0, I d) and (εi)n i=1 are
iid from N(0, σ2), then how large must signal-to-noise ratio SNR = ∥ ¯ w∥2
2/σ2 be so that ¯
π can be recovered? Answer: log(1 + SNR) ≳ log(n) is necessary and sufficient. Achieved by maximum likelihood / least squares estimator: ( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 .
6Prior works — statistical / information-theoretic issues
Pananjady, Wainwright, & Courtade (2016) Question: If (xi)n
i=1 are iid from N(0, I d) and (εi)n i=1 are
iid from N(0, σ2), then how large must signal-to-noise ratio SNR = ∥ ¯ w∥2
2/σ2 be so that ¯
π can be recovered? Answer: log(1 + SNR) ≳ log(n) is necessary and sufficient. Achieved by maximum likelihood / least squares estimator: ( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 . Note: If correspondence between (xi)n
i=1 and (yi)n i=1 is known
(i.e., standard linear regression setting), then just need SNR ≳ d/n.
6Prior works — computational issues
Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n
i=1 from Rd and (yi)n i=1 from R, find
( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 .
7Prior works — computational issues
Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n
i=1 from Rd and (yi)n i=1 from R, find
( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 .
▶ d = 1: O(n log n)-time algorithm based on sorting.
7Prior works — computational issues
Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n
i=1 from Rd and (yi)n i=1 from R, find
( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 .
▶ d = 1: O(n log n)-time algorithm based on sorting. ▶ d = Ω(n): NP-hard.
7Prior works — computational issues
Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n
i=1 from Rd and (yi)n i=1 from R, find
( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 .
▶ d = 1: O(n log n)-time algorithm based on sorting. ▶ d = Ω(n): NP-hard.
Elhami, Scholefield, Haro, & Vetterli (2017)
[d = 2] O(n3)-time algorithm with ∥ ˆ
w − ¯ w∥2 ≤ O(2n · ∥ε∥∞) when (xi)n
i=1 are exponentially spaced (in angle) on unit cir-
cle.
7Our contributions
time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n
i=1).
8Our contributions
time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n
i=1).
w and ¯ π (with high probability) when (xi)n
i=1 are iid from N(0, I d),
εi ≡ 0, and n ≥ d + 1.
(∗After appropriate discretization.)
8Our contributions
time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n
i=1).
w and ¯ π (with high probability) when (xi)n
i=1 are iid from N(0, I d),
εi ≡ 0, and n ≥ d + 1.
(∗After appropriate discretization.)
recovery of ¯ w when (εi)n
i=1 are iid from N(0, σ2), and (xi)n i=1
are iid from N(0, I d) or Uniform([−1, 1]d).
8problem
9Least squares problem
Given (xi)n
i=1 from Rd and (yi)n i=1 from R, find
( ˆ wmle, ˆ πmle) := arg min
w∈Rd,π∈Sn n
∑
i=1
( yi − w⊤xπ(i) )2 .
▶ d = 1: O(n log n)-time algorithm based on sorting [PWC’16]. ▶ d = Ω(n): NP-hard to decide if opt = 0 [PWC’16]. ▶ Naïve brute-force search: Ω(|Sn|) = Ω(n!).
10Least squares problem (d = 1)
Given (xi)n
i=1 and (yi)n i=1 from R, find
(ˆ wmle, ˆ πmle) := arg min
w∈R,π∈Sn n
∑
i=1
( yi − wxπ(i) )2 .
11Least squares problem (d = 1)
Given (xi)n
i=1 and (yi)n i=1 from R, find
(ˆ wmle, ˆ πmle) := arg min
w∈R,π∈Sn n
∑
i=1
( yi − wxπ(i) )2 . Fix w ∈ R, and suppose (WLOG) w ≥ 0. Then min
π∈Sn n
∑
i=1
( yi − wxπ(i) )2 =
n
∑
j=1
( y(j) − wx(j) )2 where x(1) ≤ x(2) ≤ · · · ≤ x(n) and y(1) ≤ y(2) ≤ · · · ≤ y(n).
11Least squares problem (d = 1)
Given (xi)n
i=1 and (yi)n i=1 from R, find
(ˆ wmle, ˆ πmle) := arg min
w∈R,π∈Sn n
∑
i=1
( yi − wxπ(i) )2 . Fix w ∈ R, and suppose (WLOG) w ≥ 0. Then min
π∈Sn n
∑
i=1
( yi − wxπ(i) )2 =
n
∑
j=1
( y(j) − wx(j) )2 where x(1) ≤ x(2) ≤ · · · ≤ x(n) and y(1) ≤ y(2) ≤ · · · ≤ y(n). ∴ As observed by [PWC’16], can find ˆ πmle (and ˆ wmle) by sorting.
11Alternating minimization
Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min
π∈Sn n
∑
i=1
( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min
w∈Rd n
∑
i=1
( yi − w⊤x ˆ
π(i)
)2 .
12Alternating minimization
Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min
π∈Sn n
∑
i=1
( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min
w∈Rd n
∑
i=1
( yi − w⊤x ˆ
π(i)
)2 .
▶ Each loop-iteration efficiently computable.
12Alternating minimization
Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min
π∈Sn n
∑
i=1
( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min
w∈Rd n
∑
i=1
( yi − w⊤x ˆ
π(i)
)2 .
▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial
ˆ w ∈ Rd.
12Alternating minimization
Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min
π∈Sn n
∑
i=1
( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min
w∈Rd n
∑
i=1
( yi − w⊤x ˆ
π(i)
)2 .
▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial
ˆ w ∈ Rd. (Questions: How many restarts? How many iterations?)
12Approximation result
Theorem
There is an algorithm that given any inputs (xi)n
i=1, (yi)n i=1,
and ϵ ∈ (0, 1), returns a (1 + ϵ)-approximate solution to the least squares problem in time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n
i=1).
13Beating brute-force search: “realizable” case
“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤
⋆ xπ⋆(i) ,
i ∈ [n] .
14Beating brute-force search: “realizable” case
“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤
⋆ xπ⋆(i) ,
i ∈ [n] . Solution is determined by action of π⋆ on d points
(assume dim(span(xi)d
i=1) = d). 14Beating brute-force search: “realizable” case
“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤
⋆ xπ⋆(i) ,
i ∈ [n] . Solution is determined by action of π⋆ on d points
(assume dim(span(xi)d
i=1) = d).Algorithm:
▶ Find subset of d linearly independent points xi1, xi2, . . . , xid. ▶ “Guess” values of π−1 ⋆ (ij) ∈ [d], j ∈ [d]. ▶ Solve linear system yπ−1
⋆(ij) = w⊤xij, j ∈ [d], for w ∈ Rd. ▶ To check correctness of ˆ
w: compute ˆ yi := ˆ w⊤xi, i ∈ [n], and check if minπ∈Sn ∑n
i=1 (yi − ˆ
yπ(i))2 = 0.
14Beating brute-force search: “realizable” case
“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤
⋆ xπ⋆(i) ,
i ∈ [n] . Solution is determined by action of π⋆ on d points
(assume dim(span(xi)d
i=1) = d).Algorithm:
▶ Find subset of d linearly independent points xi1, xi2, . . . , xid. ▶ “Guess” values of π−1 ⋆ (ij) ∈ [d], j ∈ [d]. ▶ Solve linear system yπ−1
⋆(ij) = w⊤xij, j ∈ [d], for w ∈ Rd. ▶ To check correctness of ˆ
w: compute ˆ yi := ˆ w⊤xi, i ∈ [n], and check if minπ∈Sn ∑n
i=1 (yi − ˆ
yπ(i))2 = 0. “Guess” means “enumerate over ≤ nd choices”; rest is poly(n, d).
14Beating brute-force search: general case
General case: solution may not be determined by only d points.
15Beating brute-force search: general case
General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ w ∈ arg minw∈Rd ∑d
j=1 (bij − w⊤xij)2 satisfies n
∑
i=1
( bi − ˆ w⊤xi )2 ≤ (d + 1) · min
w∈Rd n
∑
i=1
( bi − w⊤xi )2 .
(Follows from result of Dereziński and Warmuth (2017) on volume sampling.)
15Beating brute-force search: general case
General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ w ∈ arg minw∈Rd ∑d
j=1 (bij − w⊤xij)2 satisfies n
∑
i=1
( bi − ˆ w⊤xi )2 ≤ (d + 1) · min
w∈Rd n
∑
i=1
( bi − w⊤xi )2 .
(Follows from result of Dereziński and Warmuth (2017) on volume sampling.)
= ⇒ nO(d)-time algorithm with approximation ratio d + 1,
Beating brute-force search: general case
General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ w ∈ arg minw∈Rd ∑d
j=1 (bij − w⊤xij)2 satisfies n
∑
i=1
( bi − ˆ w⊤xi )2 ≤ (d + 1) · min
w∈Rd n
∑
i=1
( bi − w⊤xi )2 .
(Follows from result of Dereziński and Warmuth (2017) on volume sampling.)
= ⇒ nO(d)-time algorithm with approximation ratio d + 1,
Better way to get 1 + ϵ: exploit first-order optimality conditions (i.e., “normal equations”) and ϵ-nets. Overall time: (n/ϵ)O(k) + poly(n, d) for k = dim(span(xi)n
i=1).
15Remarks
▶ Algorithm is justified in statistical setting by results of [PWC’16]
for MLE, but guarantees also hold when inputs are worst-case.
▶ Algorithm is poly-time only when k = O(1).
16Remarks
▶ Algorithm is justified in statistical setting by results of [PWC’16]
for MLE, but guarantees also hold when inputs are worst-case.
▶ Algorithm is poly-time only when k = O(1).
Open problems:
(Perhaps in average-case or smoothed setting.)
Similar to Lloyd’s algorithm for Euclidean k-means.
16Remarks
▶ Algorithm is justified in statistical setting by results of [PWC’16]
for MLE, but guarantees also hold when inputs are worst-case.
▶ Algorithm is poly-time only when k = O(1).
Open problems:
(Perhaps in average-case or smoothed setting.)
Similar to Lloyd’s algorithm for Euclidean k-means. Next: Algorithm for noise-free average-case setting.
16Setting
Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯
π(i) ,
i ∈ {0, 1, . . . , n}
▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ S{0,1,...,n}
18Setting
Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯
π(i) ,
i ∈ {0, 1, . . . , n}
▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0.
18Setting
Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯
π(i) ,
i ∈ {0, 1, . . . , n}
▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}.
18Setting
Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯
π(i) ,
i ∈ {0, 1, . . . , n}
▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}. Number of measurements: If n + 1 ≥ d, then recovery of ¯ π gives exact recovery of ¯ w (a.s.).
18Setting
Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯
π(i) ,
i ∈ {0, 1, . . . , n}
▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}. Number of measurements: If n + 1 ≥ d, then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d).
18Setting
Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯
π(i) ,
i ∈ {0, 1, . . . , n}
▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}. Number of measurements: If n + 1 ≥ d, then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d). Claim: n ≥ d suffices to recover ¯ π with high probability.
18Exact recovery result
Theorem
Fix any ¯ w ∈ Rd and ¯ π ∈ Sn, and assume n ≥ d. Suppose (xi)n
i=0 are drawn iid from N(0, I d), and (yi)n i=0 satisfy
y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯
π(i) ,
i ∈ [n] . There is an algorithm that, given inputs (xi)n
i=0 and (yi)n i=0,
returns ¯ π and ¯ w with high probability.
19Main idea: hidden subset
Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯
π(i) ,
i ∈ [n] .
20Main idea: hidden subset
Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯
π(i) ,
i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal.
20Main idea: hidden subset
Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯
π(i) ,
i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal. y0 = ¯ w⊤x0 =
d
∑
j=1
( ¯ w⊤xj) (x⊤
j x0)
20Main idea: hidden subset
Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯
π(i) ,
i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal. y0 = ¯ w⊤x0 =
d
∑
j=1
( ¯ w⊤xj) (x⊤
j x0)
=
d
∑
j=1
y¯
π−1(j) (x⊤ j x0)
20Main idea: hidden subset
Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯
π(i) ,
i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal. y0 = ¯ w⊤x0 =
d
∑
j=1
( ¯ w⊤xj) (x⊤
j x0)
=
d
∑
j=1
y¯
π−1(j) (x⊤ j x0)
=
d
∑
i=1 d
∑
j=1
1{¯ π(i) = j} · yi (x⊤
j x0) .
20Reduction to subset sum
y0 =
d
∑
i=1 d
∑
j=1
1{¯ π(i) = j} · yi (x⊤
j x0)
Reduction to subset sum
y0 =
d
∑
i=1 d
∑
j=1
1{¯ π(i) = j} · yi (x⊤
j x0)
▶ d2 “source” numbers ci,j := yi(x⊤ j x0), “target” sum T := y0.
Promised that a size-d subset of the ci,j sum to T.
21Reduction to subset sum
y0 =
d
∑
i=1 d
∑
j=1
1{¯ π(i) = j} · yi (x⊤
j x0)
▶ d2 “source” numbers ci,j := yi(x⊤ j x0), “target” sum T := y0.
Promised that a size-d subset of the ci,j sum to T.
▶ Correct subset corresponds to (i, j) ∈ [d]2 s.t. ¯
π(i) = j.
21Reduction to subset sum
y0 =
d
∑
i=1 d
∑
j=1
1{¯ π(i) = j} · yi (x⊤
j x0)
▶ d2 “source” numbers ci,j := yi(x⊤ j x0), “target” sum T := y0.
Promised that a size-d subset of the ci,j sum to T.
▶ Correct subset corresponds to (i, j) ∈ [d]2 s.t. ¯
π(i) = j. Next: How to solve Subset Sum efficiently?
21Reducing subset sum to shortest vector problem
Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z.
22Reducing subset sum to shortest vector problem
Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount.
22Reducing subset sum to shortest vector problem
Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that
▶ correct subset of basis vectors gives short lattice vector v⋆; ▶ any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.
22Reducing subset sum to shortest vector problem
Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that
▶ correct subset of basis vectors gives short lattice vector v⋆; ▶ any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.
[ b0 b1 · · · bN ] = [
I N
βT −βc1 · · · −βcN ] for sufficiently large β > 0.
22Reducing subset sum to shortest vector problem
Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that
▶ correct subset of basis vectors gives short lattice vector v⋆; ▶ any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.
[ b0 b1 · · · bN ] = [
I N
βT −βc1 · · · −βcN ] for sufficiently large β > 0. Using Lenstra, Lenstra, & Lovász (1982) algorithm to find approximately-shortest vector reveals correct subset.
22Our random subset sum instance
Catch: Our source numbers ci,j = yix⊤
j x0 are not independent,
and not uniformly distributed on some wide interval of Z.
23Our random subset sum instance
Catch: Our source numbers ci,j = yix⊤
j x0 are not independent,
and not uniformly distributed on some wide interval of Z.
▶ Instead, have some joint density derived from N(0, 1).
23Our random subset sum instance
Catch: Our source numbers ci,j = yix⊤
j x0 are not independent,
and not uniformly distributed on some wide interval of Z.
▶ Instead, have some joint density derived from N(0, 1). ▶ To show that Lagarias & Odlyzko reduction still works, need
Gaussian anti-concentration for quadratic and quartic forms.
23Our random subset sum instance
Catch: Our source numbers ci,j = yix⊤
j x0 are not independent,
and not uniformly distributed on some wide interval of Z.
▶ Instead, have some joint density derived from N(0, 1). ▶ To show that Lagarias & Odlyzko reduction still works, need
Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every Z ∈ Zd×d that is not an integer multiple of permutation matrix corresponding to ¯ π,
∑
i,j
Zi,j · ci,j
1 2poly(d) · ∥ ¯ w∥2 .
23Some details
▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but
similar reduction works via Moore-Penrose pseudoinverse.
24Some details
▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but
similar reduction works via Moore-Penrose pseudoinverse.
▶ Reduction uses real coefficients in lattice basis.
For LLL to run in poly-time, need to round (xi)n
i=0 and ¯
w coefficients to finite-precision rational numbers. Similar to drawing (xi)n
i=0 iid from discretized N(0, I d).
24Some details
▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but
similar reduction works via Moore-Penrose pseudoinverse.
▶ Reduction uses real coefficients in lattice basis.
For LLL to run in poly-time, need to round (xi)n
i=0 and ¯
w coefficients to finite-precision rational numbers. Similar to drawing (xi)n
i=0 iid from discretized N(0, I d). ▶ Algorithm strongly exploits assumption of noise-free
measurements; likely fails in presence of noise.
24Some details
▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but
similar reduction works via Moore-Penrose pseudoinverse.
▶ Reduction uses real coefficients in lattice basis.
For LLL to run in poly-time, need to round (xi)n
i=0 and ¯
w coefficients to finite-precision rational numbers. Similar to drawing (xi)n
i=0 iid from discretized N(0, I d). ▶ Algorithm strongly exploits assumption of noise-free
measurements; likely fails in presence of noise.
▶ Similar algorithm used by Andoni, H., Shi, & Sun (2017) for
different problems (phase retrieval / correspondence retrieval).
24Connections to prior works
▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)
Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd.
25Connections to prior works
▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)
Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)
25Connections to prior works
▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)
Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)
▶ Pananjady, Wainwright, & Courtade (2016)
Noise-free setting: signal-to-noise conditions trivially satisfied whenever ¯ w ̸= 0.
25Connections to prior works
▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)
Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)
▶ Pananjady, Wainwright, & Courtade (2016)
Noise-free setting: signal-to-noise conditions trivially satisfied whenever ¯ w ̸= 0. Noisy setting: recovering ¯ w may be easier than recovering ¯ π.
25Connections to prior works
▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)
Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)
▶ Pananjady, Wainwright, & Courtade (2016)
Noise-free setting: signal-to-noise conditions trivially satisfied whenever ¯ w ̸= 0. Noisy setting: recovering ¯ w may be easier than recovering ¯ π. Next: Limits for recovering ¯ w.
25Setting
Linear model with Gaussian noise yi = ¯ w⊤x ¯
π(i) + εi ,
i ∈ [n]
▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ Sn
27Setting
Linear model with Gaussian noise yi = ¯ w⊤x ¯
π(i) + εi ,
i ∈ [n]
▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ Sn Equivalent: ignore ¯ π; observe (xi)n
i=1 and yin i=1
(where · denotes unordered multi-set).
27Setting
Linear model with Gaussian noise yi = ¯ w⊤x ¯
π(i) + εi ,
i ∈ [n]
▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ Sn Equivalent: ignore ¯ π; observe (xi)n
i=1 and yin i=1
(where · denotes unordered multi-set).
We consider P = N(0, I d) and P = Uniform([−1, 1]d).
27Setting
Linear model with Gaussian noise yi = ¯ w⊤x ¯
π(i) + εi ,
i ∈ [n]
▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯
w ∈ Rd
▶ Unknown permutation: ¯
π ∈ Sn Equivalent: ignore ¯ π; observe (xi)n
i=1 and yin i=1
(where · denotes unordered multi-set).
We consider P = N(0, I d) and P = Uniform([−1, 1]d). Note: If correspondence between (xi)n
i=1 and yin i=1 is known,
then just need SNR ≳ d/n to approximately recover ¯ w.
27Uniform case
Theorem
If (xi)n
i=1 are iid draws from Uniform([−1, 1]d), (yi)n i=1
follow the linear model with N(0, σ2) noise, and SNR ≤ (1 − 2c)2 for some c ∈ (0, 1/2), then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ c∥ ¯ w∥2 .
28Uniform case
Theorem
If (xi)n
i=1 are iid draws from Uniform([−1, 1]d), (yi)n i=1
follow the linear model with N(0, σ2) noise, and SNR ≤ (1 − 2c)2 for some c ∈ (0, 1/2), then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ c∥ ¯ w∥2 . Increasing sample size n does not help, unlike in the “known correspondence” setting (where SNR ≳ d/n suffices).
28Proof sketch
We show that no estimator can confidently distinguish between ¯ w = e1 and ¯ w = −e1, where e1 = (1, 0, . . . , 0)⊤.
29Proof sketch
We show that no estimator can confidently distinguish between ¯ w = e1 and ¯ w = −e1, where e1 = (1, 0, . . . , 0)⊤. Let P¯
w be the data distribution with parameter ¯
w ∈ {e1, −e1}. Task: show Pe1 and P−e1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max
¯ w∈{e1,−e1} EP¯
w ∥ ˆw − ¯ w∥2 ≥ 1 − ∥Pe1 − P−e1∥tv .
29Proof sketch
We show that no estimator can confidently distinguish between ¯ w = e1 and ¯ w = −e1, where e1 = (1, 0, . . . , 0)⊤. Let P¯
w be the data distribution with parameter ¯
w ∈ {e1, −e1}. Task: show Pe1 and P−e1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max
¯ w∈{e1,−e1} EP¯
w ∥ ˆw − ¯ w∥2 ≥ 1 − ∥Pe1 − P−e1∥tv . Key idea: conditional means of yin
i=1 given (xi)n i=1, under Pe1
and P−e1, are close as unordered multi-sets.
29Proof sketch (continued)
Generative process for P¯
w:
30Proof sketch (continued)
Generative process for P¯
w:
i=1 iid
∼ Uniform([−1, 1]d), (εi)n
i=1 iid
∼ N(0, σ2).
30Proof sketch (continued)
Generative process for P¯
w:
i=1 iid
∼ Uniform([−1, 1]d), (εi)n
i=1 iid
∼ N(0, σ2).
w⊤xi for i ∈ [n].
30Proof sketch (continued)
Generative process for P¯
w:
i=1 iid
∼ Uniform([−1, 1]d), (εi)n
i=1 iid
∼ N(0, σ2).
w⊤xi for i ∈ [n].
Proof sketch (continued)
Generative process for P¯
w:
i=1 iid
∼ Uniform([−1, 1]d), (εi)n
i=1 iid
∼ N(0, σ2).
w⊤xi for i ∈ [n].
Conditional distribution of y = (y1, y2, . . . , yn) given (xi)n
i=1:
Under Pe1: y | (xi)n
i=1 ∼ N(u↑, σ2I n)
Under P−e1: y | (xi)n
i=1 ∼ N(−u↓, σ2I n)
where u↑ = (u(1), u(2), . . . , u(n)) and u↓ = (u(n), u(n−1), . . . , u(1)).
30Proof sketch (continued)
Generative process for P¯
w:
i=1 iid
∼ Uniform([−1, 1]d), (εi)n
i=1 iid
∼ N(0, σ2).
w⊤xi for i ∈ [n].
Conditional distribution of y = (y1, y2, . . . , yn) given (xi)n
i=1:
Under Pe1: y | (xi)n
i=1 ∼ N(u↑, σ2I n)
Under P−e1: y | (xi)n
i=1 ∼ N(−u↓, σ2I n)
where u↑ = (u(1), u(2), . . . , u(n)) and u↓ = (u(n), u(n−1), . . . , u(1)). Data processing: Lose information by going from y to yin
i=1.
30Proof sketch (continued)
By data processing inequality, KL ( Pe1(· | (xi)n
i=1), P−e1(· | (xi)n i=1)
) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) )
31Proof sketch (continued)
By data processing inequality, KL ( Pe1(· | (xi)n
i=1), P−e1(· | (xi)n i=1)
) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2
2
2σ2
31Proof sketch (continued)
By data processing inequality, KL ( Pe1(· | (xi)n
i=1), P−e1(· | (xi)n i=1)
) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2
2
2σ2 = SNR 2 · ∥u↑ + u↓∥2
2 .
31Proof sketch (continued)
By data processing inequality, KL ( Pe1(· | (xi)n
i=1), P−e1(· | (xi)n i=1)
) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2
2
2σ2 = SNR 2 · ∥u↑ + u↓∥2
2 .
Some computations show that med ∥u↑ + u↓∥2
2 ≤ 4 .
31Proof sketch (continued)
By data processing inequality, KL ( Pe1(· | (xi)n
i=1), P−e1(· | (xi)n i=1)
) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2
2
2σ2 = SNR 2 · ∥u↑ + u↓∥2
2 .
Some computations show that med ∥u↑ + u↓∥2
2 ≤ 4 .
By conditioning + Pinsker’s inequality, ∥Pe1 − P−e1∥tv ≤ 1 2 + 1 2 med √ SNR 4 · ∥u↑ + u↓∥2
2
31Proof sketch (continued)
By data processing inequality, KL ( Pe1(· | (xi)n
i=1), P−e1(· | (xi)n i=1)
) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2
2
2σ2 = SNR 2 · ∥u↑ + u↓∥2
2 .
Some computations show that med ∥u↑ + u↓∥2
2 ≤ 4 .
By conditioning + Pinsker’s inequality, ∥Pe1 − P−e1∥tv ≤ 1 2 + 1 2 med √ SNR 4 · ∥u↑ + u↓∥2
2
≤ 1 2 + 1 2 √ SNR .
31Gaussian case
Theorem
If (xi)n
i=1 are iid draws from N(0, I d), (yi)n i=1 follow the
linear model with N(0, σ2) noise, and SNR ≤ C · d log log(n) for some absolute constant C > 0, then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ C ′∥ ¯ w∥2 for some other absolute constant C ′ > 0.
32Gaussian case
Theorem
If (xi)n
i=1 are iid draws from N(0, I d), (yi)n i=1 follow the
linear model with N(0, σ2) noise, and SNR ≤ C · d log log(n) for some absolute constant C > 0, then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ C ′∥ ¯ w∥2 for some other absolute constant C ′ > 0. C.f. “known correspondence” setting, where SNR ≳ d/n suffices.
32Closing remarks and open problems
Lack of correspondence changes both computational and statistical difficulty of linear regression.
34Closing remarks and open problems
Lack of correspondence changes both computational and statistical difficulty of linear regression.
▶ Algorithms shed light on computational difficulty in worst-case
and average-case settings.
34Closing remarks and open problems
Lack of correspondence changes both computational and statistical difficulty of linear regression.
▶ Algorithms shed light on computational difficulty in worst-case
and average-case settings.
▶ SNR lower bounds show striking contrast to “known
correspondence” settings.
34Closing remarks and open problems
Lack of correspondence changes both computational and statistical difficulty of linear regression.
▶ Algorithms shed light on computational difficulty in worst-case
and average-case settings.
▶ SNR lower bounds show striking contrast to “known
correspondence” settings.
▶ Gap remains between SNR lower and upper bounds.
E.g., N(0, I d) case: O (
d log log n
) : fails Ω(nc): succeeds [PWC’16]
▶ Is MLE (near) optimal for recovering ¯w?
▶ N(0, I d) vs Uniform([−1, 1]d)? 34Closing remarks and open problems
Lack of correspondence changes both computational and statistical difficulty of linear regression.
▶ Algorithms shed light on computational difficulty in worst-case
and average-case settings.
▶ SNR lower bounds show striking contrast to “known
correspondence” settings.
▶ Gap remains between SNR lower and upper bounds.
E.g., N(0, I d) case: O (
d log log n
) : fails ω(1): succeeds (new!)
▶ Is MLE (near) optimal for recovering ¯w?
▶ N(0, I d) vs Uniform([−1, 1]d)? 34Closing remarks and open problems
Lack of correspondence changes both computational and statistical difficulty of linear regression.
▶ Algorithms shed light on computational difficulty in worst-case
and average-case settings.
▶ SNR lower bounds show striking contrast to “known
correspondence” settings.
▶ Gap remains between SNR lower and upper bounds.
E.g., N(0, I d) case: O (
d log log n
) : fails ω(1): succeeds (new!)
▶ Is MLE (near) optimal for recovering ¯w?
▶ N(0, I d) vs Uniform([−1, 1]d)?▶ Faster algorithms?
(Smoothed) analysis of alternating minimization?
34Acknowledgements
Collaborators: Kevin Shi (Columbia), Xiaorui Sun (Simons Institute). Discussants: Ashwin Pananjady (UCB), Michał Dereziński (UCSC), Manfred Warmuth (UCSC). Funding: NSF (DMR-1534910, IIS-1563785), Sloan Research Fellowship, Bloomberg Data Science Research Grant. Hospitality: Simons Institute for the Theory of Computing (UCB). See preprint for details & references: arxiv.org/abs/1705.07048
Thank you
35