Linear regression without correspondence Daniel Hsu Columbia - - PowerPoint PPT Presentation

linear regression without correspondence
SMART_READER_LITE
LIVE PREVIEW

Linear regression without correspondence Daniel Hsu Columbia - - PowerPoint PPT Presentation

Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research). 1 Linear regression without correspondence Covariate vectors : x


slide-1
SLIDE 1

Linear regression without correspondence

Daniel Hsu

Columbia University

October 3, 2017

Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research).

1
slide-2
SLIDE 2

Linear regression without correspondence

▶ Covariate vectors: x1, x2, . . . , xn ∈ Rd ▶ Responses: y1, y2, . . . , yn ∈ R ▶ Model:

yi = ¯ w⊤x ¯

π(i) + εi ,

i ∈ [n]

▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ Sn

▶ Measurement errors: ε1, ε2, . . . , εn ∈ R

(e.g., (εi)n

i=1 iid from N(0, σ2)) 2
slide-3
SLIDE 3

Linear regression without correspondence

▶ Covariate vectors: x1, x2, . . . , xn ∈ Rd ▶ Responses: y1, y2, . . . , yn ∈ R ▶ Model:

yi = ¯ w⊤x ¯

π(i) + εi ,

i ∈ [n]

▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ Sn

▶ Measurement errors: ε1, ε2, . . . , εn ∈ R

(e.g., (εi)n

i=1 iid from N(0, σ2))

Correspondence between (xi)n

i=1 and (yi)n i=1 is unknown.

2
slide-4
SLIDE 4

Example #1: pose and correspondence estimation

▶ 3D object is captured as a 2D image. ▶ Some known 3D points on object map to 2D points in image.

3
slide-5
SLIDE 5

Example #1: pose and correspondence estimation

▶ 3D object is captured as a 2D image. ▶ Some known 3D points on object map to 2D points in image.

Goal: Find mapping between points on object and points in image. Perspective projection unknown.

3
slide-6
SLIDE 6

Example #2: flow cytometry

  • 1. Suspend population of cells in a fluid.
  • 2. Pass cells, one at a time, through laser (via hydrodynamic

focusing), and measure emitted light using photomultipliers.

1*4%4

"

÷i*y

l%%t

g¥o*ny

÷

¥

"

f"¥%¥F

Few

HEE

,

FINE

' 4
slide-7
SLIDE 7

Example #2: flow cytometry

  • 1. Suspend population of cells in a fluid.
  • 2. Pass cells, one at a time, through laser (via hydrodynamic

focusing), and measure emitted light using photomultipliers.

1*4%4

"

÷i*y

l%%t

g¥o*ny

÷

¥

"

f"¥%¥F

Few

HEE

,

FINE

'

Goal: Learn relationship between measurements and cell properties. Order in which cells pass through laser is unknown.

4
slide-8
SLIDE 8

Prior works — statistical / information-theoretic issues

Unnikrishnan, Haghighatshoar, & Vetterli (2015) Question: If (xi)n

i=1 are iid from continuous distribution on

Rd, then how large must n be so that noiseless measurements uniquely determine every ¯ w ∈ Rd?

slide-9
SLIDE 9

Prior works — statistical / information-theoretic issues

Unnikrishnan, Haghighatshoar, & Vetterli (2015) Question: If (xi)n

i=1 are iid from continuous distribution on

Rd, then how large must n be so that noiseless measurements uniquely determine every ¯ w ∈ Rd? Answer: n ≥ 2d is necessary and sufficient.

5
slide-10
SLIDE 10

Prior works — statistical / information-theoretic issues

Unnikrishnan, Haghighatshoar, & Vetterli (2015) Question: If (xi)n

i=1 are iid from continuous distribution on

Rd, then how large must n be so that noiseless measurements uniquely determine every ¯ w ∈ Rd? Answer: n ≥ 2d is necessary and sufficient. Elhami, Scholefield, Haro, & Vetterli (2017) Explicit construction in R2: for n ≥ 4, xi := [ cos(φi) sin(φi) ] where φi := 2π· 2i−1 − 1 2n − 1 , i ∈ [n] .

5
slide-11
SLIDE 11

Prior works — statistical / information-theoretic issues

Pananjady, Wainwright, & Courtade (2016) Question: If (xi)n

i=1 are iid from N(0, I d) and (εi)n i=1 are

iid from N(0, σ2), then how large must signal-to-noise ratio SNR = ∥ ¯ w∥2

2/σ2 be so that ¯

π can be recovered?

slide-12
SLIDE 12

Prior works — statistical / information-theoretic issues

Pananjady, Wainwright, & Courtade (2016) Question: If (xi)n

i=1 are iid from N(0, I d) and (εi)n i=1 are

iid from N(0, σ2), then how large must signal-to-noise ratio SNR = ∥ ¯ w∥2

2/σ2 be so that ¯

π can be recovered? Answer: log(1 + SNR) ≳ log(n) is necessary and sufficient. Achieved by maximum likelihood / least squares estimator: ( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 .

6
slide-13
SLIDE 13

Prior works — statistical / information-theoretic issues

Pananjady, Wainwright, & Courtade (2016) Question: If (xi)n

i=1 are iid from N(0, I d) and (εi)n i=1 are

iid from N(0, σ2), then how large must signal-to-noise ratio SNR = ∥ ¯ w∥2

2/σ2 be so that ¯

π can be recovered? Answer: log(1 + SNR) ≳ log(n) is necessary and sufficient. Achieved by maximum likelihood / least squares estimator: ( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 . Note: If correspondence between (xi)n

i=1 and (yi)n i=1 is known

(i.e., standard linear regression setting), then just need SNR ≳ d/n.

6
slide-14
SLIDE 14

Prior works — computational issues

Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n

i=1 from Rd and (yi)n i=1 from R, find

( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 .

7
slide-15
SLIDE 15

Prior works — computational issues

Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n

i=1 from Rd and (yi)n i=1 from R, find

( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 .

▶ d = 1: O(n log n)-time algorithm based on sorting.

7
slide-16
SLIDE 16

Prior works — computational issues

Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n

i=1 from Rd and (yi)n i=1 from R, find

( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 .

▶ d = 1: O(n log n)-time algorithm based on sorting. ▶ d = Ω(n): NP-hard.

7
slide-17
SLIDE 17

Prior works — computational issues

Pananjady, Wainwright, & Courtade (2016) Least squares problem Given (xi)n

i=1 from Rd and (yi)n i=1 from R, find

( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 .

▶ d = 1: O(n log n)-time algorithm based on sorting. ▶ d = Ω(n): NP-hard.

Elhami, Scholefield, Haro, & Vetterli (2017)

[d = 2] O(n3)-time algorithm with ∥ ˆ

w − ¯ w∥2 ≤ O(2n · ∥ε∥∞) when (xi)n

i=1 are exponentially spaced (in angle) on unit cir-

cle.

7
slide-18
SLIDE 18

Our contributions

  • 1. Algorithm for least squares that gives (1 + ϵ)-approximation in

time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n

i=1).

8
slide-19
SLIDE 19

Our contributions

  • 1. Algorithm for least squares that gives (1 + ϵ)-approximation in

time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n

i=1).

  • 2. poly(n, d)-time∗ algorithm that exactly recovers ¯

w and ¯ π (with high probability) when (xi)n

i=1 are iid from N(0, I d),

εi ≡ 0, and n ≥ d + 1.

(∗After appropriate discretization.)

8
slide-20
SLIDE 20

Our contributions

  • 1. Algorithm for least squares that gives (1 + ϵ)-approximation in

time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n

i=1).

  • 2. poly(n, d)-time∗ algorithm that exactly recovers ¯

w and ¯ π (with high probability) when (xi)n

i=1 are iid from N(0, I d),

εi ≡ 0, and n ≥ d + 1.

(∗After appropriate discretization.)

  • 3. Information-theoretic lower bounds on SNR for approximate

recovery of ¯ w when (εi)n

i=1 are iid from N(0, σ2), and (xi)n i=1

are iid from N(0, I d) or Uniform([−1, 1]d).

8
slide-21
SLIDE 21
  • 1. Approximation algorithm for least squares

problem

9
slide-22
SLIDE 22

Least squares problem

Given (xi)n

i=1 from Rd and (yi)n i=1 from R, find

( ˆ wmle, ˆ πmle) := arg min

w∈Rd,π∈Sn n

i=1

( yi − w⊤xπ(i) )2 .

▶ d = 1: O(n log n)-time algorithm based on sorting [PWC’16]. ▶ d = Ω(n): NP-hard to decide if opt = 0 [PWC’16]. ▶ Naïve brute-force search: Ω(|Sn|) = Ω(n!).

10
slide-23
SLIDE 23

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, find

(ˆ wmle, ˆ πmle) := arg min

w∈R,π∈Sn n

i=1

( yi − wxπ(i) )2 .

11
slide-24
SLIDE 24

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, find

(ˆ wmle, ˆ πmle) := arg min

w∈R,π∈Sn n

i=1

( yi − wxπ(i) )2 . Fix w ∈ R, and suppose (WLOG) w ≥ 0. Then min

π∈Sn n

i=1

( yi − wxπ(i) )2 =

n

j=1

( y(j) − wx(j) )2 where x(1) ≤ x(2) ≤ · · · ≤ x(n) and y(1) ≤ y(2) ≤ · · · ≤ y(n).

11
slide-25
SLIDE 25

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, find

(ˆ wmle, ˆ πmle) := arg min

w∈R,π∈Sn n

i=1

( yi − wxπ(i) )2 . Fix w ∈ R, and suppose (WLOG) w ≥ 0. Then min

π∈Sn n

i=1

( yi − wxπ(i) )2 =

n

j=1

( y(j) − wx(j) )2 where x(1) ≤ x(2) ≤ · · · ≤ x(n) and y(1) ≤ y(2) ≤ · · · ≤ y(n). ∴ As observed by [PWC’16], can find ˆ πmle (and ˆ wmle) by sorting.

11
slide-26
SLIDE 26

Alternating minimization

Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min

w∈Rd n

i=1

( yi − w⊤x ˆ

π(i)

)2 .

12
slide-27
SLIDE 27

Alternating minimization

Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min

w∈Rd n

i=1

( yi − w⊤x ˆ

π(i)

)2 .

▶ Each loop-iteration efficiently computable.

12
slide-28
SLIDE 28

Alternating minimization

Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min

w∈Rd n

i=1

( yi − w⊤x ˆ

π(i)

)2 .

▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial

ˆ w ∈ Rd.

12
slide-29
SLIDE 29

Alternating minimization

Pick initial ˆ w ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

( yi − ˆ w⊤xπ(i) )2 . ˆ w ← arg min

w∈Rd n

i=1

( yi − w⊤x ˆ

π(i)

)2 .

▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial

ˆ w ∈ Rd. (Questions: How many restarts? How many iterations?)

12
slide-30
SLIDE 30

Approximation result

Theorem

There is an algorithm that given any inputs (xi)n

i=1, (yi)n i=1,

and ϵ ∈ (0, 1), returns a (1 + ϵ)-approximate solution to the least squares problem in time (n/ϵ)O(k) + poly(n, d), where k = dim(span(xi)n

i=1).

13
slide-31
SLIDE 31

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤

⋆ xπ⋆(i) ,

i ∈ [n] .

14
slide-32
SLIDE 32

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤

⋆ xπ⋆(i) ,

i ∈ [n] . Solution is determined by action of π⋆ on d points

(assume dim(span(xi)d

i=1) = d). 14
slide-33
SLIDE 33

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤

⋆ xπ⋆(i) ,

i ∈ [n] . Solution is determined by action of π⋆ on d points

(assume dim(span(xi)d

i=1) = d).

Algorithm:

▶ Find subset of d linearly independent points xi1, xi2, . . . , xid. ▶ “Guess” values of π−1 ⋆ (ij) ∈ [d], j ∈ [d]. ▶ Solve linear system yπ−1

(ij) = w⊤xij, j ∈ [d], for w ∈ Rd. ▶ To check correctness of ˆ

w: compute ˆ yi := ˆ w⊤xi, i ∈ [n], and check if minπ∈Sn ∑n

i=1 (yi − ˆ

yπ(i))2 = 0.

14
slide-34
SLIDE 34

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist w⋆ ∈ Rd and π⋆ ∈ Sn s.t. yi = w⊤

⋆ xπ⋆(i) ,

i ∈ [n] . Solution is determined by action of π⋆ on d points

(assume dim(span(xi)d

i=1) = d).

Algorithm:

▶ Find subset of d linearly independent points xi1, xi2, . . . , xid. ▶ “Guess” values of π−1 ⋆ (ij) ∈ [d], j ∈ [d]. ▶ Solve linear system yπ−1

(ij) = w⊤xij, j ∈ [d], for w ∈ Rd. ▶ To check correctness of ˆ

w: compute ˆ yi := ˆ w⊤xi, i ∈ [n], and check if minπ∈Sn ∑n

i=1 (yi − ˆ

yπ(i))2 = 0. “Guess” means “enumerate over ≤ nd choices”; rest is poly(n, d).

14
slide-35
SLIDE 35

Beating brute-force search: general case

General case: solution may not be determined by only d points.

15
slide-36
SLIDE 36

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ w ∈ arg minw∈Rd ∑d

j=1 (bij − w⊤xij)2 satisfies n

i=1

( bi − ˆ w⊤xi )2 ≤ (d + 1) · min

w∈Rd n

i=1

( bi − w⊤xi )2 .

(Follows from result of Dereziński and Warmuth (2017) on volume sampling.)

15
slide-37
SLIDE 37

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ w ∈ arg minw∈Rd ∑d

j=1 (bij − w⊤xij)2 satisfies n

i=1

( bi − ˆ w⊤xi )2 ≤ (d + 1) · min

w∈Rd n

i=1

( bi − w⊤xi )2 .

(Follows from result of Dereziński and Warmuth (2017) on volume sampling.)

= ⇒ nO(d)-time algorithm with approximation ratio d + 1,

  • r nO(d/ϵ)-time algorithm with approximation ratio 1 + ϵ.
15
slide-38
SLIDE 38

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ w ∈ arg minw∈Rd ∑d

j=1 (bij − w⊤xij)2 satisfies n

i=1

( bi − ˆ w⊤xi )2 ≤ (d + 1) · min

w∈Rd n

i=1

( bi − w⊤xi )2 .

(Follows from result of Dereziński and Warmuth (2017) on volume sampling.)

= ⇒ nO(d)-time algorithm with approximation ratio d + 1,

  • r nO(d/ϵ)-time algorithm with approximation ratio 1 + ϵ.

Better way to get 1 + ϵ: exploit first-order optimality conditions (i.e., “normal equations”) and ϵ-nets. Overall time: (n/ϵ)O(k) + poly(n, d) for k = dim(span(xi)n

i=1).

15
slide-39
SLIDE 39

Remarks

▶ Algorithm is justified in statistical setting by results of [PWC’16]

for MLE, but guarantees also hold when inputs are worst-case.

▶ Algorithm is poly-time only when k = O(1).

16
slide-40
SLIDE 40

Remarks

▶ Algorithm is justified in statistical setting by results of [PWC’16]

for MLE, but guarantees also hold when inputs are worst-case.

▶ Algorithm is poly-time only when k = O(1).

Open problems:

  • 1. Poly-time approximation algorithm when k = ω(1).

(Perhaps in average-case or smoothed setting.)

  • 2. (Smoothed) analysis of alternating minimization.

Similar to Lloyd’s algorithm for Euclidean k-means.

16
slide-41
SLIDE 41

Remarks

▶ Algorithm is justified in statistical setting by results of [PWC’16]

for MLE, but guarantees also hold when inputs are worst-case.

▶ Algorithm is poly-time only when k = O(1).

Open problems:

  • 1. Poly-time approximation algorithm when k = ω(1).

(Perhaps in average-case or smoothed setting.)

  • 2. (Smoothed) analysis of alternating minimization.

Similar to Lloyd’s algorithm for Euclidean k-means. Next: Algorithm for noise-free average-case setting.

16
slide-42
SLIDE 42
  • 2. Exact recovery in the noise-free Gaussian setting
17
slide-43
SLIDE 43

Setting

Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯

π(i) ,

i ∈ {0, 1, . . . , n}

▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ S{0,1,...,n}

18
slide-44
SLIDE 44

Setting

Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯

π(i) ,

i ∈ {0, 1, . . . , n}

▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0.

18
slide-45
SLIDE 45

Setting

Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯

π(i) ,

i ∈ {0, 1, . . . , n}

▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}.

18
slide-46
SLIDE 46

Setting

Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯

π(i) ,

i ∈ {0, 1, . . . , n}

▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}. Number of measurements: If n + 1 ≥ d, then recovery of ¯ π gives exact recovery of ¯ w (a.s.).

18
slide-47
SLIDE 47

Setting

Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯

π(i) ,

i ∈ {0, 1, . . . , n}

▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}. Number of measurements: If n + 1 ≥ d, then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d).

18
slide-48
SLIDE 48

Setting

Noise-free Gaussian linear model (with n + 1 measurements): yi = ¯ w⊤x ¯

π(i) ,

i ∈ {0, 1, . . . , n}

▶ Covariate vectors: (xi)n i=0 iid from N(0, I d) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ S{0,1,...,n} “Equivalent” problem: We’re promised that ¯ π(0) = 0. So can just consider ¯ π as unknown permutation over {1, 2, . . . , n}. Number of measurements: If n + 1 ≥ d, then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d). Claim: n ≥ d suffices to recover ¯ π with high probability.

18
slide-49
SLIDE 49

Exact recovery result

Theorem

Fix any ¯ w ∈ Rd and ¯ π ∈ Sn, and assume n ≥ d. Suppose (xi)n

i=0 are drawn iid from N(0, I d), and (yi)n i=0 satisfy

y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯

π(i) ,

i ∈ [n] . There is an algorithm that, given inputs (xi)n

i=0 and (yi)n i=0,

returns ¯ π and ¯ w with high probability.

19
slide-50
SLIDE 50

Main idea: hidden subset

Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯

π(i) ,

i ∈ [n] .

20
slide-51
SLIDE 51

Main idea: hidden subset

Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯

π(i) ,

i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal.

20
slide-52
SLIDE 52

Main idea: hidden subset

Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯

π(i) ,

i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal. y0 = ¯ w⊤x0 =

d

j=1

( ¯ w⊤xj) (x⊤

j x0)

20
slide-53
SLIDE 53

Main idea: hidden subset

Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯

π(i) ,

i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal. y0 = ¯ w⊤x0 =

d

j=1

( ¯ w⊤xj) (x⊤

j x0)

=

d

j=1

π−1(j) (x⊤ j x0)

20
slide-54
SLIDE 54

Main idea: hidden subset

Measurements: y0 = ¯ w⊤x0 ; yi = ¯ w⊤x ¯

π(i) ,

i ∈ [n] . For simplicity: assume n = d, and x1, x2, . . . , xd orthonormal. y0 = ¯ w⊤x0 =

d

j=1

( ¯ w⊤xj) (x⊤

j x0)

=

d

j=1

π−1(j) (x⊤ j x0)

=

d

i=1 d

j=1

1{¯ π(i) = j} · yi (x⊤

j x0) .

20
slide-55
SLIDE 55

Reduction to subset sum

y0 =

d

i=1 d

j=1

1{¯ π(i) = j} · yi (x⊤

j x0)

  • ci,j
21
slide-56
SLIDE 56

Reduction to subset sum

y0 =

d

i=1 d

j=1

1{¯ π(i) = j} · yi (x⊤

j x0)

  • ci,j

▶ d2 “source” numbers ci,j := yi(x⊤ j x0), “target” sum T := y0.

Promised that a size-d subset of the ci,j sum to T.

21
slide-57
SLIDE 57

Reduction to subset sum

y0 =

d

i=1 d

j=1

1{¯ π(i) = j} · yi (x⊤

j x0)

  • ci,j

▶ d2 “source” numbers ci,j := yi(x⊤ j x0), “target” sum T := y0.

Promised that a size-d subset of the ci,j sum to T.

▶ Correct subset corresponds to (i, j) ∈ [d]2 s.t. ¯

π(i) = j.

21
slide-58
SLIDE 58

Reduction to subset sum

y0 =

d

i=1 d

j=1

1{¯ π(i) = j} · yi (x⊤

j x0)

  • ci,j

▶ d2 “source” numbers ci,j := yi(x⊤ j x0), “target” sum T := y0.

Promised that a size-d subset of the ci,j sum to T.

▶ Correct subset corresponds to (i, j) ∈ [d]2 s.t. ¯

π(i) = j. Next: How to solve Subset Sum efficiently?

21
slide-59
SLIDE 59

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z.

22
slide-60
SLIDE 60

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount.

22
slide-61
SLIDE 61

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that

▶ correct subset of basis vectors gives short lattice vector v⋆; ▶ any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.

22
slide-62
SLIDE 62

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that

▶ correct subset of basis vectors gives short lattice vector v⋆; ▶ any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.

[ b0 b1 · · · bN ] = [

I N

βT −βc1 · · · −βcN ] for sufficiently large β > 0.

22
slide-63
SLIDE 63

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that

▶ correct subset of basis vectors gives short lattice vector v⋆; ▶ any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.

[ b0 b1 · · · bN ] = [

I N

βT −βc1 · · · −βcN ] for sufficiently large β > 0. Using Lenstra, Lenstra, & Lovász (1982) algorithm to find approximately-shortest vector reveals correct subset.

22
slide-64
SLIDE 64

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

23
slide-65
SLIDE 65

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

▶ Instead, have some joint density derived from N(0, 1).

23
slide-66
SLIDE 66

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

▶ Instead, have some joint density derived from N(0, 1). ▶ To show that Lagarias & Odlyzko reduction still works, need

Gaussian anti-concentration for quadratic and quartic forms.

23
slide-67
SLIDE 67

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

▶ Instead, have some joint density derived from N(0, 1). ▶ To show that Lagarias & Odlyzko reduction still works, need

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every Z ∈ Zd×d that is not an integer multiple of permutation matrix corresponding to ¯ π,

  • T −

i,j

Zi,j · ci,j

1 2poly(d) · ∥ ¯ w∥2 .

23
slide-68
SLIDE 68

Some details

▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but

similar reduction works via Moore-Penrose pseudoinverse.

24
slide-69
SLIDE 69

Some details

▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but

similar reduction works via Moore-Penrose pseudoinverse.

▶ Reduction uses real coefficients in lattice basis.

For LLL to run in poly-time, need to round (xi)n

i=0 and ¯

w coefficients to finite-precision rational numbers. Similar to drawing (xi)n

i=0 iid from discretized N(0, I d).

24
slide-70
SLIDE 70

Some details

▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but

similar reduction works via Moore-Penrose pseudoinverse.

▶ Reduction uses real coefficients in lattice basis.

For LLL to run in poly-time, need to round (xi)n

i=0 and ¯

w coefficients to finite-precision rational numbers. Similar to drawing (xi)n

i=0 iid from discretized N(0, I d). ▶ Algorithm strongly exploits assumption of noise-free

measurements; likely fails in presence of noise.

24
slide-71
SLIDE 71

Some details

▶ In general, x1, x2, . . . , xn are not (exactly) orthonormal, but

similar reduction works via Moore-Penrose pseudoinverse.

▶ Reduction uses real coefficients in lattice basis.

For LLL to run in poly-time, need to round (xi)n

i=0 and ¯

w coefficients to finite-precision rational numbers. Similar to drawing (xi)n

i=0 iid from discretized N(0, I d). ▶ Algorithm strongly exploits assumption of noise-free

measurements; likely fails in presence of noise.

▶ Similar algorithm used by Andoni, H., Shi, & Sun (2017) for

different problems (phase retrieval / correspondence retrieval).

24
slide-72
SLIDE 72

Connections to prior works

▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)

Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd.

25
slide-73
SLIDE 73

Connections to prior works

▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)

Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)

25
slide-74
SLIDE 74

Connections to prior works

▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)

Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)

▶ Pananjady, Wainwright, & Courtade (2016)

Noise-free setting: signal-to-noise conditions trivially satisfied whenever ¯ w ̸= 0.

25
slide-75
SLIDE 75

Connections to prior works

▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)

Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)

▶ Pananjady, Wainwright, & Courtade (2016)

Noise-free setting: signal-to-noise conditions trivially satisfied whenever ¯ w ̸= 0. Noisy setting: recovering ¯ w may be easier than recovering ¯ π.

25
slide-76
SLIDE 76

Connections to prior works

▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015)

Recall: [UHV’15] show that n ≥ 2d is necessary for measurements to uniquely determine every ¯ w ∈ Rd. Our result: For fixed ¯ w ∈ Rd, d + 1 measurements suffice to recover ¯ w; same covariate vectors may fail for other ¯ w′ ∈ Rd. (C.f. “for all” vs. “for each” results in compressive sensing.)

▶ Pananjady, Wainwright, & Courtade (2016)

Noise-free setting: signal-to-noise conditions trivially satisfied whenever ¯ w ̸= 0. Noisy setting: recovering ¯ w may be easier than recovering ¯ π. Next: Limits for recovering ¯ w.

25
slide-77
SLIDE 77
  • 3. Lower bounds on SNR for approximate recovery
26
slide-78
SLIDE 78

Setting

Linear model with Gaussian noise yi = ¯ w⊤x ¯

π(i) + εi ,

i ∈ [n]

▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ Sn

27
slide-79
SLIDE 79

Setting

Linear model with Gaussian noise yi = ¯ w⊤x ¯

π(i) + εi ,

i ∈ [n]

▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ Sn Equivalent: ignore ¯ π; observe (xi)n

i=1 and yin i=1

(where · denotes unordered multi-set).

27
slide-80
SLIDE 80

Setting

Linear model with Gaussian noise yi = ¯ w⊤x ¯

π(i) + εi ,

i ∈ [n]

▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ Sn Equivalent: ignore ¯ π; observe (xi)n

i=1 and yin i=1

(where · denotes unordered multi-set).

We consider P = N(0, I d) and P = Uniform([−1, 1]d).

27
slide-81
SLIDE 81

Setting

Linear model with Gaussian noise yi = ¯ w⊤x ¯

π(i) + εi ,

i ∈ [n]

▶ Covariate vectors: (xi)n i=1 iid from P ▶ Measurement errors: (εi) iid from N(0, σ2) ▶ Unknown linear function: ¯

w ∈ Rd

▶ Unknown permutation: ¯

π ∈ Sn Equivalent: ignore ¯ π; observe (xi)n

i=1 and yin i=1

(where · denotes unordered multi-set).

We consider P = N(0, I d) and P = Uniform([−1, 1]d). Note: If correspondence between (xi)n

i=1 and yin i=1 is known,

then just need SNR ≳ d/n to approximately recover ¯ w.

27
slide-82
SLIDE 82

Uniform case

Theorem

If (xi)n

i=1 are iid draws from Uniform([−1, 1]d), (yi)n i=1

follow the linear model with N(0, σ2) noise, and SNR ≤ (1 − 2c)2 for some c ∈ (0, 1/2), then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ c∥ ¯ w∥2 .

28
slide-83
SLIDE 83

Uniform case

Theorem

If (xi)n

i=1 are iid draws from Uniform([−1, 1]d), (yi)n i=1

follow the linear model with N(0, σ2) noise, and SNR ≤ (1 − 2c)2 for some c ∈ (0, 1/2), then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ c∥ ¯ w∥2 . Increasing sample size n does not help, unlike in the “known correspondence” setting (where SNR ≳ d/n suffices).

28
slide-84
SLIDE 84

Proof sketch

We show that no estimator can confidently distinguish between ¯ w = e1 and ¯ w = −e1, where e1 = (1, 0, . . . , 0)⊤.

29
slide-85
SLIDE 85

Proof sketch

We show that no estimator can confidently distinguish between ¯ w = e1 and ¯ w = −e1, where e1 = (1, 0, . . . , 0)⊤. Let P¯

w be the data distribution with parameter ¯

w ∈ {e1, −e1}. Task: show Pe1 and P−e1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max

¯ w∈{e1,−e1} EP¯

w ∥ ˆ

w − ¯ w∥2 ≥ 1 − ∥Pe1 − P−e1∥tv .

29
slide-86
SLIDE 86

Proof sketch

We show that no estimator can confidently distinguish between ¯ w = e1 and ¯ w = −e1, where e1 = (1, 0, . . . , 0)⊤. Let P¯

w be the data distribution with parameter ¯

w ∈ {e1, −e1}. Task: show Pe1 and P−e1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max

¯ w∈{e1,−e1} EP¯

w ∥ ˆ

w − ¯ w∥2 ≥ 1 − ∥Pe1 − P−e1∥tv . Key idea: conditional means of yin

i=1 given (xi)n i=1, under Pe1

and P−e1, are close as unordered multi-sets.

29
slide-87
SLIDE 87

Proof sketch (continued)

Generative process for P¯

w:

30
slide-88
SLIDE 88

Proof sketch (continued)

Generative process for P¯

w:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

30
slide-89
SLIDE 89

Proof sketch (continued)

Generative process for P¯

w:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := ¯

w⊤xi for i ∈ [n].

30
slide-90
SLIDE 90

Proof sketch (continued)

Generative process for P¯

w:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := ¯

w⊤xi for i ∈ [n].

  • 3. Set yi := u(i) + εi for i ∈ [n], where u(1) ≤ u(2) ≤ · · · ≤ u(n).
30
slide-91
SLIDE 91

Proof sketch (continued)

Generative process for P¯

w:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := ¯

w⊤xi for i ∈ [n].

  • 3. Set yi := u(i) + εi for i ∈ [n], where u(1) ≤ u(2) ≤ · · · ≤ u(n).

Conditional distribution of y = (y1, y2, . . . , yn) given (xi)n

i=1:

Under Pe1: y | (xi)n

i=1 ∼ N(u↑, σ2I n)

Under P−e1: y | (xi)n

i=1 ∼ N(−u↓, σ2I n)

where u↑ = (u(1), u(2), . . . , u(n)) and u↓ = (u(n), u(n−1), . . . , u(1)).

30
slide-92
SLIDE 92

Proof sketch (continued)

Generative process for P¯

w:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := ¯

w⊤xi for i ∈ [n].

  • 3. Set yi := u(i) + εi for i ∈ [n], where u(1) ≤ u(2) ≤ · · · ≤ u(n).

Conditional distribution of y = (y1, y2, . . . , yn) given (xi)n

i=1:

Under Pe1: y | (xi)n

i=1 ∼ N(u↑, σ2I n)

Under P−e1: y | (xi)n

i=1 ∼ N(−u↓, σ2I n)

where u↑ = (u(1), u(2), . . . , u(n)) and u↓ = (u(n), u(n−1), . . . , u(1)). Data processing: Lose information by going from y to yin

i=1.

30
slide-93
SLIDE 93

Proof sketch (continued)

By data processing inequality, KL ( Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) )

31
slide-94
SLIDE 94

Proof sketch (continued)

By data processing inequality, KL ( Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2

2

2σ2

31
slide-95
SLIDE 95

Proof sketch (continued)

By data processing inequality, KL ( Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

31
slide-96
SLIDE 96

Proof sketch (continued)

By data processing inequality, KL ( Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

Some computations show that med ∥u↑ + u↓∥2

2 ≤ 4 .

31
slide-97
SLIDE 97

Proof sketch (continued)

By data processing inequality, KL ( Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

Some computations show that med ∥u↑ + u↓∥2

2 ≤ 4 .

By conditioning + Pinsker’s inequality, ∥Pe1 − P−e1∥tv ≤ 1 2 + 1 2 med √ SNR 4 · ∥u↑ + u↓∥2

2

31
slide-98
SLIDE 98

Proof sketch (continued)

By data processing inequality, KL ( Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

) ≤ KL ( N(u↑, σ2I n), N(−u↓, σ2I n) ) = ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

Some computations show that med ∥u↑ + u↓∥2

2 ≤ 4 .

By conditioning + Pinsker’s inequality, ∥Pe1 − P−e1∥tv ≤ 1 2 + 1 2 med √ SNR 4 · ∥u↑ + u↓∥2

2

≤ 1 2 + 1 2 √ SNR .

31
slide-99
SLIDE 99

Gaussian case

Theorem

If (xi)n

i=1 are iid draws from N(0, I d), (yi)n i=1 follow the

linear model with N(0, σ2) noise, and SNR ≤ C · d log log(n) for some absolute constant C > 0, then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ C ′∥ ¯ w∥2 for some other absolute constant C ′ > 0.

32
slide-100
SLIDE 100

Gaussian case

Theorem

If (xi)n

i=1 are iid draws from N(0, I d), (yi)n i=1 follow the

linear model with N(0, σ2) noise, and SNR ≤ C · d log log(n) for some absolute constant C > 0, then for any estimator ˆ w, there exists ¯ w ∈ Rd such that E [ ∥ ˆ w − ¯ w∥2 ] ≥ C ′∥ ¯ w∥2 for some other absolute constant C ′ > 0. C.f. “known correspondence” setting, where SNR ≳ d/n suffices.

32
slide-101
SLIDE 101
  • 4. Closing remarks and open problems
33
slide-102
SLIDE 102

Closing remarks and open problems

Lack of correspondence changes both computational and statistical difficulty of linear regression.

34
slide-103
SLIDE 103

Closing remarks and open problems

Lack of correspondence changes both computational and statistical difficulty of linear regression.

▶ Algorithms shed light on computational difficulty in worst-case

and average-case settings.

34
slide-104
SLIDE 104

Closing remarks and open problems

Lack of correspondence changes both computational and statistical difficulty of linear regression.

▶ Algorithms shed light on computational difficulty in worst-case

and average-case settings.

▶ SNR lower bounds show striking contrast to “known

correspondence” settings.

34
slide-105
SLIDE 105

Closing remarks and open problems

Lack of correspondence changes both computational and statistical difficulty of linear regression.

▶ Algorithms shed light on computational difficulty in worst-case

and average-case settings.

▶ SNR lower bounds show striking contrast to “known

correspondence” settings.

▶ Gap remains between SNR lower and upper bounds.

E.g., N(0, I d) case: O (

d log log n

) : fails Ω(nc): succeeds [PWC’16]

▶ Is MLE (near) optimal for recovering ¯

w?

▶ N(0, I d) vs Uniform([−1, 1]d)? 34
slide-106
SLIDE 106

Closing remarks and open problems

Lack of correspondence changes both computational and statistical difficulty of linear regression.

▶ Algorithms shed light on computational difficulty in worst-case

and average-case settings.

▶ SNR lower bounds show striking contrast to “known

correspondence” settings.

▶ Gap remains between SNR lower and upper bounds.

E.g., N(0, I d) case: O (

d log log n

) : fails ω(1): succeeds (new!)

▶ Is MLE (near) optimal for recovering ¯

w?

▶ N(0, I d) vs Uniform([−1, 1]d)? 34
slide-107
SLIDE 107

Closing remarks and open problems

Lack of correspondence changes both computational and statistical difficulty of linear regression.

▶ Algorithms shed light on computational difficulty in worst-case

and average-case settings.

▶ SNR lower bounds show striking contrast to “known

correspondence” settings.

▶ Gap remains between SNR lower and upper bounds.

E.g., N(0, I d) case: O (

d log log n

) : fails ω(1): succeeds (new!)

▶ Is MLE (near) optimal for recovering ¯

w?

▶ N(0, I d) vs Uniform([−1, 1]d)?

▶ Faster algorithms?

(Smoothed) analysis of alternating minimization?

34
slide-108
SLIDE 108

Acknowledgements

Collaborators: Kevin Shi (Columbia), Xiaorui Sun (Simons Institute). Discussants: Ashwin Pananjady (UCB), Michał Dereziński (UCSC), Manfred Warmuth (UCSC). Funding: NSF (DMR-1534910, IIS-1563785), Sloan Research Fellowship, Bloomberg Data Science Research Grant. Hospitality: Simons Institute for the Theory of Computing (UCB). See preprint for details & references: arxiv.org/abs/1705.07048

Thank you

35