Learning without correspondence Daniel Hsu Computer Science - - PowerPoint PPT Presentation

learning without correspondence
SMART_READER_LITE
LIVE PREVIEW

Learning without correspondence Daniel Hsu Computer Science - - PowerPoint PPT Presentation

Learning without correspondence Daniel Hsu Computer Science Department & Data Science Institute Columbia University Introduction Example #1: unlinked data sources Two separate data sources about same entities: Record linkage unknown.


slide-1
SLIDE 1

Learning without correspondence

Daniel Hsu

Computer Science Department & Data Science Institute Columbia University

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Example #1: unlinked data sources

  • Two separate data sources about same entities:

Sex Age Height M 20 180 F 24 162.5 F 22 160 F 23 167.5 Disease 1 1

  • First source contains covariates (sex, age, height, …).
  • Second source contains response variable (disease status).

To learn: relationship between response and covariates. Record linkage unknown.

1

slide-4
SLIDE 4

Example #1: unlinked data sources

  • Two separate data sources about same entities:

Sex Age Height M 20 180 F 24 162.5 F 22 160 F 23 167.5 Disease 1 1

???

  • First source contains covariates (sex, age, height, …).
  • Second source contains response variable (disease status).

To learn: relationship between response and covariates. Record linkage unknown.

1

slide-5
SLIDE 5

Example #2: fmow cytometry

  • 1. Suspended cells in fmuid.
  • 2. Cells pass through laser, one at a time; measure emitted light.

1*4%4

"

÷i*y

l%%t

g¥o*ny

÷

¥

"

f"¥%¥F

Few

HEE

,

FINE

'

To learn: relationship between measurements and cell properties. Order in which cells pass through laser is unknown.

2

slide-6
SLIDE 6

Example #2: fmow cytometry

  • 1. Suspended cells in fmuid.
  • 2. Cells pass through laser, one at a time; measure emitted light.

1*4%4

"

÷i*y

l%%t

g¥o*ny

÷

¥

"

f"¥%¥F

Few

HEE

,

FINE

'

To learn: relationship between measurements and cell properties. Order in which cells pass through laser is unknown.

2

slide-7
SLIDE 7

Example #3: unassigned distance geometry

  • 1. Unknown arrangement of n points in Euclidean space.
(Image credit: Billinge, Duxbury, Gonçalves, Lavor, & Mucherino, 2016)
  • 2. Measure distribution of pairwise distances among the n points

(using high-energy X-rays).

To learn: original arrangement of the points. Assignment of distances to pairs of points is unknown.

3

slide-8
SLIDE 8

Example #3: unassigned distance geometry

  • 1. Unknown arrangement of n points in Euclidean space.
(Image credit: Billinge, Duxbury, Gonçalves, Lavor, & Mucherino, 2016)
  • 2. Measure distribution of pairwise distances among the n points

(using high-energy X-rays).

To learn: original arrangement of the n points. Assignment of distances to pairs of points is unknown.

3

slide-9
SLIDE 9

Learning without correspondence

Observation: Correspondence information is missing in many natural settings. Question: How does this afgect machine learning / statistical estimation? We give a theoretical treatment in context of two simple problems:

  • 1. Linear regression without correspondence

(Joint work with Kevin Shi and Xiaorui Sun; NIPS 2017.)

  • 2. Correspondence retrieval (generalization of phase retrieval)

(Joint work with Alexandr Andoni, Kevin Shi, and Xiaorui Sun; COLT 2017.) 4

slide-10
SLIDE 10

Learning without correspondence

Observation: Correspondence information is missing in many natural settings. Question: How does this afgect machine learning / statistical estimation? We give a theoretical treatment in context of two simple problems:

  • 1. Linear regression without correspondence

(Joint work with Kevin Shi and Xiaorui Sun; NIPS 2017.)

  • 2. Correspondence retrieval (generalization of phase retrieval)

(Joint work with Alexandr Andoni, Kevin Shi, and Xiaorui Sun; COLT 2017.) 4

slide-11
SLIDE 11

Learning without correspondence

Observation: Correspondence information is missing in many natural settings. Question: How does this afgect machine learning / statistical estimation? We give a theoretical treatment in context of two simple problems:

  • 1. Linear regression without correspondence

(Joint work with Kevin Shi and Xiaorui Sun; NIPS 2017.)

  • 2. Correspondence retrieval (generalization of phase retrieval)

(Joint work with Alexandr Andoni, Kevin Shi, and Xiaorui Sun; COLT 2017.) 4

slide-12
SLIDE 12

Our contributions

  • 1. Linear regression without correspondence
  • Strong NP-hardness of least squares problem.
  • Polynomial-time approximation scheme in constant dimensions.
  • Information-theoretic signal-to-noise lower bounds.
  • Polynomial-time algorithm in noise-free average case setting.
  • 2. Correspondence retrieval
  • Measurement-optimal recovery algorithm in noise-free setting.
  • Robust recovery algorithm in noisy setting.

5

slide-13
SLIDE 13

Linear regression without correspondence

slide-14
SLIDE 14

Linear regression without correspondence

y1 y2 yn . . . x⊤

1

x⊤

2

x⊤

n

. . .

Feature vectors: x1, x2, . . . , xn ∈ Rd Labels: y1, y2, . . . , yn ∈ R

6

slide-15
SLIDE 15

Linear regression without correspondence

y1 y2 yn . . . = ε1 ε2 εn . . . + β∗ x⊤

1

x⊤

2

x⊤

n

. . .

Classical linear regression: yi = x⊤

i β∗ + εi,

i = 1, . . . , n.

6

slide-16
SLIDE 16

Linear regression without correspondence

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

. . .

Linear regression without correspondence: yi = x⊤

π∗(i)β∗ + εi,

i = 1, . . . , n.

6

slide-17
SLIDE 17

Model for linear regression without correspondence

Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade 2016; Elhami, Scholefjeld, Haro, & Vetterli, 2017; Abid, Poon, & Zou, 2017; …

  • Feature vectors: x1, x2, . . . , xn ∈ Rd
  • Labels: y1, y2, . . . , yn ∈ R
  • Model:

yi = x⊤

π∗(i)β∗ + εi ,

i = 1, . . . , n.

  • Linear function: β∗ ∈ Rd
  • Permutation: π∗ ∈ Sn
  • Errors: ε1, ε2, . . . , εn ∈ R.
  • Goal: “learn”

. Correspondence between

1 and 1 is unknown. 7

slide-18
SLIDE 18

Model for linear regression without correspondence

Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade 2016; Elhami, Scholefjeld, Haro, & Vetterli, 2017; Abid, Poon, & Zou, 2017; …

  • Feature vectors: x1, x2, . . . , xn ∈ Rd
  • Labels: y1, y2, . . . , yn ∈ R
  • Model:

yi = x⊤

π∗(i)β∗ + εi ,

i = 1, . . . , n.

  • Linear function: β∗ ∈ Rd
  • Permutation: π∗ ∈ Sn
  • Errors: ε1, ε2, . . . , εn ∈ R.
  • Goal: “learn” β∗.

Correspondence between

1 and 1 is unknown. 7

slide-19
SLIDE 19

Model for linear regression without correspondence

Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade 2016; Elhami, Scholefjeld, Haro, & Vetterli, 2017; Abid, Poon, & Zou, 2017; …

  • Feature vectors: x1, x2, . . . , xn ∈ Rd
  • Labels: y1, y2, . . . , yn ∈ R
  • Model:

yi = x⊤

π∗(i)β∗ + εi ,

i = 1, . . . , n.

  • Linear function: β∗ ∈ Rd
  • Permutation: π∗ ∈ Sn
  • Errors: ε1, ε2, . . . , εn ∈ R.
  • Goal: “learn” β∗.

Correspondence between (xi)n

i=1 and (yi)n i=1 is unknown. 7

slide-20
SLIDE 20

Questions

  • 1. Can we determine if there is a good linear fjt to the data?

(Least squares approximation.)

  • 2. When is it possible to recover the “correct”

? (When is the “best” linear fjt actually meaningful?)

8

slide-21
SLIDE 21

Questions

  • 1. Can we determine if there is a good linear fjt to the data?

(Least squares approximation.)

  • 2. When is it possible to recover the “correct” β∗?

(When is the “best” linear fjt actually meaningful?)

8

slide-22
SLIDE 22

Least squares approximation

slide-23
SLIDE 23

Least squares problem

Given (xi)n

i=1 from Rd and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

x⊤

i β − yπ(i)

)2 .

  • 1:
  • time algorithm.

(Observed by Pananjady, Wainwright, & Courtade, 2016.)

  • : (strongly) NP-hard to decide if

0. Reduction from 3-PARTITION (H., Shi, & Sun, 2017). Naïve brute-force search: . Least squares with known correspondence:

2 time. 9

slide-24
SLIDE 24

Least squares problem

Given (xi)n

i=1 from Rd and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

x⊤

i β − yπ(i)

)2 .

  • d = 1: O(n log n)-time algorithm.

(Observed by Pananjady, Wainwright, & Courtade, 2016.)

  • : (strongly) NP-hard to decide if

0. Reduction from 3-PARTITION (H., Shi, & Sun, 2017). Naïve brute-force search: . Least squares with known correspondence:

2 time. 9

slide-25
SLIDE 25

Least squares problem

Given (xi)n

i=1 from Rd and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

x⊤

i β − yπ(i)

)2 .

  • d = 1: O(n log n)-time algorithm.

(Observed by Pananjady, Wainwright, & Courtade, 2016.)

  • d = Ω(n): (strongly) NP-hard to decide if min F = 0.

Reduction from 3-PARTITION (H., Shi, & Sun, 2017). Naïve brute-force search: . Least squares with known correspondence:

2 time. 9

slide-26
SLIDE 26

Least squares problem

Given (xi)n

i=1 from Rd and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

x⊤

i β − yπ(i)

)2 .

  • d = 1: O(n log n)-time algorithm.

(Observed by Pananjady, Wainwright, & Courtade, 2016.)

  • d = Ω(n): (strongly) NP-hard to decide if min F = 0.

Reduction from 3-PARTITION (H., Shi, & Sun, 2017). Naïve brute-force search: Ω(|Sn|) = Ω(n!). Least squares with known correspondence:

2 time. 9

slide-27
SLIDE 27

Least squares problem

Given (xi)n

i=1 from Rd and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

x⊤

i β − yπ(i)

)2 .

  • d = 1: O(n log n)-time algorithm.

(Observed by Pananjady, Wainwright, & Courtade, 2016.)

  • d = Ω(n): (strongly) NP-hard to decide if min F = 0.

Reduction from 3-PARTITION (H., Shi, & Sun, 2017). Naïve brute-force search: Ω(|Sn|) = Ω(n!). Least squares with known correspondence: O(nd2) time.

9

slide-28
SLIDE 28

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

xiβ − yπ(i)

)2 . y1 y2 yn . . . x1 x2 xn . . .

25

2

20 5 25

2

22 5

10

slide-29
SLIDE 29

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

xiβ − yπ(i)

)2 . x1 x2 xn . . . y1 y2 yn . . .

− β β β

  • 2
2 2

Cost with π(i) = i for all i = 1, . . . , n. 25

2

20 5 25

2

22 5

10

slide-30
SLIDE 30

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

xiβ − yπ(i)

)2 . 3 4 6 . . . 2 1 7 . . .

− β β β

  • 2
2 2

Cost with π(i) = i for all i = 1, . . . , n. 25

2

20 5 25

2

22 5

10

slide-31
SLIDE 31

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

xiβ − yπ(i)

)2 . 3 4 6 . . . 2 1 7 . . .

− β β β

  • 2
2 2

If β > 0, then can improve cost with π(1) = 2 and π(2) = 1. 25

2

20 5 25

2

22 5

10

slide-32
SLIDE 32

Least squares problem (d = 1)

Given (xi)n

i=1 and (yi)n i=1 from R, minimize

F(β, π) :=

n

i=1

(

xiβ − yπ(i)

)2 . 3 4 6 . . . 2 1 7 . . .

− β β β 3 4 6 . . . 1 2 7 . . .

− β β β >

  • 2
2 2
  • 2
2 2

If β > 0, then can improve cost with π(1) = 2 and π(2) = 1. 25β2 − 20β + 5 + · · · > 25β2 − 22β + 5 + · · ·

10

slide-33
SLIDE 33

Algorithm for least squares problem (d = 1) [PWC’16]

  • 1. “Guess” sign of optimal β. (Only two possibilities.)
  • 2. Assuming WLOG that

1 2

, fjnd optimal such that

1 2

(via sorting).

  • 3. Solve classical least squares problem

1 2

to get optimal . Overall running time: . What about 1?

11

slide-34
SLIDE 34

Algorithm for least squares problem (d = 1) [PWC’16]

  • 1. “Guess” sign of optimal β. (Only two possibilities.)
  • 2. Assuming WLOG that x1β ≤ x2β · · · ≤ xnβ,

fjnd optimal π such that yπ(1) ≤ yπ(2) ≤ · · · ≤ yπ(n) (via sorting).

  • 3. Solve classical least squares problem

1 2

to get optimal . Overall running time: . What about 1?

11

slide-35
SLIDE 35

Algorithm for least squares problem (d = 1) [PWC’16]

  • 1. “Guess” sign of optimal β. (Only two possibilities.)
  • 2. Assuming WLOG that x1β ≤ x2β · · · ≤ xnβ,

fjnd optimal π such that yπ(1) ≤ yπ(2) ≤ · · · ≤ yπ(n) (via sorting).

  • 3. Solve classical least squares problem

min

β∈R n

i=1

(xiβ − yπ(i))2 to get optimal β. Overall running time: . What about 1?

11

slide-36
SLIDE 36

Algorithm for least squares problem (d = 1) [PWC’16]

  • 1. “Guess” sign of optimal β. (Only two possibilities.)
  • 2. Assuming WLOG that x1β ≤ x2β · · · ≤ xnβ,

fjnd optimal π such that yπ(1) ≤ yπ(2) ≤ · · · ≤ yπ(n) (via sorting).

  • 3. Solve classical least squares problem

min

β∈R n

i=1

(xiβ − yπ(i))2 to get optimal β. Overall running time: O(n log n). What about 1?

11

slide-37
SLIDE 37

Algorithm for least squares problem (d = 1) [PWC’16]

  • 1. “Guess” sign of optimal β. (Only two possibilities.)
  • 2. Assuming WLOG that x1β ≤ x2β · · · ≤ xnβ,

fjnd optimal π such that yπ(1) ≤ yπ(2) ≤ · · · ≤ yπ(n) (via sorting).

  • 3. Solve classical least squares problem

min

β∈R n

i=1

(xiβ − yπ(i))2 to get optimal β. Overall running time: O(n log n). What about d > 1?

11

slide-38
SLIDE 38

Alternating minimization

Pick initial ˆ β ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

(

x⊤

i ˆ

β − yπ(i)

)2 .

ˆ β ← arg min

β∈Rd n

i=1

(

x⊤

i β − yˆ π(i)

)2 .

  • Each loop-iteration effjciently computable.
  • But can get stuck in local minima.

So try many initial . (Open: How many restarts? How many iterations?)

12

slide-39
SLIDE 39

Alternating minimization

Pick initial ˆ β ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

(

x⊤

i ˆ

β − yπ(i)

)2 .

ˆ β ← arg min

β∈Rd n

i=1

(

x⊤

i β − yˆ π(i)

)2 .

  • Each loop-iteration effjciently computable.
  • But can get stuck in local minima.

So try many initial . (Open: How many restarts? How many iterations?)

12

slide-40
SLIDE 40

Alternating minimization

Pick initial ˆ β ∈ Rd (e.g., randomly). Loop until convergence: ˆ π ← arg min

π∈Sn n

i=1

(

x⊤

i ˆ

β − yπ(i)

)2 .

ˆ β ← arg min

β∈Rd n

i=1

(

x⊤

i β − yˆ π(i)

)2 .

  • Each loop-iteration effjciently computable.
  • But can get stuck in local minima.

So try many initial . (Open: How many restarts? How many iterations?)

12

slide-41
SLIDE 41

Alternating minimization

(Image credit: Wolfram|Alpha)
  • Each loop-iteration effjciently computable.
  • But can get stuck in local minima.

So try many initial . (Open: How many restarts? How many iterations?)

12

slide-42
SLIDE 42

Alternating minimization

(Image credit: Wolfram|Alpha)
  • Each loop-iteration effjciently computable.
  • But can get stuck in local minima. So try many initial ˆ

β ∈ Rd. (Open: How many restarts? How many iterations?)

12

slide-43
SLIDE 43

Approximation result

Theorem (H., Shi, & Sun, 2017) There is an algorithm that given any inputs (xi)n

i=1, (yi)n i=1,

and ϵ ∈ (0, 1), returns a (1 + ϵ)-approximate solution to the least squares problem in time

(n

ϵ

)O(d)

+ poly(n, d) . Recall: Brute-force solution needs time. (No other previous algorithm with approximation guarantee.)

13

slide-44
SLIDE 44

Approximation result

Theorem (H., Shi, & Sun, 2017) There is an algorithm that given any inputs (xi)n

i=1, (yi)n i=1,

and ϵ ∈ (0, 1), returns a (1 + ϵ)-approximate solution to the least squares problem in time

(n

ϵ

)O(d)

+ poly(n, d) . Recall: Brute-force solution needs Ω(n!) time. (No other previous algorithm with approximation guarantee.)

13

slide-45
SLIDE 45

Statistical recovery of β∗: algorithms and lower-bounds

slide-46
SLIDE 46

Motivation

When does the best fjt model shed light on the “truth” (π∗ & β∗)? Approach: Study question in context of statistical model for data.

  • 1. Understand information-theoretic limits on recovering truth.
  • 2. Natural “average-case” setting for algorithms.

14

slide-47
SLIDE 47

Motivation

When does the best fjt model shed light on the “truth” (π∗ & β∗)? Approach: Study question in context of statistical model for data.

  • 1. Understand information-theoretic limits on recovering truth.
  • 2. Natural “average-case” setting for algorithms.

14

slide-48
SLIDE 48

Motivation

When does the best fjt model shed light on the “truth” (π∗ & β∗)? Approach: Study question in context of statistical model for data.

  • 1. Understand information-theoretic limits on recovering truth.
  • 2. Natural “average-case” setting for algorithms.

14

slide-49
SLIDE 49

Statistical model

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

. . .

Assume (xi)n

i=1 iid from P and (εi)n i=1 iid from N(0, σ2).

Recoverability of depends on signal-to-noise ratio:

2 2 2

Classical setting (where is known): Just need to approximately recover .

15

slide-50
SLIDE 50

Statistical model

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

. . .

Assume (xi)n

i=1 iid from P and (εi)n i=1 iid from N(0, σ2).

Recoverability of β∗ depends on signal-to-noise ratio: SNR := ∥β∗∥2

2

σ2 . Classical setting (where is known): Just need to approximately recover .

15

slide-51
SLIDE 51

Statistical model

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

. . .

Assume (xi)n

i=1 iid from P and (εi)n i=1 iid from N(0, σ2).

Recoverability of β∗ depends on signal-to-noise ratio: SNR := ∥β∗∥2

2

σ2 . Classical setting (where π∗ is known): Just need SNR ≳ d/n to approximately recover β∗.

15

slide-52
SLIDE 52

High-level intuition

Suppose is either

1

1 0 0 0 or

2

0 1 0 0 .

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

. . .

known: distinguishability of

1 and 2 can improve with

. unknown: distinguishability is less clear.

1 1 1 2

if

1 2 1 2

if

2

( denotes unordered multi-set.)

16

slide-53
SLIDE 53

High-level intuition

Suppose β∗ is either e1 = (1, 0, 0, . . . , 0) or e2 = (0, 1, 0, . . . , 0).

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ . . . x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

known: distinguishability of

1 and 2 can improve with

. unknown: distinguishability is less clear.

1 1 1 2

if

1 2 1 2

if

2

( denotes unordered multi-set.)

16

slide-54
SLIDE 54

High-level intuition

Suppose β∗ is either e1 = (1, 0, 0, . . . , 0) or e2 = (0, 1, 0, . . . , 0).

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ . . . x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

π∗ known: distinguishability of e1 and e2 can improve with n. unknown: distinguishability is less clear.

1 1 1 2

if

1 2 1 2

if

2

( denotes unordered multi-set.)

16

slide-55
SLIDE 55

High-level intuition

Suppose β∗ is either e1 = (1, 0, 0, . . . , 0) or e2 = (0, 1, 0, . . . , 0).

= ε1 ε2 εn . . . + y1 y2 yn . . . β∗ . . . x⊤

π∗(1)

x⊤

π∗(2)

x⊤

π∗(n)

π∗ known: distinguishability of e1 and e2 can improve with n. π∗ unknown: distinguishability is less clear. yin

i=1 =

    

xi,1n

i=1 + N(0, σ2)

if β∗ = e1, xi,2n

i=1 + N(0, σ2)

if β∗ = e2. (· denotes unordered multi-set.)

16

slide-56
SLIDE 56

Efgect of noise

Without noise (P = N(0, Id))

  • 6
  • 4
  • 2
2 4 6 10 20 30 40 50 60
  • 6
  • 4
  • 2
2 4 6 10 20 30 40 50 60

xi,1n

i=1

xi,2n

i=1

With noise

2 17

slide-57
SLIDE 57

Efgect of noise

Without noise (P = N(0, Id))

  • 6
  • 4
  • 2
2 4 6 10 20 30 40 50 60
  • 6
  • 4
  • 2
2 4 6 10 20 30 40 50 60

xi,1n

i=1

xi,2n

i=1

With noise

  • 6
  • 4
  • 2
2 4 6 10 20 30 40 50 60

??? + N(0, σ2)

17

slide-58
SLIDE 58

Lower bound on SNR

Theorem (H., Shi, & Sun, 2017) For P = N(0, Id), no estimator ˆ β can guarantee E

[

∥ˆ β − β∗∥2

]

≤ ∥β∗∥2 3 unless SNR ≥ C · d log log(n) . “Known correspondence” setting: suffjces. Another theorem: for 1 1 , must have 1 9, even as .

18

slide-59
SLIDE 59

Lower bound on SNR

Theorem (H., Shi, & Sun, 2017) For P = N(0, Id), no estimator ˆ β can guarantee E

[

∥ˆ β − β∗∥2

]

≤ ∥β∗∥2 3 unless SNR ≥ C · d log log(n) . “Known correspondence” setting: SNR ≳ d/n suffjces. Another theorem: for 1 1 , must have 1 9, even as .

18

slide-60
SLIDE 60

Lower bound on SNR

Theorem (H., Shi, & Sun, 2017) For P = N(0, Id), no estimator ˆ β can guarantee E

[

∥ˆ β − β∗∥2

]

≤ ∥β∗∥2 3 unless SNR ≥ C · d log log(n) . “Known correspondence” setting: SNR ≳ d/n suffjces. Another theorem: for P = Uniform([−1, 1]d), must have SNR ≥ 1/9, even as n → ∞.

18

slide-61
SLIDE 61

High SNR regime

Previous works (Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade, 2016): If , then can recover (and , approximately) using Maximum Likelihood Estimation, i.e., least squares. Related ( 1): broken random sample (DeGroot and Goel, 1980). Estimate sign of correlation between and . Have estimator for that is correct w.p. 1

1 4 .

Does high also permit effjcient algorithms? (Recall: our approximate MLE algorithm has running time .)

19

slide-62
SLIDE 62

High SNR regime

Previous works (Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade, 2016): If SNR ≫ poly(n), then can recover π∗ (and β∗, approximately) using Maximum Likelihood Estimation, i.e., least squares. Related ( 1): broken random sample (DeGroot and Goel, 1980). Estimate sign of correlation between and . Have estimator for that is correct w.p. 1

1 4 .

Does high also permit effjcient algorithms? (Recall: our approximate MLE algorithm has running time .)

19

slide-63
SLIDE 63

High SNR regime

Previous works (Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade, 2016): If SNR ≫ poly(n), then can recover π∗ (and β∗, approximately) using Maximum Likelihood Estimation, i.e., least squares. Related (d = 1): broken random sample (DeGroot and Goel, 1980). Estimate sign of correlation between xi and yi. Have estimator for sign(β∗) that is correct w.p. 1 − ˜ O(SNR−1/4). Does high also permit effjcient algorithms? (Recall: our approximate MLE algorithm has running time .)

19

slide-64
SLIDE 64

High SNR regime

Previous works (Unnikrishnan, Haghighatshoar, & Vetterli, 2015; Pananjady, Wainwright, & Courtade, 2016): If SNR ≫ poly(n), then can recover π∗ (and β∗, approximately) using Maximum Likelihood Estimation, i.e., least squares. Related (d = 1): broken random sample (DeGroot and Goel, 1980). Estimate sign of correlation between xi and yi. Have estimator for sign(β∗) that is correct w.p. 1 − ˜ O(SNR−1/4). Does high SNR also permit effjcient algorithms? (Recall: our approximate MLE algorithm has running time nO(d).)

19

slide-65
SLIDE 65

Average-case recovery with very high SNR

slide-66
SLIDE 66

Noise-free setting (SNR = ∞)

= β∗ y0 y1 yn . . . x⊤

π∗(0)

x⊤

π∗(1)

x⊤

π∗(n)

. . .

Assume (xi)n

i=0 iid from N(0, Id).

Also assume 0. If 1 , then recovery of gives exact recovery of (a.s.). We’ll assume 1 1 (i.e., ). Claim: suffjces to recover with high probability.

20

slide-67
SLIDE 67

Noise-free setting (SNR = ∞)

= β∗ y0 y1 yn . . . x⊤ x⊤

π∗(1)

x⊤

π∗(n)

. . .

Assume (xi)n

i=0 iid from N(0, Id).

Also assume π∗(0) = 0. If 1 , then recovery of gives exact recovery of (a.s.). We’ll assume 1 1 (i.e., ). Claim: suffjces to recover with high probability.

20

slide-68
SLIDE 68

Noise-free setting (SNR = ∞)

= β∗ y0 y1 yn . . . x⊤ x⊤

π∗(1)

x⊤

π∗(n)

. . .

Assume (xi)n

i=0 iid from N(0, Id).

Also assume π∗(0) = 0. If n + 1 ≥ d, then recovery of π∗ gives exact recovery of β∗ (a.s.). We’ll assume 1 1 (i.e., ). Claim: suffjces to recover with high probability.

20

slide-69
SLIDE 69

Noise-free setting (SNR = ∞)

= β∗ y0 y1 yn . . . x⊤ x⊤

π∗(1)

x⊤

π∗(n)

. . .

Assume (xi)n

i=0 iid from N(0, Id).

Also assume π∗(0) = 0. If n + 1 ≥ d, then recovery of π∗ gives exact recovery of β∗ (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d). Claim: suffjces to recover with high probability.

20

slide-70
SLIDE 70

Noise-free setting (SNR = ∞)

= β∗ y0 y1 yn . . . x⊤ x⊤

π∗(1)

x⊤

π∗(n)

. . .

Assume (xi)n

i=0 iid from N(0, Id).

Also assume π∗(0) = 0. If n + 1 ≥ d, then recovery of π∗ gives exact recovery of β∗ (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d). Claim: n ≥ d suffjces to recover π∗ with high probability.

20

slide-71
SLIDE 71

Result on exact recovery

Theorem (H., Shi, & Sun, 2017) In the noise-free setting, there is a poly(n, d)-time⋆ algorithm that returns π∗ and β∗ with high probability.

⋆Assuming problem is appropriately discretized.

21

slide-72
SLIDE 72

Main idea: hidden subset

Measurements: y0 = x⊤

0 β∗ ;

yi = x⊤

π∗(i)β∗ ,

i = 1, . . . , n . For simplicity: assume , and for 1 , so

1 1

We also know:

1 22

slide-73
SLIDE 73

Main idea: hidden subset

Measurements: y0 = x⊤

0 β∗ ;

yi = x⊤

π∗(i)β∗ ,

i = 1, . . . , n . For simplicity: assume n = d, and xi = ei for i = 1, . . . , d, so y1, . . . , yd = β∗

1 , . . . , β∗ d .

We also know:

1 22

slide-74
SLIDE 74

Main idea: hidden subset

Measurements: y0 = x⊤

0 β∗ ;

yi = x⊤

π∗(i)β∗ ,

i = 1, . . . , n . For simplicity: assume n = d, and xi = ei for i = 1, . . . , d, so y1, . . . , yd = β∗

1 , . . . , β∗ d .

We also know: y0 = x⊤

0 β∗ = d

j=1

x0,jβ∗

j . 22

slide-75
SLIDE 75

Reduction to Subset Sum

y0 = x⊤

0 β∗ = d

j=1

x0,jβ∗

j

=

d

i=1 d

j=1

x0,jyi · 1{π∗(i) = j}

x0,1 x0,2 x0,n . . . y1 . . . y2 yn

  • 2 “source” numbers

, “target” sum

0.

The subset adds up to

0.

Subset Sum problem.

23

slide-76
SLIDE 76

Reduction to Subset Sum

y0 = x⊤

0 β∗ = d

j=1

x0,jβ∗

j

=

d

i=1 d

j=1

x0,jyi · 1{π∗(i) = j}

x0,1 x0,2 x0,n . . . . . . ??? y1 y2 yn

  • 2 “source” numbers

, “target” sum

0.

The subset adds up to

0.

Subset Sum problem.

23

slide-77
SLIDE 77

Reduction to Subset Sum

y0 = x⊤

0 β∗ = d

j=1

x0,jβ∗

j

=

d

i=1 d

j=1

x0,jyi · 1{π∗(i) = j}

x0,1 x0,2 x0,n . . . . . . ??? y1 y2 yn

  • d2 “source” numbers ci,j := x0,jyi, “target” sum y0.

The subset {ci,j : π∗(i) = j} adds up to y0. Subset Sum problem.

23

slide-78
SLIDE 78

Reduction to Subset Sum

y0 = x⊤

0 β∗ = d

j=1

x0,jβ∗

j

=

d

i=1 d

j=1

x0,jyi · 1{π∗(i) = j}

x0,1 x0,2 x0,n . . . . . . ??? y1 y2 yn

  • d2 “source” numbers ci,j := x0,jyi, “target” sum y0.

The subset {ci,j : π∗(i) = j} adds up to y0. Subset Sum problem.

23

slide-79
SLIDE 79

NP-Completeness of Subset Sum (a.k.a. “Knapsack”)

(Karp, 1972) 24

slide-80
SLIDE 80

Easiness of Subset Sum

  • But Subset Sum is only “weakly” NP-hard

(effjcient algorithm exists for unary-encoded inputs).

  • Lagarias & Odlyzko (1983): solving certain random

instances can be reduced to solving Approximate Shortest Vector Problem in lattices.

  • Lenstra, Lenstra, & Lovász (1982): effjcient algorithm to

solve Approximate SVP.

  • Our algorithm is based on similar reduction but requires a

somewhat difgerent analysis.

25

slide-81
SLIDE 81

Easiness of Subset Sum

  • But Subset Sum is only “weakly” NP-hard

(effjcient algorithm exists for unary-encoded inputs).

  • Lagarias & Odlyzko (1983): solving certain random

instances can be reduced to solving Approximate Shortest Vector Problem in lattices.

  • Lenstra, Lenstra, & Lovász (1982): effjcient algorithm to

solve Approximate SVP.

  • Our algorithm is based on similar reduction but requires a

somewhat difgerent analysis.

25

slide-82
SLIDE 82

Easiness of Subset Sum

  • But Subset Sum is only “weakly” NP-hard

(effjcient algorithm exists for unary-encoded inputs).

  • Lagarias & Odlyzko (1983): solving certain random

instances can be reduced to solving Approximate Shortest Vector Problem in lattices.

  • Lenstra, Lenstra, & Lovász (1982): effjcient algorithm to

solve Approximate SVP.

  • Our algorithm is based on similar reduction but requires a

somewhat difgerent analysis.

25

slide-83
SLIDE 83

Easiness of Subset Sum

  • But Subset Sum is only “weakly” NP-hard

(effjcient algorithm exists for unary-encoded inputs).

  • Lagarias & Odlyzko (1983): solving certain random

instances can be reduced to solving Approximate Shortest Vector Problem in lattices.

  • Lenstra, Lenstra, & Lovász (1982): effjcient algorithm to

solve Approximate SVP.

  • Our algorithm is based on similar reduction but requires a

somewhat difgerent analysis.

25

slide-84
SLIDE 84

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum effjciently solvable when N source numbers c1, . . . , cN chosen independently and u.a.r. from suffjciently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum by noticeable amount. Reduction: construct lattice basis in

1 such that

  • correct subset of basis vectors gives short lattice vector

;

  • any other lattice vector

is more than 2

2-times longer. 1 1

for suffjciently large 0.

26

slide-85
SLIDE 85

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum effjciently solvable when N source numbers c1, . . . , cN chosen independently and u.a.r. from suffjciently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in

1 such that

  • correct subset of basis vectors gives short lattice vector

;

  • any other lattice vector

is more than 2

2-times longer. 1 1

for suffjciently large 0.

26

slide-86
SLIDE 86

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum effjciently solvable when N source numbers c1, . . . , cN chosen independently and u.a.r. from suffjciently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that

  • correct subset of basis vectors gives short lattice vector v⋆;
  • any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.

[

b0 b1 · · · bN

]

:=

  0

IN

MT −Mc1 · · · −McN

 

for suffjciently large M > 0.

26

slide-87
SLIDE 87

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from

0 1 .

  • To show that Lagarias & Odlyzko reduction still works, use

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every that is not an integer multiple of permutation matrix corresponding to , 1 2

2 27

slide-88
SLIDE 88

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from N(0, 1).
  • To show that Lagarias & Odlyzko reduction still works, use

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every that is not an integer multiple of permutation matrix corresponding to , 1 2

2 27

slide-89
SLIDE 89

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from N(0, 1).
  • To show that Lagarias & Odlyzko reduction still works, use

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every that is not an integer multiple of permutation matrix corresponding to , 1 2

2 27

slide-90
SLIDE 90

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from N(0, 1).
  • To show that Lagarias & Odlyzko reduction still works, use

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every Z ∈ Zd×d that is not an integer multiple of permutation matrix corresponding to π∗,

  • y0 −

i,j

Zi,j · ci,j

1 2poly(d) · ∥β∗∥2 .

27

slide-91
SLIDE 91

Some remarks

  • In general, x1, . . . , xn are not e1, . . . , ed, but similar reduction

works via Moore-Penrose pseudoinverse.

  • Algorithm strongly exploits assumption of noise-free
  • measurements. Unlikely to tolerate much noise.

Open problem: robust effjcient algorithm in high setting.

28

slide-92
SLIDE 92

Some remarks

  • In general, x1, . . . , xn are not e1, . . . , ed, but similar reduction

works via Moore-Penrose pseudoinverse.

  • Algorithm strongly exploits assumption of noise-free
  • measurements. Unlikely to tolerate much noise.

Open problem: robust effjcient algorithm in high SNR setting.

28

slide-93
SLIDE 93

Correspondence retrieval

slide-94
SLIDE 94

Correspondence retrieval problem

Goal: recover k unknown “signals” β∗

1, . . . , β∗ k ∈ Rd.

Measurements: for 1 , where

  • iid from

;

  • 1

1

as unordered multi-set;

  • iid from

2 .

Correspondence across measurements is lost.

29

slide-95
SLIDE 95

Correspondence retrieval problem

Goal: recover k unknown “signals” β∗

1, . . . , β∗ k ∈ Rd.

Measurements: (xi, Yi) for i = 1, . . . , n, where

  • (xi) iid from N(0, Id);
  • Yi = x⊤

i β∗ 1 + εi,1, . . . , x⊤ i β∗ k + εi,k as unordered multi-set;

  • (εi,j) iid from N(0, σ2).

Correspondence across measurements is lost.

29

slide-96
SLIDE 96

Correspondence retrieval problem

Goal: recover k unknown “signals” β∗

1, . . . , β∗ k ∈ Rd.

Measurements: (xi, Yi) for i = 1, . . . , n, where

  • (xi) iid from N(0, Id);
  • Yi = x⊤

i β∗ 1 + εi,1, . . . , x⊤ i β∗ k + εi,k as unordered multi-set;

  • (εi,j) iid from N(0, σ2).

Correspondence across measurements is lost.

xi β∗

3

β∗

2

β∗

1

29

slide-97
SLIDE 97

Correspondence retrieval problem

Goal: recover k unknown “signals” β∗

1, . . . , β∗ k ∈ Rd.

Measurements: (xi, Yi) for i = 1, . . . , n, where

  • (xi) iid from N(0, Id);
  • Yi = x⊤

i β∗ 1 + εi,1, . . . , x⊤ i β∗ k + εi,k as unordered multi-set;

  • (εi,j) iid from N(0, σ2).

Correspondence across measurements is lost.

xi β∗

3

β∗

2

β∗

1

29

slide-98
SLIDE 98

Special cases

  • k = 1: classical linear regression regression model.
  • 2 and

1 2: (real variant of) phase retrieval.

Note that has same information as . Existing methods require 2 .

30

slide-99
SLIDE 99

Special cases

  • k = 1: classical linear regression regression model.
  • k = 2 and β∗

1 = −β∗ 2: (real variant of) phase retrieval.

Note that x⊤

i β∗, −x⊤ i β∗ has same information as |x⊤ i β∗|.

Existing methods require n > 2d.

30

slide-100
SLIDE 100

Algorithmic results (Andoni, H., Shi, & Sun, 2017)

  • Noise-free setting (i.e., σ = 0):

Algorithm based on reduction to Subset Sum that requires n ≥ d + 1, which is optimal.

  • General setting:

Method-of-moments algorithm that requires . I.e., based on forming averages over the data, like: 1

1 2

Questions: SNR limits? Sub-optimality of “method-of-moments”?

31

slide-101
SLIDE 101

Algorithmic results (Andoni, H., Shi, & Sun, 2017)

  • Noise-free setting (i.e., σ = 0):

Algorithm based on reduction to Subset Sum that requires n ≥ d + 1, which is optimal.

  • General setting:

Method-of-moments algorithm that requires n ≥ d · poly(k). I.e., based on forming averages over the data, like: 1

1 2

Questions: SNR limits? Sub-optimality of “method-of-moments”?

31

slide-102
SLIDE 102

Algorithmic results (Andoni, H., Shi, & Sun, 2017)

  • Noise-free setting (i.e., σ = 0):

Algorithm based on reduction to Subset Sum that requires n ≥ d + 1, which is optimal.

  • General setting:

Method-of-moments algorithm that requires n ≥ d · poly(k). I.e., based on forming averages over the data, like: 1 n

n

i=1

  ∑

yj∈Yi

y2

j

 xix⊤

i .

Questions: SNR limits? Sub-optimality of “method-of-moments”?

31

slide-103
SLIDE 103

Algorithmic results (Andoni, H., Shi, & Sun, 2017)

  • Noise-free setting (i.e., σ = 0):

Algorithm based on reduction to Subset Sum that requires n ≥ d + 1, which is optimal.

  • General setting:

Method-of-moments algorithm that requires n ≥ d · poly(k). I.e., based on forming averages over the data, like: 1 n

n

i=1

  ∑

yj∈Yi

y2

j

 xix⊤

i .

Questions: SNR limits? Sub-optimality of “method-of-moments”?

31

slide-104
SLIDE 104

Closing remarks and open problems

slide-105
SLIDE 105

Closing remarks and open problems

Learning without correspondence is challenging for computation and statistics.

  • Computational and information-theoretic hardness show

striking contrast to “known correspondence” settings.

  • New (and unexpected?) algorithmic techniques in

worst-case and average-case settings.

  • Open problems:

Close gap between lower and upper bounds? Lower bounds for correspondence retrieval? Faster/more robust algorithms? (Smoothed) analysis of alternating minimization?

32

slide-106
SLIDE 106

Closing remarks and open problems

Learning without correspondence is challenging for computation and statistics.

  • Computational and information-theoretic hardness show

striking contrast to “known correspondence” settings.

  • New (and unexpected?) algorithmic techniques in

worst-case and average-case settings.

  • Open problems:

Close gap between lower and upper bounds? Lower bounds for correspondence retrieval? Faster/more robust algorithms? (Smoothed) analysis of alternating minimization?

32

slide-107
SLIDE 107

Closing remarks and open problems

Learning without correspondence is challenging for computation and statistics.

  • Computational and information-theoretic hardness show

striking contrast to “known correspondence” settings.

  • New (and unexpected?) algorithmic techniques in

worst-case and average-case settings.

  • Open problems:

Close gap between lower and upper bounds? Lower bounds for correspondence retrieval? Faster/more robust algorithms? (Smoothed) analysis of alternating minimization?

32

slide-108
SLIDE 108

Closing remarks and open problems

Learning without correspondence is challenging for computation and statistics.

  • Computational and information-theoretic hardness show

striking contrast to “known correspondence” settings.

  • New (and unexpected?) algorithmic techniques in

worst-case and average-case settings.

  • Open problems:

Close gap between SNR lower and upper bounds? Lower bounds for correspondence retrieval? Faster/more robust algorithms? (Smoothed) analysis of alternating minimization?

32

slide-109
SLIDE 109

Acknowledgements

Collaborators: Alexandr Andoni (Columbia), Kevin Shi (Columbia), Xiaorui Sun (Microsoft Research). Funding: NSF (DMR-1534910, IIS-1563785), Sloan Research Fellowship, Bloomberg Data Science Research Grant. Hospitality: Simons Institute for the Theory of Computing (UCB).

Thank you

33

slide-110
SLIDE 110

34

slide-111
SLIDE 111

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist β⋆ ∈ Rd and π⋆ ∈ Sn s.t. yπ⋆(i) = x⊤

i β⋆ ,

i ∈ [n] . Solution is determined by action of

  • n

points

(assume

1

).

Algorithm:

  • Find subset of

linearly independent points

1 2

.

  • “Guess” values of

, .

  • Solve linear system

, , for .

  • To check correctness of

: compute , , and check if

1 2

0. “Guess” means “enumerate over choices”; rest is .

35

slide-112
SLIDE 112

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist β⋆ ∈ Rd and π⋆ ∈ Sn s.t. yπ⋆(i) = x⊤

i β⋆ ,

i ∈ [n] . Solution is determined by action of π⋆ on d points

(assume dim(span(xi)d

i=1) = d).

Algorithm:

  • Find subset of

linearly independent points

1 2

.

  • “Guess” values of

, .

  • Solve linear system

, , for .

  • To check correctness of

: compute , , and check if

1 2

0. “Guess” means “enumerate over choices”; rest is .

35

slide-113
SLIDE 113

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist β⋆ ∈ Rd and π⋆ ∈ Sn s.t. yπ⋆(i) = x⊤

i β⋆ ,

i ∈ [n] . Solution is determined by action of π⋆ on d points

(assume dim(span(xi)d

i=1) = d).

Algorithm:

  • Find subset of d linearly independent points xi1, xi2, . . . , xid.
  • “Guess” values of π⋆(ij) ∈ [d], j ∈ [d].
  • Solve linear system yπ⋆(ij) = x⊤

ijβ, j ∈ [d], for β ∈ Rd.

  • To check correctness of ˆ

β: compute ˆ yi := x⊤

i ˆ

β, i ∈ [n], and check if minπ∈Sn

∑n

i=1 (yπ(i) − ˆ

yi)2 = 0. “Guess” means “enumerate over choices”; rest is .

35

slide-114
SLIDE 114

Beating brute-force search: “realizable” case

“Realizable” case: Suppose there exist β⋆ ∈ Rd and π⋆ ∈ Sn s.t. yπ⋆(i) = x⊤

i β⋆ ,

i ∈ [n] . Solution is determined by action of π⋆ on d points

(assume dim(span(xi)d

i=1) = d).

Algorithm:

  • Find subset of d linearly independent points xi1, xi2, . . . , xid.
  • “Guess” values of π⋆(ij) ∈ [d], j ∈ [d].
  • Solve linear system yπ⋆(ij) = x⊤

ijβ, j ∈ [d], for β ∈ Rd.

  • To check correctness of ˆ

β: compute ˆ yi := x⊤

i ˆ

β, i ∈ [n], and check if minπ∈Sn

∑n

i=1 (yπ(i) − ˆ

yi)2 = 0. “Guess” means “enumerate over

(n

d

) choices”; rest is poly(n, d).

35

slide-115
SLIDE 115

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS , there exist

1 2

s.t. every

1 2 satisfjes 1 2

1

1 2

  • time algorithm with approximation ratio

1,

  • r
  • time algorithm with approximation ratio 1

. Better way to get 1 : exploit fjrst-order optimality conditions (i.e., “normal equations”) and -nets. Overall time: for

1 . 36

slide-116
SLIDE 116

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ β ∈ arg minβ∈Rd

∑d

j=1 (x⊤ ijβ − bij)2 satisfjes n

i=1

(x⊤

i ˆ

β − bi)

2 ≤ (d + 1) · min β∈Rd n

i=1

(x⊤

i β − bi)2 .

  • time algorithm with approximation ratio

1,

  • r
  • time algorithm with approximation ratio 1

. Better way to get 1 : exploit fjrst-order optimality conditions (i.e., “normal equations”) and -nets. Overall time: for

1 . 36

slide-117
SLIDE 117

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ β ∈ arg minβ∈Rd

∑d

j=1 (x⊤ ijβ − bij)2 satisfjes n

i=1

(x⊤

i ˆ

β − bi)

2 ≤ (d + 1) · min β∈Rd n

i=1

(x⊤

i β − bi)2 .

= ⇒ nO(d)-time algorithm with approximation ratio d + 1,

  • r n ˜

O(d/ϵ)-time algorithm with approximation ratio 1 + ϵ.

Better way to get 1 : exploit fjrst-order optimality conditions (i.e., “normal equations”) and -nets. Overall time: for

1 . 36

slide-118
SLIDE 118

Beating brute-force search: general case

General case: solution may not be determined by only d points. But, for any RHS b ∈ Rn, there exist xi1, xi2, . . . , xid s.t. every ˆ β ∈ arg minβ∈Rd

∑d

j=1 (x⊤ ijβ − bij)2 satisfjes n

i=1

(x⊤

i ˆ

β − bi)

2 ≤ (d + 1) · min β∈Rd n

i=1

(x⊤

i β − bi)2 .

= ⇒ nO(d)-time algorithm with approximation ratio d + 1,

  • r n ˜

O(d/ϵ)-time algorithm with approximation ratio 1 + ϵ.

Better way to get 1 + ϵ: exploit fjrst-order optimality conditions (i.e., “normal equations”) and ϵ-nets. Overall time: (n/ϵ)O(k) + poly(n, d) for k = dim(span(xi)n

i=1). 36

slide-119
SLIDE 119

Lower bound proof sketch

We show that no estimator can confjdently distinguish between β∗ = e1 and β∗ = −e1, where e1 = (1, 0, . . . , 0)⊤. Let be the data distribution with parameter

1 1 .

Task: show

1 and 1 are “close”, then appeal to Le Cam’s

standard “two-point argument”:

1 1

2

1

1 1

Key idea: conditional means of

1 given 1, under

1

and

1, are close as unordered multi-sets.

37

slide-120
SLIDE 120

Lower bound proof sketch

We show that no estimator can confjdently distinguish between β∗ = e1 and β∗ = −e1, where e1 = (1, 0, . . . , 0)⊤. Let Pβ∗ be the data distribution with parameter β∗ ∈ {e1, −e1}. Task: show Pe1 and P−e1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max

β∗∈{e1,−e1} EPβ∗ ∥ˆ

β − β∗∥2 ≥ 1 − ∥Pe1 − P−e1∥tv . Key idea: conditional means of

1 given 1, under

1

and

1, are close as unordered multi-sets.

37

slide-121
SLIDE 121

Lower bound proof sketch

We show that no estimator can confjdently distinguish between β∗ = e1 and β∗ = −e1, where e1 = (1, 0, . . . , 0)⊤. Let Pβ∗ be the data distribution with parameter β∗ ∈ {e1, −e1}. Task: show Pe1 and P−e1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max

β∗∈{e1,−e1} EPβ∗ ∥ˆ

β − β∗∥2 ≥ 1 − ∥Pe1 − P−e1∥tv . Key idea: conditional means of yin

i=1 given (xi)n i=1, under Pe1

and P−e1, are close as unordered multi-sets.

37

slide-122
SLIDE 122

Proof sketch (continued)

Generative process for Pβ∗:

  • 1. Draw

1 iid

1 1 ,

1 iid 2 .

  • 2. Set

for .

  • 3. Set

for , where

1 2

. Conditional distribution of

1 2

given

1:

Under

1:

1 2

Under

1:

1 2

where

1 2

and

1 1 .

Data processing: Lose information by going from to

1. 38

slide-123
SLIDE 123

Proof sketch (continued)

Generative process for Pβ∗:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set

for .

  • 3. Set

for , where

1 2

. Conditional distribution of

1 2

given

1:

Under

1:

1 2

Under

1:

1 2

where

1 2

and

1 1 .

Data processing: Lose information by going from to

1. 38

slide-124
SLIDE 124

Proof sketch (continued)

Generative process for Pβ∗:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := x⊤

i β∗ for i ∈ [n].

  • 3. Set

for , where

1 2

. Conditional distribution of

1 2

given

1:

Under

1:

1 2

Under

1:

1 2

where

1 2

and

1 1 .

Data processing: Lose information by going from to

1. 38

slide-125
SLIDE 125

Proof sketch (continued)

Generative process for Pβ∗:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := x⊤

i β∗ for i ∈ [n].

  • 3. Set yi := u(i) + εi for i ∈ [n], where u(1) ≤ u(2) ≤ · · · ≤ u(n).

Conditional distribution of

1 2

given

1:

Under

1:

1 2

Under

1:

1 2

where

1 2

and

1 1 .

Data processing: Lose information by going from to

1. 38

slide-126
SLIDE 126

Proof sketch (continued)

Generative process for Pβ∗:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := x⊤

i β∗ for i ∈ [n].

  • 3. Set yi := u(i) + εi for i ∈ [n], where u(1) ≤ u(2) ≤ · · · ≤ u(n).

Conditional distribution of y = (y1, y2, . . . , yn) given (xi)n

i=1:

Under Pe1: y | (xi)n

i=1 ∼ N(u↑, σ2In)

Under P−e1: y | (xi)n

i=1 ∼ N(−u↓, σ2In)

where u↑ = (u(1), u(2), . . . , u(n)) and u↓ = (u(n), u(n−1), . . . , u(1)). Data processing: Lose information by going from to

1. 38

slide-127
SLIDE 127

Proof sketch (continued)

Generative process for Pβ∗:

  • 1. Draw (xi)n

i=1 iid

∼ Uniform([−1, 1]d), (εi)n

i=1 iid

∼ N(0, σ2).

  • 2. Set ui := x⊤

i β∗ for i ∈ [n].

  • 3. Set yi := u(i) + εi for i ∈ [n], where u(1) ≤ u(2) ≤ · · · ≤ u(n).

Conditional distribution of y = (y1, y2, . . . , yn) given (xi)n

i=1:

Under Pe1: y | (xi)n

i=1 ∼ N(u↑, σ2In)

Under P−e1: y | (xi)n

i=1 ∼ N(−u↓, σ2In)

where u↑ = (u(1), u(2), . . . , u(n)) and u↓ = (u(n), u(n−1), . . . , u(1)). Data processing: Lose information by going from y to yin

i=1. 38

slide-128
SLIDE 128

Proof sketch (continued)

By data processing inequality, KL

(

Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

)

≤ KL

(

N(u↑, σ2In), N(−u↓, σ2In)

)

2 2

2

2

2

2 2

Some computations show that

2 2

4 By conditioning + Pinsker’s inequality,

1 1

1 2 1 2 4

2 2

1 2 1 2

39

slide-129
SLIDE 129

Proof sketch (continued)

By data processing inequality, KL

(

Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

)

≤ KL

(

N(u↑, σ2In), N(−u↓, σ2In)

)

= ∥u↑ − (−u↓)∥2

2

2σ2 2

2 2

Some computations show that

2 2

4 By conditioning + Pinsker’s inequality,

1 1

1 2 1 2 4

2 2

1 2 1 2

39

slide-130
SLIDE 130

Proof sketch (continued)

By data processing inequality, KL

(

Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

)

≤ KL

(

N(u↑, σ2In), N(−u↓, σ2In)

)

= ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

Some computations show that

2 2

4 By conditioning + Pinsker’s inequality,

1 1

1 2 1 2 4

2 2

1 2 1 2

39

slide-131
SLIDE 131

Proof sketch (continued)

By data processing inequality, KL

(

Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

)

≤ KL

(

N(u↑, σ2In), N(−u↓, σ2In)

)

= ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

Some computations show that med ∥u↑ + u↓∥2

2 ≤ 4 .

By conditioning + Pinsker’s inequality,

1 1

1 2 1 2 4

2 2

1 2 1 2

39

slide-132
SLIDE 132

Proof sketch (continued)

By data processing inequality, KL

(

Pe1(· | (xi)n

i=1), P−e1(· | (xi)n i=1)

)

≤ KL

(

N(u↑, σ2In), N(−u↓, σ2In)

)

= ∥u↑ − (−u↓)∥2

2

2σ2 = SNR 2 · ∥u↑ + u↓∥2

2 .

Some computations show that med ∥u↑ + u↓∥2

2 ≤ 4 .

By conditioning + Pinsker’s inequality, ∥Pe1 − P−e1∥tv ≤ 1 2 + 1 2 med

SNR 4 · ∥u↑ + u↓∥2

2 ≤ 1

2 + 1 2 √ SNR .

39

slide-133
SLIDE 133

Result on exact recovery

Theorem (H., Shi, & Sun, 2017) Fix any β∗ ∈ Rd and π∗ ∈ Sn, and assume n ≥ d. Suppose (xi)n

i=0 are drawn iid from N(0, Id), and (yi)n i=0 satisfy

y0 = x⊤

0 β∗ ;

yi = x⊤

π∗(i)β∗ ,

i = 1, . . . , n . There is a poly(n, d)-time‡ algorithm that, given inputs (xi)n

i=0 and (yi)n i=0, returns π∗ and β∗ with high probability.

‡Assuming problem is appropriately discretized.

40

slide-134
SLIDE 134

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum effjciently solvable when N source numbers chosen independently and u.a.r. from suffjciently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum by noticeable amount. Reduction: construct lattice basis in

1 such that

  • correct subset of basis vectors gives short lattice vector

;

  • any other lattice vector

is more than 2

2-times longer. 1 1

for suffjciently large 0.

41

slide-135
SLIDE 135

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum effjciently solvable when N source numbers chosen independently and u.a.r. from suffjciently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in

1 such that

  • correct subset of basis vectors gives short lattice vector

;

  • any other lattice vector

is more than 2

2-times longer. 1 1

for suffjciently large 0.

41

slide-136
SLIDE 136

Reducing subset sum to shortest vector problem

Lagarias & Odlyzko (1983): random instances of Subset Sum effjciently solvable when N source numbers chosen independently and u.a.r. from suffjciently wide interval of Z. Main idea: (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction: construct lattice basis in RN+1 such that

  • correct subset of basis vectors gives short lattice vector v⋆;
  • any other lattice vector ̸∝ v⋆ is more than 2N/2-times longer.

[

b0 b1 · · · bN

]

:=

  0

IN

MT −Mc1 · · · −McN

 

for suffjciently large M > 0.

41

slide-137
SLIDE 137

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from

0 1 .

  • To show that Lagarias & Odlyzko reduction still works, need

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every that is not an integer multiple of permutation matrix corresponding to , 1 2

2 42

slide-138
SLIDE 138

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from N(0, 1).
  • To show that Lagarias & Odlyzko reduction still works, need

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every that is not an integer multiple of permutation matrix corresponding to , 1 2

2 42

slide-139
SLIDE 139

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from N(0, 1).
  • To show that Lagarias & Odlyzko reduction still works, need

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every that is not an integer multiple of permutation matrix corresponding to , 1 2

2 42

slide-140
SLIDE 140

Our random subset sum instance

Catch: Our source numbers ci,j = yix⊤

j x0 are not independent,

and not uniformly distributed on some wide interval of Z.

  • Instead, have some joint density derived from N(0, 1).
  • To show that Lagarias & Odlyzko reduction still works, need

Gaussian anti-concentration for quadratic and quartic forms. Key lemma: (w.h.p.) for every Z ∈ Zd×d that is not an integer multiple of permutation matrix corresponding to π∗,

  • y0 −

i,j

Zi,j · ci,j

1 2poly(d) · ∥β∗∥2 .

42