Communication Lower Bounds for Statistical Estimation Problems via a - - PowerPoint PPT Presentation

β–Ά
communication lower bounds for statistical
SMART_READER_LITE
LIVE PREVIEW

Communication Lower Bounds for Statistical Estimation Problems via a - - PowerPoint PPT Presentation

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms


slide-1
SLIDE 1

1

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms Aug 28, 2015

Mark Braverman Ankit Garg Tengyu Ma Huy Nguyen David Woodruff

slide-2
SLIDE 2

Distributed mean estimation

Statistical estimation:

– Unknown parameter πœ„. – Inputs to machines: i.i.d. data points ∼ πΈπœ„. – Output estimator πœ„ .

Objectives:

– Low communication 𝐷 = Ξ  . – Small loss 𝑆 ≔ 𝔽 πœ„ βˆ’ πœ„

2 .

2

Blackboard Big Data! Distributed Storage and Processing

small data

πœ„

small data

slide-3
SLIDE 3

Distributed sparse Gaussian mean estimation

  • Ambient dimension 𝑒.
  • Sparsity parameter 𝑙: πœ„ 0 ≀ 𝑙.
  • Number of machines 𝑛.
  • Each machine holds π‘œ samples.
  • Standard deviation 𝜏.
  • Thus each sample is a vector

π‘Œ

π‘˜ (𝑒) ∼ π’ͺ πœ„1, 𝜏2 , … , π’ͺ πœ„π‘’, 𝜏2

∈ ℝ𝑒

3

Goal: estimate (πœ„1, … , πœ„π‘’)

slide-4
SLIDE 4
  • Ambient dimension 𝑒.
  • Sparsity parameter 𝑙: πœ„ 0 ≀ 𝑙.
  • Number of machines 𝑛.
  • Each machine holds π‘œ samples.
  • Standard deviation 𝜏.
  • Thus each sample is a vector

π‘Œ

π‘˜ (𝑒) ∼ π’ͺ πœ„1, 𝜏2 , … , π’ͺ πœ„π‘’, 𝜏2

∈ ℝ𝑒

4

Higher value makes estimation: harder harder easier easier* harder Goal: estimate (πœ„1, … , πœ„π‘’)

slide-5
SLIDE 5

Distributed sparse Gaussian mean estimation

  • Main result: if Ξ  = C, then

𝑆 β‰₯ Ξ© max 𝜏2𝑒𝑙 π‘œπ· , 𝜏2𝑙 π‘œπ‘›

  • Tight up to a log 𝑒 factor

[GMN14]. Up to a const. factor in the dense case.

  • For optimal performance,

𝐷 ≳ 𝑛𝑒 (not 𝑛𝑙) is needed!

5

  • 𝑒 – dim
  • 𝑙 – sparsity
  • 𝑛 – machine
  • π‘œ – samp. each
  • 𝜏 – deviation
  • 𝑆 – sq. loss

Statistical limit

slide-6
SLIDE 6

Prior work (partial list)

  • [Zhang-Duchi-Jordan-Wainwright’13]: the case

when 𝑒 = 1 and general communication; and the dense case for simultaneous-message protocols.

  • [Shamir’14]: Implies the result for 𝑙 = 1 in a

restricted communication model.

  • [Duchi-Jordan-Wainwright-Zhang’14, Garg-Ma-

Nguyen’14]: the dense case (up to logarithmic factors).

  • A lot of recent work on communication-efficient

distributed learning.

6

slide-7
SLIDE 7

Reduction from Gaussian mean detection

  • 𝑆 β‰₯ Ξ© max

𝜏2𝑒𝑙 π‘œπ· , 𝜏2𝑙 π‘œπ‘›

  • Gaussian mean detection

– A one-dimensional problem. – Goal: distinguish between 𝜈0 = π’ͺ 0, 𝜏2 and 𝜈1 = π’ͺ πœ€, 𝜏2 . – Each player gets π‘œ samples.

7

slide-8
SLIDE 8
  • Assume 𝑆 β‰ͺ max

𝜏2𝑒𝑙 π‘œπ· , 𝜏2𝑙 π‘œπ‘›

  • Distinguish between 𝜈0 = π’ͺ 0, 𝜏2 and

𝜈1 = π’ͺ πœ€, 𝜏2 .

  • Theorem: If can attain 𝑆 ≀

1 16 π‘™πœ€2 in the

estimation problem using 𝐷 communication, then we can solve the detection problem at ∼ 𝐷/𝑒 min- information cost.

  • Using πœ€2 β‰ͺ 𝜏2𝑒/(𝐷 π‘œ), get detection using

𝐽 β‰ͺ 𝜏2

π‘œ πœ€2 min-information cost.

8

slide-9
SLIDE 9

The detection problem

  • Distinguish between 𝜈0 = π’ͺ 0,1 and

𝜈1 = π’ͺ πœ€, 1 .

  • Each player gets π‘œ samples.
  • Want this to be impossible using 𝐽 β‰ͺ

1 π‘œ πœ€2

min-information cost.

9

slide-10
SLIDE 10

The detection problem

  • Distinguish between 𝜈0 = π’ͺ 0,1 and

𝜈1 = π’ͺ πœ€, 1 .

  • Distinguish between 𝜈0 = π’ͺ 0, 1

π‘œ and

𝜈1 = π’ͺ πœ€, 1

π‘œ .

  • Each player gets π‘œ samples. one sample.
  • Want this to be impossible using 𝐽 β‰ͺ

1 π‘œ πœ€2

min-information cost.

10

slide-11
SLIDE 11

The detection problem

  • By scaling everything by π‘œ (and replacing

πœ€ with πœ€ π‘œ).

  • Distinguish between 𝜈0 = π’ͺ 0,1 and

𝜈1 = π’ͺ πœ€, 1 .

  • Each player gets one sample.
  • Want this to be impossible using 𝐽 β‰ͺ 1

πœ€2

min-information cost.

11

Tight (for 𝑛 large enough,

  • therwise task impossible)
slide-12
SLIDE 12

Information cost

12

Blackboard Ξ  πœˆπ‘€ = π’ͺ πœ€π‘Š, 1

π‘Š π‘Œ1 ∼ πœˆπ‘€ π‘Œ2 ∼ πœˆπ‘€ π‘Œπ‘› ∼ πœˆπ‘€

𝐽𝐷 𝜌 : = 𝐽(Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘›)

slide-13
SLIDE 13

Min-Information cost

13

Blackboard Ξ 

π‘Œ1 ∼ πœˆπ‘€ π‘Œ2 ∼ πœˆπ‘€ π‘Œπ‘› ∼ πœˆπ‘€

πœˆπ‘Š = π’ͺ πœ€π‘Š, 1

π‘Š

π‘›π‘—π‘œπ½π· 𝜌 ≔ min

π‘€βˆˆ{0,1} 𝐽(Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘›|π‘Š = 𝑀)

slide-14
SLIDE 14

Min-Information cost

π‘›π‘—π‘œπ½π· 𝜌 ≔ min

π‘€βˆˆ{0,1} 𝐽(Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘›|π‘Š = 𝑀)

  • We will want this quantity to be Ξ©

1 πœ€2 .

  • Warning: it is not the same thing as

𝐽(Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘›|π‘Š)= π”½π‘€βˆΌπ‘Š 𝐽(Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘›|π‘Š = 𝑀)

because one case can be much smaller than the

  • ther.
  • In our case, the need to use π‘›π‘—π‘œπ½π· instead of

𝐽𝐷 happens because of the sparsity.

14

slide-15
SLIDE 15

Strong data processing inequality

15

Blackboard Ξ 

π‘Œ1 ∼ πœˆπ‘€ π‘Œ2 ∼ πœˆπ‘€ π‘Œπ‘› ∼ πœˆπ‘€

πœˆπ‘€ = π’ͺ πœ€π‘Š, 1

π‘Š

Fact: Ξ  β‰₯ 𝐽 Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘› = 𝐽(Ξ ; π‘Œπ‘—|π‘Œ<𝑗)

𝑗

slide-16
SLIDE 16

Strong data processing inequality

16

  • πœˆπ‘€ = π’ͺ πœ€π‘Š, 1 ; suppose π‘Š ∼ 𝐢1/2.
  • For each 𝑗, π‘Š βˆ’ π‘Œπ‘— βˆ’ Ξ  is a Markov chain.
  • Intuition: β€œπ‘Œπ‘— contains little information about

π‘Š; no way to learn this information except by learning a lot about π‘Œπ‘—β€.

  • Data processing: 𝐽 π‘Š; Ξ  ≀ 𝐽 π‘Œπ‘—; Ξ  .
  • Strong Data Processing: 𝐽 π‘Š; Ξ  ≀ 𝛾 β‹… 𝐽 π‘Œπ‘—; Ξ 

for some 𝛾 = 𝛾(𝜈0, 𝜈1) < 1.

slide-17
SLIDE 17

Strong data processing inequality

17

  • πœˆπ‘€ = π’ͺ πœ€π‘Š, 1 ; suppose π‘Š ∼ 𝐢1/2.
  • For each 𝑗, π‘Š βˆ’ π‘Œπ‘— βˆ’ Ξ  is a Markov chain.
  • Strong Data Processing: 𝐽 π‘Š; Ξ  ≀ 𝛾 β‹… 𝐽 π‘Œπ‘—; Ξ 

for some 𝛾 = 𝛾(𝜈0, 𝜈1) < 1.

  • In this case (𝜈0 = π’ͺ 0,1 , 𝜈1 = π’ͺ πœ€, 1 ):

𝛾 𝜈0, 𝜈1 ∼ 𝐽 π‘Š; sign π‘Œπ‘— 𝐽 π‘Œπ‘—; sign(π‘Œπ‘—) ∼ πœ€2

slide-18
SLIDE 18

β€œProof”

  • πœˆπ‘€ = π’ͺ πœ€π‘Š, 1 ; suppose π‘Š ∼ 𝐢1/2.
  • Strong Data Processing: 𝐽 π‘Š; Ξ  ≀ πœ€2 β‹… 𝐽 π‘Œπ‘—; Ξ 
  • We know 𝐽 π‘Š; Ξ  = Ξ©(1).

Ξ  β‰₯ 𝐽 Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘› ≳ 𝐽 Ξ ; π‘Œπ‘—

𝑗

β‰₯ 1 πœ€2 … "π½π‘œπ‘”π‘ Ξ  π‘‘π‘π‘œπ‘€π‘“π‘§π‘‘ 𝑏𝑐𝑝𝑣𝑒 π‘Š π‘’β„Žπ‘ π‘π‘£π‘•β„Ž π‘žπ‘šπ‘π‘§π‘“π‘  𝑗"

𝑗

≳

1

πœ€2 𝐽 π‘Š; Ξ  = Ξ© 1 πœ€2

18

Q.E.D!

slide-19
SLIDE 19

Issues with the proof

  • The right high level idea.
  • Two main issues:

– Not clear how to deal with additivity over coordinates. – Dealing with π‘›π‘—π‘œπ½π· instead of 𝐽𝐷.

19

slide-20
SLIDE 20

If the picture were this…

20

Blackboard Ξ 

π‘Œ1 ∼ πœˆπ‘€ π‘Œ2 ∼ 𝜈0 π‘Œπ‘› ∼ 𝜈0

πœˆπ‘€ = π’ͺ πœ€π‘Š, 1

π‘Š

Then indeed 𝐽 Ξ ; π‘Š ≀ πœ€2 β‹… 𝐽 Ξ ; π‘Œ1 .

slide-21
SLIDE 21

Hellinger distance

  • Solution to additivity: using Hellinger

distance 𝑔 𝑦 βˆ’ 𝑕 𝑦

2𝑒𝑦 Ξ©

  • Following from [Jayram’09].

β„Ž2 Ξ π‘Š=0, Ξ π‘Š=1 ∼ 𝐽 π‘Š; Ξ  = Ξ© 1

  • β„Ž2 Ξ π‘Š=0, Ξ π‘Š=1 decomposes into 𝑛

scenarios as above using the fact that Ξ  is a protocol.

21

slide-22
SLIDE 22

π‘›π‘—π‘œπ½π·

  • Dealing with π‘›π‘—π‘œπ½π· is more technical. Recall:
  • π‘›π‘—π‘œπ½π· 𝜌 ≔ min

π‘€βˆˆ{0,1} 𝐽(Ξ ; π‘Œ1π‘Œ2 … π‘Œπ‘›|π‘Š = 𝑀)

  • Leads to our main technical statement:

β€œDistributed Strong Data Processing Inequality” Theorem: Suppose Ξ© 1 β‹… 𝜈0 ≀ 𝜈1 ≀ 𝑃 1 β‹… 𝜈0, and let 𝛾(𝜈0, 𝜈1) be the SDPI constant. Then β„Ž2 Ξ π‘Š=0, Ξ π‘Š=1 ≀ 𝑃 𝛾 𝜈0, 𝜈1 β‹… π‘›π‘—π‘œπ½π·(𝜌)

22

slide-23
SLIDE 23

Putting it together

Theorem: Suppose Ξ© 1 β‹… 𝜈0 ≀ 𝜈1 ≀ 𝑃 1 β‹… 𝜈0, and let 𝛾(𝜈0, 𝜈1) be the SDPI constant. Then β„Ž2 Ξ π‘Š=0, Ξ π‘Š=1 ≀ 𝑃 𝛾 𝜈0, 𝜈1 β‹… π‘›π‘—π‘œπ½π·(𝜌)

  • With 𝜈0 = π’ͺ 0,1 , 𝜈1 = π’ͺ πœ€, 1 , 𝛾 ∼ πœ€2, we

get Ξ© 1 = β„Ž2 Ξ π‘Š=0, Ξ π‘Š=1 ≀ πœ€2 β‹… π‘›π‘—π‘œπ½π·(𝜌)

  • Therefore, π‘›π‘—π‘œπ½π· 𝜌 = Ξ©

1 πœ€2 .

23

slide-24
SLIDE 24

Putting it together

Theorem: Suppose Ξ© 1 β‹… 𝜈0 ≀ 𝜈1 ≀ 𝑃 1 β‹… 𝜈0, and let 𝛾(𝜈0, 𝜈1) be the SDPI constant. Then β„Ž2 Ξ π‘Š=0, Ξ π‘Š=1 ≀ 𝑃 𝛾 𝜈0, 𝜈1 β‹… π‘›π‘—π‘œπ½π·(𝜌)

  • With 𝜈0 = π’ͺ 0,1 , 𝜈1 = π’ͺ πœ€, 1
  • Ξ© 1 β‹… 𝜈0 ≀ 𝜈1 ≀ 𝑃 1 β‹… 𝜈0 fails!!
  • Need an additional truncation step. Fortunately,

the failure happens far in the tails.

24

Essential!

slide-25
SLIDE 25

Gaussian mean detection (π‘œ β†’ 1) sample (π‘›π‘—π‘œπ½π·) A direct sum argument

Summary

25

Sparse Gaussian mean estimation β€œOnly get πœ€2 bits toward detection per bit of π‘›π‘—π‘œπ½π·β€β‡’ an

1 πœ€2 lower bound

Reduction [ZDJW’13]

Distributed sparse linear regression Hellinger distance Strong data processing

slide-26
SLIDE 26

Distributed sparse linear regression

  • Each machine gets π‘œ data of the form (π΅π‘˜, π‘§π‘˜),

where π‘§π‘˜ = π΅π‘˜, πœ„ + π‘₯π‘˜, π‘₯π‘˜ ∼ π’ͺ 0, 𝜏2

  • Promised that πœ„ is 𝑙-sparse: πœ„ 0 ≀ 𝑙.
  • Ambient dimension 𝑒.
  • Loss 𝑆 = 𝔽

πœ„ βˆ’ πœ„

2 .

  • How much communication to achieve statistically
  • ptimal loss?

26

slide-27
SLIDE 27
  • Promised that πœ„ is 𝑙-sparse: πœ„ 0 ≀ 𝑙.
  • Ambient dimension 𝑒. Loss 𝑆 = 𝔽

πœ„ βˆ’ πœ„

2 .

  • How much communication to achieve statistically
  • ptimal loss?
  • We get: 𝐷 = Ξ© 𝑛 β‹… min

(π‘œ, 𝑒) (small 𝑙 doesn’t help).

  • [Lee-Sun-Liu-Taylor’15]: under some conditions

𝐷 = 𝑃 𝑛 β‹… 𝑒 suffice.

27

Distributed sparse linear regression

slide-28
SLIDE 28

A new upper bound (time permitting)

  • For the one-dimensional distributed

Gaussian estimation (generalizes to 𝑒 dimensions trivially).

  • For optimal statistical performance, Ξ© 𝑛 is

the lower bound.

  • We give a simple simultaneous-message

upper bound of 𝑃(𝑛).

  • Previously: multi-round 𝑃(𝑛) [GMN’14] or

simultaneous 𝑃(𝑛 log π‘œ) [folklore].

28

slide-29
SLIDE 29

A new upper bound (time permitting)

(Stylized) main idea:

  • Each machine wants to send the empirical

average 𝑧𝑗 ∈ [0,1] of its input.

  • Then the average

1 𝑛

𝑧𝑗

𝑛 𝑗=1

= 𝑧 is computed.

  • Instead of 𝑧𝑗 each machine sends 𝑐𝑗 sampled

from Bernoulli distribution 𝐢𝑧𝑗.

  • Form the estimate 𝑧

= 1

𝑛

𝑐𝑗

𝑛 𝑗=1

.

  • β€œGood enough” if var 𝑧𝑗 ∼ 1.

29

slide-30
SLIDE 30

Open problems

  • Closing the gap for the sparse linear

regression problem.

  • Other statistical questions in the

distributed framework. More general theorems?

  • Can Strong Data Processing be applied to

the two-party Gap Hamming Distance problem?

30

slide-31
SLIDE 31
  • http://csnexus.info/

Organizers

  • Mark Braverman (Princeton University)
  • Bobak Nazer (Boston University)
  • Anup Rao (University of Washington)
  • Aslan Tchamkerten, General Chair (Telecom Paristech)

31

slide-32
SLIDE 32
  • http://csnexus.info/

Primary themes

  • Distributed Computation and Communication
  • Fundamental Inequalities and Lower Bounds
  • Inference Problems
  • Secrecy and Privacy

32

slide-33
SLIDE 33

Institut Henri PoincarΓ©

33

http://csnexus.info/

slide-34
SLIDE 34

34

Thank You!