Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) - - PowerPoint PPT Presentation

β–Ά
time space tradeoffs for two pass learning
SMART_READER_LITE
LIVE PREVIEW

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) - - PowerPoint PPT Presentation

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley) [Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one


slide-1
SLIDE 1

Time-Space Tradeoffs for Two-Pass Learning

Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley)

Sumegha Garg (Princeton)

slide-2
SLIDE 2

[Shamir 14], [Steinhardt-Valiant-Wager 15]

Initiated a study of memory-samples lower bounds for learning Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints? (when samples are viewed one by one) (also known as online learning)

slide-3
SLIDE 3

When two-passes are allowed?

Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints, and when learner is allowed to go over the stream of samples twice? (in the same order)

slide-4
SLIDE 4

Toy-Example: Parity Learning

slide-5
SLIDE 5

Parity Learning

𝑦 ∈# 0,1 ' is unknown A learner tries to learn 𝑦 from (𝑏*, 𝑐*), 𝑏-, 𝑐- , … , (𝑏/, 𝑐/), where βˆ€ 𝑒, 𝑏2 ∈# 0,1 ' and 𝑐2 =< 𝑏2, 𝑦 > (inner product mod 2) In other words, learner gets random linear equations in 𝑦*, 𝑦-, . . , 𝑦',

  • ne by one, and need to solve them
slide-6
SLIDE 6

Parity Learners

  • Solve independent linear equations (Gaussian Elimination)

β—‹

𝑃(π‘œ) samples and 𝑃(π‘œ-) memory

  • Try all possibilities of 𝑦

β—‹

𝑃(π‘œ) memory but exponential number of samples

slide-7
SLIDE 7

Parity Learning (Two-pass)

𝑦 ∈# 0,1 ' is unknown A learner tries to learn 𝑦 from (𝑏*, 𝑐*), 𝑏-, 𝑐- , … , 𝑏/, 𝑐/ , (𝑏*, 𝑐*), 𝑏-, 𝑐- , … , (𝑏/, 𝑐/), where βˆ€ 𝑒, 𝑏2 ∈# 0,1 ' and 𝑐2 =< 𝑏2, 𝑦 > (inner product mod 2)

slide-8
SLIDE 8

Raz’s Breakthough ’16 (One-pass)

Any algorithm for parity learning of size π‘œ requires either Ξ© π‘œ- memory bits or an exponential number of samples Conjectured by: Steinhardt, Valiant and Wager [2015]

slide-9
SLIDE 9

Subsequent Results (One-pass)

[Kol-Raz-Tal β€˜17]: Generalization to sparse parities [Raz’17, Moshkovitz-Moshkovitz’17, Moshkovitz- Tishby’17, Moshkovitz- Moshkovitz’18, Garg-Raz-Tal’18, Beame-Gharan-Yang’18]: Generalization to larger class of problems [Sharan-Sidford-Valiant’19]: Generalization to real-valued learning

slide-10
SLIDE 10

Related Results (Multiple-pass)

[Dagan-Shamir’18, Assadi-Chen-Khanna’19,…]: Uses communication complexity (Quite different technique, at most polynomial bound on the number

  • f samples)
slide-11
SLIDE 11

Motivation

Learning Theory, Bounded Storage Cryptography, Complexity Theory With [Barrington’89], proving super-polynomial lower bounds on the time needed for computing a function, by a branching program of width 5, with polynomially many passes over the input, would imply super-polynomial lower bounds for formula size Technically Challenging: previous techniques are heavily based on the fact that in the one-pass case all the samples are independent

slide-12
SLIDE 12

Our Result

slide-13
SLIDE 13

Our Result for Parity Learning

Any two-pass algorithm for parity learning of size π‘œ requires either Ξ©(π‘œ*.:) memory bits or 2<( ') number of samples (no matching upper bound)

slide-14
SLIDE 14

Learning Problem as a Matrix

𝐡, π‘Œ : finite sets, 𝑁: π΅Γ—π‘Œ β†’ {βˆ’1,1} : a matrix 𝑦 ∈# π‘Œ is unknown. A learner tries to learn 𝑦 from a stream (𝑏*, 𝑐*), 𝑏-, 𝑐- , … , 𝑏/, 𝑐/ , (𝑏*, 𝑐*), 𝑏-, 𝑐- , … , (𝑏/, 𝑐/), where βˆ€π‘’ : 𝑏2 ∈# 𝐡 and 𝑐2 = 𝑁(𝑏2, 𝑦) π‘Œ : concept class = 0,1 ' 𝐡 : possible samples = 0,1 'F

slide-15
SLIDE 15

Generalized Result

Assume that any submatrix of 𝑁 of at least 2GH|𝐡| rows and at least 2Gβ„“|π‘Œ| columns, has a bias of at most 2GK. Then: Any two-pass algorithm requires either Ξ©(𝑙 β‹… min{𝑙, π‘š}) memory bits

  • r 2<(RST{H, U,K}) samples

In contrast, [GRT’18] proved Any one-pass algorithm requires either Ξ©(𝑙 β‹… π‘š) memory bits or 2<(K) samples

slide-16
SLIDE 16

Branching Program (length 𝑛, width 𝑒, 2-pass)

Each layer represents a time step. Each vertex represents a memory state of the learner (𝑒 = 2/X/YKZ). Each non-leaf vertex has 2'F[* outgoing edges, one for each 𝑏, 𝑐 ∈ 0,1 '\Γ— βˆ’1,1

(𝑏*, 𝑐*) (𝑏, 𝑐) 𝑛 𝑒 (𝑏/, 𝑐/) (𝑏*, 𝑐*) (𝑏/, 𝑐/) 𝑄𝑏𝑠𝑒 2

slide-17
SLIDE 17

Branching Program (length 𝑛, width 𝑒, 2-pass)

The samples 𝑏*, 𝑐* , . . , 𝑏/, 𝑐/ , 𝑏*, 𝑐* , . . , 𝑏/, 𝑐/ define a computation-path. Each vertex 𝑀 in the last layer is labeled by ` 𝑦a ∈ 0,1 '. The output is the label ` 𝑦a of the vertex reached by the path

(𝑏*, 𝑐*) (𝑏, 𝑐) 𝑛 𝑒 (𝑏/, 𝑐/) (𝑏*, 𝑐*) (𝑏/, 𝑐/) 𝑄𝑏𝑠𝑒 2

slide-18
SLIDE 18

Brief Overview of One-Pass Lower Bound [GRT’18]

P

c|a = distribution of 𝑦 conditioned on the event that the computation-

path reaches 𝑀 Significant vertices: 𝑀 s.t. ||P

c|a||- β‰₯ 2U β‹… 2G'

𝑄𝑠 𝑀 = probability that the path reaches 𝑀 GRT proves: If 𝑀 is significant, 𝑄𝑠 𝑀 ≀ 2G<(Hβ‹…U) Hence, there are at least 2<(Hβ‹…U) significant vertices to output correct answer with high probability

slide-19
SLIDE 19

Brief Overview of One-Pass Lower Bound

P

c|a = distribution of 𝑦 conditioned on the event that the computation-

path reaches 𝑀 𝑄𝑠 𝑀 = probability that the path reaches 𝑀 under π‘ˆ = same as the computational path, but stops when β€œatypical” things happen (traversing a bad edge and …) Bad edges: 𝑏 s.t. |(𝑁 β‹… P

c|a)(𝑏)| β‰₯ 2GK

𝑄𝑠 π‘ˆ π‘‘π‘’π‘π‘žπ‘‘ is exp small (uses 𝑏 ∈# 0,1 '!)

slide-20
SLIDE 20

Difficulties for Two-Passes (1)

P

j|a β‰  π‘‰π‘œπ‘—π‘”π‘π‘ π‘› o,* p for 𝑀 in Part-2

For e.g., BP remembers 𝑏*. Therefore, probability of traversing a β€œbad edge” may not be small Bad edges: |(𝑁 β‹… P

c|a)(𝑏)| β‰₯ 2GK (gives too much information about 𝑦)

Save: can’t remember too many 𝑏s. New stopping rules!

slide-21
SLIDE 21

Difficulties for Two-Passes (2)

Proving -- if 𝑀 is significant, then 𝑄𝑠 𝑀 ≀ 2G<(Hβ‹…U) -- uses 𝑏 ∈# 0,1 ' along with extractor property Save: work on product of 2 parts which is read-once. New stopping rules!

slide-22
SLIDE 22

Product of 2 Parts (length 𝑛, width 𝑒-, 1-pass)

Let 𝑀o be start vertex of 2-pass BP. 𝑀*, 𝑀- be the vertices reached in the end of Part-1 and 2 respectively. Then 𝑀o β†’ 𝑀* β†’ 𝑀- ≑ 𝑀o, 𝑀* β†’ (𝑀*, 𝑀-)

(𝑏, 𝑐)

𝑣o 𝑣* 𝑣-

( 𝑏 , 𝑐 )

𝑣′o 𝑣′* 𝑣′-

(𝑏, 𝑐)

(𝑣o, 𝑣′o) (𝑣-, 𝑣′-) (𝑣-, 𝑣′*) (𝑣*, 𝑣′-) (𝑣*, 𝑣′*) 𝐢 𝐢′

slide-23
SLIDE 23

Proof Outline: Stopping Rules for Product

Significant vertices: 𝑀, 𝑀F s.t. ||Pc| au,av β†’(a,a\)||- β‰₯ 2U β‹… 2G' Bad edges: 𝑏 s.t. |(𝑁 β‹… Pc| au,av β†’(a,a\))(𝑏)| β‰₯ 2GK High-probability edges: 𝑏 s.t. Pr[𝑏|𝑀o β†’ 𝑀 β†’ 𝑀* β†’ 𝑀′] β‰₯ 2H β‹… 2G' .... Stop at bad edges unless high-probability edges unless they are very bad

slide-24
SLIDE 24

Proof Outline: Stopping Rules for Product

Conditioned on 𝑀o β†’ 𝑀 β†’ 𝑀* β†’ 𝑀′, 𝑄𝑠 π‘‘π‘’π‘π‘žπ‘‘ is small (1/100) 𝑀o β†’ 𝑀 β†’ 𝑀* β†’ 𝑀F β‰  𝑀o, 𝑀* β†’ (𝑀, 𝑀′) Proved using single-pass result as a subroutine

slide-25
SLIDE 25

Open Problems

  • Generalize to multiple-passes
  • Better lower-bounds for two-pass
  • Non-trivial upper bounds for constant, linear passes
slide-26
SLIDE 26

Thank You!

Anyone wants a second pass?