Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) - PowerPoint PPT Presentation

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley)

[Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints? (when samples are viewed one by one) (also known as online learning)

When two-passes are allowed? Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints, and when learner is allowed to go over the stream of samples twice? (in the same order)

Toy-Example: Parity Learning

Parity Learning 𝑦 ∈ # 0,1 ' is unknown A learner tries to learn 𝑦 from (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , (𝑏 / , 𝑐 / ) , where ∀ 𝑢 , 𝑏 2 ∈ # 0,1 ' and 𝑐 2 =< 𝑏 2 , 𝑦 > (inner product mod 2) In other words, learner gets random linear equations in 𝑦 * , 𝑦 - , . . , 𝑦 ' , one by one, and need to solve them

Parity Learners Solve independent linear equations (Gaussian Elimination) ● 𝑃(𝑜) samples and 𝑃(𝑜 - ) memory ○ Try all possibilities of 𝑦 ● 𝑃(𝑜) memory but exponential number of samples ○

Parity Learning (Two-pass) 𝑦 ∈ # 0,1 ' is unknown A learner tries to learn 𝑦 from (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , 𝑏 / , 𝑐 / , (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , (𝑏 / , 𝑐 / ) , where ∀ 𝑢 , 𝑏 2 ∈ # 0,1 ' and 𝑐 2 =< 𝑏 2 , 𝑦 > (inner product mod 2)

Raz’s Breakthough ’16 (One-pass) Any algorithm for parity learning of size 𝑜 requires either Ω 𝑜 - memory bits or an exponential number of samples Conjectured by: Steinhardt, Valiant and Wager [2015]

Subsequent Results (One-pass) [Kol-Raz-Tal ‘17]: Generalization to sparse parities [Raz’17, Moshkovitz-Moshkovitz’17, Moshkovitz- Tishby’17, Moshkovitz- Moshkovitz’18, Garg-Raz-Tal’18, Beame-Gharan-Yang’18]: Generalization to larger class of problems [Sharan-Sidford-Valiant’19]: Generalization to real-valued learning

Related Results (Multiple-pass) [Dagan-Shamir’18, Assadi-Chen-Khanna’19,…]: Uses communication complexity (Quite different technique, at most polynomial bound on the number of samples)

Motivation Learning Theory, Bounded Storage Cryptography, Complexity Theory With [Barrington’89], proving super-polynomial lower bounds on the time needed for computing a function, by a branching program of width 5, with polynomially many passes over the input, would imply super-polynomial lower bounds for formula size Technically Challenging: previous techniques are heavily based on the fact that in the one-pass case all the samples are independent

Our Result

Our Result for Parity Learning Any two-pass algorithm for parity learning of size 𝑜 requires either Ω(𝑜 *.: ) memory bits or 2 <( ') number of samples (no matching upper bound)

Learning Problem as a Matrix 𝐵, 𝑌 : finite sets, 𝑁: 𝐵×𝑌 → {−1,1} : a matrix 𝑦 ∈ # 𝑌 is unknown. A learner tries to learn 𝑦 from a stream (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , 𝑏 / , 𝑐 / , (𝑏 * , 𝑐 * ), 𝑏 - , 𝑐 - , … , (𝑏 / , 𝑐 / ) , where ∀𝑢 : 𝑏 2 ∈ # 𝐵 and 𝑐 2 = 𝑁(𝑏 2 , 𝑦) 𝑌 : concept class = 0,1 ' 𝐵 : possible samples = 0,1 'F

Generalized Result Assume that any submatrix of 𝑁 of at least 2 GH |𝐵| rows and at least 2 Gℓ |𝑌| columns, has a bias of at most 2 GK . Then: Any two-pass algorithm requires either Ω(𝑙 ⋅ min{𝑙, 𝑚}) memory bits or 2 <(RST{H, U,K}) samples In contrast, [GRT’18] proved Any one-pass algorithm requires either Ω(𝑙 ⋅ 𝑚) memory bits or 2 <(K) samples

Branching Program (length 𝑛 , width 𝑒 , 2 -pass) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) 𝑒 (𝑏, 𝑐) 𝑛 𝑄𝑏𝑠𝑢 2 Each layer represents a time step. Each vertex represents a memory state of the learner ( 𝑒 = 2 /X/YKZ ). Each non-leaf vertex has 2 'F[* outgoing edges, one for each 𝑏, 𝑐 ∈ 0,1 ' \ × −1,1

Branching Program (length 𝑛 , width 𝑒 , 2 -pass) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) (𝑏 / , 𝑐 / ) (𝑏 * , 𝑐 * ) 𝑒 (𝑏, 𝑐) 𝑛 𝑄𝑏𝑠𝑢 2 The samples 𝑏 * , 𝑐 * , . . , 𝑏 / , 𝑐 / , 𝑏 * , 𝑐 * , . . , 𝑏 / , 𝑐 / define a computation-path. Each vertex 𝑤 in the last layer is labeled by ` 𝑦 a ∈ 0,1 ' . The output is the label 𝑦 a of the vertex reached by the path `

Brief Overview of One-Pass Lower Bound [GRT’18] c|a = distribution of 𝑦 conditioned on the event that the computation- P path reaches 𝑤 c|a || - ≥ 2 U ⋅ 2 G' Significant vertices: 𝑤 s.t. ||P 𝑄𝑠 𝑤 = probability that the path reaches 𝑤 GRT proves: If 𝑤 is significant, 𝑄𝑠 𝑤 ≤ 2 G<(H⋅U) Hence, there are at least 2 <(H⋅U) significant vertices to output correct answer with high probability

Brief Overview of One-Pass Lower Bound c|a = distribution of 𝑦 conditioned on the event that the computation- P path reaches 𝑤 𝑄𝑠 𝑤 = probability that the path reaches 𝑤 under 𝑈 = same as the computational path, but stops when “atypical” things happen (traversing a bad edge and …) Bad edges: 𝑏 s.t. |(𝑁 ⋅ P c|a )(𝑏)| ≥ 2 GK 𝑄𝑠 𝑈 𝑡𝑢𝑝𝑞𝑡 is exp small (uses 𝑏 ∈ # 0,1 ' !)

Difficulties for Two-Passes (1) j|a ≠ 𝑉𝑜𝑗𝑔𝑝𝑠𝑛 o,* p for 𝑤 in Part-2 P For e.g., BP remembers 𝑏 * . Therefore, probability of traversing a “bad edge” may not be small c|a )(𝑏)| ≥ 2 GK (gives too much information about 𝑦 ) Bad edges: |(𝑁 ⋅ P Save: can’t remember too many 𝑏 s. New stopping rules!

Difficulties for Two-Passes (2) Proving -- if 𝑤 is significant, then 𝑄𝑠 𝑤 ≤ 2 G<(H⋅U) -- uses 𝑏 ∈ # 0,1 ' along with extractor property Save: work on product of 2 parts which is read-once. New stopping rules!

Product of 2 Parts (length 𝑛 , width 𝑒 - , 1 -pass) 𝐶′ 𝐶 𝑣′ - (𝑣 - , 𝑣′ - ) (𝑏, 𝑐) 𝑣 - ( 𝑏 , (𝑏, 𝑐) 𝑐 ) 𝑣′ o 𝑣′ * (𝑣 o , 𝑣′ o ) (𝑣 - , 𝑣′ * ) 𝑣 o 𝑣 * (𝑣 * , 𝑣′ - ) Let 𝑤 o be start vertex of 2-pass BP. 𝑤 * , 𝑤 - be the vertices (𝑣 * , 𝑣′ * ) reached in the end of Part-1 and 2 respectively. Then 𝑤 o → 𝑤 * → 𝑤 - ≡ 𝑤 o , 𝑤 * → (𝑤 * , 𝑤 - )

Proof Outline: Stopping Rules for Product Significant vertices: 𝑤, 𝑤 F s.t. ||P c| a u ,a v →(a,a \ ) || - ≥ 2 U ⋅ 2 G' Bad edges: 𝑏 s.t. |(𝑁 ⋅ P c| a u ,a v →(a,a \ ) )(𝑏)| ≥ 2 GK High-probability edges: 𝑏 s.t. Pr[𝑏|𝑤 o → 𝑤 → 𝑤 * → 𝑤′] ≥ 2 H ⋅ 2 G' .... Stop at bad edges unless high-probability edges unless they are very bad

Proof Outline: Stopping Rules for Product Conditioned on 𝑤 o → 𝑤 → 𝑤 * → 𝑤′ , 𝑄𝑠 𝑡𝑢𝑝𝑞𝑡 is small (1/100) 𝑤 o → 𝑤 → 𝑤 * → 𝑤 F ≠ 𝑤 o , 𝑤 * → (𝑤, 𝑤′) Proved using single-pass result as a subroutine

Open Problems Generalize to multiple-passes ● Better lower-bounds for two-pass ● Non-trivial upper bounds for constant, linear passes ●

Thank You! Anyone wants a second pass?

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) - PowerPoint PPT Presentation

Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley) [Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one

Space/time tradeoffs; dynamic programming; y g g transform and conquer 1. Space/time

50% pass developmental credit course course pass take pass developmental credit credit

Area and Time Tradeoffs in FPGAs Examining the concept of area/time tradeoffs in FPGA design,

U-Pass Program Executive Management Committee May 17, 2018 1 U-PASS The U-Pass Pilot

Quantum Time-Space Tradeoffs for Deciding Systems of Linear Inequalities Robert Spalek

REDD+ within the WEL nexus Opportunities and tradeoffs Kristy Graham May 2011 Outline What

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

The Proposed Closure of Rollover Pass Texas General Land Office Jerry Patterson, Land

U-PASS IMPLEMENTATION 2015/2016 Why are we implementing the U-Pass? In 2014/2015, the

Week 5 Space-Time Tradeoffs and Calgary Drop-In Centre Locating a Particular Piece of Data

Supplemental Instruction (SI-PASS) A Rose with many Names LANCASTER / LEIF BRYNGFORS EUROPEAN

Building a Pass Rusher from Scratch Since 2002 Big Skill Pass Rush System V.G.H.H Vision

Quantum Time-Space Tradeoffs by Recording Queries Yassine Hamoudi, Frdric Magniez IRIF ,

MA/CSSE 473 Day 24 Student questions Space-time tradeoffs Hash tables review String search

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Tradeoffs in Infinite Games Martin Zimmermann Saarland University May 15th, 2018 Scientific

HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Microprocessors/Embedded Cores

Allocation and Instruction Scheduling Christian Schulte KTH Royal Institute of Technology RISE

trt rstts

Course Script INF 5110: Compiler con- struction INF5110, spring 2020 Martin Steffen Contents

Plan for Lexical Analysis with Jlex and One Pass Code Gen Structure of the MeggyJava Compiler

Intro to Unity Shaders CM163 Lab 1 Rendering Pipeline Vertex Shader - Program that transforms

One-Shot Learning: Language Acquisition for Machine SS16 Computational Linguistics for

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot