Time-Space Tradeoffs for Two-Pass Learning
Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley)
Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) - - PowerPoint PPT Presentation
Time-Space Tradeoffs for Two-Pass Learning Sumegha Garg (Princeton) Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley) [Shamir 14], [Steinhardt-Valiant-Wager 15] Initiated a study of memory-samples lower bounds for learning Can one
Joint Work with Ran Raz (Princeton) and Avishay Tal (UC Berkeley)
Initiated a study of memory-samples lower bounds for learning Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints? (when samples are viewed one by one) (also known as online learning)
Can one prove unconditional lower bounds on the number of samples needed for learning under memory constraints, and when learner is allowed to go over the stream of samples twice? (in the same order)
π¦ β# 0,1 ' is unknown A learner tries to learn π¦ from (π*, π*), π-, π- , β¦ , (π/, π/), where β π’, π2 β# 0,1 ' and π2 =< π2, π¦ > (inner product mod 2) In other words, learner gets random linear equations in π¦*, π¦-, . . , π¦',
β
π(π) samples and π(π-) memory
β
π(π) memory but exponential number of samples
π¦ β# 0,1 ' is unknown A learner tries to learn π¦ from (π*, π*), π-, π- , β¦ , π/, π/ , (π*, π*), π-, π- , β¦ , (π/, π/), where β π’, π2 β# 0,1 ' and π2 =< π2, π¦ > (inner product mod 2)
Any algorithm for parity learning of size π requires either Ξ© π- memory bits or an exponential number of samples Conjectured by: Steinhardt, Valiant and Wager [2015]
[Kol-Raz-Tal β17]: Generalization to sparse parities [Razβ17, Moshkovitz-Moshkovitzβ17, Moshkovitz- Tishbyβ17, Moshkovitz- Moshkovitzβ18, Garg-Raz-Talβ18, Beame-Gharan-Yangβ18]: Generalization to larger class of problems [Sharan-Sidford-Valiantβ19]: Generalization to real-valued learning
[Dagan-Shamirβ18, Assadi-Chen-Khannaβ19,β¦]: Uses communication complexity (Quite different technique, at most polynomial bound on the number
Learning Theory, Bounded Storage Cryptography, Complexity Theory With [Barringtonβ89], proving super-polynomial lower bounds on the time needed for computing a function, by a branching program of width 5, with polynomially many passes over the input, would imply super-polynomial lower bounds for formula size Technically Challenging: previous techniques are heavily based on the fact that in the one-pass case all the samples are independent
Any two-pass algorithm for parity learning of size π requires either Ξ©(π*.:) memory bits or 2<( ') number of samples (no matching upper bound)
π΅, π : finite sets, π: π΅Γπ β {β1,1} : a matrix π¦ β# π is unknown. A learner tries to learn π¦ from a stream (π*, π*), π-, π- , β¦ , π/, π/ , (π*, π*), π-, π- , β¦ , (π/, π/), where βπ’ : π2 β# π΅ and π2 = π(π2, π¦) π : concept class = 0,1 ' π΅ : possible samples = 0,1 'F
Assume that any submatrix of π of at least 2GH|π΅| rows and at least 2Gβ|π| columns, has a bias of at most 2GK. Then: Any two-pass algorithm requires either Ξ©(π β min{π, π}) memory bits
In contrast, [GRTβ18] proved Any one-pass algorithm requires either Ξ©(π β π) memory bits or 2<(K) samples
Each layer represents a time step. Each vertex represents a memory state of the learner (π = 2/X/YKZ). Each non-leaf vertex has 2'F[* outgoing edges, one for each π, π β 0,1 '\Γ β1,1
The samples π*, π* , . . , π/, π/ , π*, π* , . . , π/, π/ define a computation-path. Each vertex π€ in the last layer is labeled by ` π¦a β 0,1 '. The output is the label ` π¦a of the vertex reached by the path
P
c|a = distribution of π¦ conditioned on the event that the computation-
path reaches π€ Significant vertices: π€ s.t. ||P
c|a||- β₯ 2U β 2G'
ππ π€ = probability that the path reaches π€ GRT proves: If π€ is significant, ππ π€ β€ 2G<(Hβ U) Hence, there are at least 2<(Hβ U) significant vertices to output correct answer with high probability
P
c|a = distribution of π¦ conditioned on the event that the computation-
path reaches π€ ππ π€ = probability that the path reaches π€ under π = same as the computational path, but stops when βatypicalβ things happen (traversing a bad edge and β¦) Bad edges: π s.t. |(π β P
c|a)(π)| β₯ 2GK
ππ π π‘π’πππ‘ is exp small (uses π β# 0,1 '!)
P
j|a β ππππππ π o,* p for π€ in Part-2
For e.g., BP remembers π*. Therefore, probability of traversing a βbad edgeβ may not be small Bad edges: |(π β P
c|a)(π)| β₯ 2GK (gives too much information about π¦)
Save: canβt remember too many πs. New stopping rules!
Proving -- if π€ is significant, then ππ π€ β€ 2G<(Hβ U) -- uses π β# 0,1 ' along with extractor property Save: work on product of 2 parts which is read-once. New stopping rules!
Let π€o be start vertex of 2-pass BP. π€*, π€- be the vertices reached in the end of Part-1 and 2 respectively. Then π€o β π€* β π€- β‘ π€o, π€* β (π€*, π€-)
π£o π£* π£-
π£β²o π£β²* π£β²-
(π£o, π£β²o) (π£-, π£β²-) (π£-, π£β²*) (π£*, π£β²-) (π£*, π£β²*) πΆ πΆβ²
Significant vertices: π€, π€F s.t. ||Pc| au,av β(a,a\)||- β₯ 2U β 2G' Bad edges: π s.t. |(π β Pc| au,av β(a,a\))(π)| β₯ 2GK High-probability edges: π s.t. Pr[π|π€o β π€ β π€* β π€β²] β₯ 2H β 2G' .... Stop at bad edges unless high-probability edges unless they are very bad
Conditioned on π€o β π€ β π€* β π€β², ππ π‘π’πππ‘ is small (1/100) π€o β π€ β π€* β π€F β π€o, π€* β (π€, π€β²) Proved using single-pass result as a subroutine
Anyone wants a second pass?