Computational and Statistical Learning Theory
TTIC 31120
- Prof. Nati Srebro
Computational and Statistical Learning Theory TTIC 31120 Prof. - - PowerPoint PPT Presentation
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Hardness of Improper Learning (continued) Agnostic Learning Hardness of Learning via Crypto Easy to generate
πΏ, π β¦ π
πΏ(π) easy
πΏ, π β¦ π
πΏ β1(π) very hard:
No poly-time alg for non-negligible πΏ, π Hard to learn β = βπΏ π, π : π, π β¦ π
πΏ β1(π) i
πΈ πΏ , π β¦ π
πΏ β1 π easy
(e.g. polytime) Hard to learn polytime functions Easy to generate random (πΏ, πΈ πΏ )
Assumption: No poly-time algorithm for
3 π πππ πΏ that works for non-
negligible π, πΏ = ππ (π, π primes with 3 β€ π β 1 π β 1 ) πΏ, π β¦ π
πΏ(π) easy
πΏ, π β¦ π
πΏ β1(π) very hard:
No poly-time alg for non-negligible πΏ, π Hard to learn β = βπΏ π, π : π, π β¦ π
πΏ β1(π) i
πΈ πΏ , π β¦ π
πΏ β1 π easy
(e.g. polytime) Hard to learn polytime functions βπΏβπΏ β β Hard to learn β π β¦ ππΈ πΏ = 3 π πππ πΏ Computable using log-depth logic circuit Computable using log-depth neural nets Hard to learn log-depth circuit Hard to learn log-depth NN πΏ, π β¦ π3 πππ πΏ easy
ο¨ hard to learn poly-time functions
ο¨ hard to learn log(n)-depth logic circuits ο¨ hard to learn log(n)-depth poly-size neural networks
ο¨ hard to learn poly-length logical formulas ο¨ hard to learn poly-size automata ο¨ hard to learn push-down automata, ie regexps ο¨ for some depth d, hard to learn poly-size depth-d threshold circuits
(output of unit is one iff number of input units that are one is greater than threshold)
ο¨ hard to learn intersection of ππ halfspaces (for any π > 0)
Michael Kearns
βπ
π(π) =
π¦ β¦β§π=1
π π
π₯π, π¦ > 0 | π₯1, β¦ , π₯π π β βπ π π1.5 β π£πππ β ππ β Lattice-based cryptosystem is secure β For any π > 0, hard to learn πΌπ
π π =ππ
β Hard to learn 2-layer NN with ππ hidden units
The unique shortest lattice vector problem:
π π1.5 times longer Sasha Sherstov Adam Klivans
πΏ, π β¦ π
πΏ(π) easy
πΏ, π β¦ π
πΏ β1(π) very hard:
No poly-time alg for non-negligible πΏ, π Hard to learn β = βπΏ π, π : π, π β¦ π
πΏ β1(π) i
πΈ πΏ , π β¦ π
πΏ β1 π easy
(e.g. polytime) Hard to learn polytime functions Easy to generate random (πΏ, πΈ πΏ )
πΏ, π β¦ π
πΏ(π) easy
πΏ, π β¦ π
πΏ β1(π) very hard:
No poly-time alg for non-negligible πΏ, π Hard to learn β = βπΏ π, π : π, π β¦ π
πΏ β1(π) i
πΈ πΏ , π β¦ π
πΏ β1 π easy
(e.g. polytime) Hard to learn polytime functions Easy to generate random (πΏ, πΈ πΏ ) No poly-time alg for all πΏ and almost all π
hypothesis in β if it exists
not in β, even though π not consistent with β
Amit Daniely
algorithm that gets as input a K-SAT formula with ππ(πΏ) constraints, and:
algorithm), it outputs 1
with probability approaching 1 (as π β β) over the formula, w.p. β₯ 3/4 (over the randomization in the algorithm), it outputs 0
e.g. β π¦ = π¦ 1 β§ π¦ 7 β§ π¦ 15 β§ π¦ 17 β¨ π¦ 2 β§ π¦ 24 β¨ β―
ο¨ 2-layer Neural Networks with π log1.1 π hidden layers are not efficiently PAC learnable
Amit Daniely
Not Efficiently Learnable Efficiently Properly Learnable Efficiently Learnable, but not Properly
PAC-Learnable if there exists a learning rule π΅ such that βπβπ, π > 0, βπ π, π, π , βπ s.t. ππ β = 0 for some β β β, βπβΌπ π π,π,π
π
, ππ π΅ π β€ π and π΅(π)(π¦) can be computed in time ππππ§ π,
1 π , πππ 1 π
and π΅ always outputs a predictor in βπ
agnostically PAC-Learnable if there exists a learning rule π΅ such that βπβπ, π > 0, βπ π, π, π , βπ βπβΌπ π π,π,π
π
, ππ π΅ π β€ inf
βββπ ππ β + π
and π΅(π)(π¦) can be computed in time ππππ§ π,
1 π , πππ 1 π
and π΅ always outputs a predictor in βπ
πΉππβ π = arg min
βββ ππ(β)
(i.e. that returns any ERM) then βπ is efficiently agnostically properly PAC learnable. π΅π»ππΉπΉππΉππβ π, π = 1 πππ ββββππ β β€ (1 β π π )
then π΅π»ππΉπΉππΉππβ β ππ
π¦ β¦ π₯, π¦ > 0 | π₯ β βπ
properly agnostically learnable
π¦ β¦β¨π=1
π
ππ β€ π¦ β€ ππ | ππ, ππ β 0,1
No! (not even in realizable case) No! (not even in realizable case) No! (not even in realizable case) No! No! Yes!
βββ π
βπ₯ π¦ = β©π₯, π¦βͺ β π¦ β β β01(β π¦ ; π§ = β1) 1 βπ‘ππ (β π¦ ; π§ = β1) 1
β π¦ β β β01 β π¦ ; π§ = π§β(π¦) β€ 0
π£ π€ π(π£) π(π€)
π½π£ + 1 β π½ π€
π(π½π£ + 1 β π½ π€) π½π(π£) + 1 β π½ π(π€)
min
βββ π
β(βπ₯ π¦π ; π§π)
β01 π¨; π§ = π§π¨ β€ 0 β€ 1 β π§π¨ + = ββππππ z; π§
βπ₯ππ
01 π¦ β¦ π₯, π¦
= 0 ο¨ ππ
βππππ π¦ β¦ 1 πΏ π₯, π¦
= 0, πΏ = minπ π§βπ₯(π¦) > 0 ο¨ ππ
βππππ ERMhinge S
= 0 ο¨ ππ
01 πΉππβππππ π
β€ ππ
βππππ ERMhinge S
= 0
β¦ weβll use boosting to reduce learning intersection
computational problem
exact/heuristic approach, and that it cannot be exact (eg use surrogate loss)
guarantee it (e.g. cannot learn with a large NN just because there is a small NN that completely explains the data)
gaps between the statistical limit (using any learning rule) and the computational limit (using a tractable learning rule)
For any π s.t. inf
βββ ππ β = 0, and any π, π > 0, using π(π, π) sample,
βπβΌπ π(π,π)
π
ππ π΅ π < π
There exists π <
1 2, π < 1, π, s.t. for any π with inf βββ ππ β = 0,
βπβΌπ π
π
ππ π΅ π < π (e.g. π = 0.49 and 1 β π = 0.01)
an (efficient) strong learner?
1 2 β πΏ < 1 2 (for some fixed πΏ > 0),
β β€ π0 + π
π + log1
π
π2
complexity ππππ§(π, π0) ο¨ efficient algorithm for any π > 0 with runtime ππππ§(π, π, log
1 π)
log 2 π log 1 π0
Collect π0 independent samples ππ βπ = π΅(ππ)
4 log
4π π
π2
additional independent samples ππ€ππ
β = arg min
β1,β¦,βπ πππ€ππ βπ
w.p. β₯ 1 β π, inf
π π βπ β€ π0
ERM from class of size π
1 2
inf
β ππ β = 0, finds ππ π΅ π
β€
1 2 β πΏ.