Computational and Statistical Learning Theory TTIC 31120 Prof. - - PowerPoint PPT Presentation

β–Ά
computational and statistical
SMART_READER_LITE
LIVE PREVIEW

Computational and Statistical Learning Theory TTIC 31120 Prof. - - PowerPoint PPT Presentation

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Hardness of Improper Learning (continued) Agnostic Learning Hardness of Learning via Crypto Easy to generate


slide-1
SLIDE 1

Computational and Statistical Learning Theory

TTIC 31120

  • Prof. Nati Srebro

Lecture 7:

Computational Complexity of Learningβ€” Hardness of Improper Learning (continued) Agnostic Learning

slide-2
SLIDE 2

Hardness of Learning via Crypto

𝐿, 𝑏 ↦ 𝑔

𝐿(𝑏) easy

𝐿, 𝑐 ↦ 𝑔

𝐿 βˆ’1(𝑐) very hard:

No poly-time alg for non-negligible 𝐿, 𝑐 Hard to learn β„‹ = β„ŽπΏ 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔

𝐿 βˆ’1(𝑐) i

𝐸 𝐿 , 𝑐 ↦ 𝑔

𝐿 βˆ’1 𝑐 easy

(e.g. polytime) Hard to learn polytime functions Easy to generate random (𝐿, 𝐸 𝐿 )

slide-3
SLIDE 3

Hardness of Learning via Crypto

Assumption: No poly-time algorithm for

3 𝑐 𝑛𝑝𝑒 𝐿 that works for non-

negligible 𝑐, 𝐿 = π‘žπ‘Ÿ (π‘ž, π‘Ÿ primes with 3 ∀ π‘ž βˆ’ 1 π‘Ÿ βˆ’ 1 ) 𝐿, 𝑏 ↦ 𝑔

𝐿(𝑏) easy

𝐿, 𝑐 ↦ 𝑔

𝐿 βˆ’1(𝑐) very hard:

No poly-time alg for non-negligible 𝐿, 𝑐 Hard to learn β„‹ = β„ŽπΏ 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔

𝐿 βˆ’1(𝑐) i

𝐸 𝐿 , 𝑐 ↦ 𝑔

𝐿 βˆ’1 𝑐 easy

(e.g. polytime) Hard to learn polytime functions βˆ€πΏβ„ŽπΏ ∈ β„‹ Hard to learn β„‹ 𝑏 ↦ 𝑏𝐸 𝐿 = 3 𝑏 𝑛𝑝𝑒 𝐿 Computable using log-depth logic circuit Computable using log-depth neural nets Hard to learn log-depth circuit Hard to learn log-depth NN 𝐿, 𝑏 ↦ 𝑏3 𝑛𝑝𝑒 𝐿 easy

slide-4
SLIDE 4

Hardness of Learning via Crypto

  • Public-key crypto is possible

 hard to learn poly-time functions

  • Hardness of Discrete Cube Root

 hard to learn log(n)-depth logic circuits  hard to learn log(n)-depth poly-size neural networks

  • Hardness of breaking RSA

 hard to learn poly-length logical formulas  hard to learn poly-size automata  hard to learn push-down automata, ie regexps  for some depth d, hard to learn poly-size depth-d threshold circuits

(output of unit is one iff number of input units that are one is greater than threshold)

  • Hardness of lattice-shortest-vector based cryptography

 hard to learn intersection of π‘œπ‘  halfspaces (for any 𝑠 > 0)

Michael Kearns

slide-5
SLIDE 5

Intersections of Halfspaces

β„‹π‘œ

𝑙(π‘œ) =

𝑦 β†¦βˆ§π‘—=1

𝑙 π‘œ

π‘₯𝑗, 𝑦 > 0 | π‘₯1, … , π‘₯𝑙 π‘œ ∈ β„π‘œ 𝑃 π‘œ1.5 βˆ’ π‘£π‘‡π‘Šπ‘„ βˆ‰ 𝑆𝑄 ⇓ Lattice-based cryptosystem is secure ⇓ For any 𝑠 > 0, hard to learn πΌπ‘œ

𝑙 π‘œ =π‘œπ‘ 

⇓ Hard to learn 2-layer NN with π‘œπ‘  hidden units

The unique shortest lattice vector problem:

  • SVP 𝑀1, 𝑀2, … , π‘€π‘œ ∈ β„π‘œ = arg min𝑏1,𝑏2,…,π‘π‘œβˆˆβ„€ 𝑏1𝑀1 + 𝑏2𝑀2 + β‹― + π‘π‘œπ‘€π‘œ
  • 𝑃 π‘œ1.5 βˆ’ π‘£π‘‡π‘Šπ‘„: only required to return SVP if next-shortest is

𝑃 π‘œ1.5 times longer Sasha Sherstov Adam Klivans

slide-6
SLIDE 6

Hardness of Learning via Crypto

𝐿, 𝑏 ↦ 𝑔

𝐿(𝑏) easy

𝐿, 𝑐 ↦ 𝑔

𝐿 βˆ’1(𝑐) very hard:

No poly-time alg for non-negligible 𝐿, 𝑐 Hard to learn β„‹ = β„ŽπΏ 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔

𝐿 βˆ’1(𝑐) i

𝐸 𝐿 , 𝑐 ↦ 𝑔

𝐿 βˆ’1 𝑐 easy

(e.g. polytime) Hard to learn polytime functions Easy to generate random (𝐿, 𝐸 𝐿 )

slide-7
SLIDE 7

Hardness of Learning via Crypto

𝐿, 𝑏 ↦ 𝑔

𝐿(𝑏) easy

𝐿, 𝑐 ↦ 𝑔

𝐿 βˆ’1(𝑐) very hard:

No poly-time alg for non-negligible 𝐿, 𝑐 Hard to learn β„‹ = β„ŽπΏ 𝑐, 𝑗 : 𝑐, 𝑗 ↦ 𝑔

𝐿 βˆ’1(𝑐) i

𝐸 𝐿 , 𝑐 ↦ 𝑔

𝐿 βˆ’1 𝑐 easy

(e.g. polytime) Hard to learn polytime functions Easy to generate random (𝐿, 𝐸 𝐿 ) No poly-time alg for all 𝐿 and almost all 𝑐

slide-8
SLIDE 8

Hardness of Learning: Take II

  • Recall how we proved hardness of proper learning:
  • Reduction from deciding consistency with β„‹
  • If we had efficient proper learner, could train it and find consistent

hypothesis in β„‹ if it exists

  • Problem: if learning is not proper, might return good hypothesis

not in β„‹, even though 𝒠 not consistent with β„‹

  • Instead: reduction from deciding between two possibilities:
  • Sample is consistent with β„‹
  • For every consistent sample, return 1 w.p. β‰₯ 3/4 (over randomization in algorithm)
  • Sample comes from random β€œunpredictable” distribution
  • E.g. sampled such that labels 𝑧 independent of 𝑦
  • For all but negligible samples 𝑇 ∼ 𝒠𝑛, return 0 w.p. β‰₯ 3/4

Amit Daniely

slide-9
SLIDE 9

Hardness Relative to RSAT

  • RSAT assumption: For some 𝑔 𝐿 = πœ• 1 , there is no poly-time randomized

algorithm that gets as input a K-SAT formula with π‘œπ‘”(𝐿) constraints, and:

  • If the input is satisfiable, then w.p. β‰₯ 3/4 (over the randomization in the

algorithm), it outputs 1

  • If each constraint is generated independently and uniformly at random, then

with probability approaching 1 (as π‘œ β†’ ∞) over the formula, w.p. β‰₯ 3/4 (over the randomization in the algorithm), it outputs 0

  • Theorem: Under the RSAT assumption,
  • Poly-length DNFs are not efficiently PAC learnable

e.g. β„Ž 𝑦 = 𝑦 1 ∧ 𝑦 7 ∧ 𝑦 15 ∧ 𝑦 17 ∨ 𝑦 2 ∧ 𝑦 24 ∨ β‹―

  • Intersection of πœ• log π‘œ halfspaces are not efficiently PAC learnable

 2-layer Neural Networks with 𝑃 log1.1 π‘œ hidden layers are not efficiently PAC learnable

Amit Daniely

slide-10
SLIDE 10

Hardness of Learning

  • Axis-aligned rectangles in π‘œ dimensions
  • Halfspaces in π‘œ dimensions
  • Conjunctions on π‘œ variables
  • 3-term DNF’s
  • DNF formulas of size poly(n)
  • Generic logical formulas of size poly(n)
  • Neural nets with at most poly(n) units
  • Functions computable in poly(n) time

Not Efficiently Learnable Efficiently Properly Learnable Efficiently Learnable, but not Properly

slide-11
SLIDE 11

Realizable vs Agnostic

  • Definition: A family β„‹π‘œ of hypothesis classes is efficiently properly

PAC-Learnable if there exists a learning rule 𝐡 such that βˆ€π‘œβˆ€πœ—, πœ€ > 0, βˆƒπ‘› π‘œ, πœ—, πœ€ , βˆ€π’  s.t. 𝑀𝒠 β„Ž = 0 for some β„Ž ∈ β„‹, βˆ€π‘‡βˆΌπ’ π‘› π‘œ,πœ—,πœ€

πœ€

, 𝑀𝒠 𝐡 𝑇 ≀ πœ— and 𝐡(𝑇)(𝑦) can be computed in time π‘žπ‘π‘šπ‘§ π‘œ,

1 πœ— , π‘šπ‘π‘• 1 πœ€

and 𝐡 always outputs a predictor in β„‹π‘œ

  • Definition: A family β„‹π‘œ of hypothesis classes is efficiently properly

agnostically PAC-Learnable if there exists a learning rule 𝐡 such that βˆ€π‘œβˆ€πœ—, πœ€ > 0, βˆƒπ‘› π‘œ, πœ—, πœ€ , βˆ€π’  βˆ€π‘‡βˆΌπ’ π‘› π‘œ,πœ—,πœ€

πœ€

, 𝑀𝒠 𝐡 𝑇 ≀ inf

β„Žβˆˆβ„‹π‘œ 𝑀𝒠 β„Ž + πœ—

and 𝐡(𝑇)(𝑦) can be computed in time π‘žπ‘π‘šπ‘§ π‘œ,

1 πœ— , π‘šπ‘π‘• 1 πœ€

and 𝐡 always outputs a predictor in β„‹π‘œ

slide-12
SLIDE 12

Conditions for Efficient Agnostic Learning

𝐹𝑆𝑁ℋ 𝑇 = arg min

β„Žβˆˆβ„‹ 𝑀𝑇(β„Ž)

  • Claim: If
  • VCdim β„‹π‘œ ≀ π‘žπ‘π‘šπ‘§(π‘œ), and
  • Each β„Ž ∈ β„‹π‘œ is computable in time poly(n)
  • There is a poly-time (in size of input) algorithm for 𝐹𝑆𝑁ℋ

(i.e. that returns any ERM) then β„‹π‘œ is efficiently agnostically properly PAC learnable. π΅π»π‘†πΉπΉπ‘πΉπ‘‚π‘ˆβ„‹ 𝑇, 𝑙 = 1 𝑗𝑔𝑔 βˆƒβ„Žβˆˆβ„‹π‘€π‘‡ β„Ž ≀ (1 βˆ’ 𝑙 𝑇 )

  • Claim: If β„‹π‘œ is efficiently properly agnostically PAC learnable,

then π΅π»π‘†πΉπΉπ‘πΉπ‘‚π‘ˆβ„‹ ∈ 𝑆𝑄

slide-13
SLIDE 13

What is Properly Agnostically Learnable?

  • Poly-time functions?
  • Poly-length logical formulas?
  • Poly-size depth-2 neural networks?
  • Halfspaces (linear predictors)?
  • π’΄π‘œ = 0,1 π‘œ, β„‹π‘œ =

𝑦 ↦ π‘₯, 𝑦 > 0 | π‘₯ ∈ β„π‘œ

  • Claim: π΅π»π‘†πΉπΉπ‘πΉπ‘‚π‘ˆβ„‹ is NP-Hard (optional HW problem)
  • Conclusion: If 𝑂𝑄 β‰  𝑆𝑄, halfspaces are not efficiently

properly agnostically learnable

  • Conjunctions?
  • Also NP-hard!
  • Unions of segments on the line
  • π’΄π‘œ = 0,1 , β„‹π‘œ =

𝑦 β†¦βˆ¨π‘—=1

π‘œ

𝑏𝑗 ≀ 𝑦 ≀ 𝑐𝑗 | 𝑏𝑗, 𝑐𝑗 ∈ 0,1

  • Efficiently Properly Agnostically PAC Learnable!

No! (not even in realizable case) No! (not even in realizable case) No! (not even in realizable case) No! No! Yes!

slide-14
SLIDE 14

Source of the Hardness

min

β„Žβˆˆβ„‹ 𝑗

β„“(β„Žπ‘₯ 𝑦𝑗 ; 𝑧𝑗)

β„Žπ‘₯ 𝑦 = 〈π‘₯, 𝑦βŒͺ β„Ž 𝑦 ∈ ℝ β„“01(β„Ž 𝑦 ; 𝑧 = βˆ’1) 1 β„“π‘‘π‘Ÿπ‘ (β„Ž 𝑦 ; 𝑧 = βˆ’1) 1

  • 1

β„Ž 𝑦 ∈ ℝ β„“01 β„Ž 𝑦 ; 𝑧 = π‘§β„Ž(𝑦) ≀ 0

slide-15
SLIDE 15

Convexity

  • Definition (convex set):

A set 𝐷 in a vector space is convex if βˆ€π‘£, 𝑀 ∈ 𝐷 and for all 𝛽 ∈ 0,1 : 𝛽𝑣 + 1 βˆ’ 𝛽 𝑀 ∈ 𝐷

slide-16
SLIDE 16

Convexity

  • Definition (convex function):

A function 𝑔: 𝐷 ↦ ℝ is convex if βˆ€π‘£, 𝑀 ∈ 𝐷: 𝑔 𝛽𝑣 + 1 βˆ’ 𝛽 𝑀 ≀ 𝛽𝑔 𝑣 + 1 βˆ’ 𝛽 𝑔(𝑀)

𝑣 𝑀 𝑔(𝑣) 𝑔(𝑀)

𝛽𝑣 + 1 βˆ’ 𝛽 𝑀

𝑔(𝛽𝑣 + 1 βˆ’ 𝛽 𝑀) 𝛽𝑔(𝑣) + 1 βˆ’ 𝛽 𝑔(𝑀)

slide-17
SLIDE 17

Using a surrogate loss

min

β„Žβˆˆβ„‹ 𝑗

β„“(β„Žπ‘₯ 𝑦𝑗 ; 𝑧𝑗)

  • Instead of β„“01(𝑨; 𝑧), use a surrogate β„“(𝑨; 𝑧) s.t.:
  • βˆ€π‘§β„“(𝑨; 𝑧) is convex in 𝑨
  • βˆ€π‘¨,𝑧ℓ01 𝑨; 𝑧 ≀ β„“(𝑨; 𝑧)
  • E.g.
  • β„“π‘‘π‘Ÿπ‘  𝑨; 𝑧 = 𝑧 βˆ’ 𝑨 2
  • β„“π‘šπ‘π‘•π‘—π‘‘π‘’π‘—π‘‘ 𝑨; 𝑧 = log(1 + exp βˆ’π‘§π‘¨ )
  • β„“β„Žπ‘—π‘œπ‘•π‘“ z; 𝑧 = 1 βˆ’ 𝑧𝑨 + = max{0,1 βˆ’ 𝑧𝑨}
slide-18
SLIDE 18

Minimizing a Surrogate Loss Does Not Minimize 0/1 Loss!

β„“01 𝑨; 𝑧 = 𝑧𝑨 ≀ 0 ≀ 1 βˆ’ 𝑧𝑨 + = β„“β„Žπ‘—π‘œπ‘•π‘“ z; 𝑧

  • Realizable case:

βˆƒπ‘₯𝑀𝑇

01 𝑦 ↦ π‘₯, 𝑦

= 0  𝑀𝑇

β„Žπ‘—π‘œπ‘•π‘“ 𝑦 ↦ 1 𝛿 π‘₯, 𝑦

= 0, 𝛿 = min𝑗 π‘§β„Žπ‘₯(𝑦) > 0  𝑀𝑇

β„Žπ‘—π‘œπ‘•π‘“ ERMhinge S

= 0  𝑀𝑇

01 πΉπ‘†π‘β„Žπ‘—π‘œπ‘•π‘“ 𝑇

≀ 𝑀𝑇

β„Žπ‘—π‘œπ‘•π‘“ ERMhinge S

= 0

  • Non-Realizable case:
  • What can we ensure by minimizing surrogate loss???
slide-19
SLIDE 19

Improper Learning?

  • Halfspaces not efficiently properly agnostically PAC learnable
  • What about improper learning?

… we’ll use boosting to reduce learning intersection

  • f halfspaces to agnostic learning halfspaces
slide-20
SLIDE 20

Why Study Hardness?

  • Understand why machine learning is essentially a

computational problem

  • Understand why we must sometimes take a non-

exact/heuristic approach, and that it cannot be exact (eg use surrogate loss)

  • Understand what we can never guarantee, and not try to

guarantee it (e.g. cannot learn with a large NN just because there is a small NN that completely explains the data)

  • Understand and be able to argue about sample complexity

gaps between the statistical limit (using any learning rule) and the computational limit (using a tractable learning rule)

slide-21
SLIDE 21

β€œWeak” vs β€œStrong” Learning

  • Recall definition of (realizable) PAC learning of β„‹ using rule 𝐡(β‹…):

For any 𝒠 s.t. inf

β„Žβˆˆβ„‹ 𝑀𝒠 β„Ž = 0, and any πœ—, πœ€ > 0, using 𝑛(πœ—, πœ€) sample,

βˆ€π‘‡βˆΌπ’ π‘›(πœ—,πœ€)

πœ€

𝑀𝒠 𝐡 𝑇 < πœ—

  • 𝐡(β‹…) is a weak learner for β„‹ if:

There exists πœ— <

1 2, πœ€ < 1, 𝑛, s.t. for any 𝒠 with inf β„Žβˆˆβ„‹ 𝑀𝒠 β„Ž = 0,

βˆ€π‘‡βˆΌπ’ π‘›

πœ€

𝑀𝒠 𝐡 𝑇 < πœ— (e.g. πœ— = 0.49 and 1 βˆ’ πœ€ = 0.01)

  • If β„‹ is weakly learnable, is it also strongly learnable?
  • Yes: β„‹ is weakly learnable  VCdim(β„‹)<∞  β„‹ is (strongly) learnable
  • If β„‹π‘œ is efficiently weakly learnable, is it also strongly efficiently learnable?
  • If we have access to an (efficient) weak learner 𝐡(β‹…), can we use it to build

an (efficient) strong learner?

slide-22
SLIDE 22

The Boosting Problem

  • Boosting the Confidence:

If the learning algorithm works only with some very small fixed probability 1 βˆ’ πœ€0 (e.g. 1 βˆ’ πœ€0 = 0.01), can we construct a new algorithm that works with arbitrarily high probability 1 βˆ’ πœ€ (for any πœ€ > 0) ?

  • Boosting the error:

If the learning algorithm only returns a predictor that is guaranteed to be slightly better then chance, i.e. has error πœ—0 =

1 2 βˆ’ 𝛿 < 1 2 (for some fixed 𝛿 > 0),

can we construct a new algorithm that achieves arbitrarily low error πœ—?

slide-23
SLIDE 23

Boosting the Confidence

  • For any πœ€:
  • Claim: w.p. β‰₯ 1 βˆ’ πœ€, 𝑀

β„Ž ≀ πœ—0 + πœ—

  • Total samples used: 𝑃 𝑛0 πœ—0 β‹… log 1

πœ€ + log1

πœ€

πœ—2

  • Efficient algorithm for some πœ€0 < 1 and all πœ— > 0 with runtime and sample

complexity π‘žπ‘π‘šπ‘§(π‘œ, πœ—0)  efficient algorithm for any πœ€ > 0 with runtime π‘žπ‘π‘šπ‘§(π‘œ, πœ—, log

1 πœ€)

  • 1. For i=1..k: 𝑙 =

log 2 πœ€ log 1 πœ€0

Collect 𝑛0 independent samples 𝑇𝑗 β„Žπ‘— = 𝐡(𝑇𝑗)

  • 2. Collect 𝑛val =

4 log

4𝑙 πœ€

πœ—2

additional independent samples π‘‡π‘€π‘π‘š

  • 3. Return

β„Ž = arg min

β„Ž1,…,β„Žπ‘™ π‘€π‘‡π‘€π‘π‘š β„Žπ‘—

w.p. β‰₯ 1 βˆ’ πœ€, inf

𝑗 𝑀 β„Žπ‘— ≀ πœ—0

ERM from class of size 𝑙

slide-24
SLIDE 24

Boosting the Error?

  • What if we can only find a predictor with relatively high excess error πœ—?
  • We can always find a predictor with error ≀

1 2

  • What if we have an algorithm that, for any source dist 𝒠 s.t.

inf

β„Ž 𝑀𝒠 β„Ž = 0, finds 𝑀𝒠 𝐡 𝑇

≀

1 2 βˆ’ 𝛿.

  • Can we use 𝐡(β‹…) to find a predictor with arbitrarily low error?