Theoretical Implications
CS 535: Deep Learning
Theoretical Implications CS 535: Deep Learning Machine Learning - - PowerPoint PPT Presentation
Theoretical Implications CS 535: Deep Learning Machine Learning Theory: Basic setup Generic supervised learning setup: For , 1 i.i.d. drawn from the joint distribution (, ) , find a best function
CS 535: Deep Learning
best function π β πΊ that minimizes the error πΉπ¦,π§[π π π¦ , π§ ]
π π π¦ , π§ = α1, π π¦ β π§ 0, π π¦ = π§
quadratic functions, all smooth functions, etc.)
with probability 1 β π πΉπ¦,π§ π π¦ β π§ β€ 1 π ΰ·
π=1 π
π π π¦π , π§π + Ξ©(πΊ, π)
Error on the training set Flexibility of the function class Error on the whole distribution How to represent βflexibilityβ? Thatβs a course on ML theory
πΊ: {π(π¦)|π π¦ = π₯1
β€π¦9 + β― }
error
needed
training error with a flexible function class
πΉπ¦,π§ π π¦ β π§ β€ 1 π ΰ·
π=1 π
π π π¦π , π§π + Ξ©(πΊ, π)
Error on the training set Flexibility of the function class Error on the whole distribution Add-on term If this is 60% errorβ¦
min
π₯
π₯β€π β π
2 + π||π₯||2
In order to choose a w with big ||π₯||2 you need to overcome this term
min
π ΰ· π
π(π π¦π , π§π) + π||π||2
In order to choose a very unsmooth function f you need to overcome this term
the MAP estimate
2)
min
π₯
π₯β€π β π
2 + π||π₯||2
min
π ΰ· π
π(π π¦π , π§π) + π||π||2
any smooth function efficiently (meaning using a polynomial number of hidden units)
averaged over any training set d and over all possible distributions P, their average error is the same
assumptions about the data
assumption as a convex optimization problem (with global optimum)
achieve!
Philosophical discussion about high-dimensional spaces
interesting effects:
cover 10% of the data in a unit cube, one needs a box to cover 80% of the range
to the origin has a median distance of
is less than 1/π
cos π¦1, π¦2 = |π¦1
β€π¦2| β₯
log π π
ΰΆ± |πβ² π¦ |ππ¦ β€ π· ΰΆ± ππ ππ¦1 ππ¦1 + ΰΆ± ππ ππ¦2 ππ¦2 + β― + ΰΆ± ππ ππ¦π ππ¦π β€ π·
functions!
Why would CNN make sense
for artificial neural networks. Machine Learning, Vol.14, pp.113-143.
Theoretical Foundations 1st Edition
between learning algorithms. Neural Computation, 8(7), 1341β1390.
lunch theorems. In: Proceedings of the 6th Online World Conference on Soft Computing in Industrial Applications.
Large-Scale Kernel Machines. NIPS 2007.