10-701: Introduction to Deep Neural Networks Machine Learning
http://www.cs.cmu.edu/~10701
Deep Neural Networks Machine Learning http://www.cs.cmu.edu/~10701 - - PowerPoint PPT Presentation
10-701: Introduction to Deep Neural Networks Machine Learning http://www.cs.cmu.edu/~10701 Organizational info All up-to-date info is on the course web page (follow links from my page). Instructors - Nina balcan - Ziv Bar-Joseph
http://www.cs.cmu.edu/~10701
Make sure you are subscribed.
Game Theory Approx. Algorithms Matroid Theory
Machine Learning Theory
Discrete Optimization Mechanism Design Control Theory
ending, privacy preserving learning
algorithmic game theory)
learning conference), COLT 2014 (main learning theory conference)
sschultz@cs.cmu.edu
Email: vitercik@cs.cmu.edu Office hours: Friday 10-11 in GHC 7511 Research interests: Theoretical machine learning Computational economics
forecasting
regulatory genomics
Office hours : Friday 3-4pm Interests: Application of ML in Biology Software Development in Java Playing the tabla (an Indian drum)
Intro and classification (A.K.A. ‘supervised learning’) Unsupervised learning Graphical models
10/25 (Wednesday): Midterm
Theoretical considerations Non linear and kernel methods Reasoning under uncertainty
Recitations
etc.
Easy part: Machine Hard part: Learning
generalize information from the observed data so that it can be used to make better decisions in the future
Longer answer: The term Machine Learning is used to characterize a number of different approaches for generalizing from observed data:
new feature set
Given D = {Xi} group the data into Y classes using a model (or function) F: Xi -> Yj
Given D = {environment, actions, rewards} learn a policy and utility functions: policy: F1: {e,r} - > a utility: F2: {a,e}- > R
the supervised learning function F2: {Xi , xk}-> Y
Primarily supervised learning
semi supervised learning
Supervised and reinforcement learning
Reinforcement learning
Supervised learning (though can also be trained in an unsupervised way)
Reasoning under uncertainty
A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A
Supervised and unsupervised learning (can also use active learning)
A = “it will rain tomorrow”
A variety of useful facts can be derived from just three axioms:
There have been several
foundation for probability
are the most widely used.
P(A = 0.2) P(A|B = 0.5)
more random variables we can improve upon
p(slept in movie) = 0.5
p(slept in movie | liked movie) = 1/4 p(didn’t sleep in movie | liked movie) = 3/4
Slept Liked 1 1 1 1 1 1 1 1
specific value is their joint distribution.
If we assume independence then P(A,B)=P(A)P(B) However, in many cases such an assumption may be too strong (more later in the class)
P(class size > 20) = 0.6 P(summer) = 0.4 Evaluation of classes P(class size > 20, summer) = ?
Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1
P(class size > 20) = 0.6 P(summer) = 0.4 P(class size > 20, summer) = 0.1 Evaluation of classes
Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1
P(class size > 20) = 0.6 P(eval = 1) = 0.3 P(class size > 20, eval = 1) = 0.3
Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1
P(class size > 20) = 0.6 P(eval = 1) = 0.3 P(class size > 20, eval = 1) = 0.3 Evaluation of classes
Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1
P(A,B) = P(A|B)*P(B)
powerful rules in probabilistic reasoning
P(A,B) = P(A | B)P(B) = P(B | A)P(A)
Thomas Bayes was an English clergyman who set
probability in 1764.
Often it would be useful to derive the rule a bit further:
A
A B A B P(B,A=1) P(B,A=0)
Density Estimator Probability Input data for a variable or a set of variables
coin flip, alarm
dice, car model year
height, weight, temp.,
discuss
algorithms (i.e. EM) while in others they are part of a more general process (learning in BNs and HMMs)
M is our model (usually a collection of parameters)
k1 n
We can define the likelihood of the data given the model as follows: For example M is
samples
L(x1, … ,xn | ) = p(x1 | ) … p(xn | )
argmaxq = #H/#samples
k1 n
2 1
) 1 ( ) | (
n n
q q M D P
2 1
) 1 ( max arg
n n q
q q
Omitting terms that do not depend on q
2 1
) 1 ( ) | (
n n
q q M D P
2 1
) 1 ( max arg
n n q
q q
2 1 1 2 1 1 2 1 2 1 1 1 1 2 1 1 1 2 1 1
) 1 ( ) ) 1 ( ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 (
2 1 2 1 2 1 2 1 2 1 2 1
n n n q q n q n n qn q n qn q n q q q n q q q n q q n q q q n q q q
n n n n n n n n n n n n
When working with products, probabilities of entire datasets often get too small. A possible solution is to use the log of probabilities, often termed ‘log likelihood’
k1 n
k1 n
Maximizing this likelihood function is the same as maximizing P(dataset | M) In some cases moving to log space would also make computation easier (for example, removing the exponents)
class).
Sleep 2 4 6 8 10 12 3 4 5 6 7 8 9 10 11
Hours Frequency
Sleep
Sleep / GPA 2 2.5 3 3.5 4 4.5 5 2 4 6 8 10 12
Sleep hours GPA
Sleep / GPA
population of interest
follow a normal (Gaussian) distribution with some mean µ and variance 2 , x ~ N(µ,2) where and =(µ,2) defines the parameters (mean and variance) of the model.
x
2 2
2 ) ( 2
for our sleep data
Sleep 2 4 6 8 10 12 3 4 5 6 7 8 9 10 11
Hours Frequency
Sleep
n i i
1
generating the observed samples: L(x1, … ,xn | ) = p(x1 | ) … p(xn | ) (the samples are assumed to be independent)
to the sample mean and the sample variance:
n i
n
1 2 2
1
1 2 3 4 5 6 f(x) x a
F(x)
a b
2 , 2 1 1 , 1 2 1
i n i i
Anti-correlated Covariance: -9.2 Correlated Covariance: 18.33 Independent (almost) Covariance: 0.6