SLIDE 1
ECE 4524 Artificial Intelligence and Engineering Applications - - PowerPoint PPT Presentation
ECE 4524 Artificial Intelligence and Engineering Applications - - PowerPoint PPT Presentation
ECE 4524 Artificial Intelligence and Engineering Applications Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Todays Schedule: Motivation for Learning Types of Learning Supervised Learning and Hypothesis spaces
SLIDE 2
SLIDE 3
Learning is a very general concept.
It can be applied to all elements of an agents design, e.g. we might
◮ learn functions mapping percepts to internal states ◮ learn functions mapping states to actions ◮ learn the agent model itself ◮ learn probabilities ◮ learn utilities of internal states or actions
Any agent component with a representation, prior knowledge of the representation, and a way to update the representation using feedback can use learning methods.
SLIDE 4
Categorization of Learning
The most basic distinction in learning is the difference between
◮ Deductive Learning ◮ Inductive Learning
Within inductive learning there is
◮ unsupervised learning ◮ reinforcement learning ◮ supervised learning
SLIDE 5
Supervised Learning
Supervised learning is conceptually very simple, but has many practical and subtle issues.
◮ Given a training set consisting of examples
D = {(x1, y1), (x2, y2), · · · , (xn, yn)} where each example
- beys
yi = f (xi) for some unknown function f (·).
◮ Find a function, the hypothesis h(·)
y = h(x) that approximates the true f .
SLIDE 6
The quality of the approximation is measured using the Test Set.
T = {(x1, y1), (x2, y2), · · · , (xm, ym)} where m < n and T ∩ D = ∅
◮ Collecting training and testing sets is often hard and expensive ◮ a h that performs well on the test set is said to generalize well. ◮ an h that performs well on the training set (said to be
consistent) but poorly on the test set is said to be
- ver-trained.
Note the test set is independent of the training set!
SLIDE 7
Some Nomenclature
◮ When y is finite with a categorical interpretation, this is a
classification problem
◮ If y is binary it is a binary classification problem ◮ If y is continuous then it is a regression problem.
SLIDE 8
Hypothesis Space
In y = h(x), h is a hypothesis in some space of functions H.
◮ Goal is to find a consistent h with smallest testing error and
the simplest representation (Ockham’s Razor)
◮ If we restrict the space H then it may be that no h can be
found which approximates f sufficiently (unrealizable).
◮ The complexity/expressiveness of H and the generalization of
h ∈ H is related through the bias-variance dilemma.
SLIDE 9
Bayesian analysis gives us a useful framework for supervised learning
◮ Let h ∈ H be parameterized by θ, and the training data given
by D, then the posterior of the parameters is p(θ|D, h) = p(D|θ, h)p(θ|h) P(D|h)
◮ The posterior of the model is the evidence for h
p(h|D) = p(D|h)p(h) P(D) where the denominator integrates over all models in H
SLIDE 10
Bayesian analysis gives us a useful framework for supervised learning
◮ The maximum likelihood model ignores the prior over models
argmax
h
P(D|h) and is the model with the most evidence.
◮ The maximum a-posteriori (MAP) model includes the prior
- ver models
argmax
h
p(h|D) = argmax
h
p(D|h)p(h) () where the denominator P(D) is common to all models and so irrelevant to the model selection. We can also average models by choosing the top models rather than a single model. This is particularly useful in binary classification, where the models can simply vote on the final classifier output.
SLIDE 11
Utility of models
◮ We assume the true f(x) is stationary and samples are IID. ◮ The error rate is the proportion of incorrect classifications. ◮ Note the error rate may be misleading since it makes no
distinction about utility differences. Example: Binary classifier has 4 cases: TP, FP, TN, FN
◮ The cost of a FP or TN may not be the same. ◮ This is accounted for via a utility/loss function.
SLIDE 12
Sources of Model Error
◮ The estimated h may differ from the true f because
- 1. the space H is overly restrictive (unrealizable)
- 2. the variance is large (high degrees of freedom)
- 3. f itself may be non-deterministic (noisy)
- 4. f is ”too complex”
◮ Most of Machine Learning has been focused on 1 and 2. ◮ A large open area in machine learning now is 4, ”learning in
the large” (e.g. neuroscience, bioinformatics, sociology, networks)
SLIDE 13
An example learning method: Decision Trees
Consider a simple reflex agent that reasons by testing a series of attribute = value pairs.
◮ Let x be a vector of attributes ◮ Let y be a +/- or 0/1 assignment for a Goal (a binary
classifier)
◮ Given D = (xi, yi) for i = 1 · · · N build the tree of decisions
formed by testing the attributes of x individually.
SLIDE 14
SLIDE 15
Implementing the importance function
The idea is that we want to select the attribute that maximizes our ”surprise”
◮ The entropy of a R.V. V with values vk measures it’s
uncertainty H(V ) = −
- k
p(vk) log2(p(vk)) in bits
◮ For a Boolean R.V. with probability of true = q the Entropy is
B(q) = −(q log2 q + (1 − q) log2(1 − q)) where q ≈ p/(p + n) for p positive and n negative samples.
SLIDE 16
Implementing the importance function
Now suppose we choose attribute A from x
◮ For each possible value of A we divide the training set into k
subsets with pk positive and nk negative examples
◮ After testing A, the remaining entropy is
remainder(A) =
d
- k=1
pk + nk p + n B
- pk
pk + nk
- ◮ The information gain associated with selecting A is then
gain(A) = B
- p
p + n
- − remainder(A)
We choose the attribute with the highest gain in information.
SLIDE 17