ece 4524 artificial intelligence and engineering
play

ECE 4524 Artificial Intelligence and Engineering Applications - PowerPoint PPT Presentation

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Todays Schedule: Motivation for Learning Types of Learning Supervised Learning and Hypothesis spaces


  1. ECE 4524 Artificial Intelligence and Engineering Applications Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Today’s Schedule: ◮ Motivation for Learning ◮ Types of Learning ◮ Supervised Learning and Hypothesis spaces ◮ Example: Decision Trees

  2. Why learning? ◮ not all information is known at design time ◮ it might be impractical to program all possibilities directly ◮ some agents need to be able to adapt over time ◮ we might not know how to solve a problem directly by design This area in general is referred to as Machine Learning .

  3. Learning is a very general concept. It can be applied to all elements of an agents design, e.g. we might ◮ learn functions mapping percepts to internal states ◮ learn functions mapping states to actions ◮ learn the agent model itself ◮ learn probabilities ◮ learn utilities of internal states or actions Any agent component with a representation, prior knowledge of the representation, and a way to update the representation using feedback can use learning methods.

  4. Categorization of Learning The most basic distinction in learning is the difference between ◮ Deductive Learning ◮ Inductive Learning Within inductive learning there is ◮ unsupervised learning ◮ reinforcement learning ◮ supervised learning

  5. Supervised Learning Supervised learning is conceptually very simple, but has many practical and subtle issues. ◮ Given a training set consisting of examples D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , · · · , ( x n , y n ) } where each example obeys y i = f ( x i ) for some unknown function f ( · ). ◮ Find a function, the hypothesis h ( · ) y = h ( x ) that approximates the true f .

  6. The quality of the approximation is measured using the Test Set . T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , · · · , ( x m , y m ) } where m < n and T ∩ D = ∅ ◮ Collecting training and testing sets is often hard and expensive ◮ a h that performs well on the test set is said to generalize well. ◮ an h that performs well on the training set (said to be consistent) but poorly on the test set is said to be over-trained . Note the test set is independent of the training set!

  7. Some Nomenclature ◮ When y is finite with a categorical interpretation, this is a classification problem ◮ If y is binary it is a binary classification problem ◮ If y is continuous then it is a regression problem.

  8. Hypothesis Space In y = h ( x ), h is a hypothesis in some space of functions H . ◮ Goal is to find a consistent h with smallest testing error and the simplest representation (Ockham’s Razor) ◮ If we restrict the space H then it may be that no h can be found which approximates f sufficiently (unrealizable). ◮ The complexity/expressiveness of H and the generalization of h ∈ H is related through the bias-variance dilemma .

  9. Bayesian analysis gives us a useful framework for supervised learning ◮ Let h ∈ H be parameterized by θ , and the training data given by D , then the posterior of the parameters is p ( θ | D , h ) = p ( D | θ, h ) p ( θ | h ) P ( D | h ) ◮ The posterior of the model is the evidence for h p ( h | D ) = p ( D | h ) p ( h ) P ( D ) where the denominator integrates over all models in H

  10. Bayesian analysis gives us a useful framework for supervised learning ◮ The maximum likelihood model ignores the prior over models argmax P ( D | h ) h and is the model with the most evidence. ◮ The maximum a-posteriori (MAP) model includes the prior over models p ( D | h ) p ( h ) argmax p ( h | D ) = argmax () h h where the denominator P ( D ) is common to all models and so irrelevant to the model selection. We can also average models by choosing the top models rather than a single model. This is particularly useful in binary classification, where the models can simply vote on the final classifier output.

  11. Utility of models ◮ We assume the true f(x) is stationary and samples are IID. ◮ The error rate is the proportion of incorrect classifications. ◮ Note the error rate may be misleading since it makes no distinction about utility differences. Example: Binary classifier has 4 cases: TP, FP, TN, FN ◮ The cost of a FP or TN may not be the same. ◮ This is accounted for via a utility/loss function.

  12. Sources of Model Error ◮ The estimated h may differ from the true f because 1. the space H is overly restrictive (unrealizable) 2. the variance is large (high degrees of freedom) 3. f itself may be non-deterministic (noisy) 4. f is ”too complex” ◮ Most of Machine Learning has been focused on 1 and 2. ◮ A large open area in machine learning now is 4, ”learning in the large” (e.g. neuroscience, bioinformatics, sociology, networks)

  13. An example learning method: Decision Trees Consider a simple reflex agent that reasons by testing a series of attribute = value pairs. ◮ Let x be a vector of attributes ◮ Let y be a +/- or 0/1 assignment for a Goal (a binary classifier) ◮ Given D = ( x i , y i ) for i = 1 · · · N build the tree of decisions formed by testing the attributes of x individually.

  14. Implementing the importance function The idea is that we want to select the attribute that maximizes our ”surprise” ◮ The entropy of a R.V. V with values v k measures it’s uncertainty � H ( V ) = − p ( v k ) log 2 ( p ( v k )) in bits k ◮ For a Boolean R.V. with probability of true = q the Entropy is B ( q ) = − ( q log 2 q + (1 − q ) log 2 (1 − q )) where q ≈ p / ( p + n ) for p positive and n negative samples.

  15. Implementing the importance function Now suppose we choose attribute A from x ◮ For each possible value of A we divide the training set into k subsets with p k positive and n k negative examples ◮ After testing A , the remaining entropy is d p k + n k � � p k � remainder( A ) = p + n B p k + n k k =1 ◮ The information gain associated with selecting A is then � � p gain( A ) = B − remainder( A ) p + n We choose the attribute with the highest gain in information.

  16. Next Actions ◮ Reading on Learning Theory (AIAMA 18.4-18.5) ◮ No warmup. Reminders: ◮ Quiz 3 will be Thurday 4/12. ◮ PS 3 is due tonight.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend