 
              Fundamental Problem of Machine Learning: It is ill-posed A few methods for learning binary classifiers 600.325/425 Declarative Methods - J. Eisner 1 2 600.325/425 Declarative Methods - J. Eisner slide thanks to Tom Dietterich (modified) Learning Appears Impossible Learning Appears Impossible spam detection spam detection x � x � x � x � y x � x � x � x � y � There are 2 16 = 65536 possible � There are 2 16 = 65536 possible � ? � ? � � � � � � boolean functions over four input boolean functions over four input � ? � ? � � � � � � features. features. � � ? � � � � � � � � � ? � Why? Such a function is defined � � � Why? Such a function is defined � � � � � by 2 4 = 16 rows. So its output � ? by 2 4 = 16 rows. So its output � � � � � � � � � ? column has 16 slots for answers. � � � column has 16 slots for answers. � � � � � � � � ? There are 2 16 ways it could fill There are 2 16 ways it could fill � � � � � � � � � ? � � � ? � � those in. those in. � ? � ? � � � � � � � We can’t figure out which one is � ? � � � � � � � � � � ? correct until we’ve seen every � � ? � � � � � � ? possible input-output pair. � � ? � � � � � ? � � � � After 7 examples, we still have 9 � � � � � � ? � ? slots to fill in, or 2 9 possibilities. � � � � � � � � � � ? � � � � ? � � � � ? � � � � ? 600.325/425 Declarative Methods - J. Eisner 3 600.325/425 Declarative Methods - J. Eisner 4 slide thanks to Tom Dietterich (modified) slide thanks to Tom Dietterich (modified) Solution: Work with a restricted Illustration: Simple Conjunctive Rules hypothesis space � There are only 16 � We need to generalize from our few training examples! simple conjunctions (no negation) � Either by applying prior knowledge or by guessing, we � Try them all! choose a space of hypotheses H that is smaller than the space of all possible Boolean functions : � But no simple rule � simple conjunctive rules explains our 7 � m -of- n rules training examples. � linear functions � The same is true for � multivariate Gaussian joint probability distributions simple disjunctions. � etc. 600.325/425 Declarative Methods - J. Eisner 5 600.325/425 Declarative Methods - J. Eisner 6 slide thanks to Tom Dietterich (modified) slide thanks to Tom Dietterich (modified) 1
A larger hypothesis space: Two Views of Learning m -of- n rules � View 1: Learning is the removal of our remaining uncertainty about the truth � Suppose we knew that the unknown function was an m -of- n boolean function. Then we could use the training examples to deduce which � At least m of n function it is. � View 2: Learning is just an engineering guess – the truth is specified variables too messy to try to find must be true � Need to pick a hypothesis class that is � There are 32 big enough to fit the training data “well,” � but not so big that we overfit the data & predict test data poorly. possible rules � � Can start with a very small class and enlarge it until it contains an � Only one rule is hypothesis that fits the training data perfectly . ∅ ∅ ∅ ∅ � Or we could stop enlarging sooner, when there are still some errors consistent! on training data. (There’s a “structural risk minimization” formula for knowing when to stop! - a loose bound on the test data error rate.) 7 8 600.325/425 Declarative Methods - J. Eisner 600.325/425 Declarative Methods - J. Eisner slide thanks to Tom Dietterich (modified) slide thanks to Tom Dietterich (modified) Balancing generalization and overfitting We could be wrong! which boundary? Multiple hypotheses in the class might fit the data 1. go for simplicity or accuracy? Our guess of the hypothesis class could be wrong 2. � Within our class, the only answer was ? “y=true [spam] iff at least 2 of {x1,x3,x4} say so” � � But who says the right answer is an m-of-n rule at all? � Other hypotheses outside the class also work: y=true iff … (x1 xor x3) ^ x4 � y=true iff … x4 ^ ~x2 � more training data makes the choice more obvious 600.325/425 Declarative Methods - J. Eisner 9 600.325/425 Declarative Methods - J. Eisner 10 figures from a paper by Mueller et al. example thanks to Tom Dietterich Two Strategies for Machine Learning � Use a “little language” to define a hypothesis class H that’s tailored to your problem’s structure (likely to contain a winner) Memory-Based Learning � Then use a learning algorithm for that little language � Rule grammars; stochastic models (HMMs, PCFGs …); graphical models (Bayesian nets, Markov random fields …) � Dominant view in 600.465 Natural Language Processing � Not e: Algor it hms f or gr aphical models ar e closely r elat ed t o algor it hms f or const r aint pr ogr amming! So you’r e on your way. E.g., k-Nearest Neighbor � Just pick a flexible, generic hypothesis class H � Use a standard learning algorithm for that hypothesis class Also known as “case-based” or � Decision trees; neural networks; nearest neighbor; SVMs “example-based” learning � What we’ll focus on this week � It’s now crucial how you encode your problem as a feature vector 600.325/425 Declarative Methods - J. Eisner 11 600.325/425 Declarative Methods - J. Eisner 12 parts of slide thanks to Tom Dietterich 2
Intuition behind memory-based learning 1-Nearest Neighbor � Define a distance d(x1,x2) between any 2 examples � Similar inputs map to similar outputs � examples are feature vectors � If not true � learning is impossible � so could just use Euclidean distance … � If true � learning reduces to defining “ similar ” � Not all similarities created equal � Training: Index the training examples for fast lookup. � guess J. D. Salinger’s weight � Test: Given a new x, find the closest x1 from training. who are the similar people? � Classify x the same as x1 (positive or negative) similar occupation, age, diet, genes, climate, … � � guess J. D. Salinger’s IQ similar occupation, writing style, fame, SAT score, … � Can learn complex decision boundaries � As training size � ∞ , error rate is at most 2x the Bayes-optimal rate � � Superficial vs. deep similarities? (i.e., the error rate you’d get from knowing the true model that what do br ains generated the data – whatever it is!) act ually do? � B. F. Skinner and the behaviorism movement 13 14 600.325/425 Declarative Methods - J. Eisner 600.325/425 Declarative Methods - J. Eisner parts of slide thanks to Rich Caruana parts of slide thanks to Rich Caruana 1-Nearest Neighbor – decision boundary k-Nearest Neighbor � Instead of picking just the single nearest neighbor, pick the k nearest neighbors and have them vote � Average of k points more reliable when: � noise in training vectors x o + o o o attribute_2 o o o o o o + + � noise in training labels y o o o o + + + + + ++ � classes partially overlap + + attribute_1 From Hastie, Tibshirani, Friedman 2001 p418 600.325/425 Declarative Methods - J. Eisner 15 600.325/425 Declarative Methods - J. Eisner 16 slide thanks to Rich Caruana (modified) slide thanks to Rich Caruana (modified) 1 Nearest Neighbor – decision boundary 15 Nearest Neighbors – it’s smoother! From Hastie, Tibshirani, Friedman 2001 p418 From Hastie, Tibshirani, Friedman 2001 p418 600.325/425 Declarative Methods - J. Eisner 17 600.325/425 Declarative Methods - J. Eisner 18 slide thanks to Rich Caruana (modified) slide thanks to Rich Caruana (modified) 3
Recommend
More recommend