A few methods for learning binary classifiers 600.325/425 - PDF document

Fundamental Problem of Machine Learning: It is ill-posed A few methods for learning binary classifiers 600.325/425 Declarative Methods - J. Eisner 1 2 600.325/425 Declarative Methods - J. Eisner slide thanks to Tom Dietterich (modified) Learning Appears Impossible Learning Appears Impossible spam detection spam detection x � x � x � x � y x � x � x � x � y � There are 2 16 = 65536 possible � There are 2 16 = 65536 possible � ? � ? � � � � � � boolean functions over four input boolean functions over four input � ? � ? � � � � � � features. features. � � ? � � � � � � � � � ? � Why? Such a function is defined � � � Why? Such a function is defined � � � � � by 2 4 = 16 rows. So its output � ? by 2 4 = 16 rows. So its output � � � � � � � � � ? column has 16 slots for answers. � � � column has 16 slots for answers. � � � � � � � � ? There are 2 16 ways it could fill There are 2 16 ways it could fill � � � � � � � � � ? � � � ? � � those in. those in. � ? � ? � � � � � � � We can’t figure out which one is � ? � � � � � � � � � � ? correct until we’ve seen every � � ? � � � � � � ? possible input-output pair. � � ? � � � � � ? � � � � After 7 examples, we still have 9 � � � � � � ? � ? slots to fill in, or 2 9 possibilities. � � � � � � � � � � ? � � � � ? � � � � ? � � � � ? 600.325/425 Declarative Methods - J. Eisner 3 600.325/425 Declarative Methods - J. Eisner 4 slide thanks to Tom Dietterich (modified) slide thanks to Tom Dietterich (modified) Solution: Work with a restricted Illustration: Simple Conjunctive Rules hypothesis space � There are only 16 � We need to generalize from our few training examples! simple conjunctions (no negation) � Either by applying prior knowledge or by guessing, we � Try them all! choose a space of hypotheses H that is smaller than the space of all possible Boolean functions : � But no simple rule � simple conjunctive rules explains our 7 � m -of- n rules training examples. � linear functions � The same is true for � multivariate Gaussian joint probability distributions simple disjunctions. � etc. 600.325/425 Declarative Methods - J. Eisner 5 600.325/425 Declarative Methods - J. Eisner 6 slide thanks to Tom Dietterich (modified) slide thanks to Tom Dietterich (modified) 1

A larger hypothesis space: Two Views of Learning m -of- n rules � View 1: Learning is the removal of our remaining uncertainty about the truth � Suppose we knew that the unknown function was an m -of- n boolean function. Then we could use the training examples to deduce which � At least m of n function it is. � View 2: Learning is just an engineering guess – the truth is specified variables too messy to try to find must be true � Need to pick a hypothesis class that is � There are 32 big enough to fit the training data “well,” � but not so big that we overfit the data & predict test data poorly. possible rules � � Can start with a very small class and enlarge it until it contains an � Only one rule is hypothesis that fits the training data perfectly . ∅ ∅ ∅ ∅ � Or we could stop enlarging sooner, when there are still some errors consistent! on training data. (There’s a “structural risk minimization” formula for knowing when to stop! - a loose bound on the test data error rate.) 7 8 600.325/425 Declarative Methods - J. Eisner 600.325/425 Declarative Methods - J. Eisner slide thanks to Tom Dietterich (modified) slide thanks to Tom Dietterich (modified) Balancing generalization and overfitting We could be wrong! which boundary? Multiple hypotheses in the class might fit the data 1. go for simplicity or accuracy? Our guess of the hypothesis class could be wrong 2. � Within our class, the only answer was ? “y=true [spam] iff at least 2 of {x1,x3,x4} say so” � � But who says the right answer is an m-of-n rule at all? � Other hypotheses outside the class also work: y=true iff … (x1 xor x3) ^ x4 � y=true iff … x4 ^ ~x2 � more training data makes the choice more obvious 600.325/425 Declarative Methods - J. Eisner 9 600.325/425 Declarative Methods - J. Eisner 10 figures from a paper by Mueller et al. example thanks to Tom Dietterich Two Strategies for Machine Learning � Use a “little language” to define a hypothesis class H that’s tailored to your problem’s structure (likely to contain a winner) Memory-Based Learning � Then use a learning algorithm for that little language � Rule grammars; stochastic models (HMMs, PCFGs …); graphical models (Bayesian nets, Markov random fields …) � Dominant view in 600.465 Natural Language Processing � Not e: Algor it hms f or gr aphical models ar e closely r elat ed t o algor it hms f or const r aint pr ogr amming! So you’r e on your way. E.g., k-Nearest Neighbor � Just pick a flexible, generic hypothesis class H � Use a standard learning algorithm for that hypothesis class Also known as “case-based” or � Decision trees; neural networks; nearest neighbor; SVMs “example-based” learning � What we’ll focus on this week � It’s now crucial how you encode your problem as a feature vector 600.325/425 Declarative Methods - J. Eisner 11 600.325/425 Declarative Methods - J. Eisner 12 parts of slide thanks to Tom Dietterich 2

Intuition behind memory-based learning 1-Nearest Neighbor � Define a distance d(x1,x2) between any 2 examples � Similar inputs map to similar outputs � examples are feature vectors � If not true � learning is impossible � so could just use Euclidean distance … � If true � learning reduces to defining “ similar ” � Not all similarities created equal � Training: Index the training examples for fast lookup. � guess J. D. Salinger’s weight � Test: Given a new x, find the closest x1 from training. who are the similar people? � Classify x the same as x1 (positive or negative) similar occupation, age, diet, genes, climate, … � � guess J. D. Salinger’s IQ similar occupation, writing style, fame, SAT score, … � Can learn complex decision boundaries � As training size � ∞ , error rate is at most 2x the Bayes-optimal rate � � Superficial vs. deep similarities? (i.e., the error rate you’d get from knowing the true model that what do br ains generated the data – whatever it is!) act ually do? � B. F. Skinner and the behaviorism movement 13 14 600.325/425 Declarative Methods - J. Eisner 600.325/425 Declarative Methods - J. Eisner parts of slide thanks to Rich Caruana parts of slide thanks to Rich Caruana 1-Nearest Neighbor – decision boundary k-Nearest Neighbor � Instead of picking just the single nearest neighbor, pick the k nearest neighbors and have them vote � Average of k points more reliable when: � noise in training vectors x o + o o o attribute_2 o o o o o o + + � noise in training labels y o o o o + + + + + ++ � classes partially overlap + + attribute_1 From Hastie, Tibshirani, Friedman 2001 p418 600.325/425 Declarative Methods - J. Eisner 15 600.325/425 Declarative Methods - J. Eisner 16 slide thanks to Rich Caruana (modified) slide thanks to Rich Caruana (modified) 1 Nearest Neighbor – decision boundary 15 Nearest Neighbors – it’s smoother! From Hastie, Tibshirani, Friedman 2001 p418 From Hastie, Tibshirani, Friedman 2001 p418 600.325/425 Declarative Methods - J. Eisner 17 600.325/425 Declarative Methods - J. Eisner 18 slide thanks to Rich Caruana (modified) slide thanks to Rich Caruana (modified) 3

A few methods for learning binary classifiers 600.325/425 - PDF document

Fundamental Problem of Machine Learning: It is ill-posed A few methods for learning binary classifiers 600.325/425 Declarative Methods - J. Eisner 1 2 600.325/425 Declarative Methods - J. Eisner slide thanks to Tom Dietterich (modified)

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

My contact details: AmirTaaki.org Signal: +44 7873170866 amir@dyne.org Twitter: @AmirPolyteknik

Salvedicta 1 Education is what survives when what has been learned has been forgotten. B. F.

1 Skinners Operant Conditioning Skinners Operant Conditioning 7 8 Skinner called

Skills for Care update on support for Registered Managers 6 th May 2020 Join our Facebook

Learning Analytics: Potential Opportunities for eLearning in the Workplace Ryan S. Baker

S ocial Justice Recovery How We Are All Connect ed- And Needed S eptmeber 23, 2020 Dr.

Plato's Meno and the Socratic Method slideshow arranged by Daniel Lyle May 2004 1 SOCRATES: Now

Tutoring Construction and Learners TUTORIAL DIALOG evaluation of Discussion or knowledge