Complex learning example: curve fitting t = sin(2 x ) + noise t n t - PDF document

Artificial Intelligence: Representation and Problem Solving 15-381 April 10, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning aka: • regression • What is learning? • pattern recognition • machine learning - more than just memorizing facts • data mining - learning the underlying structure of the problem or data • A fundamental aspect of learning is generalization: - given a few examples, can you generalize to others? • Learning is ubiquitous: - medical diagnosis : identify new disorders from observations - loan applications : predict risk of default - prediction: ( climate, stocks, etc.) predict future from current and past data - speech/object recognition : from examples, generalize to others Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 2

Representation • How do we model or represent the world? • All learning requires some form of representation. model { θ 1 , . . . , θ n } • Learning: adjust model parameters to match data world (or data) Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 3 The complexity of learning • Fundamental trade-off in learning: complexity of model vs amount of data required to learn parameters • The more complex the model, the more it can describe, but the more data it requires to constrain the parameters. • Consider a hypothesis space of N models: - How many bits would it take to identify which of the N models is ‘correct’? - log 2 (N) in the worst case • Want simple models to explain examples and generalize to others - Ockham’s (some say Occam) razor Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 4

Complex learning example: curve fitting t = sin(2 π x ) + noise t n t 1 y ( x n , w ) 0 � 1 x x n 0 1 How do we model the data? example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 5 Polynomial curve fitting M y ( x, w ) = w 0 + w 1 x + w 2 x 2 + · · · + w M x M = � w j x j N j =0 E ( w ) = 1 � [ y ( x n , w ) − t n ] 2 2 n =1 1 1 0 0 � 1 � 1 0 1 0 1 1 1 0 0 � 1 � 1 0 1 0 1 example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 6

More data are needed to learn correct model 1 1 0 0 � 1 � 1 0 1 0 1 This is overfitting. example from Bishop (2006), Pattern Recognition and Machine Learning Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 7 Types of learning supervised reinforcement unsupervised reinforcement desired output { y 1 , . . . , y n } model output model model model { θ 1 , . . . , θ n } { θ 1 , . . . , θ n } { θ 1 , . . . , θ n } world world world (or data) (or data) (or data) Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 8

Decision Trees Decision trees: classifying from a set of attributes Predicting credit risk <2 years at missed defaulted? bad: 3 current job? payments? good: 7 N N N missed Y N Y payments? N Y N N N N N N bad: 1 bad: 2 N Y Y good: 6 good: 1 Y N N <2 years N Y N N Y at current N Y Y job? bad: 0 bad: 1 Y N N good: 3 good: 3 Y N N • each level splits the data according to different attributes • goal : achieve perfect classification with minimal number of decisions - not always possible due to noise or inconsistencies in the data Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 10

Observations • Any boolean function can be represented by a decision tree. • not good for all functions, e.g.: - parity function: return 1 iff an even number of inputs are 1 - majority function: return 1 if more than half inputs are 1 • best when a small number of attributes provide a lot of information • Note: finding optimal tree for arbitrary data is NP-hard. Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 11 Decision trees with continuous values Predicting credit risk ! years at # missed defaulted? current job payments 7 0 N 0.75 0 Y # missed payments 3 0 N 9 0 N ! 4 2 Y 0.25 0 N >1.5 5 1 N " 8 4 Y 1.0 0 N " 1.75 0 N " ! " " " " " years at current job � 1 • Now tree corresponds to order and placement of boundaries • General case: - arbitrary number of attributes: binary, multi-valued, or continuous - output: binary, multi-valued ( decision or axis-aligned classification trees ), or continuous ( regression trees ) Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 12

Examples • loan applications • medical diagnosis • movie preferences (Netflix contest) • spam filters • security screening • many real-word systems, and AI success • In each case, we want - accurate classification, i.e. minimize error - efficient decision making, i.e. fewest # of decisions/tests • decision sequence could be further complicated - want to minimize false negatives in medical diagnosis or minimize cost of test sequence - don’t want to miss important email Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 13 Decision Trees • simple example of inductive learning class prediction 1. learn decision tree from training examples 2. predict classes for novel testing model examples { θ 1 , . . . , θ n } • Generalization is how well we do on the testing examples. • Only works if we can learn the training examples testing examples underlying structure of the data. Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 14

Choosing the attributes • How do we find a decision tree that agrees with the training data? • Could just choose a tree that has one path to a leaf for each example - but this just memorizes the observations (assuming data are consistent) - we want it to generalize to new examples • Ideally, best attribute would partition the data into positive and negative examples • Strategy (greedy): - choose attributes that give the best partition first • Want correct classification with fewest number of tests Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 15 Problems • How do we which attribute or value to split on? • When should we stop splitting? • What do we do when we can’t achieve perfect classification? • What if tree is too large? Can we approximate with a smaller tree? Predicting credit risk <2 years at missed defaulted? bad: 3 current job? payments? good: 7 N N N Y N Y missed payments? N Y N N N N N N bad: 1 bad: 2 N Y Y good: 6 good: 1 Y N N <2 years N Y N N Y at current N Y Y job? bad: 0 bad: 1 Y N N good: 3 good: 3 Y N N Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 16

Basic algorithm for learning decision trees 1. starting with whole training data 2. select attribute or value along dimension that gives “best” split 3. create child nodes based on split 4. recurse on each child using child data until a stopping criterion is reached • all examples have same class • amount of data is too small • tree too large • Central problem: How do we choose the “best” attribute? Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 17 Measuring information • A convenient measure to use is based on information theory. • How much “information” does an attribute give us about the class? - attributes that perfectly partition should given maximal information - unrelated attributes should give no information • Information of symbol w: I ( w ) − log 2 P ( w ) ≡ P ( w ) = 1 / 2 ⇒ I ( w ) = − log 2 1 / 2 = 1 bit P ( w ) = 1 / 4 ⇒ I ( w ) = − log 2 1 / 4 = 2 bits Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 18

Information and Entropy I ( w ) − log 2 P ( w ) ≡ • For a random variable X with probability P(x) , the entropy is the average (or expected) amount of information obtained by observing x : � � H ( X ) = P ( x ) I ( x ) = − P ( x ) log 2 P ( x ) x x • Note: H(X) depends only on the probability, not the value. • H(X) quantifies the uncertainty in the data in terms of bits • H(X) gives a lower bound on cost (in bits) of coding (or describing) X � H ( X ) = P ( x ) log 2 P ( x ) − x − 1 2 − 1 1 1 P (heads) = 1 / 2 2 log 2 2 log 2 2 = 1 bit ⇒ − 1 3 − 2 1 2 P (heads) = 1 / 3 3 log 2 3 log 2 3 = 0 . 9183 bits ⇒ Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 19 Entropy of a binary random variable • Entropy is maximum at p=0.5 • Entropy is zero and p=0 or p=1. Artificial Intelligence: Learning and Decision Trees Michael S. Lewicki � Carnegie Mellon 20

Complex learning example: curve fitting t = sin(2 x ) + noise t n t - PDF document

Artificial Intelligence: Representation and Problem Solving 15-381 April 10, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning aka: regression What is learning? pattern recognition

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Curve Curve Ninjas December 19, 2012 Curve Ninjas Curve Overview Using Curve Implementation

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Elliptic Curve Cryptography Applications of Elliptic Curve Cryptography Elliptic Curve

Lecture 11 Fitting ARIMA Models 10/10/2018 1 Model Fitting Fitting ARIMA For an

Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Bending the Cost Curve and Improving Bending the Cost Curve and Improving Bending the Cost Curve

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

Over fitting distribution functions over Bayesian Regression / " ' i diggllloise dist

Fitting high resolution structures into low resolution EM maps Michael Rossmann Purdue

Fitting a Line, Residuals, and Correlation October 28, 2019 October 28, 2019 1 / 36 Fitting a

Unit 1: Data Fitting Motivation Data fitting: Construct a continuous function that represents

Fitting a Line, Residuals, and Correlation August 27, 2019 August 27, 2019 1 / 54 Fitting a

Least Squares and Data Fitting Data fitting How do we best fit a set of data points? Linear

Local Analysis of 2D Curve Patches Local Analysis of 2D Curve Patches Topic 4.2: Topic 4.2:

ANGLUIN'S ALGORITHM FOR LEARNING REGULAR SETS Ullas Aparanji DISCLAIMER The speaker takes no

Romans Series Lesson #29 August 11, 2011 Dean Bible Ministries www.deanbible.org Dr. Robert L.

Opposition Politics and Urban Service Delivery in Kampala, Uganda Gina M. S. Lambright

URBAN FOREST MANAGEMENT PLAN SCOPE OF WORK DISCUSSION CITY COUNCIL STUDY SESSION JUNE 5, 2017

Lecture 18: Recognition IV Thursday, Nov 15 Prof. Kristen Grauman Outline Discriminative

APTA Human Resources Committee 2014 Webinar Series Implementing Transits New National

The Estates General One vote per estate Clergy and nobility usually joined together to

in the NHS Estate How can we work together to better occupy and use NHS buildings? Julie

Complex learning example: curve fitting t = sin(2 x ) + noise t n t - PDF document

Artificial Intelligence: Representation and Problem Solving 15-381 April 10, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning aka: regression What is learning? pattern recognition

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Curve Curve Ninjas December 19, 2012 Curve Ninjas Curve Overview Using Curve Implementation

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Elliptic Curve Cryptography Applications of Elliptic Curve Cryptography Elliptic Curve

Lecture 11 Fitting ARIMA Models 10/10/2018 1 Model Fitting Fitting ARIMA For an

Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Bending the Cost Curve and Improving Bending the Cost Curve and Improving Bending the Cost Curve

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

PATTERN RECOGNITION AND MACHINE LEARNING Polynomial Curve Fitting Sum-of-Squares Error Function 0

Over fitting distribution functions over Bayesian Regression / &quot; ' i diggllloise dist

Fitting high resolution structures into low resolution EM maps Michael Rossmann Purdue

Fitting a Line, Residuals, and Correlation October 28, 2019 October 28, 2019 1 / 36 Fitting a

Unit 1: Data Fitting Motivation Data fitting: Construct a continuous function that represents

Fitting a Line, Residuals, and Correlation August 27, 2019 August 27, 2019 1 / 54 Fitting a

Least Squares and Data Fitting Data fitting How do we best fit a set of data points? Linear

Local Analysis of 2D Curve Patches Local Analysis of 2D Curve Patches Topic 4.2: Topic 4.2:

ANGLUIN'S ALGORITHM FOR LEARNING REGULAR SETS Ullas Aparanji DISCLAIMER The speaker takes no

Romans Series Lesson #29 August 11, 2011 Dean Bible Ministries www.deanbible.org Dr. Robert L.

Opposition Politics and Urban Service Delivery in Kampala, Uganda Gina M. S. Lambright

URBAN FOREST MANAGEMENT PLAN SCOPE OF WORK DISCUSSION CITY COUNCIL STUDY SESSION JUNE 5, 2017

Lecture 18: Recognition IV Thursday, Nov 15 Prof. Kristen Grauman Outline Discriminative

APTA Human Resources Committee 2014 Webinar Series Implementing Transits New National

The Estates General One vote per estate Clergy and nobility usually joined together to

in the NHS Estate How can we work together to better occupy and use NHS buildings? Julie

Over fitting distribution functions over Bayesian Regression / " ' i diggllloise dist