Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: show an example of decision-tree learning explain how to avoid overfitting in decision-tree learning explain the relationship between linear and logistic regression explain how overfitting can be avoided � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 1

Basic Models for Supervised Learning Many learning algorithms can be seen as deriving from: decision trees linear (and non-linear) classifiers Bayesian classifiers � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 2

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision trees. Search through the space of decision trees, from simple decision trees to more complex ones. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 3

Decision trees A (binary) decision tree (for a particular output feature) is a tree where: Each nonleaf node is labeled with an test (function of input features). The arcs out of a node labeled with values for the test. The leaves of the tree are labeled with point prediction of the output feature. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 4

Example Decision Trees Length Length long short long short reads with skips Thread skips probability 0.82 new follow_up reads Author unknown known skips reads � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 5

Equivalent Logic Program skips ← long . reads ← short ∧ new . reads ← short ∧ follow up ∧ known . skips ← short ∧ follow up ∧ unknown . or with negation as failure: reads ← short ∧ new . reads ← short ∧ ∼ new ∧ known . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 6

Issues in decision-tree learning Given some training examples, which decision tree should be generated? A decision tree can represent any discrete function of the input features. You need a bias. Example, prefer the smallest tree. Least depth? Fewest nodes? Which trees are the best predictors of unseen data? How should you go about building a decision tree? The space of decision trees is too big for systematic search for the smallest decision tree. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 7

Searching for a Good Decision Tree The input is a set of input features, a target feature and, a set of training examples. Either: ◮ Stop and return the a value for the target feature or a distribution over target feature values ◮ Choose a test (e.g. an input feature) to split on. For each value of the test, build a subtree for those examples with this value for the test. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 8

Choices in implementing the algorithm When to stop: � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 9

Choices in implementing the algorithm When to stop: ◮ no more input features ◮ all examples are classified the same ◮ too few examples to make an informative split � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 10

Choices in implementing the algorithm When to stop: ◮ no more input features ◮ all examples are classified the same ◮ too few examples to make an informative split Which test to split on isn’t defined. Often we use myopic split: which single split gives smallest error. With multi-valued features, the text can be can to split on all values or split values into half. More complex tests are possible. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 11

Example Classification Data Training Examples: Action Author Thread Length Where e1 skips known new long home e2 reads unknown new short work e3 skips unknown old long work e4 skips known old long home e5 reads known new short home e6 skips known old long work New Examples: e7 ??? known new short work e8 ??? unknown new short work We want to classify new examples on feature Action based on the examples’ Author , Thread , Length , and Where . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 12

Example: possible splits skips 9 length reads 9 long short skips 7 skips 2 skips 9 reads 9 reads 0 thread reads 9 new old skips 3 skips 6 reads 7 reads 2 � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 13

Handling Overfitting This algorithm can overfit the data. This occurs when � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 14

Handling Overfitting This algorithm can overfit the data. This occurs when noise and correlations in the training set that are not reflected in the data as a whole. To handle overfitting: ◮ restrict the splitting, and split only when the split is useful. ◮ allow unrestricted splitting and prune the resulting tree where it makes unwarranted distinctions. ◮ learn multiple trees and average them. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 15

Linear Function A linear function of features X 1 , . . . , X n is a function of the form: f w ( X 1 , . . . , X n ) = w 0 + w 1 X 1 + · · · + w n X n We invent a new feature X 0 which has value 1, to make it not a special case. � n f w ( X 1 , . . . , X n ) = w i X i i =0 � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 16

Linear Regression Aim: predict feature Y from features X 1 , . . . , X n . A feature is a function of an example. X i ( e ) is the value of feature X i on example e . Linear regression: predict a linear function of the input features. � Y w ( e ) = w 0 + w 1 X 1 ( e ) + · · · + w n X n ( e ) n � = w i X i ( e ) , i =0 � Y w ( e ) is the predicted value for Y on example e . It depends on the weights w . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 17

Sum of squares error for linear regression The sum of squares error on examples E for output Y is: � ( Y ( e ) − � Y w ( e )) 2 Error E ( w ) = e ∈ E � � 2 � � n = Y ( e ) − w i X i ( e ) . e ∈ E i =0 Goal: find weights that minimize Error E ( w ). � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 18

Finding weights that minimize Error E ( w ) Find the minimum analytically. Effective when it can be done (e.g., for linear regression). � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 19

Finding weights that minimize Error E ( w ) Find the minimum analytically. Effective when it can be done (e.g., for linear regression). Find the minimum iteratively. Works for larger classes of problems. Gradient descent: w i ← w i − η∂ Error E ( w ) ∂ w i η is the gradient descent step size, the learning rate. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 20

Linear Classifier Assume we are doing binary classification, with classes { 0 , 1 } (e.g., using indicator functions). � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 21

Linear Classifier Assume we are doing binary classification, with classes { 0 , 1 } (e.g., using indicator functions). There is no point in making a prediction of less than 0 or greater than 1. A squashed linear function is of the form: f w ( X 1 , . . . , X n ) = f ( w 0 + w 1 X 1 + · · · + w n X n ) where f is an activation function . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 22

Linear Classifier Assume we are doing binary classification, with classes { 0 , 1 } (e.g., using indicator functions). There is no point in making a prediction of less than 0 or greater than 1. A squashed linear function is of the form: f w ( X 1 , . . . , X n ) = f ( w 0 + w 1 X 1 + · · · + w n X n ) where f is an activation function . A simple activation function is the step function: � 1 if x ≥ 0 f ( x ) = 0 if x < 0 � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 23

Error for Squashed Linear Function The sum of squares error is: � � 2 � � Error E ( w ) = Y ( e ) − f ( w i X i ( e )) . e ∈ E i If f is differentiable, we can do gradient descent. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 24

The sigmoid or logistic activation function 1 0.9 1 0.8 1 + e - x 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 1 f ( x ) = 1 + e − x � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 25

The sigmoid or logistic activation function 1 0.9 1 0.8 1 + e - x 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 1 f ( x ) = 1 + e − x f ′ ( x ) = f ( x )(1 − f ( x )) A logistic function is the sigmoid of a linear function. Logistic regression: find weights to minimise error of a logistic function. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.3, Page 26

Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: show an example of decision-tree learning explain how to avoid overfitting in decision-tree learning explain the relationship between linear and logistic regression explain how

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

PVMD Delft University of Technology Learning objectives Improved light calibration Learning

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Standards for Professional Learning Standards for Professional Learning Learning objectives

Designs Learning designs Learning objectives Learners will be able to Provide a

Trigeminal Autonomic Cephalalgias Learning Objectives Learning Objectives At the conclusion

Learning Objectives Learning Objectives Overuse Injuries in Overuse Injuries in Endurance

Rate vs temporal code about synchrony Learning objectives: Learning objectives: To

Testing Object Oriented Software Chapter 15 p Learning objectives Learning objectives

PVMD Ren van Swaaij Delft University of Technology Learning objectives Why use a

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Decision Tree Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 3

Lecture 4 Review from last lecture Nearest neighbor classifier A lazy learning

Decision Tree Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

DecisionTrees MachineLearning10601 GeoffGordon,MiroslavDudk

Decision Tree and Automata Learning Stefan Edelkamp 1 Overview - Decision tree representation

and Random Forests Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL Universit

Machine Learning I: Decision Trees AI Class 14 (Ch. 18.118.3) Cynthia Matuszek CMSC 671

Decision Trees Petr Pok Czech Technical University in Prague Faculty of Electrical