Supervised Learning Given: a set of inputs features X 1 , . . . , X - PowerPoint PPT Presentation

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 1

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. classification when the Y i are discrete regression when the Y i are continuous � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 2

Evaluating Predictions Suppose F is a feature and e is an example: val(e,F) is the value of feature F for example e . pval(e,F) is the predicted value of feature F for example e . The error of the prediction is a measure of how close pval ( e , Y ) is to val ( e , Y ). There are many possible errors that could be measured. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 3

Example Data Representations A travel agent wants to predict the preferred length of a trip, which can be from 1 to 6 days. (No input features). Two representations of the same data (each Y i is an indicator variable): Example Y Example Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 e 1 1 e 1 1 0 0 0 0 0 6 0 0 0 0 0 1 e 2 e 2 6 0 0 0 0 0 1 e 3 e 3 2 0 1 0 0 0 0 e 4 e 4 1 1 0 0 0 0 0 e 5 e 5 What is a prediction? � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 4

Measures of error E is the set of examples. O is the set of output features. absolute error � � | val ( e , Y ) − pval ( e , Y ) | e ∈ E Y ∈ O � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 5

Measures of error E is the set of examples. O is the set of output features. absolute error � � | val ( e , Y ) − pval ( e , Y ) | e ∈ E Y ∈ O sum of squares error � � ( val ( e , Y ) − pval ( e , Y )) 2 Y ∈ O e ∈ E � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 6

Measures of error E is the set of examples. O is the set of output features. absolute error � � | val ( e , Y ) − pval ( e , Y ) | e ∈ E Y ∈ O sum of squares error � � ( val ( e , Y ) − pval ( e , Y )) 2 Y ∈ O e ∈ E A cost-based error takes into account costs of various errors. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 7

Measures of error (cont.) When output features are { 0 , 1 } : likelihood of the data � � pval ( e , Y ) val ( e , Y ) (1 − pval ( e , Y )) (1 − val ( e , Y )) Y ∈ O e ∈ E � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 8

Measures of error (cont.) When output features are { 0 , 1 } : likelihood of the data � � pval ( e , Y ) val ( e , Y ) (1 − pval ( e , Y )) (1 − val ( e , Y )) Y ∈ O e ∈ E entropy � � − [ val ( e , Y ) log pval ( e , Y )+ e ∈ E Y ∈ O (1 − val ( e , Y )) log(1 − pval ( e , Y ))] � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 9

Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 10

Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 11

Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . When Y has domain { 0 , 1 } , the prediction that maximizes the likelihood is the empirical probability. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 12

Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . When Y has domain { 0 , 1 } , the prediction that maximizes the likelihood is the empirical probability. When Y has domain { 0 , 1 } , the prediction that minimizes the entropy is the empirical probability. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 13

Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . When Y has domain { 0 , 1 } , the prediction that maximizes the likelihood is the empirical probability. When Y has domain { 0 , 1 } , the prediction that minimizes the entropy is the empirical probability. But that doesn’t mean that these predictions minimize the error for future predictions. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 14

Training and Test Sets To evaluate how well a learner will work on future predictions, we divide the examples into: training examples that are used to train the learner test examples that are used to evaluate the learner ...these must be kept separate. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 15

Learning Probabilities Empirical probabilities do not make good predictors when evaluated by likelihood or entropy. Why? � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 16

Learning Probabilities Empirical probabilities do not make good predictors when evaluated by likelihood or entropy. Why? A probability of zero means “impossible” and has infinite cost. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 17

Learning Probabilities Empirical probabilities do not make good predictors when evaluated by likelihood or entropy. Why? A probability of zero means “impossible” and has infinite cost. Solution: add (non-negative) pseudo-counts to the data. Suppose n i is the number of examples with X = v i , and c i is the pseudo-count: c i + n i P ( X = v i ) = � i ′ c i ′ + n i ′ Pseudo-counts convey prior knowledge. Consider: “how much more would I believe v i if I had seen one example with v i true than if I has seen no examples with v i true?” � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 18

Supervised Learning Given: a set of inputs features X 1 , . . . , X - PowerPoint PPT Presentation

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Self-Supervised Feature Learning by Learning to Spot Artifacts Wonbin Kim Self-Supervised

Parameter Estimation Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Absolute and relative error Let z = exact answer to some problem, z = computed answer using

Software for the numerical integration of ODE by means of high-order Taylor methods (III) `

How to Best Process Data If Formulation of the . . . Recommendation We Have Both Absolute and

Effective computation of biased quantiles over data streams Graham Cormode Flip Korn

Alternative Strategies for Mapping ACS Estimates and Error of Estimation Joe Francis, Jan Vink,

32-bit Multipliers Accomplished Milan e ka, Ji Maty, Vojtch Mrzek, Luk

Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based