supervised learning
play

Supervised Learning Given: a set of inputs features X 1 , . . . , X - PowerPoint PPT Presentation

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new


  1. Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 1

  2. Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. classification when the Y i are discrete regression when the Y i are continuous � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 2

  3. Evaluating Predictions Suppose F is a feature and e is an example: val(e,F) is the value of feature F for example e . pval(e,F) is the predicted value of feature F for example e . The error of the prediction is a measure of how close pval ( e , Y ) is to val ( e , Y ). There are many possible errors that could be measured. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 3

  4. Example Data Representations A travel agent wants to predict the preferred length of a trip, which can be from 1 to 6 days. (No input features). Two representations of the same data (each Y i is an indicator variable): Example Y Example Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 e 1 1 e 1 1 0 0 0 0 0 6 0 0 0 0 0 1 e 2 e 2 6 0 0 0 0 0 1 e 3 e 3 2 0 1 0 0 0 0 e 4 e 4 1 1 0 0 0 0 0 e 5 e 5 What is a prediction? � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 4

  5. Measures of error E is the set of examples. O is the set of output features. absolute error � � | val ( e , Y ) − pval ( e , Y ) | e ∈ E Y ∈ O � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 5

  6. Measures of error E is the set of examples. O is the set of output features. absolute error � � | val ( e , Y ) − pval ( e , Y ) | e ∈ E Y ∈ O sum of squares error � � ( val ( e , Y ) − pval ( e , Y )) 2 Y ∈ O e ∈ E � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 6

  7. Measures of error E is the set of examples. O is the set of output features. absolute error � � | val ( e , Y ) − pval ( e , Y ) | e ∈ E Y ∈ O sum of squares error � � ( val ( e , Y ) − pval ( e , Y )) 2 Y ∈ O e ∈ E A cost-based error takes into account costs of various errors. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 7

  8. Measures of error (cont.) When output features are { 0 , 1 } : likelihood of the data � � pval ( e , Y ) val ( e , Y ) (1 − pval ( e , Y )) (1 − val ( e , Y )) Y ∈ O e ∈ E � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 8

  9. Measures of error (cont.) When output features are { 0 , 1 } : likelihood of the data � � pval ( e , Y ) val ( e , Y ) (1 − pval ( e , Y )) (1 − val ( e , Y )) Y ∈ O e ∈ E entropy � � − [ val ( e , Y ) log pval ( e , Y )+ e ∈ E Y ∈ O (1 − val ( e , Y )) log(1 − pval ( e , Y ))] � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 9

  10. Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 10

  11. Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 11

  12. Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . When Y has domain { 0 , 1 } , the prediction that maximizes the likelihood is the empirical probability. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 12

  13. Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . When Y has domain { 0 , 1 } , the prediction that maximizes the likelihood is the empirical probability. When Y has domain { 0 , 1 } , the prediction that minimizes the entropy is the empirical probability. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 13

  14. Point Estimates Suppose where is a single numerical feature, Y . Let E be the training examples. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The value that minimizes the absolute error is the median value of Y . When Y has domain { 0 , 1 } , the prediction that maximizes the likelihood is the empirical probability. When Y has domain { 0 , 1 } , the prediction that minimizes the entropy is the empirical probability. But that doesn’t mean that these predictions minimize the error for future predictions. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 14

  15. Training and Test Sets To evaluate how well a learner will work on future predictions, we divide the examples into: training examples that are used to train the learner test examples that are used to evaluate the learner ...these must be kept separate. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 15

  16. Learning Probabilities Empirical probabilities do not make good predictors when evaluated by likelihood or entropy. Why? � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 16

  17. Learning Probabilities Empirical probabilities do not make good predictors when evaluated by likelihood or entropy. Why? A probability of zero means “impossible” and has infinite cost. � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 17

  18. Learning Probabilities Empirical probabilities do not make good predictors when evaluated by likelihood or entropy. Why? A probability of zero means “impossible” and has infinite cost. Solution: add (non-negative) pseudo-counts to the data. Suppose n i is the number of examples with X = v i , and c i is the pseudo-count: c i + n i P ( X = v i ) = � i ′ c i ′ + n i ′ Pseudo-counts convey prior knowledge. Consider: “how much more would I believe v i if I had seen one example with v i true than if I has seen no examples with v i true?” � D. Poole and A. Mackworth 2008 c Artificial Intelligence, Lecture 7.2, Page 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend