Learning Objectives At the end of the class you should be able to: - - PowerPoint PPT Presentation

learning objectives
SMART_READER_LITE
LIVE PREVIEW

Learning Objectives At the end of the class you should be able to: - - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and test sets D. Poole and A. Mackworth


slide-1
SLIDE 1

Learning Objectives

At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and test sets

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 1

slide-2
SLIDE 2

Supervised Learning

Given: a set of inputs features X1, . . . , Xn a set of target features Y1, . . . , Yk a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 2

slide-3
SLIDE 3

Supervised Learning

Given: a set of inputs features X1, . . . , Xn a set of target features Y1, . . . , Yk a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. classification when the Yi are discrete regression when the Yi are continuous

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 3

slide-4
SLIDE 4

Example Data Representations

A travel agent wants to predict the preferred length of a trip, which can be from 1 to 6 days. (No input features). Two representations of the same data: — Y is the length of trip chosen. — Each Yi is an indicator variable that has value 1 if the chosen length is i, and is 0 otherwise. Example Y e1 1 e2 6 e3 6 e4 2 e5 1 Example Y1 Y2 Y3 Y4 Y5 Y6 e1 1 e2 1 e3 1 e4 1 e5 1 What is a prediction?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 4

slide-5
SLIDE 5

Evaluating Predictions

Suppose we want to make a prediction of a value for a target feature on example e:

  • e is the observed value of target feature on example e.

pe is the predicted value of target feature on example e. The error of the prediction is a measure of how close pe is to oe. There are many possible errors that could be measured. Sometimes pe can be a real number even though oe can only have a few values.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 5

slide-6
SLIDE 6

Measures of error

E is the set of examples, with single target feature. For e ∈ E,

  • e is observed value and pe is predicted value:

absolute error L1(E) =

  • e∈E

|oe − pe|

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 6

slide-7
SLIDE 7

Measures of error

E is the set of examples, with single target feature. For e ∈ E,

  • e is observed value and pe is predicted value:

absolute error L1(E) =

  • e∈E

|oe − pe| sum of squares error L2

2(E) =

  • e∈E

(oe − pe)2

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 7

slide-8
SLIDE 8

Measures of error

E is the set of examples, with single target feature. For e ∈ E,

  • e is observed value and pe is predicted value:

absolute error L1(E) =

  • e∈E

|oe − pe| sum of squares error L2

2(E) =

  • e∈E

(oe − pe)2 worst-case error : L∞(E) = max

e∈E |oe − pe|

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 8

slide-9
SLIDE 9

Measures of error

E is the set of examples, with single target feature. For e ∈ E,

  • e is observed value and pe is predicted value:

absolute error L1(E) =

  • e∈E

|oe − pe| sum of squares error L2

2(E) =

  • e∈E

(oe − pe)2 worst-case error : L∞(E) = max

e∈E |oe − pe|

number wrong : L0(E) = #{e : oe = pe}

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 9

slide-10
SLIDE 10

Measures of error

E is the set of examples, with single target feature. For e ∈ E,

  • e is observed value and pe is predicted value:

absolute error L1(E) =

  • e∈E

|oe − pe| sum of squares error L2

2(E) =

  • e∈E

(oe − pe)2 worst-case error : L∞(E) = max

e∈E |oe − pe|

number wrong : L0(E) = #{e : oe = pe} A cost-based error takes into account costs of errors.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 10

slide-11
SLIDE 11

Measures of error (cont.)

With binary feature: oe ∈ {0, 1}: likelihood of the data

  • e∈E

poe

e (1 − pe)(1−oe)

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 11

slide-12
SLIDE 12

Measures of error (cont.)

With binary feature: oe ∈ {0, 1}: likelihood of the data

  • e∈E

poe

e (1 − pe)(1−oe)

log likelihood

  • e∈E

(oe log pe + (1 − oe) log(1 − pe)) is negative of number of bits to encode the data given a code based on pe.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 12

slide-13
SLIDE 13

Information theory overview

A bit is a binary digit. 1 bit can distinguish 2 items k bits can distinguish 2k items n items can be distinguished using log2 n bits Can we do better?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 13

slide-14
SLIDE 14

Information and Probability

Consider a code to distinguish elements of {a, b, c, d} with P(a) = 1 2, P(b) = 1 4, P(c) = 1 8, P(d) = 1 8 Consider the code: a b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P(a) × 1 + P(b) × 2 + P(c) × 3 + P(d) × 3 = 1 2 + 2 4 + 3 8 + 3 8 = 13 4 bits. The string aacabbda has code

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 14

slide-15
SLIDE 15

Information and Probability

Consider a code to distinguish elements of {a, b, c, d} with P(a) = 1 2, P(b) = 1 4, P(c) = 1 8, P(d) = 1 8 Consider the code: a b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P(a) × 1 + P(b) × 2 + P(c) × 3 + P(d) × 3 = 1 2 + 2 4 + 3 8 + 3 8 = 13 4 bits. The string aacabbda has code 00110010101110. The code 0111110010100 represents string

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 15

slide-16
SLIDE 16

Information and Probability

Consider a code to distinguish elements of {a, b, c, d} with P(a) = 1 2, P(b) = 1 4, P(c) = 1 8, P(d) = 1 8 Consider the code: a b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P(a) × 1 + P(b) × 2 + P(c) × 3 + P(d) × 3 = 1 2 + 2 4 + 3 8 + 3 8 = 13 4 bits. The string aacabbda has code 00110010101110. The code 0111110010100 represents string adcabba

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 16

slide-17
SLIDE 17

Information Content

To identify x, we need − log2 P(x) bits. Give a distribution over a set, to a identify a member, the expected number of bits

  • x

−P(x) × log2 P(x). is the information content or entropy of the distribution. The expected number of bits it takes to describe a distribution given evidence e: I(e) =

  • x

−P(x|e) × log2 P(x|e).

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 17

slide-18
SLIDE 18

Information Gain

Given a test that can distinguish the cases where α is true from the cases where α is false, the information gain from this test is: I(true) − (P(α) × I(α) + P(¬α) × I(¬α)). I(true) is the expected number of bits needed before the test P(α) × I(α) + P(¬α) × I(¬α) is the expected number of bits after the test.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 18

slide-19
SLIDE 19

Linear Predictions

1 2 3 4 5 1 2 3 4 5 6 7 8 L∞

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 19

slide-20
SLIDE 20

Linear Predictions

1 2 3 4 5 1 2 3 4 5 6 7 8 L1 L2

2

L∞

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 20

slide-21
SLIDE 21

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 21

slide-22
SLIDE 22

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y .

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 22

slide-23
SLIDE 23

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 23

slide-24
SLIDE 24

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y .

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 24

slide-25
SLIDE 25

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 25

slide-26
SLIDE 26

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y .

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 26

slide-27
SLIDE 27

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 27

slide-28
SLIDE 28

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is (maximum + minimum)/2

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 28

slide-29
SLIDE 29

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 29

slide-30
SLIDE 30

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 30

slide-31
SLIDE 31

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. When Y has values {0, 1}, the prediction that minimizes the entropy on E is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 31

slide-32
SLIDE 32

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. When Y has values {0, 1}, the prediction that minimizes the entropy on E is the empirical probability.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 32

slide-33
SLIDE 33

Point Estimates

To make a single prediction for feature Y , with examples E. The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is the mode of Y . The prediction that minimizes the worst-case error on E is (maximum + minimum)/2 When Y has values {0, 1}, the prediction that maximizes the likelihood on E is the empirical probability. When Y has values {0, 1}, the prediction that minimizes the entropy on E is the empirical probability. But that doesn’t mean that these predictions minimize the error for future predictions....

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 33

slide-34
SLIDE 34

Training and Test Sets

To evaluate how well a learner will work on future predictions, we divide the examples into: training examples that are used to train the learner test examples that are used to evaluate the learner ...these must be kept separate.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 34

slide-35
SLIDE 35

Learning Probabilities

Empirical probabilities do not make good predictors of test set when evaluated by likelihood or entropy. Why?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 35

slide-36
SLIDE 36

Learning Probabilities

Empirical probabilities do not make good predictors of test set when evaluated by likelihood or entropy. Why? A probability of zero means “impossible” and has infinite cost if there is one true case in test set.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 36

slide-37
SLIDE 37

Learning Probabilities

Empirical probabilities do not make good predictors of test set when evaluated by likelihood or entropy. Why? A probability of zero means “impossible” and has infinite cost if there is one true case in test set. Solution: (Laplace smoothing) add (non-negative) pseudo-counts to the data. Suppose ni is the number of examples with X = vi, and ci is the pseudo-count: P(X = vi) = ci + ni

  • i′ ci′ + ni′

Pseudo-counts convey prior knowledge. Consider: “how much more would I believe vi if I had seen one example with vi true than if I has seen no examples with vi true?”

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 7.2, Page 37