Chapter18 Learning from Observations Sec. 1 - 3 20070607 - - PDF document

chapter18
SMART_READER_LITE
LIVE PREVIEW

Chapter18 Learning from Observations Sec. 1 - 3 20070607 - - PDF document

Chapter18 Learning from Observations Sec. 1 - 3 20070607 Chap18 1 Learning Essential for unknown environments. i.e. when designer lacks omniscience Learning modifies the agents decision mechanisms to improve performance


slide-1
SLIDE 1

1

20070607 Chap18 1

Chapter18

Learning from Observations

  • Sec. 1 - 3

20070607 Chap18 2

Learning

  • Essential for unknown environments.

i.e. when designer lacks omniscience

  • Learning modifies the agent’s decision

mechanisms to improve performance

slide-2
SLIDE 2

2

20070607 Chap18 3

Learning Agents

  • Performance Element
  • Decides what actions to take
  • Learning Element
  • Modifies the performance element to make better decision

20070607 Chap18 4

Inductive Learning

  • Given as input the correct value of the unknown

function for particular inputs and must try to recover the unknown function or something close it.

  • Pure inductive Inference (or Induction)

Given a collection of examples of f, return a function h that approximates f.

An example is a pair: (x, f(x)) h: hypothesis function, f: target function

  • It is not easy to tell whether any particular h is a

good approximation of f .

slide-3
SLIDE 3

3

20070607 Chap18 5

Inductive Learning Method

  • Construct/adjust h to agree with f on

training set)

(h is consistent if it agrees with f on all examples)

20070607 Chap18 6

Inductive Learning Method (cont.-1)

slide-4
SLIDE 4

4

20070607 Chap18 7

Inductive Learning Method (cont.-2)

20070607 Chap18 8

Inductive Learning Method (cont.-3)

slide-5
SLIDE 5

5

20070607 Chap18 9

Inductive Learning Method (cont.-4)

20070607 Chap18 10

  • How do we choose from among multiple

consistent hypotheses? Ockham’s razor

Prefer the simplest hypothesis consistent with the data

Inductive Learning Method (cont.-5)

slide-6
SLIDE 6

6

20070607 Chap18 11

Attribute-based Representations

Examples described by attribute values (Boolean, discrete, continuous, etc.) e.g. situations where I will/won’t wait for a table

20070607 Chap18 12

Learning Decision Trees

Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. Alternate: is there an alternative restaurant nearby? 2. Bar: is there a comfortable bar area to wait in? 3. Fri/Sat: is today Friday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (None, Some, Full) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. Type: kind of restaurant (French, Italian, Thai, Burger)

  • 10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60,

>60)

slide-7
SLIDE 7

7

20070607 Chap18 13

Decision Trees

20070607 Chap18 14

Expressiveness

  • Decision trees can express any function of the

input attributes.

  • Trivially, exist a consistent decision tree for any

training set.

  • Prefer to find more compact decision trees.
slide-8
SLIDE 8

8

20070607 Chap18 15

Hypothesis Spaces

How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = e.g. with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees.

n 2

2

20070607 Chap18 16

Hypothesis Spaces (cont.)

How many purely conjunctive hypotheses ? (e.g., Hungry ∧ ¬ Rain)

  • Each attribute can be in (positive), in (negative), or out

⇒ 3n distinct conjunctive hypotheses

  • More expressive hypothesis space
  • increases chance that target function can be expressed
  • increases number of hypotheses consistent with training set

⇒ may get worse predictions

slide-9
SLIDE 9

9

20070607 Chap18 17

Decision Tree Learning

Aim: find a small tree consistent with the training example Idea: (recursively) choose “most significant” attribute as root of (sub)tree

20070607 Chap18 18

Choosing An Attribute

Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Patrons? is a better choice.

slide-10
SLIDE 10

10

20070607 Chap18 19

Using Information Theory

  • To implement Choose-Attribute in the DTL algorithm
  • Information Content (Entropy):
  • For a training set containing p positive examples

and n negative examples:

n p n n p n n p p n p p n p n n p p I + + − + + − = + +

2 2

log log ) , ( ) ( log ) ( )) ( ),...., ( (

2 1 1 i n i i n

v P v P v P v P I

=

− =

20070607 Chap18 20

Information Gain

  • A chosen attribute A divides the training set E into subsets

E1, … , Ev according to their values for A, where A has v distinct values.

  • Information Gain (IG) or reduction in entropy from the

attribute test:

  • Choose the attribute with the largest IG

=

+ + + + =

v i i i i i i i i i

n p n n p p I n p n p A remainder

1

) , ( ) ( ) ( ) , ( ) ( A remainder n p n n p p I A IG − + + =

slide-11
SLIDE 11

11

20070607 Chap18 21

Information Gain (cont.)

  • For the training set, p = n = 6,

I(6/12, 6/12) = 1 bit

  • Consider the attributes Patrons and Type

(and others too):

  • Patrons has the highest IG of all attributes and so

is chosen by the DTL algorithm as the root

bits )] 4 2 , 4 2 ( 12 4 ) 4 2 , 4 2 ( 12 4 ) 2 1 , 2 1 ( 12 2 ) 2 1 , 2 1 ( 12 2 [ 1 ) ( bits 0541 . )] 6 4 , 6 2 ( 12 6 ) , 1 ( 12 4 ) 1 , ( 12 2 [ 1 ) ( = + + + − = = + + − = I I I I Type IG I I I Patrons IG

20070607 Chap18 22

Example

  • Decision tree learned from the 12 examples:
  • Substantially simpler than “true” tree---a more

complex hypothesis isn’t justified by small amount of data

slide-12
SLIDE 12

12

20070607 Chap18 23

Performance Measurement

  • How do we know that h ≈ f ?

1. Use theorems of computational/statistical learning theory 2. Try h on a new test set of examples (use same distribution over example space as training set)

Learning curve = % correct on test set as a function

  • f training set size

20070607 Chap18 24

Summary

  • Learning needed for unknown environments,

lazy designers

  • Learning agent

= performance element + learning element

  • For supervised learning, the aim is

to find a simple hypothesis approximately consistent with training examples

  • Decision tree learning using information gain
  • Learning performance = prediction accuracy

measured on test set