What is Machine Learning? Definition: A computer program is said to - - PowerPoint PPT Presentation

what is machine learning
SMART_READER_LITE
LIVE PREVIEW

What is Machine Learning? Definition: A computer program is said to - - PowerPoint PPT Presentation

What is Machine Learning? Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.


slide-1
SLIDE 1

CS886 Fall 10 - Lecture 5, Sept 30, 2010

1

What is Machine Learning?

  • Definition:

– A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

[T Mitchell, 1997]

slide-2
SLIDE 2

CS886 Fall 10 - Lecture 5, Sept 30, 2010

2

Inductive learning (aka concept learning)

  • Induction:

– Given a training set of examples of the form (x,f(x))

  • x is the input, f(x) is the output

– Return a function h that approximates f

  • h is called the hypothesis
slide-3
SLIDE 3

CS886 Fall 10 - Lecture 5, Sept 30, 2010

3

Classification

  • Training set:
  • Possible hypotheses:

– h1: S=sunny  ES=yes – h2: Wa=cool or F=same  enjoySport

Sky

Humidity

Wind Water Forecast

EnjoySport

Sunny Normal Strong Warm Same Yes Sunny High Strong Warm Same Yes Sunny High Strong Warm Change No Sunny High Strong Cool Change Yes

x f(x)

slide-4
SLIDE 4

CS886 Fall 10 - Lecture 5, Sept 30, 2010

4

Regression

  • Find function h that fits f at instances x
slide-5
SLIDE 5

CS886 Fall 10 - Lecture 5, Sept 30, 2010

5

Regression

  • Find function h that fits f at instances x

h1 h2

slide-6
SLIDE 6

CS886 Fall 10 - Lecture 5, Sept 30, 2010

6

Hypothesis Space

  • Hypothesis space H

– Set of all hypotheses h that the learner may consider – Learning is a search through hypothesis space

  • Objective:

– Find hypothesis that agrees with training examples – But what about unseen examples?

slide-7
SLIDE 7

CS886 Fall 10 - Lecture 5, Sept 30, 2010

7

Generalization

  • A good hypothesis will generalize well

(i.e. predict unseen examples correctly)

  • Usually…

– Any hypothesis h found to approximate the target function f well over a sufficiently large set of training examples will also approximate the target function well over any unobserved examples

slide-8
SLIDE 8

CS886 Fall 10 - Lecture 5, Sept 30, 2010

8

Inductive learning

  • Construct/adjust h to agree with f on training set
  • (h is consistent if it agrees with f on all examples)
  • E.g., curve fitting:
slide-9
SLIDE 9

CS886 Fall 10 - Lecture 5, Sept 30, 2010

9

Inductive learning

  • Construct/adjust h to agree with f on training set
  • (h is consistent if it agrees with f on all examples)
  • E.g., curve fitting:
slide-10
SLIDE 10

CS886 Fall 10 - Lecture 5, Sept 30, 2010

10

Inductive learning

  • Construct/adjust h to agree with f on training set
  • (h is consistent if it agrees with f on all examples)
  • E.g., curve fitting:
slide-11
SLIDE 11

CS886 Fall 10 - Lecture 5, Sept 30, 2010

11

Inductive learning

  • Construct/adjust h to agree with f on training set
  • (h is consistent if it agrees with f on all examples)
  • E.g., curve fitting:
slide-12
SLIDE 12

CS886 Fall 10 - Lecture 5, Sept 30, 2010

12

Inductive learning

  • Construct/adjust h to agree with f on training set
  • (h is consistent if it agrees with f on all examples)
  • E.g., curve fitting:
  • Ockham’s razor: prefer the simplest hypothesis

consistent with data

slide-13
SLIDE 13

CS886 Fall 10 - Lecture 5, Sept 30, 2010

13

Performance of a learning algorithm

  • A learning algorithm is good if it produces a

hypothesis that does a good job of predicting classifications of unseen examples

  • Verify performance with a test set

1. Collect a large set of examples

  • 2. Divide into 2 disjoint sets: training set and test set
  • 3. Learn hypothesis h with training set
  • 4. Measure percentage of correctly classified examples

by h in the test set

  • 5. Repeat 2-4 for different randomly selected training

sets of varying sizes

slide-14
SLIDE 14

CS886 Fall 10 - Lecture 5, Sept 30, 2010

14

Learning curves

Overfitting!

% correct Size of hypothesis space Training set Test set

slide-15
SLIDE 15

CS886 Fall 10 - Lecture 5, Sept 30, 2010

15

Overfitting

  • Definition: Given a hypothesis space H, a

hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h’ ∈ H such that h has smaller error than h’ over the training examples but h’ has smaller error than h over the entire distribution of instances

  • Overfitting has been found to decrease

accuracy of many algorithms by 10-25%

slide-16
SLIDE 16

CS886 Fall 10 - Lecture 5, Sept 30, 2010

16

Statistical Learning

  • View: we have uncertain knowledge of

the world

  • Idea: learning simply reduces this

uncertainty

slide-17
SLIDE 17

CS886 Fall 10 - Lecture 5, Sept 30, 2010

17

Candy Example

  • Favorite candy sold in two flavors:

– Lime (hugh) – Cherry (yum)

  • Same wrapper for both flavors
  • Sold in bags with different ratios:

– 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime

slide-18
SLIDE 18

CS886 Fall 10 - Lecture 5, Sept 30, 2010

18

Candy Example

  • You bought a bag of candy but don’t

know its flavor ratio

  • After eating k candies:

– What’s the flavor ratio of the bag? – What will be the flavor of the next candy?

slide-19
SLIDE 19

CS886 Fall 10 - Lecture 5, Sept 30, 2010

19

Statistical Learning

  • Hypothesis H: probabilistic theory of the

world

– h1: 100% cherry – h2: 75% cherry + 25% lime – h3: 50% cherry + 50% lime – h4: 25% cherry + 75% lime – h5: 100% lime

  • Data D: evidence about the world

– d1: 1st candy is cherry – d2: 2nd candy is lime – d3: 3rd candy is lime – …

slide-20
SLIDE 20

CS886 Fall 10 - Lecture 5, Sept 30, 2010

20

Bayesian Learning

  • Prior: Pr(H)
  • Likelihood: Pr(d|H)
  • Evidence: d = <d1,d2,…,dn>
  • Bayesian Learning amounts to computing

the posterior using Bayes’ Theorem: Pr(H|d) = k Pr(d|H)Pr(H)

slide-21
SLIDE 21

CS886 Fall 10 - Lecture 5, Sept 30, 2010

21

Bayesian Prediction

  • Suppose we want to make a prediction about

an unknown quantity X (i.e., the flavor of the next candy)

  • Pr(X|d) = Σi Pr(X|d,hi)P(hi|d)

= Σi Pr(X|hi)P(hi|d)

  • Predictions are weighted averages of the

predictions of the individual hypotheses

  • Hypotheses serve as “intermediaries”

between raw data and prediction

slide-22
SLIDE 22

CS886 Fall 10 - Lecture 5, Sept 30, 2010

22

Candy Example

  • Assume prior P(H) = <0.1, 0.2, 0.4, 0.2, 0.1>
  • Assume candies are i.i.d. (identically and

independently distributed)

– P(d|h) = Πj P(dj|h)

  • Suppose first 10 candies all taste lime:

– P(d|h5) = 110 = 1 – P(d|h3) = 0.510 = 0.00097 – P(d|h1) = 010 = 0

slide-23
SLIDE 23

CS886 Fall 10 - Lecture 5, Sept 30, 2010

23

Posterior

slide-24
SLIDE 24

CS886 Fall 10 - Lecture 5, Sept 30, 2010

24

Prediction

Probability that next candy is lime

slide-25
SLIDE 25

CS886 Fall 10 - Lecture 5, Sept 30, 2010

25

Bayesian Learning

  • Bayesian learning properties:

– Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (prior can be used to penalize complex hypotheses)

  • There is a price to pay:

– When hypothesis space is large Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable

  • Solution: approximate Bayesian learning
slide-26
SLIDE 26

CS886 Fall 10 - Lecture 5, Sept 30, 2010

26

Maximum a posteriori (MAP)

  • Idea: make prediction based on most

probable hypothesis hMAP

– hMAP = argmaxhi P(hi|d) – P(X|d) ≈ P(X|hMAP)

  • In contrast, Bayesian learning makes

prediction based on all hypotheses weighted by their probability

slide-27
SLIDE 27

CS886 Fall 10 - Lecture 5, Sept 30, 2010

27

Candy Example (MAP)

  • Prediction after

– 1 lime: hMAP = h3, Pr(lime|hMAP) = 0.5 – 2 limes: hMAP = h4, Pr(lime|hMAP) = 0.75 – 3 limes: hMAP = h5, Pr(lime|hMAP) = 1 – 4 limes: hMAP = h5, Pr(lime|hMAP) = 1 – …

  • After only 3 limes, it correctly selects

h5

slide-28
SLIDE 28

CS886 Fall 10 - Lecture 5, Sept 30, 2010

28

Candy Example (MAP)

  • But what if correct hypothesis is h4?

– h4: P(lime) = 0.75 and P(cherry) = 0.25

  • After 3 limes

– MAP incorrectly predicts h5 – MAP yields P(lime|hMAP) = 1 – Bayesian learning yields P(lime|d) = 0.8

slide-29
SLIDE 29

CS886 Fall 10 - Lecture 5, Sept 30, 2010

29

MAP properties

  • MAP prediction less accurate than Bayesian

prediction since it relies only on one hypothesis hMAP

  • But MAP and Bayesian predictions converge as

data increases

  • No overfitting (prior can be used to penalize

complex hypotheses)

  • Finding hMAP may be intractable:

– hMAP = argmax P(h|d) – Optimization may be difficult

slide-30
SLIDE 30

CS886 Fall 10 - Lecture 5, Sept 30, 2010

30

MAP computation

  • Optimization:

– hMAP = argmaxh P(h|d) = argmaxh P(h) P(d|h) = argmaxh P(h) Πi P(di|h)

  • Product induces non-linear optimization
  • Take the log to linearize optimization

– hMAP = argmaxh log P(h) + Σi log P(di|h)

slide-31
SLIDE 31

CS886 Fall 10 - Lecture 5, Sept 30, 2010

31

Maximum Likelihood (ML)

  • Idea: simplify MAP by assuming uniform

prior (i.e., P(hi) = P(hj) ∀i,j)

– hMAP = argmaxh P(h) P(d|h) – hML = argmaxh P(d|h)

  • Make prediction based on hML only:

– P(X|d) ≈ P(X|hML)

slide-32
SLIDE 32

CS886 Fall 10 - Lecture 5, Sept 30, 2010

32

Candy Example (ML)

  • Prediction after

– 1 lime: hML = h5, Pr(lime|hML) = 1 – 2 limes: hML = h5, Pr(lime|hML) = 1 – …

  • Frequentist: “objective” prediction since it

relies only on the data (i.e., no prior)

  • Bayesian: prediction based on data and uniform

prior (since no prior ≡ uniform prior)

slide-33
SLIDE 33

CS886 Fall 10 - Lecture 5, Sept 30, 2010

33

ML properties

  • ML prediction less accurate than Bayesian

and MAP predictions since it ignores prior info and relies only on one hypothesis hML

  • But ML, MAP and Bayesian predictions

converge as data increases

  • Subject to overfitting (no prior to penalize

complex hypothesis that could exploit statistically insignificant data patterns)

  • Finding hML is often easier than hMAP

– hML = argmaxh Σi log P(di|h)

slide-34
SLIDE 34

CS886 Fall 10 - Lecture 5, Sept 30, 2010

34

Statistical Learning

  • Use Bayesian Learning, MAP or ML
  • Complete data:

– When data has multiple attributes, all attributes are known – Easy

  • Incomplete data:

– When data has multiple attributes, some attributes are unknown – Harder

slide-35
SLIDE 35

CS886 Fall 10 - Lecture 5, Sept 30, 2010

35

Simple ML example

  • Hypothesis hθ:

– P(cherry)=θ & P(lime)=1-θ

  • Data d:

– c cherries and l limes

  • ML hypothesis:

– θ is relative frequency of observed data – θ = c/(c+l) – P(cherry) = c/(c+l) and P(lime)= l/(c+l)

slide-36
SLIDE 36

CS886 Fall 10 - Lecture 5, Sept 30, 2010

36

ML computation

  • 1) Likelihood expression

– P(d|hθ) = θc (1-θ)l

  • 2) log likelihood

– log P(d|hθ) = c log θ + l log (1-θ)

  • 3) log likelihood derivative

– d(log P(d|hθ))/dθ = c/θ - l/(1-θ)

  • 4) ML hypothesis

– c/θ - l/(1-θ) = 0  θ = c/(c+l)

slide-37
SLIDE 37

CS886 Fall 10 - Lecture 5, Sept 30, 2010

37

More complicated ML example

  • Hypothesis: hθ,θ1,θ2
  • Data:

– c cherries

  • gc green wrappers
  • rc red wrappers

– l limes

  • gl green wrappers
  • rl red wrappers
slide-38
SLIDE 38

CS886 Fall 10 - Lecture 5, Sept 30, 2010

38

ML computation

  • 1) Likelihood expression

– P(d|hθ,θ1,θ2) = θc(1-θ)l θ1

rc(1-θ1)gc θ2 rl(1-θ2)gl

  • 4) ML hypothesis

– c/θ - l/(1-θ) = 0  θ = c/(c+l) – rc/θ1 - gc/(1-θ1) = 0  θ1 = rc/(rc+gc) – rl/θ2 - gl/(1-θ2) = 0  θ2 = rl/(rl+gl)

slide-39
SLIDE 39

CS886 Fall 10 - Lecture 5, Sept 30, 2010

39

Naïve Bayes model

C An A3 A2 A1

  • Want to predict a

class C based on attributes Ai

  • Parameters:

– θ = P(C=true) – θi1 = P(Ai=true|C=true) – θi2 = P(Ai=true|C=false)

  • Assumption: Ai’s are independent given C
slide-40
SLIDE 40

CS886 Fall 10 - Lecture 5, Sept 30, 2010

40

Naïve Bayes model for Restaurant Problem

  • Data:
  • ML sets

– θ to relative frequencies of wait and ~wait – θi1, θi2 to relative frequencies of each attribute value given wait and ~wait