Announcement HW 1 out TODAY Watch your email 1 What is Machine - - PowerPoint PPT Presentation

announcement
SMART_READER_LITE
LIVE PREVIEW

Announcement HW 1 out TODAY Watch your email 1 What is Machine - - PowerPoint PPT Presentation

Announcement HW 1 out TODAY Watch your email 1 What is Machine Learning? (Formally) 2 What is Machine Learning? Study of algorithms that improve their performance at some task with experience Learning algorithm


slide-1
SLIDE 1

Announcement

  • HW 1 out TODAY – Watch your email

1

slide-2
SLIDE 2

What is Machine Learning? (Formally)

2

slide-3
SLIDE 3

What is Machine Learning?

3

Study of algorithms that

  • improve their performance
  • at some task
  • with experience

(experience) (task) (performance)

Learning algorithm

slide-4
SLIDE 4

Supervised Learning Task

4

Task:

“Anemic cell (0)” “Healthy cell (1)”

slide-5
SLIDE 5

Performance Measures

5

  • Measure of closeness between true label Y and

prediction f(X) Performance:

“Anemic cell” X f(X) Y “Anemic cell” “Healthy cell”

1

0/1 loss

slide-6
SLIDE 6

Performance Measures

6

  • Measure of closeness between true label Y and

prediction f(X) Performance:

“$24.50” X f(X) Share price, Y “$24.50” “$26.00”

1?

square loss

2?

“$26.10”

Past performance, trade volume etc. as of Sept 8, 2010

slide-7
SLIDE 7

Performance Measures

7

  • Measure of closeness between true label Y and

prediction f(X) Don’t just want label of one test data (cell image), but any cell image Performance: Given a cell image drawn randomly from the collection of all cell images, how well does the predictor perform on average?

slide-8
SLIDE 8

Performance Measures

8

Performance:

0/1 loss Probability of Error “Anemic cell” Share Price “$ 24.50” square loss Mean Square Error

slide-9
SLIDE 9

Bayes Optimal Rule

9

Ideal goal: Bayes optimal rule Best possible performance: Bayes Risk BUT… Optimal rule is not computable - depends on unknown PXY !

slide-10
SLIDE 10

Experience - Training Data

10

Can’t minimize risk since PXY unknown! Training data (experience) provides a glimpse of PXY

independent, identically distributed (unknown) (observed) Provided by expert, measuring device, some experiment, … , Anemic cell , Healthy cell , Healthy cell , Anemic cell

slide-11
SLIDE 11

Supervised Learning

11

Task: Performance: Experience: Training data

(unknown)

, Anemic cell , Healthy cell , Healthy cell , Anemic cell

slide-12
SLIDE 12

Machine Learning Algorithm

12

Learning algorithm

, Anemic cell , Healthy cell , Healthy cell , Anemic cell

Training data

= “Anemic cell”

Test data

Note: test data ≠ training data

slide-13
SLIDE 13

No Football Player Training data

Weight Height

  • A good machine learning algorithm

– Does not overfit training data – Generalizes well to test data

Weight Height

Issues in ML

13

Test data

More later …

slide-14
SLIDE 14

Performance Revisited

14

Performance: (of a learning algorithm) How well does the algorithm do on average

  • 1. for a test cell image X drawn at random, and
  • 2. for a set of training images and labels

drawn at random Expected Risk (aka Generalization Error)

slide-15
SLIDE 15

How to sense Generalization Error?

  • Can’t compute generalization error. How can we get a sense
  • f how well algorithm is performing in practice?
  • One approach -

– Split available data into two sets – Training Data – used for training the algorithm – Test Data (a.k.a. Validation Data, Hold-out Data) – provides estimate of generalization error

15

Learning algorithm Test Error = Why not use Training Error?

slide-16
SLIDE 16

Supervised vs. Unsupervised Learning

16

Learning algorithm

Supervised Learning – Learning with a teacher Unsupervised Learning – Learning without a teacher

Learning algorithm Mapping between Documents and topics

Model for word distribution OR Clustering of similar documents

Documents, topics Documents

slide-17
SLIDE 17

Lets get to some learning algorithms!

17

slide-18
SLIDE 18

Learning Distributions (Parametric Approach)

Aarti Singh

Machine Learning 10-701/15-781 Sept 13, 2010

slide-19
SLIDE 19

Your first consulting job

19

  • A billionaire from the suburbs of Seattle asks you a

question:

– He says: I have a coin, if I flip it, what’s the probability it will fall with the head up? – You say: Please flip it a few times: – You say: The probability is: – He says: Why??? – You say: Because…

3/5

slide-20
SLIDE 20

Bernoulli distribution

20

Data, D =

  • P(Heads) = , P(Tails) = 1-
  • Flips are i.i.d.:

– Independent events – Identically distributed according to Bernoulli distribution

Choose  that maximizes the probability of observed data

slide-21
SLIDE 21

Maximum Likelihood Estimation

Choose  that maximizes the probability of observed data MLE of probability of head:

21

= 3/5

“Frequency of heads”

slide-22
SLIDE 22

How many flips do I need?

22

  • Billionaire says: I flipped 3 heads and 2 tails.
  • You say:  = 3/5, I can prove it!
  • He says: What if I flipped 30 heads and 20 tails?
  • You say: Same answer, I can prove it!
  • He says: What’s better?
  • You say: Hmm… The more the merrier???
  • He says: Is this why I am paying you the big bucks???
slide-23
SLIDE 23

Simple bound (Hoeffding’s inequality)

23

  • For n = H+T, and
  • Let * be the true parameter, for any >0:
slide-24
SLIDE 24

PAC Learning

24

  • PAC: Probably Approximate Correct
  • Billionaire says: I want to know the coin parameter ,

within  = 0.1, with probability at least 1- = 0.95. How many flips? Sample complexity

slide-25
SLIDE 25

What about prior knowledge?

25

  • Billionaire says: Wait, I know that the coin is “close” to

50-50. What can you do for me now?

  • You say: I can learn it the Bayesian way…
  • Rather than estimating a single , we obtain a

distribution over possible values of 

50-50 Before data After data

slide-26
SLIDE 26

Bayesian Learning

26

  • Use Bayes rule:
  • Or equivalently:

posterior likelihood prior

slide-27
SLIDE 27

Prior distribution

  • What about prior?

– Represents expert knowledge (philosophical approach) – Simple posterior form (engineer’s approach)

  • Uninformative priors:

– Uniform distribution

  • Conjugate priors:

– Closed-form representation of posterior – P() and P(|D) have the same form

27

slide-28
SLIDE 28

Conjugate Prior

  • P() and P(|D) have the same form
  • Eg. 1 Coin flip problem

Likelihood is ~ Binomial If prior is Beta distribution, Then posterior is Beta distribution For Binomial, conjugate prior is Beta distribution.

28

slide-29
SLIDE 29

Beta distribution

29

More concentrated as values of bH, bT increase

slide-30
SLIDE 30

Beta conjugate prior

30

As n = H + T increases As we get more samples, effect of prior is “washed out”

slide-31
SLIDE 31

Conjugate Prior

  • P() and P(|D) have the same form
  • Eg. 2 Dice roll problem (6 outcomes instead of 2)

Likelihood is ~ Multinomial( = {1, 2, … , k}) If prior is Dirichlet distribution, Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution.

31

slide-32
SLIDE 32

Maximum A Posteriori Estimation

Choose  that maximizes a posterior probability MAP estimate of probability of head:

32

Mode of Beta distribution