cs534 machine learning cs534 machine learning
play

CS534: Machine Learning CS534: Machine Learning Thomas G. - PowerPoint PPT Presentation

CS534: Machine Learning CS534: Machine Learning Thomas G. Dietterich Thomas G. Dietterich 221C Dearborn Hall 221C Dearborn Hall tgd@cs.orst.edu tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 http://www.cs.orst.edu/~tgd/classes/534


  1. CS534: Machine Learning CS534: Machine Learning Thomas G. Dietterich Thomas G. Dietterich 221C Dearborn Hall 221C Dearborn Hall tgd@cs.orst.edu tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 http://www.cs.orst.edu/~tgd/classes/534 1 1

  2. Course Overview Course Overview Introduction: Introduction: – Basic problems and questions in machine learning. Example applications cations – Basic problems and questions in machine learning. Example appli Linear Classifiers Linear Classifiers Five Popular Algorithms Five Popular Algorithms – – Decision trees (C4.5) Decision trees (C4.5) – Neural networks (backpropagation) – Neural networks (backpropagation) – Probabilistic networks (Naï ïve Bayes; Mixture models) ve Bayes; Mixture models) – Probabilistic networks (Na – Support Vector Machines (SVMs) – Support Vector Machines (SVMs) – – Nearest Neighbor Method Nearest Neighbor Method Theories of Learning: Theories of Learning: – – PAC, Bayesian, Bias- PAC, Bayesian, Bias -Variance analysis Variance analysis Optimizing Test Set Performance: Optimizing Test Set Performance: – Overfitting, Penalty methods, Holdout Methods, Ensembles – Overfitting, Penalty methods, Holdout Methods, Ensembles Sequential and Spatial Data Sequential and Spatial Data – Hidden Markov models, Conditional Random Fields; Hidden Markov SVMs VMs – Hidden Markov models, Conditional Random Fields; Hidden Markov S Problem Formulation Problem Formulation – Designing Input and Output representations – Designing Input and Output representations 2 2

  3. Supervised Learning Supervised Learning – Given: Training examples Given: Training examples h , f f ( ( x ) i i for some unknown function for some unknown function f f . . – x , x ) h x – Find: A good approximation to – Find: A good approximation to f f . . Example Applications Example Applications – Handwriting recognition Handwriting recognition – x: data from pen motion x: data from pen motion f(x): letter of the alphabet f(x): letter of the alphabet – Disease Diagnosis – Disease Diagnosis x: properties of patient (symptoms, lab tests) x: properties of patient (symptoms, lab tests) f(x): disease (or maybe, recommended therapy) f(x): disease (or maybe, recommended therapy) – Face Recognition Face Recognition – x: bitmap picture of person x: bitmap picture of person’ ’s face s face f(x): name of person f(x): name of person – Spam Detection Spam Detection – x: email message x: email message f(x): spam or not spam f(x): spam or not spam 3 3

  4. Appropriate Applications for Appropriate Applications for Supervised Learning Supervised Learning Situations where there is no human expert Situations where there is no human expert – x: bond graph of a new molecule x: bond graph of a new molecule – – f(x): predicted binding strength to AIDS protease molecule f(x): predicted binding strength to AIDS protease molecule – Situations were humans can perform the task but can’ ’t describe how t describe how Situations were humans can perform the task but can they do it they do it – x: bitmap picture of hand x: bitmap picture of hand- -written character written character – – – f(x): ascii code of the character f(x): ascii code of the character Situations where the desired function is changing frequently Situations where the desired function is changing frequently – x: description of stock prices and trades for last 10 days x: description of stock prices and trades for last 10 days – – f(x): recommended stock transactions f(x): recommended stock transactions – Situations where each user needs a customized function f Situations where each user needs a customized function f – x: incoming email message x: incoming email message – – f(x): importance score for presenting to the user (or deleting w f(x): importance score for presenting to the user (or deleting without ithout – presenting) presenting) 4 4

  5. test point Formal Formal , y y i P( x , y ) x , h x h i training points Setting y Setting x Training learning f sample algorithm Training examples are drawn Training examples are drawn ŷ y independently at random according to independently at random according to loss unknown probability distribution P( x , y y ) ) unknown probability distribution P( x , function The learning algorithm analyzes the The learning algorithm analyzes the examples and produces a classifier f f examples and produces a classifier L( ŷ ,y ) Given a new data point h , y y i i drawn from P, drawn from P, Given a new data point x , h x the classifier is given x x and predicts and predicts ŷ ŷ = = f f ( ( x ) the classifier is given x ) The loss L( ŷ ŷ ,y ,y ) is then measured ) is then measured The loss L( Goal of the learning algorithm: Find the f f Goal of the learning algorithm: Find the that minimizes the expected loss expected loss that minimizes the 5 5

  6. Formal Version of Spam Detection Formal Version of Spam Detection P( x , y y ): distribution of email messages ): distribution of email messages x x and their and their P( x , true labels y y ( (“ “spam spam” ” or or “ “not spam not spam” ”) ) true labels training sample: a set of email messages that have training sample: a set of email messages that have been labeled by the user been labeled by the user learning algorithm: what we study in this course! learning algorithm: what we study in this course! f : the classifier output by the learning algorithm : the classifier output by the learning algorithm f test point: A new email message x x (with its true, but (with its true, but test point: A new email message hidden, label y y ) ) hidden, label true label y y true label loss function L( L( ŷ ŷ ,y) ,y) : : loss function predicted predicted spam not spam not label ŷ ŷ label spam spam spam spam 0 0 10 10 not spam not spam 1 1 0 0 6 6

  7. Three Main Approaches to Three Main Approaches to Machine Learning Machine Learning Learn a classifier: a function f f . . Learn a classifier: a function Learn a conditional distribution: a conditional Learn a conditional distribution: a conditional distribution P( y y | | x ) distribution P( x ) Learn the joint probability distribution: P( x , y y ) ) Learn the joint probability distribution: P( x , In the first two weeks, we will study one example In the first two weeks, we will study one example of each method: of each method: – Learn a classifier: The LMS algorithm Learn a classifier: The LMS algorithm – – Learn a conditional distribution: Logistic regression Learn a conditional distribution: Logistic regression – – Learn the joint distribution: Linear discriminant Learn the joint distribution: Linear discriminant – analysis analysis 7 7

  8. Infering a classifier f f from P( from P( y y | | x ) Infering a classifier x ) Predict the ŷ ŷ that minimizes the expected that minimizes the expected Predict the loss: loss: f ( x ) = argmin E y | x [ L (ˆ y, y )] ˆ y X = argmin P ( y | x ) L (ˆ y, y ) ˆ y y 8 8

  9. Example: Making the spam decision Example: Making the spam decision Suppose our spam detector Suppose our spam detector predicts that P( y y = =“ “spam spam” ” | | x ) = predicts that P( x ) = 0.6. What is the optimal 0.6. What is the optimal true label y y true label classification decision ŷ ŷ ? ? classification decision predicted predicted spam not spam not label ŷ ŷ label spam spam Expected loss of ŷ ŷ = = “ “spam spam” ” is is Expected loss of spam spam 0 0 10 10 0 * 0.6 + 10 * 0.4 = 4 0 * 0.6 + 10 * 0.4 = 4 not spam not spam 1 1 0 0 Expected loss of ŷ ŷ = = “ “no spam no spam” ” Expected loss of P( y y | | x ) P( x ) 0.6 0.6 0.4 0.4 is 1 * 0.6 + 0 * 0.4 = 0.6 is 1 * 0.6 + 0 * 0.4 = 0.6 Therefore, the optimal Therefore, the optimal prediction is “ “no spam no spam” ” prediction is 9 9

  10. Inferring a classifier from Inferring a classifier from the joint distribution P( x , y y ) ) the joint distribution P( x , We can compute the conditional distribution We can compute the conditional distribution according to the definition of conditional according to the definition of conditional probability: probability: P ( x , y = k ) P ( y = k | x ) = j P ( x , y = j ) . P In words, compute P( x , y=k y=k ) for each value of ) for each value of k k . . In words, compute P( x , Then normalize these numbers. Then normalize these numbers. Compute ŷ ŷ using the method from the previous using the method from the previous Compute slide slide 10 10

  11. Fundamental Problem of Machine Fundamental Problem of Machine Learning: It is ill- -posed posed Learning: It is ill Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 11 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend