irds bonus slides
play

IRDS: Bonus Slides Charles Sutton University of Edinburgh Hello - PowerPoint PPT Presentation

IRDS: Bonus Slides Charles Sutton University of Edinburgh Hello there I will not present these slides in class. Next lecture we will discuss how to choose features for learning algorithms. This means you need to understand a bit about learning


  1. IRDS: Bonus Slides Charles Sutton University of Edinburgh

  2. Hello there I will not present these slides in class. Next lecture we will discuss how to choose features for learning algorithms. This means you need to understand a bit about learning algorithms. There are just an outline of topics that will help you to appreciate the next lecture. These slides: • List a few representative algorithms • What you should know about them • With links to readings to learn about them To be ready for the next lecture, what you really need: • to know how the classifiers represent the decision boundary • not the algorithm for how the classifier is learnt • (good to know, but not necessary for next lecture)

  3. List of Algorithms (with readings) Here are the ones we will “discuss” • Linear regression • Fitting nonlinear functions by adding basis functions • BRML Sec 17.1, 17.2 • Logistic regression • BRML Sec 17.4 • (just first few pages, don’t worry about training algorithms) • k-nearest neighbour • BRML Sec 14.1, 14.2 • Decision trees • HTF Sec 9.2 Why these? • practical • have different types of decision boundaries • so representative for purposes of next lecture

  4. Key to previous slide • BRML : Barber. Bayesian Reasoning and Machine Learning. CUP, 2012. http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/ pmwiki.php?n=Brml.HomePage • HTF : Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning 2nd ed, Springer, 2009. http:// statweb.stanford.edu/~tibs/ElemStatLearn/

  5. Linear regression x ∈ R d denote the feature vector. Trying to predict y ∈ R Let w ∈ R d Simplest choice a linear function. Define parameters d X y = f ( x , w ) = w > x = ˆ w j x j j =1 (to keep notation simple assume that always ) x d = 1 2.5 Given a data set x (1) . . . x ( N ) , y (1) , . . . , y ( N ) 2 find the best parameters 1.5 N y ( i ) − w > x ( i ) ⌘ 2 ⇣ X 1 min w i =1 0.5 which can be solved easily 0 − 2 − 1 0 1 2 3 (but I won’t say how)

  6. Nonlinear regression What if we want to learn a nonlinear function? Trick: Define new features, e.g., for scalar x, define φ ( x ) = (1 , x, x 2 ) > y = f ( x , w ) = w > φ ( x ) ˆ this is still linear in w degree 2 To find parameters, 15 the minimisation problem is now 10 N ⌘ 2 ⇣ X y ( i ) − w > φ ( x ( i ) ) 5 min w i =1 0 − 5 exactly the same form as before (because x is fixed) − 10 0 5 10 15 20 so still just as easy

  7. Logistic regression (a classification method, despite the name) x 2 Linear regression was easy. o o o o Can we do linear classification too? o o o w o o x o o x x Define a discriminant function x x 1 f ( x , w ) = w > x x x x x Then predict using ( 1 if f ( x , w ) ≥ 0 y = 0 otherwise yields linear decision boundary Can get class probabilities from this idea, using logistic regression: 1 p ( y = 1 | x ) = 1 + exp { − w > x } (to show decision boundaries same, compute log odds log p ( y = 1 | x ) p ( y = 0 | x )

  8. K-Nearest Neighbour simple method for classification or regression Define a distance function between feature vectors D ( x , x 0 ) To classify a new feature vector x N K ( x ) 1. Look through your training set. Find the K closest points. Call them (this is memory-based learning.) 2. Return the majority vote. 3. If you want a probability, take the proportion p ( y = c | x ) = 1 I { y 0 = c } X K ( y 0 , x 0 ) 2 N K ( x ) (the running time of this algorithm is terrible. See IAML for better indexing.)

  9. K-Nearest Neighbour data train 5 Decision boundaries can be highly nonlinear 4 3 The bigger the K, the smoother the boundary 2 This is nonparametric : the complexity 1 of the boundary varies depending on 0 the amount of training data − 1 − 2 − 3 − 2 − 1 0 1 2 3 predicted label, K=1 predicted label, K=5 5 5 4 4 3 3 k=1 k=5 2 2 1 1 0 0 − 1 c1 − 1 c1 c2 c2 c3 c3 − 2 − 2 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3

  10. Decision Trees R 5 R 2 t 4 X 1 ≤ t 1 X 2 R 3 t 2 R 4 X 2 ≤ t 2 X 1 ≤ t 3 R 1 R 1 R 2 R 3 X 2 ≤ t 4 t 1 t 3 X 1 R 4 R 5 Can be used for classification or regression Can handle discrete or continuous features Interpretable but tend not to work as well as other methods. X 2 X 1 (figure from Hastie, Tibshirani, and Friedman, 2009)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend