lecture 17 more on binary vs multi class classifiers
play

Lecture 17: More on binary vs. multi-class classifiers - PowerPoint PPT Presentation

Lecture 17: More on binary vs. multi-class classifiers (Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy) Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by


  1. Lecture 17: More on binary vs. multi-class classifiers (Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy) Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by Julia Hockenmaier

  2. More on supervised learning 2

  3. The supervised learning task Given a labeled training data set of N items x n ∈ X with labels y n ∈ Y D train = {( x 1 , y 1 ),…, ( x N , y N )} (y n is determined by some unknown target function f( x )) Return a model g: X X ⟼ Y Y that is a good approximation of f( x ) (g should assign correct labels y to unseen x ∉ D train )

  4. Supervised learning terms Input items/data points x n ∈ X X (e.g. emails) are drawn from an instance space X Output labels y n ∈ Y Y (e.g. ‘spam’/‘nospam’) are drawn from a label space Y Every data point x n ∈ X X has a single correct label y n ∈ Y , defined by an (unknown ) target function f ( x ) = y

  5. Supervised learning Input Output Target function y' = f( x ) x ∈ X y ∈ Y Learned model y = g( x ) An item y An item x drawn from a label drawn from an space Y instance space X ^ You often seen f( x ) instead of g( x ), and y^ but PowerPoint can’t really typeset that, so g( x ) and y’ will have to do.

  6. Supervised learning: Training Labeled Training Data D train Learned Learning ( x 1 , y 1 ) model Algorithm ( x 2 , y 2 ) g( x ) … ( x N , y N ) Give the learner examples in D train The learner returns a model g( x )

  7. Supervised learning: Testing Labeled Test Data D test ( x’ 1 , y’ 1 ) ( x’ 2 , y’ 2 ) … ( x’ M , y’ M ) Reserve some labeled data for testing

  8. Supervised learning: Testing Labeled Test Data Raw Test Test D test Data Labels ( x’ 1 , y’ 1 ) X test Y test ( x’ 2 , y’ 2 ) y’ 1 x’ 1 … y’ 2 x’ 2 ( x’ M , y’ M ) …. ... x’ M y’ M

  9. Supervised learning: Testing Apply the model to the raw test data Raw Test Predicted Test Data Labels Labels X test Y test g( X test ) Learned y’ 1 x’ 1 g( x’ 1 ) model y’ 2 x’ 2 g( x’ 2 ) g( x ) …. …. ... x’ M g( x’ M ) y’ M

  10. Evaluating supervised learners Use a test data set D test that is disjoint from D train D test = {( x’ 1 , y’ 1 ),…, ( x’ M , y’ M )} The learner has not seen the test items during learning. Split your labeled data into two parts: test and training. D test and compare the predicted f( x’ i ) Take all items x’ i in D with the correct y’ i . This requires an evaluation metric (e.g. accuracy).

  11. 1. The instance space

  12. 1. The instance space X Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y Designing an appropriate instance space X X is crucial for how well we can predict y.

  13. 1. The instance space X When we apply machine learning to a task, we first need to define the instance space X. X. Instances x ∈ X X are defined by features : Boolean features: Does this email contain the word ‘money’? Numerical features: How often does ‘money’ occur in this email? What is the width/height of this bounding box?

  14. X X as a vector space X is an N-dimensional vector space (e.g. ℝ N ) Each dimension = one feature. Each x is a feature vector (hence the boldface x ). Think of x = [x 1 … x N ] as a point in X : x 2 x 1

  15. From feature templates to vectors When designing features, we often think in terms of templates, not individual features: What is the 2nd letter? N a oki → [1 0 0 0 …] A b e → [0 1 0 0 …] S c rooge → [0 0 1 0 …] What is the i- th letter? Abe → [ 1 0 0 0 0… 0 1 0 0 0 0… 0 0 0 0 1 …]

  16. Good features are essential • The choice of features is crucial for how well a task can be learned. • In many application areas (language, vision, etc.), a lot of work goes into designing suitable features. • This requires domain expertise. • We can’t teach you what specific features to use for your task. • But we will touch on some general principles

  17. 2. The label space

  18. 2. The label space Y Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y The label space Y Y determines what kind of supervised learning task we are dealing with

  19. Supervised learning tasks I Output labels y ∈ Y Y are categorical : CLASSIFICATION Binary classification : Two possible labels Multiclass classification : k possible labels Output labels y ∈ Y Y are structured objects (sequences of labels, parse trees, etc.) Structure learning, etc.

  20. Supervised learning tasks II Output labels y ∈ Y Y are numerical : Regression (linear/polynomial): Labels are continuous-valued Learn a linear/polynomial function f(x) Ranking: Labels are ordinal Learn an ordering f(x 1 ) > f(x 2 ) over input

  21. 3. Models (The hypothesis space)

  22. 3. The model g( x ) Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y We need to choose what kind of model we want to learn

  23. More terminology For classification tasks ( Y Y is categorical, e.g. {0, 1}, or {0, 1, …, k}), the model is called a classifier . For binary classification tasks ( Y Y = {0, 1} or Y Y = {-1, +1}), we can either think of the two values of Y Y as Boolean or as positive/negative

  24. A learning problem x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 ‘ 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0

  25. A learning problem X |= 2 4 = 16 Each x has 4 bits: | X Since Y Y = {0, 1}, each f( x ) defines one subset of X X has 2 16 = 65536 subsets: There are 2 16 possible f( x ) (2 9 are consistent with our data) We would need to see all of X X to learn f( x )

  26. A learning problem We would need to see all of X X to learn f( x ) Easy with | X |=16 Not feasible in general (for any real-world problems) Learning = generalization, not memorization of the training data

  27. Classifiers in vector spaces f( x ) > 0 f( x ) = 0 x 2 f( x ) < 0 x 1 Binary classification: We assume f separates the positive and negative examples: Assign y = 1 to all x where f( x ) > 0 Assign y = 0 (or -1) to all x where f( x ) < 0

  28. Learning a classifier The learning task: Find a function f( x ) that best separates the (training) data What kind of function is f? How do we define best ? How do we find f?

  29. Which model should we pick?

  30. Criteria for choosing models Accuracy: Prefer models that make fewer mistakes We only have access to the training data But we care about accuracy on unseen (test) examples Simplicity (Occam’s razor): Prefer simpler models (e.g. fewer parameters). These (often) generalize better, and need less data for training.

  31. Linear classifiers CS446 Machine Learning 31

  32. Linear classifiers f( x ) > 0 f( x ) = 0 x 2 f( x ) < 0 x 1 Many learning algorithms restrict the hypothesis space to linear classifiers : f( x ) = w 0 + wx

  33. Linear Separability • Not all data sets are linearly separable: x 2 x 1 x 1 • Sometimes, feature transformations help: x 12 |x 2 - x 1 | x 1 x 1

  34. Linear classifiers: f( x ) = w 0 + wx wx f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 Linear classifiers are defined over vector spaces Every hypothesis f( x ) is a hyperplane: f( x ) = w 0 + wx f( x ) is also called the decision boundary Assign ŷ = +1 to all x where f( x ) > 0 Assign ŷ = -1 to all x where f( x ) < 0 ŷ = sgn(f( x ))

  35. y·f( x ) > 0: Correct classification f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 An example ( x , y) is correctly classified by f( x ) if and only if y·f( x ) > 0: Case 1 (y = +1 = ŷ): f( x ) > 0 ⇒ y·f( x ) > 0 Case 2 (y = -1 = ŷ): f( x ) < 0 ⇒ y·f( x ) > 0 Case 3 (y = +1 ≠ ŷ = -1): f( x ) > 0 ⇒ y·f( x ) < 0 Case 4 (y = -1 ≠ ŷ = +1): f( x ) < 0 ⇒ y·f( x ) < 0

  36. With a separate bias term w 0 : f( x ) = w · x x + w 0 The instance space X is a d -dimensional vector space (each x ∈ X has d elements) The decision boundary f( x ) = 0 is a ( d −1)-dimensional hyperplane in the instance space. The weight vector w is orthogonal (normal) to the decision boundary f( x ) = 0: For any two points x A and x B on the decision boundary f( x A ) = f( x B ) = 0 For any vector ( x B − x A ) on the decision boundary: w ( x B − x A ) = f( x B )−w 0 −f( x A )+w 0 = 0 The bias term w 0 determines the distance of the decision boundary from the origin: For x with f( x ) = 0, the distance to the origin is w ⋅ x = − w 0 w 0 w = − w d 2 ∑ w i i = 1 CS446 Machine Learning 36

  37. With a separate bias term w 0 : f( x ) = w · x x + w 0 x 2 decision boundary arbitrary point f( x ) = 0 x f( x ) distance of x w to decision boundary weight vector w x 1 − w 0 distance of decision boundary w to origin CS446 Machine Learning 37

  38. Canonical representation: getting rid of the bias term With w = (w 1 , …, w N ) T and x = (x 1 , …, x N ) T : f(x) = w 0 + wx = w 0 + ∑ i=1…N w i x i w 0 is called the bias term. The canonical representation redefines w , x as w = (w 0 , w 1 , …, w N ) T and x = (1, x 1 , …, x N ) T => f( x ) = w·x CS446 Machine Learning 38

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend