part 2 generalized output representations and structure
play

Part 2: Generalized output representations and structure Dale - PowerPoint PPT Presentation

Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta Output transformation Output transformation What if targets y special? E.g. what if y nonnegative y 0 y probability y [0 , 1] y class


  1. Part 2: Generalized output representations and structure Dale Schuurmans University of Alberta

  2. Output transformation

  3. Output transformation What if targets y special? E.g. what if y nonnegative y ≥ 0 y probability y ∈ [0 , 1] y class indicator y ∈ {± 1 } Would like predictions ˆ y to respect same constraints Cannot do this with linear predictors Consider a new extension Nonlinear output transformation f such that range ( f ) = Y Notation and terminology z = x ′ w y = f (ˆ ˆ z ) where ˆ z = x ′ w ˆ “pre-prediction” y = f (ˆ ˆ z ) “post-prediction”

  4. Nonlinear output transformation: Examples exp(x) Exponential 18 16 14 If y ≥ 0 use ˆ y = f (ˆ z ) = exp(ˆ z ) 12 10 8 6 4 2 0 1/(1+exp(−x)) Sigmoid −3 −2 −1 0 1 2 3 x 1 1 If y ∈ [0 , 1] use ˆ y = f (ˆ z ) = 0.8 1+exp( − ˆ z ) 0.6 0.4 0.2 0 sign(x) Sign −5 −4 −3 −2 −1 0 1 2 3 4 5 x 1 0.5 If y ∈ {± 1 } use ˆ y = f (ˆ z ) = sign (ˆ z ) 0 −0.5 −1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x

  5. Nonlinear output transformation: Risk Combining arbitrary f with L can create local minima E.g. y − y ) 2 L (ˆ y ; y ) = (ˆ z )) − 1 f (ˆ z ) = σ (ˆ z ) = (1 + exp( − ˆ Objective � i ( σ ( X i : w ) − y i ) 2 is not convex in w Consider one training example 0.2 0 − 6 − 4 − 2 0 2 4 6 (Auer et al. NIPS-95) wx Local minima can combine 1.1 1 0.9 0.8 0.7 E 0.6

  6. Nonlinear output transformation Possible to create exponentially many local minima t training examples can create ( t / n ) n local minima in n dimensions — locate t / n training examples along each dimension 0.62 0.6 0.58 0.56 0.54 0.52 0.5 0.48 -14 -14 -12 -12 -10 -10 -8 -8 log w2 log w1 -6 -6 -4 -4 -2 -2 0 0 From (Auer et al., NIPS-95)

  7. Important idea: matching loss Assume f is continuous, differentiable, and strictly increasing Want to define L (ˆ y ; y ) so that L ( f (ˆ z ); y ) is convex in ˆ z Define matching loss by � ˆ z L ( f (ˆ z ); f ( z )) = f ( θ ) − f ( z ) d θ z = F ( θ ) | ˆ z − f ( z ) θ | ˆ z z z = F (ˆ z ) − F ( z ) − f ( z )(ˆ z − z ) where F ′ ( z ) = f ( z ); defines a Bregman divergence

  8. Important idea: matching loss Properties F ′′ ( z ) = f ′ ( z ) > 0 since f strictly increasing ⇒ F strictly convex ⇒ F (ˆ z ) ≥ F ( z ) + f ( z )(ˆ z − z ) (convex function lies above tangent) ⇒ L ( f (ˆ z ); f ( z )) ≥ 0 and L ( f (ˆ z ); f ( z )) = 0 iff ˆ z = z

  9. Matching loss: examples Identity transfer f ( z ) = z , F ( z ) = z 2 / 2, y = f ( z ) = z y − y ) 2 / 2 Get squared error: L (ˆ y ; y ) = (ˆ Exponential transfer f ( z ) = e z , F ( z ) = e z , y = f ( z ) = e z y ; y ) = y ln y Get unnormalized entropy error: L (ˆ y + ˆ y − y ˆ Sigmoid transfer f ( z ) = σ ( z ) = 1 / (1 + e − z ), F ( z ) = ln(1 + e z ), y = f ( z ) = σ ( z ) y ; y ) = y ln y y + (1 − y ) ln 1 − y Get cross entropy error: L (ˆ ˆ 1 − ˆ y

  10. Matching loss Given suitable f Can derive a matching loss that ensures convexity of L ( f ( X w ); y ) Retain everything from before • efficient training • basis expansions • L 2 2 regularization → kernels • L 1 regularization → sparsity

  11. Major problem remains: Classification If, say, y ∈ {± 1 } class indicator, use ˆ y = sign (ˆ z ) sign(x) 1 0.5 0 −0.5 −1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x Not continuous, differentiable, strictly increasing Cannot use matching loss construction Misclassification error � 0 if ˆ y = y L (ˆ y ; y ) = 1 (ˆ y � = y ) = 1 if ˆ y � = y

  12. Classification

  13. Classification y = sign ( x ′ w ) Consider geometry of linear classifiers ˆ w { x : x ′ w = 0 } y = sign ( x ′ w − b ) Linear classifiers with offset ˆ w u { x : x ′ w − b = 0 } b 2 w since u ′ w = b , u ′ w − b = 0 u = � w � 2

  14. Classification Question Given training data X , y ∈ {± 1 } t can minimum misclassification error w be computed efficiently? Answer Depends

  15. Classification Good news Yes, if data is linearly separable Linear program w , b , ξ 1 ′ ξ subject to ∆( y )( X w − 1 b ) ≥ 1 − ξ , ξ ≥ 0 min Returns ξ = 0 if data linearly separable Returns some ξ i > 0 if data not linearly separable

  16. Classification Bad news No, if data not linearly separable NP-hard to solve � min 1 ( sign ( X i : w − b ) � = y i ) in general w i NP-hard even to approximate (H¨ offgen et al. 1995)

  17. How to bypass intractability of learning linear classifiers? Two standard approaches 1. Use a matching loss to approximate sign (e.g. tanh transfer) 1.5 1 0.5 0 −0.5 −1 −1.5 −4 −3 −2 −1 0 1 2 3 4 2. Use a surrogate loss for training, sign for test

  18. Approximating classification with a surrogate loss Idea Use a different loss ˜ L for training than the loss L used for testing Example Train on ˜ y − y ) 2 L (ˆ y ; y ) = (ˆ even though test on L (ˆ y ; y ) = 1 (ˆ y � = y ) Obvious weakness Regression losses like least squares penalize predictions that are “too correct”

  19. Tailored surrogate losses for classification Margin losses For a given target y and pre-prediction ˆ z Definition The prediction margin is m = ˆ zy Note if ˆ zy = m > 0 then sign (ˆ z ) = y , zero misclassification if ˆ zy = m ≤ 0 then sign (ˆ z ) � = y , misclassification error 1 Definition a margin loss is a decreasing (nonincreasing) function of the margin

  20. Margin losses Exponential margin loss ˜ z ; y ) = e − ˆ zy 3 L (ˆ 2.5 2 1.5 1 0.5 0 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Binomial deviance ˜ z ; y ) = ln(1 + e − ˆ 2.5 zy ) L (ˆ 2 1.5 1 0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2

  21. Margin losses Hinge loss (support vector machines) ˜ 3 L (ˆ z ; y ) = (1 − ˆ zy ) + = max(0 , 1 − ˆ zy ) 2.5 2 1.5 1 0.5 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Robust hinge loss (intractable training) ˜ 2 L (ˆ z ; y ) = 1 − tanh(ˆ zy ) 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 −3 −2 −1 0 1 2 3

  22. Margin losses Note Convex margin loss can provide efficient upper bound minimization for misclassification error Retain all previous extensions • efficient training • basis expansion • L 2 2 regularization → kernels • L 1 regularization → sparsity

  23. Multivariate prediction

  24. Multivariate prediction What if prediction targets y ′ are vectors ? For linear predictors, use a weight matrix W Given input x ′ , predict a vector y ′ = x ′ W ˆ 1 × k 1 × n n × k On training data, get prediction matrix ˆ Y = XW t × k t × n n × k W : j is the weight vector for j th output column W i : is vector of weights applied to i th feature Try to approximate target matrix Y

  25. Multivariate linear prediction Need to define loss function between vectors y ; y ) = � y ℓ − y ℓ ) 2 E.g. L (ˆ ℓ (ˆ Given X , Y , compute t � min L ( X i : W ; Y i : ) W i =1 = min W L ( XW ; Y ) t � Note: using shorthand L ( XW ; Y ) = L ( X i : W ; Y i : ) i =1 Feature expansion X �→ Φ • Doesn’t change anything, can still solve same way as before • Will just use X and Φ interchangeably from now on

  26. Multivariate prediction Can recover all previous developments • efficient training • feature expansion • L 2 2 regularization → kernels • L 1 regularization → sparsity • output transformations • matching loss • classification—surrogate margin loss

  27. L 2 2 regularization—kernels W L ( XW ; Y ) + β 2 tr ( W ′ W ) min Still get representer theorem Solution satisfies W ∗ = X ′ A ∗ for some A ∗ Therefore still get kernels W L ( XW ; Y ) + β 2 tr ( W ′ W ) min A L ( XX ′ A ; Y ) + β 2 tr ( A ′ XX ′ A ) = min A L ( KA ; Y ) + β 2 tr ( A ′ KA ) = min Note We are actually regularizing using a matrix norm F = � � W � 2 ij W 2 ij = tr ( W ′ W ) Frobenius norm �� � ij W 2 tr ( W ′ W ) � W � F = ij =

  28. Brief background: Recall matrix trace Definition For a square matrix A , tr ( A ) = � i A ii Properties tr ( A ) = tr ( A ′ ) tr ( aA ) = a tr ( A ) tr ( A + B ) = tr ( A ) + tr ( B ) tr ( A ′ B ) = tr ( B ′ A ) = � ij A ij B ij tr ( A ′ A ) = tr ( AA ′ ) = � ij A 2 ij tr ( ABC ) = tr ( CAB ) = tr ( BCA ) dW tr ( C ′ W ) = C d dW tr ( W ′ AW ) = ( A + A ′ ) W d

  29. L 1 regularization—sparsity? We want sparsity in rows of W , not columns (that is, we want feature selection, not output selection) To achieve our goal need to select the right regularizer Consider the following matrix norms � L 1 norm � W � 1 = max j i | W ij | � L ∞ norm � W � ∞ = max i j | W ij | L 2 norm � W � 2 = σ max ( W ) (maximum singular value) � W � tr = � trace norm j σ j ( W ) (sum of singular values) � W � 2 , 1 = � 2 , 1 block norm i � W i : � �� �� ij W 2 j σ j ( W ) 2 Frobenius norm � W � F = ij = Which, if any, of these yield the desired sparsity structure?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend