stochastic gradient descent
play

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer - PowerPoint PPT Presentation

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013 The problem A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1


  1. Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie Mellon University February 5, 2013

  2. The problem ◮ A typical machine learning problem has a penalty/regularizer + loss form n w F ( w ) = g ( w ) + 1 � min f ( w ; y i , x i ) , n i =1 x i , w ∈ R p , y i ∈ R , both g and f are convex ◮ Today we only consider differentiable f , and let g = 0 for simplicity ◮ For example, let f ( w ; y i , x i ) = − log p ( y i | x i , w ), we are trying to maximize the log likelihood, which is n 1 � max log p ( y i | x i , w ) n w i =1

  3. Gradient Descent ◮ choose initial w (0) , repeat Two dimensional example: w ( t +1) = w ( t ) − η t · ∇ F ( w ( t ) ) until stop ◮ η t is the learning rate, and ∇ F ( w ( t ) ) = 1 � ∇ w f ( w ( t ) ; y i , x i ) n i ◮ How to stop? � w ( t +1) − w ( t ) � ≤ ǫ or �∇ F ( w ( t ) ) � ≤ ǫ

  4. Learning rate matters too small η t , after 100 η t = t , it is too big iterations

  5. Backtracking line search Adaptively choose the learning rate ◮ choose a parameter 0 < β < 1 ◮ start with η = 1, repeat t = 0 , 1 , . . . ◮ while L ( w ( t ) − η ∇ L ( w ( t ) )) > L ( w ( t ) ) − η 2 �∇ L ( w ( t ) ) � 2 update η = βη ◮ w ( t +1) = w ( t ) − η ∇ L ( w ( t ) )

  6. Backtracking line search A typical choice β = 0 . 8, converged after 13 iterations:

  7. Stochastic Gradient Descent ◮ We name 1 � i f ( w ; y i , x i ) the empirical loss, the thing we n hope to minimize is the expected loss f ( w ) = E y i , x i f ( w ; y i , x i ) ◮ Suppose we receive an infinite stream of samples ( y t , x t ) from the distribution, one way to optimize the objective is w ( t +1) = w ( t ) − η t ∇ w f ( w ( t ) ; y t , x t ) ◮ On practice, we simulate the stream by randomly pick up ( y t , x t ) from the samples we have ◮ Comparing the average gradient of GD 1 i ∇ w f ( w ( t ) ; y i , x i ) � n

  8. More about SGD ◮ the objective does not always decrease for each step ◮ comparing to GD, SGD needs more steps, but each step is cheaper ◮ mini-batch, say pick up 100 samples and do average, may accelerate the convergence

  9. Relation to Perceptron ◮ Recall Perceptron: initialize w , repeat � y i x i if y i � w , x i � < 0 w = w + 0 otherwise ◮ Fix learning rate η = 1, let f ( w ; y , x ) = max(0 , − y i � w , x i � ), then � − y i x i if y i � w , x i � < 0 ∇ w f ( w ; y , x ) = 0 otherwise we derive Perceptron from SGD

  10. Question?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend