cs 4100 artificial intelligence
play

CS 4100: Artificial Intelligence Optimization and Neural Nets - PDF document

CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at


  1. CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Linear Classifiers • In Inputs s are fe feature values • Ea Each feature has s a we weight • Su Sum is the act activat ation • If If the activa vation is: s: w 1 f 1 S w 2 • Po Positive ve , output +1 +1 >0? f 2 w 3 • Ne Negative , output -1 f 3

  2. How to get probabilistic decisions? • Ac Activ ivation ion: z = w · f ( x ) • If If very po positive à want probability going to 1 z = w · f ( x ) very ne ive à want probability going to 0 • If If negative z = w · f ( x ) • Sigmoid Sigmoid fu function ion 1 φ ( z ) = 1 + e − z Best w? • Maximum like kelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i 1 P ( y ( i ) = +1 | x ( i ) ; w ) = wi with th: 1 + e − w · f ( x ( i ) ) 1 P ( y ( i ) = − 1 | x ( i ) ; w ) = 1 − 1 + e − w · f ( x ( i ) ) Th This is is is calle lled Lo Logis istic ic Regressio ion

  3. Multiclass Logistic Regression • Re Recall Perceptron: n: • A we weig ight vector for each class: • Sc Score (activation) of a class y: • Prediction with hi highe ghest st sc scor ore wins ns • Ho How w to tur urn n sc scores s in into pr proba babi bilities? ? e z 1 e z 2 e z 3 z 1 , z 2 , z 3 → e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 original activations softmax activations Best w? • Maximum like kelihood estimation: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i e w y ( i ) · f ( x ( i ) ) P ( y ( i ) | x ( i ) ; w ) = with wi th: y e w y · f ( x ( i ) ) P Th This is is is calle lled Mu Multi-Cl Class L ss Logist stic R Regressi ssion

  4. This Lecture • Op Opti timizati tion • i.e., how do we solve: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Hill Climbing • Re Recall fro rom CSPs lectu ture re: simple, general idea • St Start wherever Repeat: move to the best neighboring state • Re • If no neighbors better than current, quit • What’s particularly tricky ky when hill-cl climbi bing for for mu mult ltic icla lass logis logistic ic regr gression ion? • Optimization over a co continuous s space ace • Infinitely many neighbors! • How to do this efficiently?

  5. 1-D Optimization g ( w ) g ( w 0 ) w w 0 • Co Could ev eval aluate ate an and g ( w 0 + h ) g ( w 0 − h ) • Then step in be best di dire rection on ∂ g ( w 0 ) g ( w 0 + h ) − g ( w 0 − h ) • Or, Or, evaluate te der derivat ative = lim ∂ w 2 h h → 0 • Defines di dire rection on to step into 2-D Optimization Source: offconvex.org

  6. Gradient Ascent • Pe Perfo form update in uphill direction fo for each coordinate • The steeper the slope (i.e. the higher the derivative) the bigger the step for that coordinate • E. E.g., consi sider: g ( w 1 , w 2 ) Up Updates: • • Up Updates in in vector r notatio ion: w 1 ← w 1 + α ∗ @ g ( w 1 , w 2 ) w ← w + α ∗ r w g ( w ) @ w 1 w 2 ← w 2 + α ∗ @ g " # ∂g ( w 1 , w 2 ) ∂w 1 ( w ) = gradient = • with: r w g ( w ) = @ w 2 ∂g ∂w 2 ( w ) Gradient Ascent • Id Idea: • St Start somewhere Repeat: Take a step in the gradient direction • Re Figure source: Mathworks

  7. What is the Steepest Direction? | Δ | ≤ ε 1/2 w max g ( w + ∆) ∆:∆ 2 1 +∆ 2 2 ≤ ε g ( w + ∆) ≈ g ( w ) + ∂ g ∆ 1 + ∂ g • Fi First-Or Orde der Taylor Expa pansion: ∆ 2 ∂ w 1 ∂ w 2 g ( w ) + ∂ g ∆ 1 + ∂ g • St Steepest Asc scent Dire Directio ion: max ∆ 2 ∂ w 1 ∂ w 2 ∆:∆ 2 1 +∆ 2 2 ≤ ε Δ ∆ = ε a ∆: k ∆ k≤ ε ∆ > a max • Re Recall: à a k a k " # ∂g ∆ = ε r g • He Henc nce, solut ution: n: Gr Gradi dient di direction = = steepe pest di direction! ∂w 1 r g = ∂g kr g k ∂w 2 Gradient in n dimensions  ∂g  ∂w 1 ∂g   r g = ∂w 2   · · ·     ∂g ∂w n

  8. Optimization Procedure: Gradient Ascent w • initialize • for iter = 1 , 2 , … w ← w + α ∗ r g ( w ) le learning ning ra rate te - hyperparameter • α that needs to be chosen carefully How? Try multiple choices • Ho • Cr Crude rule of thumb mb: update changes about 0. 0.1 1 – 1% 1% w Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i (Sum over al all training examples) g ( w ) • initialize w • for iter = 1 , 2 , … X r log P ( y ( i ) | x ( i ) ; w ) w ← w + α ∗ i

  9. St Stoch chast astic c Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Obs Observation: once the gradient on one training example has been computed, we might as well incorporate before computing next one • initialize w • for iter = 1 , 2 , … • pick random j w ← w + α ∗ r log P ( y ( j ) | x ( j ) ; w ) Intuition: Use sampling to approximate the gradient. In This is called st stochast stic g gradient d desc scent . Batch Gradient Ascent on the Log Likelihood Objective Mi Mini-Ba X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i Observation: gradient over small set of training examples ( =m Obs =mini-ba batch ) can be computed in parallel, might as well do that instead of a single one • initialize w • for iter = 1 , 2 , … • pick random subset of training examples J X r log P ( y ( j ) | x ( j ) ; w ) w ← w + α ∗ j ∈ J

  10. How about computing all the derivatives? • We’ll talk about that once we covered neural networks, which are a generalization of logistic regression Neural Networks

  11. <latexit sha1_base64="5jPDuadc92/H0mDYLheYrIDBNTY=">AHNnicfZXPb9MwFMe9AWUXxsckVBFhzSkqkrW0Y1JSNPGNG6MiW1ITVU5idumdX5gO107yzf+Gq5w4F/hwg1x5U/ATgIkcZijNZ7n/f1s/ts2xH2KDOMb0vL167fqN1cuVW/fefuvfuraw/OaBgTB506IQ7JextShL0AnTKPYfQ+Igj6Nkbn9vRA+c9niFAvDN6xRYT6PhwF3tBzIJOmwerjy4HXeNmwaOwPJo3zAfdaE9GwdhvDwWRj/myw2jTaRtIaesfMOs09kLbjwVoNWG7oxD4KmIMhpT3TiFifQ8I8ByNRt2KIuhM4Qj1ZDeAPqJ9nkxENOr1pzk/D9BFNGdozkSF3YdsLOp5PQ59mlpLxmEYMFq0Yih16cIvWm2/qNhD8ygkKn13ElNmh3OhUnTRUC53kjN3bRwjwU+O9gU3Wt1Oy9zcFmWGIDdDzB2jJR+NGBGEgozZ2WqZ3Z0KIpJhNE/ylCcyjhPvYJkepyBjr+YyrTa3W5L/lctI3k7HaFH7CezSHkzw67iT9SM/sgbiXoa9rwCPljAoCiuXvM/9FG6FLncjSvV3xAYjFA+m9yEVfJygQiSNeOEvg8Dl1sz5Iie2ecWCmhMkKoZbtkhdmVByA9vmkILSgNkbGJX4rmvWgGseBWSwYNMXJUqfB1CyPoUhauC2u3HDAXxeFnfJ4MmcWGrPQmEuNudSYscZYiMGKOcZFMC4LfSgJsXGiU5bBJQzL8it4FhZDmojDktINPa0tYdk5EO1nmGECGQhUYfKhcfG2PM9RnmF3qUF1wdJf3lwQ5LCalf2+aHQiMdGyclI3vFbZaUTxElroaqHVZBjohGphumgo10NjsZKmBnocHJvq1AQ10324QKTu6MF6p1/94Qeuds212ltvt5p7+9ntsQIegSdgA5hgG+yB1+AYnAIHfASfwGfwpfa19r32o/YzRZeXspiHoNBqv34DRKiTHQ=</latexit> Multi-class Logistic Regression • Lo Logis istic ic regressio ion n is is a specia ial l case of of a a ne neur ural l ne network f 1 (x) W 1, 1,1 e z 1 W 1, W 1, z 1 1,2 s P ( y 1 | x ; w ) = 1,3 e z 1 + e z 2 + e z 3 o f 2 (x) f e z 2 t z 2 P ( y 2 | x ; w ) = e z 1 + e z 2 + e z 3 f 3 (x) m a e z 3 x … z 3 P ( y 3 | x ; w ) = e z 1 + e z 2 + e z 3 W 3,3 ,3 f K (x) X z i = W i,j f j ( x ) j Deep Neural Network = Also learn the features! Can we automatically learn good features f j (x (x) ? f 1 (x) e z 1 z 1 s P ( y 1 | x ; w ) = e z 1 + e z 2 + e z 3 o f 2 (x) f e z 2 t z 2 P ( y 2 | x ; w ) = e z 1 + e z 2 + e z 3 f 3 (x) m a e z 3 x … z 3 P ( y 3 | x ; w ) = e z 1 + e z 2 + e z 3 f K (x)

  12. Deep Neural Network = Also learn the features! z (1) x 1 z (2) z ( n − 1) f 1 (x) 1 1 1 z ( OUT ) s P ( y 1 | x ; w ) = 1 z (1) z (2) x 2 o f 2 (x) z ( n − 1) 2 2 2 f … t z ( OUT ) P ( y 2 | x ; w ) = x 3 z (2) z (1) z ( n − 1) f 3 (x) 2 m 3 3 3 a x … … … … … z ( OUT ) P ( y 3 | x ; w ) = 3 z (1) z (2) x L f K (x) z ( n − 1) K (1) K (2) K ( n − 1) z ( k ) W ( k − 1 ,k ) z ( k − 1) X g = nonlinear activation function = g ( ) i i,j j j Deep Neural Network = Also learn the features! z (1) z ( n ) x 1 z (2) z ( n − 1) 1 1 1 1 z ( OUT ) s P ( y 1 | x ; w ) = 1 z ( n ) z (1) z (2) x 2 o z ( n − 1) 2 2 2 2 f … t z ( OUT ) P ( y 2 | x ; w ) = z ( n ) z (2) x 3 z (1) z ( n − 1) 2 m 3 3 3 3 a x … … … … … P ( y 3 | x ; w ) = z ( OUT ) 3 z (1) z (2) z ( n ) x L z ( n − 1) K (1) K (2) K ( n ) K ( n − 1) z ( k ) W ( k − 1 ,k ) z ( k − 1) X = g ( ) g = nonlinear activation function i i,j j j

  13. Common Activation Functions [source: MIT 6.S191 introtodeeplearning.com] Deep Neural Network: Also Learn the Features! • Tr Train inin ing g a a deep neural network k is just like ke logistic regression: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max w w i just w tends to be a much, much larger vector J à ju just run gradie ient ascent + st stop when log likelihood of hold-out data starts to decrease

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend