midterm review
play

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: - PowerPoint PPT Presentation

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: supervised learning Training data : labeled examples ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x n , y n ) 1 / 61 (Lec1.) Basic setting: supervised learning Training data : labeled


  1. (Lec3-4.) Empirical risk minimization in matrix notation Define n × d matrix A and n × 1 column vector b by     ← x T → y 1 1     1 1 . .     A := √ n .  , b := √ n .  .  .  . ← x T → y n n Can write empirical risk as n � � 2 � R ( w ) = 1 � = � Aw − b � 2 T y i − x i w 2 . n i =1 Necessary condition for w to be a minimizer of � R : ∇ � i.e., w is a critical point of � R ( w ) = 0 , R . This translates to T A ) w = A T b , ( A a system of linear equations called the normal equations . In upcoming lecture we’ll prove every critical point of � R is a minimizer of � R . 9 / 61

  2. (Lec3-4.) Full (factorization) SVD (new slide) Given M ∈ R n × d , let M = USV T denote the singular value decomposition (SVD) , where ◮ U ∈ R n × n is orthonormal, thus U T U = UU T = I , ◮ V ∈ R d × d is orthonormal, thus V T V = V V T = I , ◮ S ∈ R n × d has singular values s 1 ≥ s 2 ≥ · · · ≥ s min n,d along the diagonal and zeros elsewhere, where the number of positive singular values equals the rank of M . Some facts: ◮ SVD is not unique when the singular values are not distinct; e.g., we can write I = UIV T where U is any orthonormal matrix. ◮ Pseudoinverse S + ∈ R d × n of S is obtained by starting with S T and taking the reciprocal of each positive entry. ◮ Pseudoinverse of M is V S + U T . ◮ If M − 1 exists, then M − 1 = M + . 10 / 61

  3. (Lec3-4.) Thin (decomposition) SVD (new slide) Given M ∈ R n × d , ( s, u , v ) are a singular value with corresponding left and right singular vectors if Mv = s u and M T u = s v . The thin SVD of M is M = � r i =1 s i u i v T i , where r is the rank of M , and ◮ left singular vectors ( u 1 , . . . , u r ) are orthonormal (but we might have r < min { n, d } !) and span the column space of M , ◮ right singular vectors ( v 1 , . . . , v r ) are orthonormal (but we might have r < min { n, d } !) and span the row space of M , ◮ sigular values s 1 ≥ · · · ≥ s r > 0 . Some facts: ◮ Pseudoinverse M + = � r s i v i u T 1 i . i =1 ◮ ( u i ) r i =1 span t 11 / 61

  4. (Lec3-4.) SVD and least squares Recall: we’d like to find w such that T Aw = A T b . A If w = A + b , then       � r � r � r 1 T Aw =  T   T   T  b s i v i u s i u i v A s i v i u i i i i =1 i =1 i =1     r r � �  T   T  b = A T b . = s i v i u u i u i i i =1 i =1 w ols = A + b as the OLS solution. Henceforth, define ˆ (OLS = “ordinary least squares”.) Note : in general, AA + = � r i =1 u i u T i � = I . 12 / 61

  5. (Lec3-4.) Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. 13 / 61

  6. (Lec3-4.) Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i ,   � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤ s 2  ( w ′ − w ) ,  T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . 13 / 61

  7. (Lec3-4.) Normal equations imply optimality Consider w with A T Aw = A T y , and any w ′ ; then � Aw ′ − y � 2 = � Aw ′ − Aw + Aw − y � 2 = � Aw ′ − Aw � 2 + 2( Aw ′ − Aw ) T ( Aw − y ) + � Aw − y � 2 . Since ( Aw ′ − Aw ) T ( Aw − y ) = ( w ′ − w ) T ( A T Aw − A T y ) = 0 , then � Aw ′ − y � 2 = � Aw ′ − Aw � 2 + � Aw − y � 2 . This means w ′ is optimal. Morever, writing A = � r i =1 s i u i v T i ,   � r � Aw ′ − Aw � 2 = ( w ′ − w ) ⊤ ( A ⊤ A )( w ′ − w ) = ( w ′ − w ) ⊤ s 2  ( w ′ − w ) ,  T i v i v i i =1 so w ′ optimal iff w ′ − w is in the right nullspace of A . (We’ll revisit all this with convexity later.) 13 / 61

  8. (Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . 14 / 61

  9. (Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! 14 / 61

  10. (Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . 14 / 61

  11. (Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . ◮ Parameter λ controls how much attention is paid to the regularizer � w � 2 2 relative to the data fitting term � R ( w ) . 14 / 61

  12. (Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . ◮ Parameter λ controls how much attention is paid to the regularizer � w � 2 2 relative to the data fitting term � R ( w ) . ◮ Choose λ using cross-validation. 14 / 61

  13. (Lec3-4.) Regularized ERM Combine the two concerns : For a given λ ≥ 0 , find minimizer of R ( w ) + λ � w � 2 � 2 over w ∈ R d . Fact : If λ > 0 , then the solution is always unique (even if n < d )! ◮ This is called ridge regression . ( λ = 0 is ERM / Ordinary Least Squares.) Explicit solution ( A T A + λ I ) − 1 A T b . ◮ Parameter λ controls how much attention is paid to the regularizer � w � 2 2 relative to the data fitting term � R ( w ) . ◮ Choose λ using cross-validation. Note : in deep networks, this regularization is called “weight decay”. (Why?) Note : another popular regularizer for linear regression is ℓ 1 . 14 / 61

  14. (Lec5-6.) Geometry of linear classifiers x 2 A hyperplane in R d is a linear subspace of dimension d − 1 . ◮ A hyperplane in R 2 is a line. H ◮ A hyperplane in R 3 is a plane. ◮ As a linear subspace, a hyperplane always contains the origin. w x 1 A hyperplane H can be specified by a (non-zero) normal vector w ∈ R d . The hyperplane with normal vector w is the set of points orthogonal to w : � � x ∈ R d : x T w = 0 H = . Given w and its corresponding H : H splits the sets labeled positive { x : w T x > 0 } and those labeled negative { x : w T w < 0 } . 15 / 61

  15. (Lec5-6.) Classification with a hyperplane H w span { w } 16 / 61

  16. (Lec5-6.) Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) 16 / 61

  17. (Lec5-6.) Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w 16 / 61

  18. (Lec5-6.) Classification with a hyperplane H x Projection of x onto span { w } (a line) has coordinate w � x � 2 · cos( θ ) θ where � x � 2 · cos θ x T w cos θ = . � w � 2 � x � 2 span { w } (Distance to hyperplane is � x � 2 · | cos( θ ) | .) Decision boundary is hyperplane (oriented by w ): T w > 0 x ⇐ ⇒ � x � 2 · cos( θ ) > 0 ⇐ ⇒ x on same side of H as w What should we do if we want hyperplane decision boundary that doesn’t (necessarily) go through origin? 16 / 61

  19. (Lec5-6.) Linear separability Is it always possible to find w with sign( w T x i ) = y i ? Is it always possible to find a hyperplane separating the data? (Appending 1 means it need not go through the origin.) 1.0 0.8 0.6 0.4 0.2 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Linearly separable. Not linearly separable. 17 / 61

  20. (Lec5-6.) Cauchy-Schwarz (new slide) Cauchy-Schwarz inequality. | a T b | ≤ � a � · � b � . 18 / 61

  21. (Lec5-6.) Cauchy-Schwarz (new slide) Cauchy-Schwarz inequality. | a T b | ≤ � a � · � b � . Proof. If � a � = � b � , 0 ≤ � a − b � 2 = � a � 2 − 2 a T b + � b � 2 = 2 � a � · � b � − 2 a T b , which rearranges to give a T b ≤ � a � · � b � . � �� a T � For the case � a � < � b � , apply the preceding to � b � � a � b . � a � � b � For the absolute value, apply the preceding to ( a , − b ) . � 18 / 61

  22. (Lec5-6.) Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). 19 / 61

  23. (Lec5-6.) Logistic loss Let’s state our classification goal with a generic margin loss ℓ : n � R ( w ) = 1 � T x i ); ℓ ( y i w n i =1 the key properties we want: ◮ ℓ is continuous; ◮ ℓ ( z ) ≥ c 1 [ z ≤ 0] = cℓ zo ( z ) for some c > 0 and any z ∈ R , which implies � R ℓ ( w ) ≥ c � R zo ( w ) . ◮ ℓ ′ (0) < 0 (pushes stuff from wrong to right). Examples. ◮ Squared loss, written in margin form: ℓ ls ( z ) := (1 − z ) 2 ; y ) 2 = y 2 (1 − y ˆ y ) 2 = ( y − ˆ y ) 2 . note ℓ ls ( y ˆ y ) = (1 − y ˆ ◮ Logistic loss: ℓ log ( z ) = ln(1 + exp( − z )) . 19 / 61

  24. (Lec5-6.) Squared and logistic losses on linearly separable data I 1.0 1.0 0.8 0.8 0.6 0.6 0.000 - 1 - - 1 -1.200 -0.800 -0.400 0.400 0.800 1.200 8 4 0 4 8 2 2 . . . . . 0 0 0 . . 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 20 / 61

  25. (Lec5-6.) Squared and logistic losses on linearly separable data II 1.0 1.0 0.8 0.8 0.6 0.6 - - 0 1 8 4 1 - . - . . 1 0 . - 0 0 8 2 . 0 - 0 0 0 2 4 0 2 0 . . 0 . 0 . 0 . . 0 . 0 0 0 8 . 4 0 4 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0.4 0.2 0.2 0.0 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Logistic loss. Squared loss. 21 / 61

  26. (Lec5-6.) Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. 22 / 61

  27. (Lec5-6.) Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. If there exists ¯ w with y i ¯ w T x i > 0 for all i , then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . 22 / 61

  28. (Lec5-6.) Logistic risk and separation If there exists a perfect linear separator, empirical logistic risk minimization should find it. Theorem. If there exists ¯ w with y i ¯ w T x i > 0 for all i , then every w with � R log ( w ) < ln(2) / 2 n + inf v � R log ( v ) , also satisfies y i w T x i > 0 . Proof. Omitted. 22 / 61

  29. (Lec5-6.) Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . 23 / 61

  30. (Lec5-6.) Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n 23 / 61

  31. (Lec5-6.) Least squares and logistic ERM Least squares: ◮ Take gradient of � Aw − b � 2 , set to 0; obtain normal equations A T Aw = A T b . ◮ One choice is minimum norm solution A + b . Logistic loss: � n ◮ Take gradient of � R log ( w ) = 1 i =1 ln(1+exp( y i w T x i )) and set to 0 ??? n Remark . Is A + b a “closed form expression”? 23 / 61

  32. (Lec5-6.) Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . . 0 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 24 / 61

  33. (Lec5-6.) Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? 24 / 61

  34. (Lec5-6.) Gradient descent Given a function F : R d → R , gradient descent is the iteration w i +1 := w i − η i ∇ w F ( w i ) , where w 0 is given, and η i is a learning rate / step size. 10.0 7.5 5.0 2.000 2.5 4.000 1 10.000 8 6 4 0.0 12.000 . . 0 . 0 0 0 0 0 0 0 0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Does this work for least squares? Later we’ll show it works for least squares and logistic regression due to convexity. 24 / 61

  35. (Lec5-6.) Multiclass? All our methods so far handle multiclass: ◮ k -nn and decision tree: plurality label. W ∈ R d × k � AW − B � 2 ◮ Least squares: arg min F with B ∈ R n × k ; W ∈ R d × k is k separate linear regressors in R d . How about linear classifiers? ◮ At prediction time, x �→ arg max y ˆ f ( x ) y . ◮ As in binary case: interpretation f ( x ) y = Pr[ Y = y | X = x ] . What is a good loss function? 25 / 61

  36. (Lec5-6.) Cross-entropy ≥ 0 : � Given two probability vectors p , q ∈ ∆ k = { p ∈ R k i p i = 1 } , k � H ( p , q ) = − p i ln q i (cross-entropy) . i =1 ◮ If p = q , then H ( p , q ) = H ( p ) (entropy); indeed � � � k p i q i H ( p , q ) = − p i ln = H ( p ) + KL ( p , q ) . p i � �� � � �� � i =1 entropy KL divergence Since KL ≥ 0 and moreover 0 iff p = q , this is the cost/entropy of p plus a penalty for differing. ◮ Choose encoding ˜ y i = e y for y ∈ { 1 , . . . , k } , y ∝ exp( f ( x )) with f : R d → R k ; and ˆ   k � exp( f ( x ) i )   ℓ ce (˜ y , f ( x )) = H (˜ y , ˆ y ) = − ˜ y i ln � k j =1 exp( f ( x ) j ) i =1   � k exp( f ( x ) y )  = − f ( x ) y + ln  = − ln exp( f ( x ) j ) . � k j =1 exp( f ( x ) j ) j =1 (In pytorch, use torch.nn.CrossEntropyLoss()(f(x), y) .) 26 / 61

  37. (Lec5-6.) Cross-entropy, classification, and margins The zero-one loss for classification is � � ℓ zo ( y i , f ( x )) = 1 y i � = arg max f ( x ) j . j In the multiclass case, can define margin as f ( x ) y − max j � = y f ( x ) j , interpreted as “the distance by which f is correct”. (Can be negative!) Since ln � j z j ≈ max j z j , cross-entropy satisfies � � � y i , f ( x )) = − f ( x ) y + ln ℓ ce (˜ exp f ( x ) j j ≈ − f ( x ) y + max f ( x ) j , j thus minimizing cross-entropy maximizes margins. 27 / 61

  38. (Lec7-8.) The ERM perspective These lectures will follow an ERM perspective on deep networks: ◮ Pick a model/predictor class (network architecture). (We will spend most of our time on this!) ◮ Pick a loss/risk. (We will almost always use cross-entropy!) ◮ Pick an optimizer. (We will mostly treat this as a black box!) The goal is low test error, whereas above only gives low training error; we will briefly discuss this as well. 28 / 61

  39. (Lec7-8.) Basic deep networks A self-contained expression is � � � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L , with equivalent “functional form” x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ i ( W i z + b i ) . Some further details (many more to come!): i =1 with W i ∈ R d i × d i − 1 are the weights, and ( b i ) L ◮ ( W i ) L i =1 are the biases. i =1 with σ i : R d i → R d i are called nonlinearties, or activations, or ◮ ( σ i ) L transfer functions, or link functions. ◮ This is only the basic setup; many things can and will change, please ask many questions! 29 / 61

  40. (Lec7-8.) Choices of activation Basic form: � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · W L σ L − 1 + b L . Choices of activation (univariate, coordinate-wise): ◮ Indicator/step/heavyside/threshold z �→ 1 [ z ≥ 0] . This was the original choice (1940s!). ◮ Sigmoid σ s ( z ) := 1 1+exp( − z ) . This was popular roughly 1970s - 2005? ◮ Hyperbolic tangent z �→ tanh( z ) . Similar to sigmoid, used during same interval. ◮ Rectified Linear Unit (ReLU) σ r ( z ) = max { 0 , z } . It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominant choice now; popularized in “Imagenet/AlexNet” paper (Krizhevsky-Sutskever-Hinton, 2012). ◮ Identity z �→ z ; we’ll often use this as the last layer when we use cross-entropy loss. ◮ NON-coordinate-wise choices: we will discuss “softmax” and “pooling” a bit later. 30 / 61

  41. (Lec7-8.) “Architectures” and “models” Basic form: � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L . (( W i , b i )) L i =1 , the weights and biases, are the parameters. Let’s roll them into W := (( W i , b i )) L i =1 , and consider the network as a two-parameter function F W ( x ) = F ( x ; W ) . ◮ The model or class of functions is { F W : all possible W} . F (both arguments unset) is also called an architecture. ◮ When we fit/train/optimize, typically we leave the architecture fixed and vary W to minimize risk. (More on this in a moment.) 31 / 61

  42. (Lec7-8.) ERM recipe for basic networks Standard ERM recipe: ◮ First we pick a class of functions/predictors; for deep networks, that means a F ( · , · ) . ◮ Then we pick a loss function and write down an empirical risk minimization problem; in these lectures we will pick cross-entropy: n � � � 1 arg min ℓ ce y i , F ( x i , W ) n W i =1 n � � � 1 y i , F ( x i ; (( W i , b i )) L = arg min ℓ ce i =1 ) n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL n � � � 1 = arg min ℓ ce y i , σ L ( · · · σ 1 ( W 1 x i + b 1 ) · · · ) . n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL ◮ Then we pick an optimizer. In this class, we only use gradient descent variants. It is a miracle that this works. 32 / 61

  43. (Lec7-8.) Sometimes, linear just isn’t enough 1.00 1.00 0.75 0.75 0 -3.000 0 0 . 2 3 0.50 0.50 - -24.000 - 0.25 0.25 8 -1.500 . 0 0 0 0 0 0 -8.000 0.00 0.00 6 . 8.000 1 - 0.000 0 0.000 0 1.500 0 . 8 0.25 0.25 3.000 0.50 0.50 0.000 0.75 0.75 4.500 1.00 1.00 16.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Linear predictor: ReLU network: w T [ x �→ x �→ W 2 σ r ( W 1 x + b 1 ) + b 2 . x 1 ] . Some blue points misclassified. 0 misclassifications! 33 / 61

  44. (Lec7-8.) Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. 34 / 61

  45. (Lec7-8.) Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. 34 / 61

  46. (Lec7-8.) Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. ◮ If it doesn’t split the blue points, then one halfspace contains the common midpoint, and therefore wrong on at least one red point. 34 / 61

  47. (Lec7-8.) One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � � � ≤ ǫ, � f ( x ) − W 2 σ ( W 1 x + b 1 ) sup x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). 35 / 61

  48. (Lec7-8.) One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � � � ≤ ǫ, � f ( x ) − W 2 σ ( W 1 x + b 1 ) sup x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). Remarks. ◮ Together with XOR example, justifies using nonlinearities. ◮ Does not justify (very) deep networks. ◮ Only says these networks exist , not that we can optimize for them! 35 / 61

  49. (Lec7-8.) General graph-based view Classical graph-based perspective. ◮ Network is a directed acyclic graph; sources are inputs, sinks are outputs, intermediate nodes compute z �→ σ ( w T z + b ) (with their own ( σ, w , b ) ). ◮ Nodes at distance 1 from inputs are the first layer, distance 2 is second layer, and so on. “Modern” graph-based perspective. ◮ Edges in the graph can be multivariate, meaning vectors or general tensors, and not just scalars. ◮ Edges will often “skip” layers; “layer” is therefore ambiguous. ◮ Diagram conventions differ; e.g., tensorflow graphs include nodes for parameters. 36 / 61

  50. (Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

  51. (Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

  52. (Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

  53. (Lec7-8.) 2-D convolution in deep networks (pictures) (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 37 / 61

  54. (Lec7-8.) Softmax Replace vector input z with z ′ ∝ e z , meaning � � e z 1 e z k z �→ � j e z j , . . . , � j e z j , . ◮ Converts input into a probability vector; useful for interpreting output network output as Pr[ Y = y | X = x ] . ◮ We have baked it into our cross-entropy definition; last lectures networks with cross-entropy training had implicit softmax. ◮ If some coordinate j of z dominates others, then softmax is close to e j . 38 / 61

  55. (Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61

  56. (Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61

  57. (Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) 39 / 61

  58. (Lec7-8.) Max pooling 3 3 2 1 0 0 0 1 3 1 3.0 3.0 3.0 3 1 2 2 3 3.0 3.0 3.0 2 0 0 2 2 3.0 2.0 3.0 2 0 0 0 1 (Taken from https://github.com/vdumoulin/conv_arithmetic by Vincent Dumoulin, Francesco Visin.) ◮ Often used together with convolution layers; shrinks/downsamples the input. ◮ Another variant is average pooling. ◮ Implementation: torch.nn.MaxPool2d . 39 / 61

  59. (Lec9-10.) Multivariate network single-example gradients Define G j ( W ) = σ j ( W j · · · σ 1 ( W 1 x + b ) · · · ) . The multivariate chain rule tells us T x T , ∇ W F ( W x ) = J and J ∈ R l × k is the Jacobian matrix of F : R k → R l at W x , the matrix of all coordinate-wise derivatives. d G L T T d W L = J L G L − 1 ( W ) d G L T d b L = J L , 40 / 61

  60. (Lec9-10.) Multivariate network single-example gradients Define G j ( W ) = σ j ( W j · · · σ 1 ( W 1 x + b ) · · · ) . The multivariate chain rule tells us T x T , ∇ W F ( W x ) = J and J ∈ R l × k is the Jacobian matrix of F : R k → R l at W x , the matrix of all coordinate-wise derivatives. d G L T T d W L = J L G L − 1 ( W ) d G L T d b L = J L , . . . d G L T G j − 1 ( W ) T , d W j = ( J L W L J L − 1 W L − 1 · · · J j ) d G L T , d b j = ( J L W L J L − 1 W L − 1 · · · J j ) 40 / 61

  61. (Lec9-10.) Multivariate network single-example gradients Define G j ( W ) = σ j ( W j · · · σ 1 ( W 1 x + b ) · · · ) . The multivariate chain rule tells us T x T , ∇ W F ( W x ) = J and J ∈ R l × k is the Jacobian matrix of F : R k → R l at W x , the matrix of all coordinate-wise derivatives. d G L T T d W L = J L G L − 1 ( W ) d G L T d b L = J L , . . . d G L T G j − 1 ( W ) T , d W j = ( J L W L J L − 1 W L − 1 · · · J j ) d G L T , d b j = ( J L W L J L − 1 W L − 1 · · · J j ) with J j as the Jacobian of σ j at W j G j − 1 ( W ) + b j . For example, with σ j that is coordinate-wise σ : R → R , � �� �� σ ′ �� � � � , . . . , σ ′ W j G j − 1 ( W ) + b j W j G j − 1 ( W ) + b j J j is diag . 1 d j 40 / 61

  62. (Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? 41 / 61

  63. (Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? 41 / 61

  64. (Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? ◮ Resolving this issue is called symmetry breaking. 41 / 61

  65. (Lec9-10.) Initialization Recall d G L T G j − 1 ( W ) T . d W j = ( J L W L J L − 1 W L − 1 · · · J j ) ◮ What if we set W = 0 ? What if σ = σ r is a ReLU? ◮ What if we set two rows of W j (two nodes) identically? ◮ Resolving this issue is called symmetry breaking. ◮ Standard linear/dense layer initializations: N (0 , 2 d in ) “He et al.”, 2 N (0 , d in + d out ) Glorot/Xavier, 1 1 U ( − √ d in √ d in , ) torch default. (Convolution layers adjusted to have similar distributions.) Random initialization is emerging as a key story in deep networks! 41 / 61

  66. (Lec9-10; SGD slide.) Minibatches We used the linearity of gradients to write � n R ( w ) = 1 ∇ w � ∇ w ℓ ( F ( x i ; w ) , y i ) . n i =1 What happens if we replace (( x i , y i )) n i =1 with minibatch (( x ′ i , y ′ i )) b i =1 ? ◮ Random minibatch = ⇒ two gradients equal in expectation. ◮ Most torch layers take minibatch input: ◮ torch.nn.Linear has input shape ( b, d ) , output ( b, d ′ ) . ◮ torch.nn.Conv2d has input shape ( b, c, h, w ) , output ( b, c ′ , h ′ , w ′ ) . ◮ This is used heavily outside deep learning as well. It is an easy way to use parallel floating point operations (as in GPU and CPU). ◮ Setting batch size is black magic and depends on many things (prediction problem, gpu characteristics, . . . ). 42 / 61

  67. (Lec9-10.) Convex sets A set S is convex if, for every pair of points { x , x ′ } in S , the line segment between x and x ′ is also contained in S . ( { x , x ′ } ∈ S = ⇒ [ x , x ′ ] ∈ S .) convex not convex convex convex 43 / 61

  68. (Lec9-10.) Convex sets A set S is convex if, for every pair of points { x , x ′ } in S , the line segment between x and x ′ is also contained in S . ( { x , x ′ } ∈ S = ⇒ [ x , x ′ ] ∈ S .) convex not convex convex convex Examples : ◮ All of R d . ◮ Empty set. ◮ Half-spaces: { x ∈ R d : a T x ≤ b } . ◮ Intersections of convex sets. � � � � = � m x ∈ R d : Ax ≤ b x ∈ R d : a T ◮ Polyhedra: i x ≤ b i . i =1 ◮ Convex hulls: conv( S ) := { � k i =1 α i x i : k ∈ N , x i ∈ S, α i ≥ 0 , � k i =1 α i = 1 } . (Infinite convex hulls: intersection of all convex supersets.) 43 / 61

  69. (Lec9-10.) Convex functions from convex sets The epigraph of a function f is the area above the curve: � � ( x , y ) ∈ R d +1 : y ≥ f ( x ) epi( f ) := . A function is convex if its epigraph is convex. f is not convex f is convex 44 / 61

  70. (Lec9-10.) Convex functions (standard definition) A function f : R d → R is convex if for any x , x ′ ∈ R d and α ∈ [0 , 1] , f ((1 − α ) x + α x ′ ) ≤ (1 − α ) · f ( x ) + α · f ( x ′ ) . x x ′ x x ′ f is not convex f is convex 45 / 61

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend