 
              Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22 April tonight, Binary graded. For HW1, please get word count <100 167, 165,164 has done the homework. ( If you have not done it talk to me/TA! ) Homework 3 (released ~tomorrow) due ~ 5 May MNIST Jupiter “GPU” home work released Wednesday. Due 10 May Projects: 27 Groups formed. Look at Piazza for help. Guidelines is on Piazza May 5 proposal due. TAs and Peter can approve. Today: • Stanford CNN 9, Kernel methods (Bishop 6), • Linear models for classification, Backpropagation Monday • Stanford CNN 10, Kernel methods (Bishop 6), SVM, • Play with Tensorflow playground before class http://playground.tensorflow.org
Projects • 3-4 person groups preferred • Deliverables: Poster & Report & main code (plus proposal, midterm slide) • Topics your own or chose form suggested topics. Some physics inspired . • April 26 groups due to TA (if you don’t have a group, ask in piaza we can help). TAs will construct group after that. • May 5 proposal due. TAs and Peter can approve. • Proposal: One page: Title, A large paragraph, data, weblinks, references. • Something physical
DataSet • 80 % preparation, 20 % ML • Kaggle: https://inclass.kaggle.com/datasets https://www.kaggle.com • UCI datasets: http://archive.ics.uci.edu/ml/index.php • Past projects… • Ocean acoustics data
In 2017 Many choose the source localization • two CNN projects,
2018: Best reports 6,10,12 15; interesting 19, 47 poor 17; alone is hard 20. I
Bayes and Softmax (Bishop p. 198) • Bayes: Parametric Approach: Linear Classifier 3072x1 p ( x | y ) = p ( y | x ) p ( x ) p ( y | x ) p ( x ) f(x,W) = Wx + b 10x1 = Image T p ( y ) P y ∈ Y p ( x, y ) 10x1 10x3072 10 numbers giving f( x , W ) class scores C Array of 32x32x3 numbers t W y (3072 numbers total) • Classification of N classes: 7 parameters 0.75 p ( x |C n ) p ( C n ) or weights O p ( C n | x ) = 25 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 - 54 April 6, 2017 P N k =1 p ( x |C k ) p ( C k ) o o e e exp( a n ) = y 9 P N k =1 exp( a k ) in pied target with a n = ln ( p ( x |C n ) p ( C n )) R 19 like had Ip or
Softmax to Logistic Regression (Bishop p. 198) p ( x |C 1 ) p ( C 1 ) p ( C 1 | x ) = P 2 k =1 p ( x |C k ) p ( C k ) I logistics key exp( a 1 ) 1 = = P 2 1 + exp( − a ) k =1 exp( a k ) sigmoid with xp cap a = ln p ( x |C 1 ) p ( C 1 ) expcar p ( x |C 2 ) p ( C 2 ) s for binary classification we should use logis Epcc mix • " # = ln ' ( ) # ' ) # • " = " # − " + # • ' ) # , = #-./0(2 3 42 5 )
The Kullback-Leibler Divergence P true distribution, q is approximating distribution
Cross entropy • KL divergence (p true q approximating) > ' = ln(' = ) - ∑ = > ' = ln(; = ) 7 89 ('||;) = ∑ = e = −? ' + ?(', ;) cross entropy • Cross entropy > ' = ln(; = ) ? ', ; = ? ; + 7 89 ('||;) = - ∑ = pre In Cg 1h Caffe • Implementations tf.keras.losses.CategoricalCrossentropy() tf.losses.sparse_softmax_cross_entropy e torch.nn.CrossEntropyLoss()
Cross-entropy or “ softmax ” function for multi-class classification z e i = y The output units use a non-local non-linearity: i å z j e j y y y output units 1 2 3 ¶ y i = - y ( 1 y ) i i ¶ z i z z z 1 2 3 target value å = - E t ln y The natural cost function is the negative log prob j j of the right answer j ¶ y ¶ ¶ E E å j = = - y t i i ¶ ¶ ¶ z y z i j i j
Reminder: 1x1 convolutions 1x1 CONV 56 Reminder: 1x1 convolutions with 32 filters 56 preserves spatial dimensions, reduces depth! 56 Projects depth to lower 56 dimension (combination of 64 32 feature maps) b 1x1 CONV Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 52 May 2, 2017 May 2, 2017 56 with 32 filters 56 (each filter has size 1x1x64, and performs a 64-dimensional dot product) 56 56 64 32 Summary: CNN Architectures Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 51 May 2, 2017 May 2, 2017 Case Studies - AlexNet - VGG - GoogLeNet - ResNet Also.... - NiN (Network in Network) - DenseNet - Wide ResNet - FractalNet - ResNeXT - SqueezeNet - Stochastic Depth 10 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - May 2, 2017 0
Softmax Case Study: ResNet FC 1000 Pool 3x3 conv, 64 [He et al., 2015] 3x3 conv, 64 3x3 conv, 64 relu 3x3 conv, 64 Very deep networks using residual F(x) + x 3x3 conv, 64 64 connections n 3x3 conv, 64 .. . conv - 152-layer model for ImageNet 3x3 conv, 128 X 3x3 conv, 128 F(x) relu - ILSVRC’15 classification winner identity 3x3 conv, 128 3x3 conv, 128 (3.57% top 5 error) conv 3x3 conv, 128 - Swept all classification and 3x3 conv, 128 / 2 3x3 conv, 64 detection competitions in 3x3 conv, 64 X ILSVRC’15 and COCO’15! 3x3 conv, 64 Residual block 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 T Pool 7x7 conv, 64 / 2 Input Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - May 2, 2017 May 2, 2017 65 x H Eftt FED
Case Study: ResNet [He et al., 2015] What happens when we continue stacking deeper layers on a “plain” convolutional neural network? 56-layer Training error 56-layer Test error 20-layer i 20-layer Iterations Iterations Case Study: ResNet 56-layer model performs worse on both training and test error [He et al., 2015] -> The deeper model performs worse, but it’s not caused by overfitting! Hypothesis: the problem is an optimization problem, deeper models are harder to Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 68 May 2, 2017 May 2, 2017 optimize Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 69 May 2, 2017 May 2, 2017
Case Study: ResNet [He et al., 2015] Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping H(x) = F(x) + x relu H(x) F(x) + x s Use layers to O fit residual conv conv F(x) = H(x) - x X F(x) relu relu identity instead of conv conv H(x) directly X X “Plain” layers Residual block Fei-Fei Li & Justin Johnson & Serena Yeung Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - Lecture 9 - 72 72 May 2, 2017 May 2, 2017 HH It
Image by MIT OpenCourseWare. 4 5 |{z} |{z} |{z} |{z} Kernels We might want to consider something more complicated than a linear model: � � ⇥ x (1)2 , x (2)2 , x (1) x (2) ⇤ Example 1 : [ x (1) , x (2) ] → Φ [ x (1) , x (2) ] = Information unchanged, but now we have a linear classifier on the transformed points. With the kernel trick, we just need kernel Input Space Feature Space x B C, D = E(C) F E(D) Image by MIT OpenCourseWare. k ( x , x ′ ) = φ ( x ) T φ ( x ′ ) . (6.1) see that the kernel is a symmetric function of its arguments
Basis expansion KA lies Z tf Y 12 if 2 0 My v.IO Ggso9 EfH t.z xi2 x.E t xi O w I 41,010 1 O WTI t 72 O 2
Gaussian Process (Bishop 6.4, Murphy15) t n = y n + ϵ n f ( x ) ∼ GP ( m ( x ) , κ ( x , x ′ ))
Dual representation, Sec 6.2 Primal problem: min S(R) n R > R F ( = − T = 2 + V S = # + R 2 = WR − X + V + + M + ∑ = + R 2 roam L Solution R = W - X = (W F W + YZ [ ) 4\ W F X = W F (WW ] + YZ ^ ) 4# X = W F (_ + YZ ^ ) 4# X = W F C N M Kit ERM ERNA The kernel is ` = WW ] K 17 INT't G xxterNXN C RN Dual representation is : min S(C) C > R F ( = − T = 2 + V S = # + R 2 = _C − X + + + V + C F _C + ∑ = a is found inverting NxN matrix w is found inverting MxM matrix Only kernels , no feature vectors
Dual representation, Sec 6.2 Dual representation is : min S(C) C > R F ( = − T = 2 + V S = # + R 2 = _C − X + + + V + C F _C + ∑ = Prediction > " = ( = > " = B(( = , () k = R F ( = C F W( = ∑ = F ( = ∑ = FM • Often a is sparse (… Support vector machines ) • We don’t need to know x or a ( . cdeX Xfg _ghigj O + + Y 2 C F _C S C = _C − X + expf 8 Hjk ooo
Gaussian Kernels EE a EI E
Commonly used kernels p = + K ( x , y ) ( x . y 1 ) Polynomial: Parameters Gaussian 2 2 2 - - s that the user || x y || / = K ( x , y ) e radial basis must choose function = - d K ( x , y ) tanh ( k x.y ) Neural net: For the neural network kernel, there is one “hidden unit” per support vector, so the process of fitting the maximum margin hyperplane decides how many hidden units to use. Also, it may violate Mercer’s condition.
Recommend
More recommend