9 1 overview
play

9.1 Overview 9 Deep Learning Alexander Smola Introduction to - PowerPoint PPT Presentation

9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15 A brief history of computers 1970s 1980s 1990s 2000s 2010s Data 10 5 10 8 10 2 10 3 10 11 RAM ? 1MB 100MB


  1. 9.1 Overview 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

  2. A brief history of computers 1970s 1980s 1990s 2000s 2010s Data 10 5 10 8 10 2 10 3 10 11 RAM ? 1MB 100MB 10GB 1TB CPU ? 10MF 1GF 100GF 1PF GPU deep kernel deep nets methods nets • Data grows 
 at higher exponent • Moore’s law (silicon) vs. Kryder’s law (disks) • Early algorithms data bound, now CPU/RAM bound

  3. Perceptron x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output y ( x ) = σ ( h w, x i )

  4. Nonlinearities via Layers y 1 i = k ( x i , x ) Kernels y 1 i ( x ) = σ ( h w 1 i , x i ) y 2 ( x ) = σ ( h w 2 , y 1 i ) Deep Nets optimize all weights

  5. Nonlinearities via Layers y 1 i ( x ) = σ ( h w 1 i , x i ) y 2 i ( x ) = σ ( h w 2 i , y 1 i ) y 3 ( x ) = σ ( h w 3 , y 2 i )

  6. Multilayer Perceptron • Layer Representation y y i = W i x i W 4 x i +1 = σ ( y i ) x4 • (typically) iterate between 
 W 3 linear mapping Wx and 
 x3 nonlinear function W 2 • Loss function 
 l ( y, y i ) x2 to measure quality of 
 W 1 estimate so far x1

  7. 
 
 Backpropagation • Layer Representation 
 y y i = W i x i W 4 x i +1 = σ ( y i ) x4 • Compute change in objective 
 W 3 g j = ∂ W j l ( y, y i ) x3 W 2 • Chain rule x2 ∂ x [ f 2 � f 1 ] ( x ) W 1 = [ ∂ f 1 f 2 � f 1 ( x )] [ ∂ x f 1 ] ( x ) x1

  8. Backpropagation • Layer Representation y y i = W i x i W 4 x i +1 = σ ( y i ) • Gradients x4 ∂ x i y i = W i W 3 ∂ W i y i = x i x3 ∂ y i x i +1 = σ 0 ( y i ) W 2 ⇒ ∂ x i x i +1 = σ 0 ( y i ) W > = i • Backprop x2 W 1 g n = ∂ x n l ( y, y n ) g i = ∂ x i l ( y, y n ) = g i +1 ∂ x i x i +1 x1 ∂ W i l ( y, y n ) = g i +1 σ 0 ( y i ) x > i

  9. 
 Optimization • Layer Representation 
 y y i = W i x i W 4 x i +1 = σ ( y i ) x4 • Gradient descent 
 W 3 W i ← W i − η∂ W i l ( y, y n ) x3 • Second order method 
 W 2 (use higher derivatives) x2 • Stochastic gradient descent 
 W 1 (use only one sample) x1 • Minibatch (small subset)

  10. 
 Things we could learn • Binary classification 
 log(1 + exp( − yy n )) • Multiclass classification (softmax) 
 X exp( y n [ y 0 ]) − y n [ y ] log y 0 • Regression 1 2 k y � y n k 2 • Ranking (top-k) • Preferences • Sequences (see CRFs)

  11. 9.2 Layers 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

  12. 
 
 
 
 
 Fully Connected • Forward mapping 
 y i = W i x i x i +1 = σ ( y i ) x3 with subsequent nonlinearity W 2 • Backprop gradients 
 x2 ∂ x i x i +1 = σ 0 ( y i ) W > i ∂ W i x i +1 = σ 0 ( y i ) x > i • General purpose layer

  13. 
 
 
 Rectified Linear Unit (ReLu) • Forward mapping 
 y i = W i x i x i +1 = σ ( y i ) x3 with subsequent nonlinearity W 2 • Gradients vanish at tails x2 • Solution - replace by max(0,x) • Derivative is in {0; 1} • Sparsity of signal 
 (Nair & Hinton, machinelearning.wustl.edu/mlpapers/paper_files/icml2010_NairH10.pdf)

  14. Where is Wally

  15. LeNet for OCR (1990s)

  16. Convolutional Layers • Images have translation invariance 
 (to some extent) • Low level is mostly edge 
 and feature detectors • Usually via convolution 
 (plus nonlinearity)

  17. 
 Convolutional Layers • Images have translation invariance • Forward (usually implemented brute force) 
 y i = x i � W i x i +1 = σ ( y i ) • Backward gradients 
 (need to convolve appropriately)

  18. 
 
 
 
 Subsampling & MaxPooling • Multiple convolutions blow up dimensionality 
 • Subsampling - average over patches 
 (this works decently) • MaxPooling - pick the maximum over patches 
 (often non overlapping ones)

  19. Depth vs. Width • Longer range effects • many narrow convolutions • few wide convolutions • More nonlinearities work better 
 (same number of parameters) Simonyan and Zisserman arxiv.org/pdf/1409.1556v6.pdf

  20. Fancy structures • Compute different filters • Compose one big vector from all of them • Layer this iteratively Szegedy et al. arxiv.org/pdf/1409.4842v1.pdf

  21. Whole system training Le Cun, Bottou, Bengio, Haffner, 2001 yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

  22. Whole system training • Layers need not be ‘neural networks’ • Rankers • Segmenters • Finite state automata • Jointly train a full OCR system Le Cun, Bottou, Bengio, Haffner, 2001 yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

  23. 9.3 Objectives 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

  24. 
 
 
 Classification • Binary classification 
 log(1 + exp( − yy n )) Binary exponential model 
 • Multiclass classification (softmax) 
 Multinomial exponential model 
 e y n [ y ] e y n [ y 0 ] − y n [ y ] X − log p ( y | y n ) = − log y 0 e y n [ y 0 ] = log P y 0 • Pretty much anything else we did so far in 10-701

  25. 
 
 Regression • Least mean squares 
 1 2 k y � y n k 2 2 this works for vectors, too • Applications • Stock market prediction (more on this later) • Image superresolution 
 (regress from lower dimensional to higher dimensional image) • Recommendation and rating (Netflix)

  26. Autoencoder • Regress from observation to itself (y n = x 1 ) • Lower-dimensional layer 
 x1 is bottleneck V 1 • Often trained iteratively x2 x1 V 1 V 2 x2 x3 W 1 W 2 x1 x2 W 1 x1

  27. Autoencoder • Regress from observation to itself (y n = x 1 ) • Lower-dimensional layer 
 x1 is bottleneck V 1 • Often trained iteratively x2 • Extracts approximate 
 V 2 sufficient statistic of data 
 x3 W 2 • Special case - PCA x2 • linear mapping W 1 • only single layer x1

  28. ‘Synesthesia’ • Different data sources • Images and captions • Natural language queries and SQL queries • Movies and actions • Generative embedding for both entities • Minimize distance between pairs • Need to prevent clumping all together

  29. ‘Synesthesia’ • Different data sources • Images and captions • Natural language queries and SQL queries • Movies and actions max(0 , margin + d ( a, b ) − d ( a, n )) large margin of similarity Grefenstette et al, 2014, arxiv.org/abs/1404.7296

  30. Synthetic Data Generation • Dataset often has useful invariance • Images can be shifted, scaled, RGB transformed, blurred, sharpened, etc. • Speech can have echo, background noise, environmental noise • Text can have typos, omissions, etc. • Generate data and train on extended noisy set • Record breaking speech recognition (Baidu) • Record breaking image recognition (Baidu, LeCun) • Can be very computationally expensive

  31. Synthetic Data Generation • Sample according to relevance of transform • Similar to Virtual Support Vectors (Schölkopf, 1998) • Training with input noise & regularization (Bishop, 1995)

  32. 9.4 Optimization 9 Deep Learning Alexander Smola Introduction to Machine Learning 10-701 http://alex.smola.org/teaching/10-701-15

  33. Stochastic Gradient Descent • Update parameters according to W ij ← W ij − η ij ( t ) g ij • Rate of decay • Adjust each layer • Adjust each parameter individually • Minibatch size • Momentum terms • Lots of things that can (should) be adjusted 
 (via Bayesian optimization, e.g. Spearmint, MOE) Senior, Heigold, Ranzato and Yang, 2013 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf

  34. Minibatch • Update parameters according to W ij ← W ij − η ij ( t ) g ij • Aggregate gradients before applying • Reduces variance in gradients • Better for vectorization (GPUs) 
 vector, vector < vector, matrix < matrix, matrix • Large minibatch may need large memory 
 (and slow updates). • Magic numbers are 64 to 256 on GPUs Senior, Heigold, Ranzato and Yang, 2013 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40808.pdf

  35. 
 
 
 Learning rate decay • Constant 
 (requires schedule for piecewise constant, tricky) • Polynomial decay 
 α η ( t ) = ( β + t ) γ Recall exponent of 0.5 for conventional SGD, 1 for strong convexity. Bottou picks 0.75 • Exponential decay 
 η ( t ) = α e − β t risky since decay could be to aggressive

  36. 
 AdaGrad • Adaptive learning rate (preconditioner) 
 η 0 η ij ( t ) = q t g 2 K + P ij ( t ) • For directions with large gradient, decrease learning rate aggressively to avoid instability • If gradients start vanishing, learning rate decrease reduces, too • Local variant η t η ij ( t ) = q K + P t t 0 = t � τ g 2 ij ( t 0 ) Duchi, Hazan, Singer, 2010 http://www.magicbroom.info/Papers/DuchiHaSi10.pdf

  37. 
 
 Momentum • Average over recent gradients • Helps with local minima momentum • Flat (noisy) gradients 
 m t = (1 − λ ) m t − 1 + λ g t w t ← w t − η t g t − ˜ η t m t • Can lead to oscillations for large momentum • Nesterov’s accelerated gradient m t +1 = µm t + ✏ g ( w t − µm t ) w t +1 = w t − m t +1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend