recap: Linear Models, RBFs, Neural Networks Linear Model with - PowerPoint PPT Presentation

recap: Linear Models, RBFs, Neural Networks Linear Model with Nonlinear Transform Neural Network k -RBF-Network   � � � � ˜ d m k Learning From Data � � � h ( x ) = θ  w 0 + w j Φ j ( x ) h ( x ) = θ w 0 + w j θ ( v j t x ) h ( x ) = θ w 0 + w j φ ( | | x − µ j | | )  j =1 j =1 j =1 Lecture 23 gradient descent k-means SVM’s: Maximizing the Margin A Better Hyperplane Maximizing the Margin Link to Regularization M. Magdon-Ismail CSCI 4100/6100 Neural Network: generalization of linear model by adding layers. Support Vector Machine: more ‘robust’ linear model � A M Maximizing the Margin : 2 /19 c L Creator: Malik Magdon-Ismail Which separator to pick? − → Which Separator Do You Pick? Robustness to Noisy Data Being robust to noise (measurement error) is good (remember regularization). Being robust to noise (measurement error) is good (remember regularization). � A c M Maximizing the Margin : 3 /19 � A c M Maximizing the Margin : 4 /19 L Creator: Malik Magdon-Ismail Robustness to noise − → L Creator: Malik Magdon-Ismail Thicker cushion means more robust − →

Thicker Cushion Means More Robustness Two Crucial Questions 1. Can we efficiently find the fattest separating hyperplane? 2. Is a fatter hyperplane better than a thin one? We call such hyperplanes fat � A M Maximizing the Margin : 5 /19 � A M Maximizing the Margin : 6 /19 c L Creator: Malik Magdon-Ismail Two crucial questions − → c L Creator: Malik Magdon-Ismail Pulling out the bias − → Pulling Out the Bias Separating The Data Hyperplane h = ( b, w ) Before Now h separates the data means: w t x n + b > 0 y n ( w t x n + b ) > 0 x ∈ { 1 } × R d ; w ∈ R d +1 x ∈ R d ; b ∈ R , w ∈ R d     bias b 1 w 0     By rescaling the weights and bias, x 1 w 1 x 1 w 1     x =  ; w =  .  .   .  . . . . .  ; .  . . . x = . w = .     n =1 ,...,N y n ( w t x n + b ) = 1 min x d w d x d w d w t x n + b < 0 signal = w t x signal = w t x + b (renormalize the weights so that the signal w t x + b is meaningful) � A c M Maximizing the Margin : 7 /19 � A c M Maximizing the Margin : 8 /19 L Creator: Malik Magdon-Ismail Separating the data − → L Creator: Malik Magdon-Ismail Distance to the hyperplane − →

Distance to the Hyperplane Fatness of a Separating Hyperplane 1 w is normal to the hyperplane: dist ( x , h ) = | · | w t x + b | | | w | w t ( x 2 − x 1 ) = w t x 2 − w t x 1 = − b + b = 0 . x (because w t x = − b on the hyperplane) w Unit normal u = w / | | w | | . dist ( x , h ) Since | w t x n + b | = | y n ( w t x n + b ) | = y n ( w t x n + b ) , Fatness = Distance to the closest point x 2 1 dist ( x n , h ) = | · y n ( w t x n + b ) . | | w | x 1 dist ( x , h ) = | u t ( x − x 1 ) | 1 Fatness = min n dist ( x n , h ) = | · | w t x − w t x 1 | | | w | 1 n y n ( w t x n + b ) = | · min ← − separation condition | | w | 1 | · | w t x + b | = 1 | | w | = ← − the margin γ ( h ) | | w | | � A M Maximizing the Margin : 9 /19 � A M Maximizing the Margin : 10 /19 c L Creator: Malik Magdon-Ismail Fatness of a separating hyperplane − → c L Creator: Malik Magdon-Ismail Maximizing the margin − → Maximizing the Margin Maximizing the Margin 1 margin γ ( h ) = ← − bias b does not appear here 1 margin γ ( h ) = ← − bias b does not appear here | | w | | | | w | | 1 minimize 2 w t w 1 minimize 2 w t w b, w b, w subject to: n =1 ,...,N y n ( w t x n + b ) = 1 . min subject to: n =1 ,...,N y n ( w t x n + b ) = 1 . min 1 minimize 2 w t w 1 minimize 2 w t w b, w b, w subject to: y n ( w t x n + b ) ≥ 1 for n = 1 , . . . , N. subject to: y n ( w t x n + b ) ≥ 1 for n = 1 , . . . , N. � A c M Maximizing the Margin : 11 /19 � A c M Maximizing the Margin : 12 /19 L Creator: Malik Magdon-Ismail Equivalent form − → L Creator: Malik Magdon-Ismail Example – our toy data set − →

Example – Our Toy Data Set Quadratic Programming y n ( w t x n + b ) ≥ 1     0 0 − 1 − b ≥ 1 ( i ) 2 2 − 1 − (2 w 1 + 2 w 2 + b ) ≥ 1 ( ii )     1 X = y = minimize 2 u t Q u + p t u     2 0 +1 2 w 1 + b ≥ 1 ( iii )     u ∈ R q 3 0 +1 3 w 1 + b ≥ 1 ( iv ) subject to: A u ≥ c (i) and (iii) gives w 1 ≥ 1 (ii) and (iii) gives w 2 ≤ − 1 So, 1 2 ( w 2 1 + w 2 2 ) ≥ 1 ( b = − 1 , w 1 = 1 , w 2 = − 1) u ∗ ← QP (Q , p , A , c ) Optimal Hyperplane 0 . 707 g ( x ) = sign( x 1 − x 2 − 1) 1 | = 1 √ margin: 2 ≈ 0 . 707 . 0 | w ∗ | | = 1 − x 2 − For data points (i), (ii) and (iii) y n ( w ∗ t x n + b ∗ ) = 1 x 1 ↑ Support Vectors (Q = 0 is linear programming) � A M Maximizing the Margin : 13 /19 � A M Maximizing the Margin : 14 /19 c L Creator: Malik Magdon-Ismail Quadratic programming − → c L Creator: Malik Magdon-Ismail Maximum margin hyperplane is QP − → Maximum Margin Hyperplane is QP Back To Our Example Exercise: 1 minimize 2 w t w y n ( w t x n + b ) ≥ 1 1 minimize 2 u t Q u + c t u b, w     u ∈ R q 0 0 − 1 − b ≥ 1 ( i ) subject to: A u ≥ a 2 2 − 1 − (2 w 1 + 2 w 2 + b ) ≥ 1 ( ii )     subject to: y n ( w t x n + b ) ≥ 1 for n = 1 , . . . , N. X = y =     2 0 +1 2 w 1 + b ≥ 1 ( iii )     3 0 +1 3 w 1 + b ≥ 1 ( iv ) Show that � � b     ∈ R d +1 u = − 1 0 0 1 w     0 0 0 0 − 1 − 2 − 2 1     Q = 0 1 0 p = 0 A = c =         1 2 0 1     0 0 1 0 1 3 0 1 w t � � � � � � � � � 1 0 b 0 0 � b 0 t 0 t 0 t d d d 2 w t w = = u t u = ⇒ Q = , p = 0 d +1 0 d I d w t 0 d I d 0 d I d Use your QP-solver to give         y 1 y 1 x t 1 y 1 y 1 x t 1 ( b ∗ , w ∗ 1 , w ∗ 1 1 2 ) = ( − 1 , 1 , − 1) . . . . . . � y n �  =  , c = . .  u ≥ . . . . y n ( w t x n + b ) ≥ 1 ≡ y n x t u ≥ 1 = ⇒ ⇒ A =  . .  .  . .  .  n y N y N x t 1 y N y N x t 1 N N � A c M Maximizing the Margin : 15 /19 � A c M Maximizing the Margin : 16 /19 L Creator: Malik Magdon-Ismail Back to our example − → L Creator: Malik Magdon-Ismail Primal QP algorithm − →

Primal QP algorithm for linear-SVM Example: SVM vs PLA 1: Let p = 0 d +1 be the ( d + 1)-vector of zeros and c = 1 N the N -vector of ones. Construct matrices Q and A, where E out (SVM)   � 0 y 1 — y 1 x t 1 — � 0 t . . . . d A = . . , Q =   0 d I d y N — y N x t N — � �� signed data matrix � b ∗ � = u ∗ ← QP (Q , p , A , c ) . 2: Return w ∗ 0 . 04 0 0 . 02 0 . 06 0 . 08 E out 3: The final hypothesis is g ( x ) = sign( w ∗ t x + b ∗ ). PLA depends on the ordering of data (e.g. random) � A M Maximizing the Margin : 17 /19 � A M Maximizing the Margin : 18 /19 c L Creator: Malik Magdon-Ismail Example: SVM vs PLA − → c L Creator: Malik Magdon-Ismail Link to regularization Link to Regularization minimize E in ( w ) w subject to: w t w ≤ C. optimal hyperplane regularization minimize E in w t w subject to E in = 0 w t w ≤ C � A c M Maximizing the Margin : 19 /19 L Creator: Malik Magdon-Ismail

recap: Linear Models, RBFs, Neural Networks Linear Model with - PowerPoint PPT Presentation

recap: Linear Models, RBFs, Neural Networks Linear Model with Nonlinear Transform Neural Network k -RBF-Network d m k Learning From Data h ( x ) = w 0 + w j j ( x ) h ( x ) = w 0 +

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Neural Networks Still seeking flexible, non-linear models for classfication and CS 335: Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

The Power of Linear Recurrent Neural Networks Neural Networks Was knnen lineare rekurrente

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

General Neural Networks Compositions of linear maps and component-wise non- linearities Neural

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Fragile Separation Robust Separation x 2 x 2 x 2 x 2 New data x 1 x 1 x 1 x 1 } What

Improving the Performance of Consistency Algorithms by Localizing and Bolstering Propagation in a

Vacuum System of the pbar-Separator Pbar-target is in air. Vacuum System ends with an exit

Graphs of separability at most two: structural characterizations and their consequences Ferdinando

Approximate Nearest Neighbors Sariel Har Peled: Notes Arya, Mount, Netenyahu, Silverman, Wu An

CSED 514 Data Management for Data Science Section 1: Introduction to SQLite SQLite: What is it

On the diameter of minimal separators in a graph David Coudert 1 , 2 Guillaume Ducoffe 1 , 2

Module 10: Implementing User-defined Functions Overview What Is a User-defined Function?