Applied Machine Learning Applied Machine Learning Perceptron and - PowerPoint PPT Presentation

Perceptron: example Perceptron: example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) iteration 1 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 3

Perceptron: Perceptron: example example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) iteration 1 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence initial decision boundary w x = ⊤ 0 5 . 3

Perceptron: example Perceptron: example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) iteration 10 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 4

Perceptron: Perceptron: example example Iris dataset (linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) iteration 10 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence observations: after finding a linear separator no further updates happen the final boundary depends on the order of instances (different from all previous methods) 5 . 4

Perceptron: Perceptron: example example 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 5

Perceptron: Perceptron: example example Iris dataset ( NOT linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence 5 . 5

Perceptron: example Perceptron: example Iris dataset ( NOT linearly separable case) 1 def Perceptron(X, y, max_iters): 2 N,D = X.shape 3 w = np.random.rand(D) 4 for t in range(max_iters): 5 n = np.random.randint(N) 6 yh = np.sign(np.dot(X[n,:], w)) 7 if yh != y[n]: 8 w = w + y[n]*X[n,:] 9 return w note that the code is not chacking for convergence the algorithm does not converge there is always a wrong prediction and the weights will be updated 5 . 5

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy 5 . 6

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations 5 . 6

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal 5 . 6

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations let's fix this problem first assume linear separability the decision boundary may be suboptimal 5 . 7 Winter 2020 | Applied Machine Learning (COMP551)

Margin Margin the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary this is positive for correctly classified points 6 . 1

Margin Margin the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary 1 ⊤ ( n ) ( w x + ) signed distance is w 0 ∣∣ w ∣∣ this is positive for correctly classified points 6 . 1

Margin Margin the margin of a classifier (assuming correct classification) is the distance of the closest point to the decision boundary 1 ⊤ ( n ) ( w x + ) signed distance is w 0 ∣∣ w ∣∣ 1 ( n ) ⊤ ( w x + ) correcting for sign (margin) y w 0 ∣∣ w ∣∣ this is positive for correctly classified points 6 . 1

Max margin classifier Max margin classifier find the decision boundary with maximum margin margin is not maximal 6 . 2

Max margin classifier Max margin classifier find the decision boundary with maximum margin maximum margin margin is not maximal 6 . 2

Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 M M 6 . 3

Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 only the points (n) with 1 ( n ) ⊤ ( n ) M = ( w x + ) matter in finding the boundary y w 0 ∣∣ w ∣∣ 2 M M 6 . 3

Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 only the points (n) with 1 ( n ) ⊤ ( n ) M = ( w x + ) matter in finding the boundary y w 0 ∣∣ w ∣∣ 2 these are called support vectors M M 6 . 3

Max margin classifier Max margin classifier find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 only the points (n) with 1 ( n ) ⊤ ( n ) M = ( w x + ) matter in finding the boundary y w 0 ∣∣ w ∣∣ 2 these are called support vectors max-margin classifier is called support vector machine (SVM) M M 6 . 3

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 M M 6 . 4

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 observation ∗ ∗ if is an optimal solution then w , w 0 M M 6 . 4

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 observation ∗ ∗ if is an optimal solution then w , w 0 M ∗ ∗ cw , cw is also optimal (same margin) 0 M 6 . 4

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 observation ∗ ∗ if is an optimal solution then w , w 0 M ∗ ∗ cw , cw is also optimal (same margin) 0 M 1 fix the norm of w to avoid this ∣∣ w ∣∣ = 2 M 6 . 4

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 6 . 5

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 6 . 5

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 { 1 max w , w 0 ∣∣ w ∣∣ 2 fixing ∣∣ w ∣∣ 1 = 1 1 ( n ) ⊤ ( n ) 2 ≤ ( w x + ) ∀ n y w M 0 ∣∣ w ∣∣ ∣∣ w ∣∣ 2 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 6 . 5

Support Vector Machine Support Vector Machine find the decision boundary with maximum margin { max M w , w 0 1 ( n ) ⊤ ( n ) M ≤ ( w x + ) ∀ n y w 0 ∣∣ w ∣∣ 2 { 1 max w , w 0 ∣∣ w ∣∣ 2 fixing ∣∣ w ∣∣ 1 = 1 1 ( n ) ⊤ ( n ) 2 ≤ ( w x + ) ∀ n y w M 0 ∣∣ w ∣∣ ∣∣ w ∣∣ 2 2 1 { 2 ∣∣ w ∣∣ min ∣∣ w ∣∣ 2 w , w 2 0 1 simplifying, we get hard margin SVM objective ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n ∣∣ w ∣∣ y w 2 0 6 . 5 Winter 2020 | Applied Machine Learning (COMP551)

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal 7

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy even if linearly separable convergence could take many iterations the decision boundary may be suboptimal maximize the hard margin 7

Perceptron: Perceptron: issues issues cyclic updates if the data is not linearly separable? try make the data separable using additional features? data may be inherently noisy now lets fix this problem maximize a soft margin even if linearly separable convergence could take many iterations the decision boundary may be suboptimal maximize the hard margin 7

Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ( n ) zero if the point satisfies original margin constraint = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ( n ) zero if the point satisfies original margin constraint = 0 ξ ξ ( n ) if correctly classified but inside the margin ( n ) 0 < ξ < 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

Soft Soft margin constraints margin constraints allow points inside the margin and on the wrong side but penalize them instead of hard constraint ( n ) ⊤ ( n ) ( w x + ) ≥ 1 ∀ n y w 0 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n use y w 0 ( n ) ≥ 0 slack variables (one for each n) ξ ( n ) zero if the point satisfies original margin constraint = 0 ξ ξ ( n ) if correctly classified but inside the margin ( n ) 0 < ξ < 1 ∣∣ w ∣∣ 2 1 incorrectly classified ( n ) > 1 ξ ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 1

Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them soft-margin objective 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 2

Soft margin constraints Soft margin constraints allow points inside the margin and on the wrong side but penalize them soft-margin objective 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) ( n ) ( w x + ) ≥ 1 − ξ ∀ n y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 γ ∣∣ w ∣∣ is a hyper-parameter that defines the importance of constraints 2 γ 1 for very large this becomes similar to hard margin svm ∣∣ w ∣∣ 2 8 . 2

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 3

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ if point satisfies the margin ( n ) ⊤ ( n ) ( w x + ) ≥ 1 y w 0 ( n ) minimum slack is = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 3

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ if point satisfies the margin ( n ) ⊤ ( n ) ( w x + ) ≥ 1 y w 0 ( n ) minimum slack is = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 ( n ) ⊤ ( n ) otherwise ( w x + ) < 1 y w 1 0 the smallest slack is ( n ) ( n ) ⊤ ( n ) ∣∣ w ∣∣ = 1 − y ( w x + ) ξ w 0 2 1 ∣∣ w ∣∣ 2 8 . 3

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ if point satisfies the margin ( n ) ⊤ ( n ) ( w x + ) ≥ 1 y w 0 ( n ) minimum slack is = 0 ξ ξ ( n ) ∣∣ w ∣∣ 2 ( n ) ⊤ ( n ) otherwise ( w x + ) < 1 y w 1 0 the smallest slack is ( n ) ( n ) ⊤ ( n ) ∣∣ w ∣∣ = 1 − y ( w x + ) ξ w 0 2 1 ∣∣ w ∣∣ so the optimal slack satisfying both cases 2 ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 8 . 3

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ the same as 1 ( n ) ⊤ ( n ) 2 min max(0, 1 − ( w x + )) + ∣∣ w ∣∣ 2 0 ∑ n y w w , w 0 2 1 2 γ ∣∣ w ∣∣ 2 1 ∣∣ w ∣∣ 2 8 . 4

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ the same as 1 ( n ) ⊤ ( n ) 2 min max(0, 1 − ( w x + )) + ∣∣ w ∣∣ 2 0 ∑ n y w w , w 0 2 1 2 γ ∣∣ w ∣∣ 2 this is called the hinge loss ( y , ^ ) = max(0, 1 − y ^ ) L y y 1 hinge ∣∣ w ∣∣ 2 8 . 4

Hinge loss Hinge loss would be nice to turn this into an unconstrained optimization 1 2 ( n ) min ∣∣ w ∣∣ + ∑ n γ ξ w , w 0 2 2 ( n ) ⊤ ( n ) 1 − ξ ( n ) ( w x + ) ≥ y w 0 ( n ) ≥ 0 ∀ n ξ replace ( n ) ( n ) ⊤ ( n ) = max(0, 1 − y ( w x + )) ξ w 0 1 we get 2 ( n ) ⊤ ( n ) min ∣∣ w ∣∣ + max(0, 1 − ( w x + )) ∑ n γ y w w , w 0 0 2 2 ξ ( n ) ∣∣ w ∣∣ the same as 1 ( n ) ⊤ ( n ) 2 min max(0, 1 − ( w x + )) + ∣∣ w ∣∣ 2 0 ∑ n y w w , w 0 2 1 2 γ ∣∣ w ∣∣ 2 this is called the hinge loss ( y , ^ ) = max(0, 1 − y ^ ) L y y 1 hinge ∣∣ w ∣∣ 2 soft-margin SVM is doing L2 regularized hinge loss minimization 8 . 4 Winter 2020 | Applied Machine Learning (COMP551)

Perceptron vs. SVM Perceptron vs. SVM Perceptron if correctly classified evaluates to zero otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 9 . 1

Perceptron vs. SVM Perceptron vs. SVM Perceptron if correctly classified evaluates to zero otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 9 . 1

Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 9 . 1

Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 so this is the difference! (plus regularization) can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 9 . 1

Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 so this is the difference! (plus regularization) can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary 9 . 1

Perceptron vs. SVM Perceptron vs. SVM Perceptron SVM ( n ) ⊤ ( n ) 2 max(0, 1 − ( w x + )) + λ ∣∣ w ∣∣ ∑ n y w if correctly classified evaluates to zero 0 2 2 otherwise it is min ( n ) ⊤ ( n ) − y ( w x + )) w w , w 0 0 so this is the difference! (plus regularization) can be written as ( n ) ⊤ ( n ) max(0, − y ( w x + )) ∑ n w 0 finds some linear decision boundary if exists for small lambda finds the max-marging decision boundary stochastic gradient descent with fixed learning rate depending on the formulation we have many choices 9 . 1

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 9 . 2

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w check that the cost function is convex in w(?) 9 . 2

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J 9 . 2

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) 9 . 2

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent the update will look like Perceptron ( n ) ⊤ ( n ) 2 if minimize ( n ) y ^ ( n ) λ − y ( w x ) + ∣∣ w ∣∣ < 1 y 2 2 otherwise, do nothing 9 . 2

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent 1 1 def subgradient(X, y, w, lamb): def subgradient(X, y, w, lamb): the update will look like Perceptron 2 2 N,D = X.shape N,D = X.shape 3 3 z = np.dot(X, w) z = np.dot(X, w) 4 4 violations = np.nonzero(z*y < 1)[0] violations = np.nonzero(z*y < 1)[0] ( n ) ⊤ ( n ) 2 5 5 grad = -np.dot(X[violations,:].T, grad = -np.dot(X[violations,:].T, if minimize ( n ) y ^ ( n ) λ − y ( w x ) + ∣∣ w ∣∣ < 1 y 2 2 y[violations])/N y[violations])/N 6 6 grad[:-1] += lamb2 * w[:-1] grad[:-1] += lamb2 * w[:-1] otherwise, do nothing 7 7 return grad return grad 9 . 2

Perceptron vs. SVM Perceptron vs. SVM cost ( n ) ⊤ ( n ) 2 λ J ( w ) = ∑ n max(0, 1 − ) + ∣∣ w ∣∣ y w x 2 2 now we included bias in w 1 def cost(X,y,w, lamb=1e-3): 2 z = np.dot(X, w) 3 J = np.mean(np.maximum(0, 1 - y*z)) + lamb * np.dot(w[:-1],w[:-1])/2 check that the cost function is convex in w(?) 4 return J hinge loss is not smooth (piecewise linear) if we use "stochastic" sub-gradient descent 1 1 1 def subgradient(X, y, w, lamb): def subgradient(X, y, w, lamb): def subgradient(X, y, w, lamb): the update will look like Perceptron 2 2 2 N,D = X.shape N,D = X.shape N,D = X.shape 3 3 3 z = np.dot(X, w) z = np.dot(X, w) z = np.dot(X, w) 4 4 4 violations = np.nonzero(z*y < 1)[0] violations = np.nonzero(z*y < 1)[0] violations = np.nonzero(z*y < 1)[0] ( n ) ⊤ ( n ) 2 5 5 5 grad = -np.dot(X[violations,:].T, grad = -np.dot(X[violations,:].T, grad = -np.dot(X[violations,:].T, if minimize ( n ) y ^ ( n ) λ − y ( w x ) + ∣∣ w ∣∣ < 1 y 2 2 y[violations])/N y[violations])/N y[violations])/N 6 6 6 grad[:-1] += lamb2 * w[:-1] grad[:-1] += lamb2 * w[:-1] grad[:-1] += lamb2 * w[:-1] otherwise, do nothing 7 7 7 return grad return grad return grad 9 . 2

Example Example Iris dataset (D=2) (linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 t = 0 5 w_old = w + np.inf 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) 8 w_old = w 9 w = w - lr*g/np.sqrt(t+1) 10 t += 1 11 return w 9 . 3

Example Example Iris dataset (D=2) (linearly separable case) 1 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 2 N,D = X.shape N,D = X.shape 3 3 w = np.zeros(D) w = np.zeros(D) 4 4 t = 0 t = 0 5 5 w_old = w + np.inf w_old = w + np.inf 6 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 7 g = subgradient(X, y, w, lamb=lamb) g = subgradient(X, y, w, lamb=lamb) 8 8 w_old = w w_old = w 9 9 w = w - lr*g/np.sqrt(t+1) w = w - lr*g/np.sqrt(t+1) 10 10 t += 1 t += 1 11 11 return w return w 9 . 3

Example Example Iris dataset (D=2) (linearly separable case) 1 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 2 N,D = X.shape N,D = X.shape 3 3 w = np.zeros(D) w = np.zeros(D) 4 4 t = 0 t = 0 5 5 w_old = w + np.inf w_old = w + np.inf 6 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 7 g = subgradient(X, y, w, lamb=lamb) g = subgradient(X, y, w, lamb=lamb) 8 8 w_old = w w_old = w 9 9 w = w - lr*g/np.sqrt(t+1) w = w - lr*g/np.sqrt(t+1) 10 10 t += 1 t += 1 11 11 return w return w max-margin boundary (using small lambda ) λ = 10 −8 9 . 3

Example Example Iris dataset (D=2) (linearly separable case) 1 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 2 N,D = X.shape N,D = X.shape 3 3 w = np.zeros(D) w = np.zeros(D) 4 4 t = 0 t = 0 5 5 w_old = w + np.inf w_old = w + np.inf 6 6 while np.linalg.norm(w - w_old) > eps and t < max_iters: while np.linalg.norm(w - w_old) > eps and t < max_iters: 7 7 g = subgradient(X, y, w, lamb=lamb) g = subgradient(X, y, w, lamb=lamb) 8 8 w_old = w w_old = w 9 9 w = w - lr*g/np.sqrt(t+1) w = w - lr*g/np.sqrt(t+1) 10 10 t += 1 t += 1 11 11 return w return w max-margin boundary (using small lambda ) λ = 10 −8 compare to Perceptron's decision boundary 9 . 3

Example Example Iris dataset (D=2) ( NOT linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 t = 0 6 while np.linalg.norm(g) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) λ = 10 −8 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 9 . 4

Example Example Iris dataset (D=2) ( NOT linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 t = 0 6 while np.linalg.norm(g) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) soft margins using small lambda λ = 10 −8 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w 9 . 4

Example Example Iris dataset (D=2) ( NOT linearly separable case) 1 def SubGradientDescent(X,y,lr=1,eps=1e-18, max_iters=1000, lamb=1e-8): 2 N,D = X.shape 3 w = np.zeros(D) 4 g = np.inf 5 t = 0 6 while np.linalg.norm(g) > eps and t < max_iters: 7 g = subgradient(X, y, w, lamb=lamb) soft margins using small lambda λ = 10 −8 8 w = w - lr*g/np.sqrt(t+1) 9 t += 1 10 return w Perceptron does not converge 9 . 4 Winter 2020 | Applied Machine Learning (COMP551)

SVM vs. logistic regression SVM vs. logistic regression recall : logistic regression simplified cost for y ∈ {0, 1} − z ( n ) z ( n ) N ( n ) ( n ) ( n ) ⊤ ( n ) J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + = ∑ n =1 ) where z y e e w x includes the bias zy 10

SVM vs. logistic regression SVM vs. logistic regression recall : logistic regression simplified cost for y ∈ {0, 1} − z ( n ) z ( n ) N ( n ) ( n ) ( n ) ⊤ ( n ) J ( w ) = log ( 1 + ) + (1 − y ) log ( 1 + = ∑ n =1 ) where z y e e w x includes the bias y ∈ {−1, +1} for we can write this as ( n ) ( n ) N − y 2 J ( w ) = log ( 1 + ) + λ ∣∣ w ∣∣ ∑ n =1 z e 2 2 also added L2 regularization zy 10

Applied Machine Learning Applied Machine Learning Perceptron and - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives geometry of linear

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak

Requirements Analysis - Ambiguity R. Kuehl/J. Scott Hawker p. 1 R I T Lecture 4-1 Software

The Lord is not slow in keeping his promise, as some understand slowness. Instead he is patient

27 September 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

Ttulo do captulo Luis Lamb 8 May 2017 Dagstuhl, DE Summary Dies ist im Wesentlichen die

4/25/2020 DIABLO VALLEY COLLEGE ANNUAL PRINCIPALS BREAKFAST Fall 2018 WELCOME AND NETWORKING

Recent Developments in Context-oriented Programming (at HPI) Robert Hirschfeld Hasso Plattner

Advanced localization topics Michal iha <michal@weblate.org> https://weblate.org/

HPC Filesystems Today Whats Working and Opportunities to Improve May 15 2017 Ned Bass

Applied Machine Learning Applied Machine Learning Perceptron and - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives geometry of linear

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Linear Regression Siamak Ravanbakhsh Siamak

Requirements Analysis - Ambiguity R. Kuehl/J. Scott Hawker p. 1 R I T Lecture 4-1 Software

The Lord is not slow in keeping his promise, as some understand slowness. Instead he is patient

27 September 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

Ttulo do captulo Luis Lamb 8 May 2017 Dagstuhl, DE Summary Dies ist im Wesentlichen die

4/25/2020 DIABLO VALLEY COLLEGE ANNUAL PRINCIPALS BREAKFAST Fall 2018 WELCOME AND NETWORKING

Recent Developments in Context-oriented Programming (at HPI) Robert Hirschfeld Hasso Plattner

Advanced localization topics Michal iha &lt;michal@weblate.org&gt; https://weblate.org/

HPC Filesystems Today Whats Working and Opportunities to Improve May 15 2017 Ned Bass

Advanced localization topics Michal iha <michal@weblate.org> https://weblate.org/