Support Vector Machine II Machine Learning 10-601B Seyoung - PowerPoint PPT Presentation

Support ¡Vector ¡Machine ¡II ¡ Machine ¡Learning ¡10-‑601B ¡ Seyoung ¡Kim ¡ Many ¡of ¡these ¡slides ¡are ¡derived ¡fromTom ¡ 1 ¡ Mitchell, ¡Ziv ¡Bar-‑Joseph. ¡Thanks! ¡

Max ¡margin ¡classifiers ¡ • Instead ¡of ¡fiHng ¡all ¡points, ¡focus ¡on ¡boundary ¡points ¡ • Learn ¡a ¡boundary ¡that ¡leads ¡to ¡the ¡largest ¡margin ¡from ¡both ¡ sets ¡of ¡points ¡ From ¡all ¡the ¡possible ¡ boundary ¡lines, ¡this ¡ leads ¡to ¡the ¡largest ¡ margin ¡on ¡both ¡sides ¡ These ¡are ¡the ¡vectors ¡ supporKng ¡the ¡boundary ¡

Support ¡Vector ¡Machines ¡ Two ¡opKmizaKon ¡problems: ¡For ¡the ¡separable ¡and ¡non ¡separable ¡cases ¡ For ¡all ¡ ¡x i ¡in ¡class ¡+ ¡1 ¡ For ¡all ¡ ¡x ¡in ¡class ¡+ ¡1 ¡ ¡ w T x+b ¡ ≥ ¡1-‑ ¡ ε i ¡ ¡ w T x+b ¡ ≥ ¡1 ¡ For ¡all ¡ ¡x i ¡in ¡class ¡-‑ ¡1 ¡ For ¡all ¡ ¡x ¡in ¡class ¡-‑ ¡1 ¡ ¡ w T x+b ¡ ≤ ¡-‑1+ ¡ ε i ¡ ¡ w T x+b ¡ ≤ ¡-‑1 ¡ For ¡all ¡i ¡ ¡ ε I ¡ ≥ ¡0 ¡ 3 ¡

Non ¡linearly ¡separable ¡case ¡ • ¡Instead ¡of ¡minimizing ¡the ¡number ¡of ¡misclassified ¡points ¡we ¡can ¡minimize ¡ the ¡ distance ¡between ¡these ¡points ¡and ¡their ¡correct ¡plane ¡ The ¡new ¡opKmizaKon ¡problem ¡is: ¡ +1 ¡plane ¡ subject ¡to ¡the ¡following ¡inequality ¡ constraints: ¡ -‑1 ¡plane ¡ For ¡all ¡ ¡x i ¡in ¡class ¡+ ¡1 ¡ ¡ w T x+b ¡ ≥ ¡1-‑ ¡ ε i ¡ ε k ¡ ε j ¡ For ¡all ¡ ¡x i ¡in ¡class ¡-‑ ¡1 ¡ ¡ w T x+b ¡ ≤ ¡-‑1+ ¡ ε i ¡ Wait. ¡Are ¡we ¡missing ¡something? ¡

Support ¡Vector ¡Machines ¡ Two ¡opKmizaKon ¡problems: ¡For ¡the ¡separable ¡and ¡non ¡separable ¡cases ¡ Min ¡(w T w)/2 ¡ ¡ For ¡all ¡ ¡x i ¡in ¡class ¡+ ¡1 ¡ For ¡all ¡ ¡x ¡in ¡class ¡+ ¡1 ¡ ¡ w T x+b ¡ ≥ ¡1-‑ ¡ ε i ¡ ¡ w T x+b ¡ ≥ ¡1 ¡ For ¡all ¡ ¡x i ¡in ¡class ¡-‑ ¡1 ¡ For ¡all ¡ ¡x ¡in ¡class ¡-‑ ¡1 ¡ ¡ w T x+b ¡ ≤ ¡-‑1+ ¡ ε i ¡ ¡ w T x+b ¡ ≤ ¡-‑1 ¡ For ¡all ¡i ¡ ¡ ε I ¡ ≥ ¡0 ¡ • ¡Instead ¡of ¡solving ¡these ¡QPs ¡directly ¡we ¡will ¡solve ¡ ¡a ¡dual ¡ formulaKon ¡of ¡the ¡SVM ¡opKmizaKon ¡problem ¡ • ¡The ¡main ¡reason ¡for ¡switching ¡to ¡this ¡type ¡of ¡representaKon ¡is ¡that ¡ it ¡would ¡allow ¡us ¡to ¡use ¡a ¡neat ¡trick ¡that ¡will ¡make ¡our ¡lives ¡easier ¡ (and ¡the ¡run ¡Kme ¡faster) ¡ 5 ¡

An ¡alterna>ve ¡(dual) ¡representa>on ¡of ¡the ¡SVM ¡ QP ¡ Min ¡(w T w)/2 ¡ ¡ For ¡all ¡ ¡x ¡in ¡class ¡+1 ¡ • ¡We ¡will ¡start ¡with ¡the ¡linearly ¡separable ¡case ¡ w T x+b ¡ ≥ ¡1 ¡ • ¡Instead ¡of ¡encoding ¡the ¡correct ¡classificaKon ¡rule ¡and ¡ constraint ¡we ¡will ¡use ¡Lagrange ¡mulKpliers ¡to ¡encode ¡it ¡as ¡ For ¡all ¡ ¡x ¡in ¡class ¡-‑1 ¡ part ¡of ¡our ¡minimizaKon ¡problem ¡ w T x+b ¡ ≤ ¡-‑1 ¡ ⇓ ¡ Why? ¡ Min ¡(w T w)/2 ¡ (w T x i +b)y i ¡ ≥ ¡1 ¡ 6 ¡

An ¡alterna>ve ¡(dual) ¡representa>on ¡of ¡the ¡SVM ¡ QP ¡ Min ¡(w T w)/2 ¡ (w T x i +b)y i ¡ ≥ ¡1 ¡ • ¡We ¡will ¡start ¡with ¡the ¡linearly ¡separable ¡case ¡ • ¡Instead ¡of ¡encoding ¡the ¡correct ¡classificaKon ¡rule ¡a ¡constraint ¡ we ¡will ¡use ¡Lagrange ¡mulKpliers ¡to ¡encode ¡it ¡as ¡part ¡of ¡our ¡ minimizaKon ¡problem ¡ Recall ¡that ¡Lagrange ¡mulKpliers ¡can ¡be ¡ applied ¡to ¡turn ¡the ¡following ¡problem: ¡ min x ¡x 2 ¡ s.t. ¡x ¡ ≥ ¡b ¡ Allowed ¡min ¡ To ¡ min x ¡max α ¡ x 2 ¡ -‑ α (x-‑b) ¡ Global ¡min ¡ b ¡ s.t. ¡ α ¡ ≥ ¡0 ¡ 7 ¡

Lagrange ¡mul>plier ¡for ¡SVMs ¡ Dual ¡formulaKon ¡ ¡ Original ¡(primal) ¡formulaKon ¡ Min ¡(w T w)/2 ¡ (w T x i +b)y i ¡ ≥ ¡1 ¡ w: ¡primal ¡parameters ¡ α i ’s: ¡dual ¡parameters ¡ 8 ¡

Lagrange ¡mul>plier ¡for ¡SVMs ¡ Dual ¡formulaKon ¡ ¡ Original ¡(primal) ¡formulaKon ¡ Min ¡(w T w)/2 ¡ (w T x i +b)y i ¡ ≥ ¡1 ¡ Using ¡this ¡new ¡formulaKon ¡we ¡can ¡derive ¡w ¡and ¡b ¡by ¡ taking ¡the ¡derivaKve ¡w.r.t. ¡w ¡ ¡leading ¡to: ¡ ∑ w = α i x i y i , where α i ≥ 0 i taking ¡the ¡derivaKve ¡w.r.t. ¡b ¡we ¡get: ¡ 9 ¡

Lagrange ¡mul>plier ¡for ¡SVMs ¡ Dual ¡formulaKon ¡ ¡ Original ¡(primal) ¡formulaKon ¡ Min ¡(w T w)/2 ¡ (w T x i +b)y i ¡ ≥ ¡1 ¡ SubsKtuKng ¡w ¡into ¡our ¡target ¡ funcKon ¡and ¡using ¡the ¡ Using ¡this ¡new ¡formulaKon ¡we ¡can ¡derive ¡w ¡and ¡b ¡by ¡ taking ¡the ¡derivaKve ¡w.r.t. ¡w ¡ ¡leading ¡to: ¡ addiKonal ¡constraint ¡we ¡get: ¡ ∑ Dual ¡formulaKon ¡ ¡ w = α i x i y i , where α i ≥ 0 i 1 ∑ ∑ T x j max α α i α j y i y j x i α i − 2 i i,j taking ¡the ¡derivaKve ¡w.r.t. ¡b ¡we ¡get: ¡ ∑ α i y i = 0 i α i ≥ 0 ∀ i 10 ¡

Dual ¡SVM ¡-‑ ¡interpreta>on ¡ Support ¡ vectors ¡ For ¡ α ’s ¡that ¡are ¡not ¡0 ¡ 11 ¡

Computa>onal ¡Cost ¡ • During ¡training, ¡the ¡computaKonal ¡costs ¡for ¡solving ¡primal ¡vs. ¡ dual ¡problems ¡are ¡ Primal ¡problem: ¡ Dual ¡problem: ¡ 1 min ¡(w T w)/2 ¡ ∑ ∑ T x j max α α i α j y i y j x i α i − 2 i i,j (w T x i +b)y i ¡ ≥ ¡1 ¡ ∑ α i y i = 0 Dot ¡product ¡for ¡all ¡ training ¡samples ¡ ¡ i α i ≥ 0 ∀ i m ¡ parameters ¡ n ¡ parameters ¡ -‑ ¡The ¡cost ¡of ¡QP ¡solver ¡depends ¡on ¡#variables ¡ -‑ ¡Ojen, ¡n ¡< ¡m, ¡where ¡n ¡= ¡#samples, ¡m ¡= ¡#input ¡features ¡ ¡ -‑>Solving ¡dual ¡is ¡ojen ¡more ¡efficient ¡ -‑ ¡Even ¡when ¡n ¡> ¡m, ¡working ¡with ¡dual ¡allows ¡you ¡to ¡use ¡kernels! ¡ 12 ¡

Computa>onal ¡Cost ¡ • During ¡tesKng, ¡the ¡computaKonal ¡costs ¡using ¡primal ¡vs. ¡dual ¡ representaKons ¡are ¡ Dot ¡product ¡with ¡all ¡ training ¡samples? ¡ ¡ Using ¡primal ¡variables: ¡ Using ¡dual ¡variables: ¡ y new = sign(w T x new + b ) ∑ T x new + b ) y new = sign ( α i y i x i i m ¡operaKon ¡ mr ¡operaKons ¡where ¡ r ¡is ¡ the ¡number ¡of ¡support ¡ vectors ¡( α i >0) ¡ ¡ If ¡one ¡uses ¡dual ¡parameters ¡to ¡make ¡predicKons, ¡the ¡predicKon ¡depends ¡ only ¡on ¡the ¡support ¡vectors, ¡but ¡this ¡is ¡not ¡explicitly ¡represented ¡in ¡the ¡ primal ¡ ¡ 13 ¡

Dual ¡formula>on ¡for ¡non ¡linearly ¡separable ¡ case ¡ For ¡all ¡ ¡x i ¡in ¡class ¡+ ¡1 ¡ ¡ w T x+b ¡ ≥ ¡1-‑ ¡ ε i ¡ (w T x i +b)y i ¡ ≥ ¡1-‑ ¡ ε I ¡ ⇓ ¡ For ¡all ¡ ¡x i ¡in ¡class ¡-‑ ¡1 ¡ ε I ¡ ≥ ¡0 ¡ ¡ w T x+b ¡ ≤ ¡-‑1+ ¡ ε i ¡ For ¡all ¡i ¡ ¡ ε I ¡ ≥ ¡0 ¡ 14 ¡

Dual ¡formula>on ¡for ¡non ¡linearly ¡separable ¡ case ¡ Dual ¡target ¡funcKon: ¡ To ¡evaluate ¡a ¡new ¡sample ¡x j ¡we ¡ need ¡to ¡compute: ¡ The ¡only ¡difference ¡is ¡that ¡ the ¡ α I ’s ¡are ¡now ¡bounded ¡ ¡ 15 ¡

Dual ¡SVM ¡– ¡Interpreta>on ¡for ¡Non-‑linearly ¡ Separable ¡Case ¡ Support ¡vectors: ¡data ¡ points ¡in ¡the ¡wrong ¡side ¡ of ¡margin ¡ +1 ¡ -‑1 ¡ For ¡ α ’s ¡that ¡are ¡not ¡0 ¡ 16 ¡

Error ¡Func>on ¡for ¡SVM ¡ t ¡> ¡0 ¡for ¡both ¡posiKve ¡ and ¡negaKve ¡training ¡ samples ¡if ¡classified ¡ correctly ¡ Let ¡t ¡= ¡(w T x i +b)y i ¡ ¡ Error(t) ¡ Ideal ¡classifier: ¡ ¡ ¡ ¡ 0 ¡ ¡ if ¡t ¡ > ¡0 ¡ Error(t) ¡= ¡ ¡ 1 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡1 ¡ ¡ if ¡t ¡ < ¡0 ¡ ¡ ¡ SVM ¡ t ¡ 0 ¡ 1 ¡ ¡ ¡ ¡Error(t) ¡= ¡[1-‑ ¡t] + ¡ ¡ [ ¡] + ¡denotes ¡ posiKve ¡part ¡ Hinge ¡Loss ¡ 17 ¡

FROM ¡LINEAR ¡TO ¡NON-‑LINEAR ¡ DECISION ¡BOUNDARY ¡ 18 ¡

Support Vector Machine II Machine Learning 10-601B Seyoung - PowerPoint PPT Presentation

Support Vector Machine II Machine Learning 10-601B Seyoung Kim Many of these slides are derived fromTom 1 Mitchell, Ziv Bar-Joseph. Thanks! Max

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X i y i ( x

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

MORE ON OBJECTS CSSE 120 Rose Hulman Institute of Technology Outline Objects What is a

Note Civil Action Special Proceeding Promise to pay G.S. Chapter 1 G.S. Chapter 45 +

Transportation Subcommittee Update Derald Dudley Ronald Vaughn Geographer/Computer Scientist

If youve scheduled loops, youve gone too far 24th October 2017 1 Department s of Computing

The role of conflict in sex discrimination: the case of missing girls Astghik Mavisakalyan 1 and

IP Addresses Identify computer (interfaces) in an internet Format (IPv4): 32 bits

Networking: Network Layer Summer 2013 Cornell University 1 Today How packages are

CS519: Computer Networks Lecture 2, part 2: Feb 4, 2004 IP (Internet Protocol) More ICMP