SLIDE 1
Today Perceptron. Today Perceptron. Support Vector Machine. - - PowerPoint PPT Presentation
Today Perceptron. Today Perceptron. Support Vector Machine. - - PowerPoint PPT Presentation
Today Perceptron. Today Perceptron. Support Vector Machine. Labelled points with x 1 ,..., x n . + + ++ Labelled points with x 1 ,..., x n . Hyperplane separator. + + ++ Labelled points
SLIDE 2
SLIDE 3
+ ++ + − − − − − Labelled points with x1,...,xn.
SLIDE 4
+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator.
SLIDE 5
+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator. Margins.
SLIDE 6
+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball.
SLIDE 7
+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball.
SLIDE 8
+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ
SLIDE 9
+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane:
SLIDE 10
+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points.
SLIDE 11
+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points.
SLIDE 12
+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball.
SLIDE 13
+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ
SLIDE 14
+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels!
SLIDE 15
+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels! negate the negative.
SLIDE 16
+ ++ + − − − − − − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels! negate the negative.
SLIDE 17
+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels! negate the negative.
SLIDE 18
Perceptron Algorithm
An aside: a hyperplane is a perceptron.
SLIDE 19
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.)
SLIDE 20
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn.
SLIDE 21
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1.
SLIDE 22
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative)
SLIDE 23
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi
SLIDE 24
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1
SLIDE 25
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
SLIDE 26
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi:
SLIDE 27
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1.
SLIDE 28
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction!
SLIDE 29
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ.
SLIDE 30
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction!
SLIDE 31
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi;
SLIDE 32
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi; wt+1 ·w = (wt +xi)·w = wt ·w +xi ·w
SLIDE 33
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi; wt+1 ·w = (wt +xi)·w = wt ·w +xi ·w ≥ wt ·w +γ.
SLIDE 34
Perceptron Algorithm
An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1
γ 2 mistakes.
Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi; wt+1 ·w = (wt +xi)·w = wt ·w +xi ·w ≥ wt ·w +γ.
SLIDE 35
Alg: Given x1,...,xn.
SLIDE 36
Alg: Given x1,...,xn. Let w1 = x1.
SLIDE 37
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative)
SLIDE 38
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi
SLIDE 39
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1
SLIDE 40
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1
SLIDE 41
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi
SLIDE 42
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi
SLIDE 43
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi
SLIDE 44
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle!
SLIDE 45
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1.
SLIDE 46
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0.
SLIDE 47
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0. (wt +xi)2 = |wt|2 +2wt ·xi +|xi|2.
SLIDE 48
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0. (wt +xi)2 = |wt|2 +2wt ·xi +|xi|2. ≤ |wt|2 +|xi|2 = |wt|2 +1.
SLIDE 49
Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0. (wt +xi)2 = |wt|2 +2wt ·xi +|xi|2. ≤ |wt|2 +|xi|2 = |wt|2 +1. Claim 2 holds even if no separating hyperplane!
SLIDE 50
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ.
SLIDE 51
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1
SLIDE 52
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm.
SLIDE 53
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM
SLIDE 54
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w
SLIDE 55
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w ≤ ||wt||
SLIDE 56
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w ≤ ||wt|| ≤ √ M.
SLIDE 57
Putting it together...
Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w ≤ ||wt|| ≤ √ M. → M ≤ 1
γ2
SLIDE 58
Hinge Loss.
Most of data has good separator.
SLIDE 59
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ.
SLIDE 60
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress
SLIDE 61
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way.
SLIDE 62
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting?
SLIDE 63
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin.
SLIDE 64
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ.
SLIDE 65
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part.
SLIDE 66
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit .
SLIDE 67
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ
SLIDE 68
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. →
SLIDE 69
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M
SLIDE 70
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M
SLIDE 71
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
SLIDE 72
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
SLIDE 73
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
Uh...
SLIDE 74
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
Uh... One implication: M ≤ 1
γ2 + 2 γ TDγ.
SLIDE 75
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
Uh... One implication: M ≤ 1
γ2 + 2 γ TDγ.
SLIDE 76
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
Uh... One implication: M ≤ 1
γ2 + 2 γ TDγ.
The extra is (twice) the amount of rotation in units of γ.
SLIDE 77
Hinge Loss.
Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2
γ ≤ 0.
Uh... One implication: M ≤ 1
γ2 + 2 γ TDγ.
The extra is (twice) the amount of rotation in units of γ. Hinge loss: 1
γ TDγ.
SLIDE 78
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane.
SLIDE 79
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it!
SLIDE 80
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.)
SLIDE 81
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake.
SLIDE 82
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1,
SLIDE 83
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn,
SLIDE 84
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi,
SLIDE 85
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1
SLIDE 86
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1 Claim 1: wt+1 ·w ≥ wtw + γ
2.
SLIDE 87
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1 Claim 1: wt+1 ·w ≥ wtw + γ
2.
Same
SLIDE 88
Approximately Maximizing Margin Algorithm
There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1 Claim 1: wt+1 ·w ≥ wtw + γ
2.
Same (ish) as before.
SLIDE 89
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1??
SLIDE 90
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1??
SLIDE 91
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 Adding xi to wt even if in correct direction.
SLIDE 92
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 Adding xi to wt even if in correct direction.
SLIDE 93
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 Adding xi to wt even if in correct direction. Obtuse triangle.
SLIDE 94
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle.
SLIDE 95
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1
SLIDE 96
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
SLIDE 97
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.)
SLIDE 98
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2.
SLIDE 99
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
SLIDE 100
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
SLIDE 101
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
M updates
SLIDE 102
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
M updates |wM| ≤ 2
γ + 3 4γM.
SLIDE 103
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
M updates |wM| ≤ 2
γ + 3 4γM.
Claim 1:
SLIDE 104
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
M updates |wM| ≤ 2
γ + 3 4γM.
Claim 1: Implies |wM| ≥ γM/2.
SLIDE 105
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
M updates |wM| ≤ 2
γ + 3 4γM.
Claim 1: Implies |wM| ≥ γM/2. γM/2 ≤ 2
γ + 3 4γM
SLIDE 106
Margin Approximation: Claim 2
Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+
1 2|wt|
(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+
1 2|wt| + γ 2
If |wt| ≥ 2
γ , then |wt+1| ≤ |wt|+ 3 4γ.
M updates |wM| ≤ 2
γ + 3 4γM.
Claim 1: Implies |wM| ≥ γM/2. γM/2 ≤ 2
γ + 3 4γM→ M ≤ 8 γ2
SLIDE 107
Other fat separators?
− − − − − + + + +
SLIDE 108
Other fat separators?
− − − − − + + + + No hyperplane separator.
SLIDE 109
Other fat separators?
− − − − − + + + + No hyperplane separator. Circle separator!
SLIDE 110
Other fat separators?
− − − − − + + + + x2 +y2 x y No hyperplane separator. Circle separator! Map points to three dimensions.
SLIDE 111
Other fat separators?
− − − − − + + + + x2 +y2 x y − − − − − + + + + No hyperplane separator. Circle separator! Map points to three dimensions. map point (x,y) to point (x,y,x2 +y2).
SLIDE 112
Other fat separators?
− − − − − + + + + x2 +y2 x y − − − − − + + + + No hyperplane separator. Circle separator! Map points to three dimensions. map point (x,y) to point (x,y,x2 +y2). Hyperplane separator in three dimensions.
SLIDE 113
Other fat separators?
− − − − − + + + + x2 +y2 x y − − − − − + + + + No hyperplane separator. Circle separator! Map points to three dimensions. map point (x,y) to point (x,y,x2 +y2). Hyperplane separator in three dimensions.
SLIDE 114
Kernel Functions.
Map x to φ(x).
SLIDE 115
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·).
SLIDE 116
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension.
SLIDE 117
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron.
SLIDE 118
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products!
SLIDE 119
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ
SLIDE 120
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ···
SLIDE 121
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine.
SLIDE 122
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space.
SLIDE 123
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·):
SLIDE 124
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y)
SLIDE 125
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d
SLIDE 126
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...].
SLIDE 127
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial.
SLIDE 128
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn)
SLIDE 129
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets.
SLIDE 130
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets. K(x,y) = exp(C|x −y|2)
SLIDE 131
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets. K(x,y) = exp(C|x −y|2) Infinite dimensional space.
SLIDE 132
Kernel Functions.
Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets. K(x,y) = exp(C|x −y|2) Infinite dimensional space. Gaussian Kernel.
SLIDE 133
Video
“http://www.youtube.com/watch?v=3liCbRZPrZA”
SLIDE 134
Support Vector Machine
Pick Kernel.
SLIDE 135
Support Vector Machine
Pick Kernel. Run algorithm that:
SLIDE 136
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products.
SLIDE 137
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points.
SLIDE 138
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron.
SLIDE 139
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron.
SLIDE 140
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization:
SLIDE 141
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
SLIDE 142
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
SLIDE 143
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
SLIDE 144
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
SLIDE 145
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
X
SLIDE 146
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
X
Algorithms output:
SLIDE 147
Support Vector Machine
Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.
X
Algorithms output: tight hyperplanes!
SLIDE 148