Today Perceptron. Today Perceptron. Support Vector Machine. - - PowerPoint PPT Presentation

today
SMART_READER_LITE
LIVE PREVIEW

Today Perceptron. Today Perceptron. Support Vector Machine. - - PowerPoint PPT Presentation

Today Perceptron. Today Perceptron. Support Vector Machine. Labelled points with x 1 ,..., x n . + + ++ Labelled points with x 1 ,..., x n . Hyperplane separator. + + ++ Labelled points


slide-1
SLIDE 1

Today

Perceptron.

slide-2
SLIDE 2

Today

Perceptron. Support Vector Machine.

slide-3
SLIDE 3

+ ++ + − − − − − Labelled points with x1,...,xn.

slide-4
SLIDE 4

+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator.

slide-5
SLIDE 5

+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator. Margins.

slide-6
SLIDE 6

+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball.

slide-7
SLIDE 7

+ ++ + − − − − − Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball.

slide-8
SLIDE 8

+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ

slide-9
SLIDE 9

+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane:

slide-10
SLIDE 10

+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points.

slide-11
SLIDE 11

+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points.

slide-12
SLIDE 12

+ ++ + − − − − − γ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball.

slide-13
SLIDE 13

+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ

slide-14
SLIDE 14

+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels!

slide-15
SLIDE 15

+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels! negate the negative.

slide-16
SLIDE 16

+ ++ + − − − − − − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels! negate the negative.

slide-17
SLIDE 17

+ ++ + − − − − − γ θ Labelled points with x1,...,xn. Hyperplane separator. Margins. Inside unit ball. Margin γ Hyperplane: w ·x ≥ γ for + points. w ·x ≤ −γ for − points. Put points on unit ball. w ·x = cosθ Will assume positive labels! negate the negative.

slide-18
SLIDE 18

Perceptron Algorithm

An aside: a hyperplane is a perceptron.

slide-19
SLIDE 19

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.)

slide-20
SLIDE 20

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn.

slide-21
SLIDE 21

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1.

slide-22
SLIDE 22

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative)

slide-23
SLIDE 23

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi

slide-24
SLIDE 24

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1

slide-25
SLIDE 25

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

slide-26
SLIDE 26

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi:

slide-27
SLIDE 27

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1.

slide-28
SLIDE 28

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction!

slide-29
SLIDE 29

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ.

slide-30
SLIDE 30

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction!

slide-31
SLIDE 31

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi;

slide-32
SLIDE 32

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi; wt+1 ·w = (wt +xi)·w = wt ·w +xi ·w

slide-33
SLIDE 33

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi; wt+1 ·w = (wt +xi)·w = wt ·w +xi ·w ≥ wt ·w +γ.

slide-34
SLIDE 34

Perceptron Algorithm

An aside: a hyperplane is a perceptron. (single layer neural network.) Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Theorem: Algorithm only makes 1

γ 2 mistakes.

Idea: Mistake on positive xi: wt+1 ·xi = (wt +xi)·xi = wtxi +1. A step in the right direction! Claim 1: wt+1 ·w ≥ wt ·w +γ. A γ in the right direction! Mistake on positive xi; wt+1 ·w = (wt +xi)·w = wt ·w +xi ·w ≥ wt ·w +γ.

slide-35
SLIDE 35

Alg: Given x1,...,xn.

slide-36
SLIDE 36

Alg: Given x1,...,xn. Let w1 = x1.

slide-37
SLIDE 37

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative)

slide-38
SLIDE 38

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi

slide-39
SLIDE 39

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1

slide-40
SLIDE 40

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1

slide-41
SLIDE 41

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi

slide-42
SLIDE 42

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi

slide-43
SLIDE 43

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi

slide-44
SLIDE 44

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle!

slide-45
SLIDE 45

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1.

slide-46
SLIDE 46

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0.

slide-47
SLIDE 47

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0. (wt +xi)2 = |wt|2 +2wt ·xi +|xi|2.

slide-48
SLIDE 48

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0. (wt +xi)2 = |wt|2 +2wt ·xi +|xi|2. ≤ |wt|2 +|xi|2 = |wt|2 +1.

slide-49
SLIDE 49

Alg: Given x1,...,xn. Let w1 = x1. For each xi, wt ·xi is wrong sign (negative) wt+1 = wt +xi t = t +1 Claim 2: |wt+1|2 ≤ |wt|2 +1 wt xi xi wt+1 wt+1 = wt +xi Less than a right angle! → |wt+1|2 ≤ |wt|2 +|xi|2 ≤ |wt|2 +1. Algebraically. Positive xi, wtxi ≤ 0. (wt +xi)2 = |wt|2 +2wt ·xi +|xi|2. ≤ |wt|2 +|xi|2 = |wt|2 +1. Claim 2 holds even if no separating hyperplane!

slide-50
SLIDE 50

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ.

slide-51
SLIDE 51

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1

slide-52
SLIDE 52

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm.

slide-53
SLIDE 53

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM

slide-54
SLIDE 54

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w

slide-55
SLIDE 55

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w ≤ ||wt||

slide-56
SLIDE 56

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w ≤ ||wt|| ≤ √ M.

slide-57
SLIDE 57

Putting it together...

Claim 1: wt+1 ·w ≥ wt ·w +γ. Claim 2: |wt+1|2 ≤ |wt|2 +1 M-number of mistakes in algorithm. γM ≤ wt+1 ·w ≤ ||wt|| ≤ √ M. → M ≤ 1

γ2

slide-58
SLIDE 58

Hinge Loss.

Most of data has good separator.

slide-59
SLIDE 59

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ.

slide-60
SLIDE 60

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress

slide-61
SLIDE 61

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way.

slide-62
SLIDE 62

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting?

slide-63
SLIDE 63

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin.

slide-64
SLIDE 64

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ.

slide-65
SLIDE 65

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part.

slide-66
SLIDE 66

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit .

slide-67
SLIDE 67

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ

slide-68
SLIDE 68

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. →

slide-69
SLIDE 69

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M

slide-70
SLIDE 70

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M

slide-71
SLIDE 71

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

slide-72
SLIDE 72

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

slide-73
SLIDE 73

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

Uh...

slide-74
SLIDE 74

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

Uh... One implication: M ≤ 1

γ2 + 2 γ TDγ.

slide-75
SLIDE 75

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

Uh... One implication: M ≤ 1

γ2 + 2 γ TDγ.

slide-76
SLIDE 76

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

Uh... One implication: M ≤ 1

γ2 + 2 γ TDγ.

The extra is (twice) the amount of rotation in units of γ.

slide-77
SLIDE 77

Hinge Loss.

Most of data has good separator. Claim 1: wt+1 ·w ≥ wt ·w +γ. Don’t make progress or tilt the wrong way. How much bad tilting? Rotate points to have γ-margin. Total rotation: TDγ. Anaylsis: subtract bad tilting part. Claim 1: wt+1 ·w ≥ wt ·w +γ− rotation for xit . wM ≥ γM −TDγ + Claim 2. → γM −TDγ ≤ √ M Quadratic equation: γ2M2 −(2γTDγ +1)M +TD2

γ ≤ 0.

Uh... One implication: M ≤ 1

γ2 + 2 γ TDγ.

The extra is (twice) the amount of rotation in units of γ. Hinge loss: 1

γ TDγ.

slide-78
SLIDE 78

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane.

slide-79
SLIDE 79

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it!

slide-80
SLIDE 80

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.)

slide-81
SLIDE 81

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake.

slide-82
SLIDE 82

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1,

slide-83
SLIDE 83

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn,

slide-84
SLIDE 84

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi,

slide-85
SLIDE 85

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1

slide-86
SLIDE 86

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1 Claim 1: wt+1 ·w ≥ wtw + γ

2.

slide-87
SLIDE 87

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1 Claim 1: wt+1 ·w ≥ wtw + γ

2.

Same

slide-88
SLIDE 88

Approximately Maximizing Margin Algorithm

There is a γ separating hyperplane. Find it! (Kind of.) Any point within γ/2 is still a mistake. Let w1 = x1, For each x2,...xn, if wt ·xi < γ/2, wt+1 = wt +xi, t = t +1 Claim 1: wt+1 ·w ≥ wtw + γ

2.

Same (ish) as before.

slide-89
SLIDE 89

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1??

slide-90
SLIDE 90

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1??

slide-91
SLIDE 91

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 Adding xi to wt even if in correct direction.

slide-92
SLIDE 92

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 Adding xi to wt even if in correct direction.

slide-93
SLIDE 93

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 Adding xi to wt even if in correct direction. Obtuse triangle.

slide-94
SLIDE 94

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle.

slide-95
SLIDE 95

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1

slide-96
SLIDE 96

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

slide-97
SLIDE 97

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.)

slide-98
SLIDE 98

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2.

slide-99
SLIDE 99

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

slide-100
SLIDE 100

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

slide-101
SLIDE 101

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

M updates

slide-102
SLIDE 102

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

M updates |wM| ≤ 2

γ + 3 4γM.

slide-103
SLIDE 103

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

M updates |wM| ≤ 2

γ + 3 4γM.

Claim 1:

slide-104
SLIDE 104

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

M updates |wM| ≤ 2

γ + 3 4γM.

Claim 1: Implies |wM| ≥ γM/2.

slide-105
SLIDE 105

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

M updates |wM| ≤ 2

γ + 3 4γM.

Claim 1: Implies |wM| ≥ γM/2. γM/2 ≤ 2

γ + 3 4γM

slide-106
SLIDE 106

Margin Approximation: Claim 2

Claim 2(?): |wt+1|2 ≤ |wt|2 +1?? wt xi < γ/2 wt+1 v Adding xi to wt even if in correct direction. Obtuse triangle. |v|2 ≤ |wt|2 +1 → |v| ≤ |wt|+

1 2|wt|

(square right hand side.) Red bit is at most γ/2. Together: |wt+1| ≤ |wt|+

1 2|wt| + γ 2

If |wt| ≥ 2

γ , then |wt+1| ≤ |wt|+ 3 4γ.

M updates |wM| ≤ 2

γ + 3 4γM.

Claim 1: Implies |wM| ≥ γM/2. γM/2 ≤ 2

γ + 3 4γM→ M ≤ 8 γ2

slide-107
SLIDE 107

Other fat separators?

− − − − − + + + +

slide-108
SLIDE 108

Other fat separators?

− − − − − + + + + No hyperplane separator.

slide-109
SLIDE 109

Other fat separators?

− − − − − + + + + No hyperplane separator. Circle separator!

slide-110
SLIDE 110

Other fat separators?

− − − − − + + + + x2 +y2 x y No hyperplane separator. Circle separator! Map points to three dimensions.

slide-111
SLIDE 111

Other fat separators?

− − − − − + + + + x2 +y2 x y − − − − − + + + + No hyperplane separator. Circle separator! Map points to three dimensions. map point (x,y) to point (x,y,x2 +y2).

slide-112
SLIDE 112

Other fat separators?

− − − − − + + + + x2 +y2 x y − − − − − + + + + No hyperplane separator. Circle separator! Map points to three dimensions. map point (x,y) to point (x,y,x2 +y2). Hyperplane separator in three dimensions.

slide-113
SLIDE 113

Other fat separators?

− − − − − + + + + x2 +y2 x y − − − − − + + + + No hyperplane separator. Circle separator! Map points to three dimensions. map point (x,y) to point (x,y,x2 +y2). Hyperplane separator in three dimensions.

slide-114
SLIDE 114

Kernel Functions.

Map x to φ(x).

slide-115
SLIDE 115

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·).

slide-116
SLIDE 116

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension.

slide-117
SLIDE 117

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron.

slide-118
SLIDE 118

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products!

slide-119
SLIDE 119

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ

slide-120
SLIDE 120

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ···

slide-121
SLIDE 121

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine.

slide-122
SLIDE 122

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space.

slide-123
SLIDE 123

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·):

slide-124
SLIDE 124

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y)

slide-125
SLIDE 125

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d

slide-126
SLIDE 126

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...].

slide-127
SLIDE 127

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial.

slide-128
SLIDE 128

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn)

slide-129
SLIDE 129

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets.

slide-130
SLIDE 130

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets. K(x,y) = exp(C|x −y|2)

slide-131
SLIDE 131

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets. K(x,y) = exp(C|x −y|2) Infinite dimensional space.

slide-132
SLIDE 132

Kernel Functions.

Map x to φ(x). Good separator for points under φ(·). Problem: complexity of computing in higher dimension. Recall perceptron. Only compute dot products! Test: wt ·xi > γ wt = xi1 +xi2 +xi3 ··· Support Vectors: xi1,xi2,... → Support Vector Machine. Kernel trick: compute dot products in original space. Kernel function for mapping φ(·): K(x,y) = φ(x)·φ(y) K(x,y) = (1+x ·y)d φ(x) = [1,...,xi,...,xixj ...]. Polynomial. K(x,y) = (1+x1y1)(1+x2y2)···(1+xnyn) φ(x) - product of all subsets. K(x,y) = exp(C|x −y|2) Infinite dimensional space. Gaussian Kernel.

slide-133
SLIDE 133

Video

“http://www.youtube.com/watch?v=3liCbRZPrZA”

slide-134
SLIDE 134

Support Vector Machine

Pick Kernel.

slide-135
SLIDE 135

Support Vector Machine

Pick Kernel. Run algorithm that:

slide-136
SLIDE 136

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products.

slide-137
SLIDE 137

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points.

slide-138
SLIDE 138

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron.

slide-139
SLIDE 139

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron.

slide-140
SLIDE 140

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization:

slide-141
SLIDE 141

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

slide-142
SLIDE 142

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

slide-143
SLIDE 143

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

slide-144
SLIDE 144

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

slide-145
SLIDE 145

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

X

slide-146
SLIDE 146

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

X

Algorithms output:

slide-147
SLIDE 147

Support Vector Machine

Pick Kernel. Run algorithm that: (1) Uses dot products. (2) Outputs hyperplane that is linear combination of points. Perceptron. Max Margin Problem as Convex optimization: min|w|2 where ∀i w ·xi ≥ 1.

X

Algorithms output: tight hyperplanes!

slide-148
SLIDE 148

See you on Tuesday.