CS 573: Algorithms, Fall 2013
Learning, Linear Separability and Linear Programming
Lecture 22
November 12, 2013
Sariel (UIUC) CS573 1 Fall 2013 1 / 28
Learning, Linear Separability and Linear Programming Lecture 22 - - PowerPoint PPT Presentation
CS 573: Algorithms, Fall 2013 Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 28 Labeling... . . given examples:a database of cars. 1 . . like to determine which
November 12, 2013
Sariel (UIUC) CS573 1 Fall 2013 1 / 28
. .
1
given examples:a database of cars. . .
2
like to determine which cars are sport cars.. . .
3
Each car record: interpreted as point in high dimensions. . .
4
Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4, 1997, 6). Labeled as a sport car. . .
5
Tractor by General Mess (manufacturer ID 3) in 1998: (0, 1997, 3) Labeled as not a sport car. . .
6
Real world: hundreds of attributes. In some cases even millions
. .
7
Automate this classification process: label sports/regular car automatically.
Sariel (UIUC) CS573 2 Fall 2013 2 / 28
. .
1
given examples:a database of cars. . .
2
like to determine which cars are sport cars.. . .
3
Each car record: interpreted as point in high dimensions. . .
4
Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4, 1997, 6). Labeled as a sport car. . .
5
Tractor by General Mess (manufacturer ID 3) in 1998: (0, 1997, 3) Labeled as not a sport car. . .
6
Real world: hundreds of attributes. In some cases even millions
. .
7
Automate this classification process: label sports/regular car automatically.
Sariel (UIUC) CS573 2 Fall 2013 2 / 28
. .
1
given examples:a database of cars. . .
2
like to determine which cars are sport cars.. . .
3
Each car record: interpreted as point in high dimensions. . .
4
Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4, 1997, 6). Labeled as a sport car. . .
5
Tractor by General Mess (manufacturer ID 3) in 1998: (0, 1997, 3) Labeled as not a sport car. . .
6
Real world: hundreds of attributes. In some cases even millions
. .
7
Automate this classification process: label sports/regular car automatically.
Sariel (UIUC) CS573 2 Fall 2013 2 / 28
. .
1
given examples:a database of cars. . .
2
like to determine which cars are sport cars.. . .
3
Each car record: interpreted as point in high dimensions. . .
4
Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4, 1997, 6). Labeled as a sport car. . .
5
Tractor by General Mess (manufacturer ID 3) in 1998: (0, 1997, 3) Labeled as not a sport car. . .
6
Real world: hundreds of attributes. In some cases even millions
. .
7
Automate this classification process: label sports/regular car automatically.
Sariel (UIUC) CS573 2 Fall 2013 2 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
. .
1
learning algorithm:
. . .
1
given several (or many) classified examples... . . .
2
...develop its own conjecture for rule of classification. . . .
3
... can use it for classifying new data.
. .
2
learning: training + classifying. . .
3
Learn a function: f : I Rd → {−1, 1} . . .
4
challenge: f might have infinite complexity... . .
5
...rare situation in real world. Assume learnable functions. . .
6
red and blue points that are linearly separable. . .
7
Trying to learn a line ℓ that separates the red points from the blue points.
Sariel (UIUC) CS573 3 Fall 2013 3 / 28
Sariel (UIUC) CS573 4 Fall 2013 4 / 28
. .
1
Given red and blue points – how to compute the separating line ℓ? . .
2
line/plane/hyperplane is the zero set of a linear function. . .
3
Form: ∀x ∈ I Rd f (x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑
i aixi is the dot product of a and x.
. .
4
classification done by computing sign of f (x): sign(f (x)). . .
5
If sign(f (x)) is negative: x is not in class. If positive: inside. . .
6
A set of training examples: S =
{
(x1, y1) , . . . ,(xn, yn)
}
, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n.
Sariel (UIUC) CS573 5 Fall 2013 5 / 28
. .
1
Given red and blue points – how to compute the separating line ℓ? . .
2
line/plane/hyperplane is the zero set of a linear function. . .
3
Form: ∀x ∈ I Rd f (x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑
i aixi is the dot product of a and x.
. .
4
classification done by computing sign of f (x): sign(f (x)). . .
5
If sign(f (x)) is negative: x is not in class. If positive: inside. . .
6
A set of training examples: S =
{
(x1, y1) , . . . ,(xn, yn)
}
, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n.
Sariel (UIUC) CS573 5 Fall 2013 5 / 28
. .
1
Given red and blue points – how to compute the separating line ℓ? . .
2
line/plane/hyperplane is the zero set of a linear function. . .
3
Form: ∀x ∈ I Rd f (x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑
i aixi is the dot product of a and x.
. .
4
classification done by computing sign of f (x): sign(f (x)). . .
5
If sign(f (x)) is negative: x is not in class. If positive: inside. . .
6
A set of training examples: S =
{
(x1, y1) , . . . ,(xn, yn)
}
, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n.
Sariel (UIUC) CS573 5 Fall 2013 5 / 28
. .
1
Given red and blue points – how to compute the separating line ℓ? . .
2
line/plane/hyperplane is the zero set of a linear function. . .
3
Form: ∀x ∈ I Rd f (x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑
i aixi is the dot product of a and x.
. .
4
classification done by computing sign of f (x): sign(f (x)). . .
5
If sign(f (x)) is negative: x is not in class. If positive: inside. . .
6
A set of training examples: S =
{
(x1, y1) , . . . ,(xn, yn)
}
, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n.
Sariel (UIUC) CS573 5 Fall 2013 5 / 28
. .
1
Given red and blue points – how to compute the separating line ℓ? . .
2
line/plane/hyperplane is the zero set of a linear function. . .
3
Form: ∀x ∈ I Rd f (x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑
i aixi is the dot product of a and x.
. .
4
classification done by computing sign of f (x): sign(f (x)). . .
5
If sign(f (x)) is negative: x is not in class. If positive: inside. . .
6
A set of training examples: S =
{
(x1, y1) , . . . ,(xn, yn)
}
, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n.
Sariel (UIUC) CS573 5 Fall 2013 5 / 28
. .
1
Given red and blue points – how to compute the separating line ℓ? . .
2
line/plane/hyperplane is the zero set of a linear function. . .
3
Form: ∀x ∈ I Rd f (x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑
i aixi is the dot product of a and x.
. .
4
classification done by computing sign of f (x): sign(f (x)). . .
5
If sign(f (x)) is negative: x is not in class. If positive: inside. . .
6
A set of training examples: S =
{
(x1, y1) , . . . ,(xn, yn)
}
, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n.
Sariel (UIUC) CS573 5 Fall 2013 5 / 28
. .
1
linear classifier h: (w, b) where w ∈ I Rd and b ∈ I R. . .
2
classification of x ∈ I Rd is sign(⟨w, x⟩ + b). . .
3
labeled example (x, y), h classifies (x, y) correctly if sign(⟨w, x⟩ + b) = y. . .
4
Assume a linear classifier exists. . .
5
Given n labeled example. How to compute the linear classifier for these examples? . .
6
Use linear programming.... .
7
looking for (w, b), such that for an(xi, yi) we have sign(⟨w, xi⟩ + b) = yi, which is ⟨w, xi⟩ + b ≥ 0 if yi = 1, and ⟨w, xi⟩ + b ≤ 0 if yi = −1.
Sariel (UIUC) CS573 6 Fall 2013 6 / 28
. .
1
linear classifier h: (w, b) where w ∈ I Rd and b ∈ I R. . .
2
classification of x ∈ I Rd is sign(⟨w, x⟩ + b). . .
3
labeled example (x, y), h classifies (x, y) correctly if sign(⟨w, x⟩ + b) = y. . .
4
Assume a linear classifier exists. . .
5
Given n labeled example. How to compute the linear classifier for these examples? . .
6
Use linear programming.... .
7
looking for (w, b), such that for an(xi, yi) we have sign(⟨w, xi⟩ + b) = yi, which is ⟨w, xi⟩ + b ≥ 0 if yi = 1, and ⟨w, xi⟩ + b ≤ 0 if yi = −1.
Sariel (UIUC) CS573 6 Fall 2013 6 / 28
. .
1
linear classifier h: (w, b) where w ∈ I Rd and b ∈ I R. . .
2
classification of x ∈ I Rd is sign(⟨w, x⟩ + b). . .
3
labeled example (x, y), h classifies (x, y) correctly if sign(⟨w, x⟩ + b) = y. . .
4
Assume a linear classifier exists. . .
5
Given n labeled example. How to compute the linear classifier for these examples? . .
6
Use linear programming.... .
7
looking for (w, b), such that for an(xi, yi) we have sign(⟨w, xi⟩ + b) = yi, which is ⟨w, xi⟩ + b ≥ 0 if yi = 1, and ⟨w, xi⟩ + b ≤ 0 if yi = −1.
Sariel (UIUC) CS573 6 Fall 2013 6 / 28
. .
1
linear classifier h: (w, b) where w ∈ I Rd and b ∈ I R. . .
2
classification of x ∈ I Rd is sign(⟨w, x⟩ + b). . .
3
labeled example (x, y), h classifies (x, y) correctly if sign(⟨w, x⟩ + b) = y. . .
4
Assume a linear classifier exists. . .
5
Given n labeled example. How to compute the linear classifier for these examples? . .
6
Use linear programming.... .
7
looking for (w, b), such that for an(xi, yi) we have sign(⟨w, xi⟩ + b) = yi, which is ⟨w, xi⟩ + b ≥ 0 if yi = 1, and ⟨w, xi⟩ + b ≤ 0 if yi = −1.
Sariel (UIUC) CS573 6 Fall 2013 6 / 28
. .
1
linear classifier h: (w, b) where w ∈ I Rd and b ∈ I R. . .
2
classification of x ∈ I Rd is sign(⟨w, x⟩ + b). . .
3
labeled example (x, y), h classifies (x, y) correctly if sign(⟨w, x⟩ + b) = y. . .
4
Assume a linear classifier exists. . .
5
Given n labeled example. How to compute the linear classifier for these examples? . .
6
Use linear programming.... .
7
looking for (w, b), such that for an(xi, yi) we have sign(⟨w, xi⟩ + b) = yi, which is ⟨w, xi⟩ + b ≥ 0 if yi = 1, and ⟨w, xi⟩ + b ≤ 0 if yi = −1.
Sariel (UIUC) CS573 6 Fall 2013 6 / 28
. .
1
Or equivalently, let xi =
(
x1
i , . . . , xd i
)
∈ I Rd, for i = 1, . . . , m, and let w =
(
w1, . . . , wd) , then we get the linear constraint
d
∑
k=1
wkxk
i + b ≥ 0
if yi = 1, and
d
∑
k=1
wkxk
i + b ≤ 0
if yi = −1. Thus, we get a set of linear constraints, one for each training example, and we need to solve the resulting linear program.
Sariel (UIUC) CS573 7 Fall 2013 7 / 28
. .
1
Stumbling block: is that linear programming is very sensitive to noise. . .
2
If points are misclassified = ⇒ no solution. . .
3
use an iterative algorithm that converges to the optimal solution if it exists...
Sariel (UIUC) CS573 8 Fall 2013 8 / 28
. .
1
Stumbling block: is that linear programming is very sensitive to noise. . .
2
If points are misclassified = ⇒ no solution. . .
3
use an iterative algorithm that converges to the optimal solution if it exists...
Sariel (UIUC) CS573 8 Fall 2013 8 / 28
. .
1
Stumbling block: is that linear programming is very sensitive to noise. . .
2
If points are misclassified = ⇒ no solution. . .
3
use an iterative algorithm that converges to the optimal solution if it exists...
Sariel (UIUC) CS573 8 Fall 2013 8 / 28
perceptron(S: a set of l examples) w0 ← 0,k ← 0 R = max(x,y)∈S
repeat
for
(x, y) ∈ S do
if sign(⟨wk, x⟩) ̸= y then
wk+1 ← wk + y ∗ x k ← k + 1 until no mistakes are made in the classification
return wk and k
Sariel (UIUC) CS573 9 Fall 2013 9 / 28
. .
1
Why perceptron algorithm converges? . .
2
Assume made a mistake on a sample (x, y) and y = 1. Then, ⟨wk, x⟩ < 0, and ⟨wk+1, x⟩ = ⟨wk + y ∗ x, x⟩ = ⟨wk, x⟩ + y ⟨x, x⟩ = ⟨wk, x⟩ + y ∥x∥ > ⟨wk, x⟩ . . .
3
“walking” in the right direction.. .
4
... new value assigned to x by wk+1 is larger (“more positive”) than the old value assigned to x by wk. . .
5
After enough iterations of such fix-ups, label would change...
Sariel (UIUC) CS573 10 Fall 2013 10 / 28
. .
1
Why perceptron algorithm converges? . .
2
Assume made a mistake on a sample (x, y) and y = 1. Then, ⟨wk, x⟩ < 0, and ⟨wk+1, x⟩ = ⟨wk + y ∗ x, x⟩ = ⟨wk, x⟩ + y ⟨x, x⟩ = ⟨wk, x⟩ + y ∥x∥ > ⟨wk, x⟩ . . .
3
“walking” in the right direction.. .
4
... new value assigned to x by wk+1 is larger (“more positive”) than the old value assigned to x by wk. . .
5
After enough iterations of such fix-ups, label would change...
Sariel (UIUC) CS573 10 Fall 2013 10 / 28
. .
1
Why perceptron algorithm converges? . .
2
Assume made a mistake on a sample (x, y) and y = 1. Then, ⟨wk, x⟩ < 0, and ⟨wk+1, x⟩ = ⟨wk + y ∗ x, x⟩ = ⟨wk, x⟩ + y ⟨x, x⟩ = ⟨wk, x⟩ + y ∥x∥ > ⟨wk, x⟩ . . .
3
“walking” in the right direction.. .
4
... new value assigned to x by wk+1 is larger (“more positive”) than the old value assigned to x by wk. . .
5
After enough iterations of such fix-ups, label would change...
Sariel (UIUC) CS573 10 Fall 2013 10 / 28
. .
1
Why perceptron algorithm converges? . .
2
Assume made a mistake on a sample (x, y) and y = 1. Then, ⟨wk, x⟩ < 0, and ⟨wk+1, x⟩ = ⟨wk + y ∗ x, x⟩ = ⟨wk, x⟩ + y ⟨x, x⟩ = ⟨wk, x⟩ + y ∥x∥ > ⟨wk, x⟩ . . .
3
“walking” in the right direction.. .
4
... new value assigned to x by wk+1 is larger (“more positive”) than the old value assigned to x by wk. . .
5
After enough iterations of such fix-ups, label would change...
Sariel (UIUC) CS573 10 Fall 2013 10 / 28
.
Theorem
. . Let S be a training set of examples, and let R = max(x,y)∈S
Suppose that there exists a vector wopt such that
number γ > 0, such that y ⟨wopt, x⟩ ≥ γ ∀(x, y) ∈ S. Then, the number of mistakes made by the online perceptron algorithm on S is at most
(R
γ
)2
.
Sariel (UIUC) CS573 11 Fall 2013 11 / 28
hard easy
Sariel (UIUC) CS573 12 Fall 2013 12 / 28
hard easy
R R
Sariel (UIUC) CS573 12 Fall 2013 12 / 28
hard easy
R wopt
R wopt
Sariel (UIUC) CS573 12 Fall 2013 12 / 28
hard easy
R wopt
R wopt
# errors: (R/γ)2 # errors: (R/γ′)2
Sariel (UIUC) CS573 12 Fall 2013 12 / 28
. .
1
Idea of proof: perceptron weight vector converges to wopt. . .
2
Distance between wopt and kth update vector: αk =
γ wopt
. . .
3
Quantify the change between αk and αk+1 . .
4
Example being misclassified is (x, y).
Sariel (UIUC) CS573 13 Fall 2013 13 / 28
. .
1
Idea of proof: perceptron weight vector converges to wopt. . .
2
Distance between wopt and kth update vector: αk =
γ wopt
. . .
3
Quantify the change between αk and αk+1 . .
4
Example being misclassified is (x, y).
Sariel (UIUC) CS573 13 Fall 2013 13 / 28
. .
1
Idea of proof: perceptron weight vector converges to wopt. . .
2
Distance between wopt and kth update vector: αk =
γ wopt
. . .
3
Quantify the change between αk and αk+1 . .
4
Example being misclassified is (x, y).
Sariel (UIUC) CS573 13 Fall 2013 13 / 28
. .
1
Idea of proof: perceptron weight vector converges to wopt. . .
2
Distance between wopt and kth update vector: αk =
γ wopt
. . .
3
Quantify the change between αk and αk+1 . .
4
Example being misclassified is (x, y).
Sariel (UIUC) CS573 13 Fall 2013 13 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
Example being misclassified is (x, y) (both are constants). . .
2
wk+1 ← wk + y ∗ x . .
3
αk+1 =
γ wopt
=
γ wopt
=
wk − R2 γ wopt
)
+ yx
=
⟨ (
wk − R2
γ wopt
)
+ yx,
(
wk − R2
γ wopt
)
+ yx
⟩
=
⟨ (
wk − R2
γ wopt
)
,
(
wk − R2
γ wopt
)⟩
+2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+ ⟨x, x⟩ = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
Sariel (UIUC) CS573 14 Fall 2013 14 / 28
. .
1
We proved: αk+1 = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
. .
2
(x, y) is misclassified: sign(⟨wk, x⟩) ̸= y . .
3
= ⇒ sign(y ⟨wk, x⟩) = −1 . .
4
= ⇒ y ⟨wk, x⟩ < 0. . .
5
⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y
⟨R2
γ wopt, x
⟩
≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . . .
6
... since 2y ⟨wk, x⟩ < 0.
Sariel (UIUC) CS573 15 Fall 2013 15 / 28
. .
1
We proved: αk+1 = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
. .
2
(x, y) is misclassified: sign(⟨wk, x⟩) ̸= y . .
3
= ⇒ sign(y ⟨wk, x⟩) = −1 . .
4
= ⇒ y ⟨wk, x⟩ < 0. . .
5
⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y
⟨R2
γ wopt, x
⟩
≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . . .
6
... since 2y ⟨wk, x⟩ < 0.
Sariel (UIUC) CS573 15 Fall 2013 15 / 28
. .
1
We proved: αk+1 = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
. .
2
(x, y) is misclassified: sign(⟨wk, x⟩) ̸= y . .
3
= ⇒ sign(y ⟨wk, x⟩) = −1 . .
4
= ⇒ y ⟨wk, x⟩ < 0. . .
5
⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y
⟨R2
γ wopt, x
⟩
≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . . .
6
... since 2y ⟨wk, x⟩ < 0.
Sariel (UIUC) CS573 15 Fall 2013 15 / 28
. .
1
We proved: αk+1 = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
. .
2
(x, y) is misclassified: sign(⟨wk, x⟩) ̸= y . .
3
= ⇒ sign(y ⟨wk, x⟩) = −1 . .
4
= ⇒ y ⟨wk, x⟩ < 0. . .
5
⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y
⟨R2
γ wopt, x
⟩
≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . . .
6
... since 2y ⟨wk, x⟩ < 0.
Sariel (UIUC) CS573 15 Fall 2013 15 / 28
. .
1
We proved: αk+1 = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
. .
2
(x, y) is misclassified: sign(⟨wk, x⟩) ̸= y . .
3
= ⇒ sign(y ⟨wk, x⟩) = −1 . .
4
= ⇒ y ⟨wk, x⟩ < 0. . .
5
⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y
⟨R2
γ wopt, x
⟩
≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . . .
6
... since 2y ⟨wk, x⟩ < 0.
Sariel (UIUC) CS573 15 Fall 2013 15 / 28
. .
1
We proved: αk+1 = αk + 2y
⟨ (
wk − R2
γ wopt
)
, x
⟩
+
. .
2
(x, y) is misclassified: sign(⟨wk, x⟩) ̸= y . .
3
= ⇒ sign(y ⟨wk, x⟩) = −1 . .
4
= ⇒ y ⟨wk, x⟩ < 0. . .
5
⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y
⟨R2
γ wopt, x
⟩
≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . . .
6
... since 2y ⟨wk, x⟩ < 0.
Sariel (UIUC) CS573 15 Fall 2013 15 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
Proved: αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩.
. .
2
sign(⟨wopt, x⟩) = y. . .
3
By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. . .
4
αk+1 ≤ αk + R2 − 2R2
γ y ⟨wopt,x⟩
≤ αk + R2 − 2R2
γ γ
≤ αk + R2 − 2R2 ≤ αk − R2.
Sariel (UIUC) CS573 16 Fall 2013 16 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
. .
1
We have: αk+1 ≤ αk − R2 . .
2
α0 =
γ wopt
= R4 γ2
γ2 . . .
3
∀i αi ≥ 0. . .
4
Q: max # classification errors can make? . .
5
... # of updates . .
6
.. # of updates ≤ α0/R2... . .
7
A: ≤ R2 γ2 .
Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Any linear program can be written as the problem of separating red points from blue points. As such, the perceptron algorithm can be used to solve linear programs.
Sariel (UIUC) CS573 18 Fall 2013 18 / 28
. .
1
Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points. . .
2
Q: How to compute the circle σ ? . .
3
Lifting: ℓ : (x, y) → (x, y, x2 + y2). . .
4
z(P) =
{
ℓ(x, y) = (x, y, x2 + y2)
}
Sariel (UIUC) CS573 19 Fall 2013 19 / 28
. .
1
Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points.
.
2
Q: How to compute the circle σ ? . .
3
Lifting: ℓ : (x, y) → (x, y, x2 + y2). . .
4
z(P) =
{
ℓ(x, y) = (x, y, x2 + y2)
}
Sariel (UIUC) CS573 19 Fall 2013 19 / 28
. .
1
Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points.
.
2
Q: How to compute the circle σ ? . .
3
Lifting: ℓ : (x, y) → (x, y, x2 + y2). . .
4
z(P) =
{
ℓ(x, y) = (x, y, x2 + y2)
}
Sariel (UIUC) CS573 19 Fall 2013 19 / 28
. .
1
Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points.
.
2
Q: How to compute the circle σ ? . .
3
Lifting: ℓ : (x, y) → (x, y, x2 + y2). . .
4
z(P) =
{
ℓ(x, y) = (x, y, x2 + y2)
}
Sariel (UIUC) CS573 19 Fall 2013 19 / 28
. .
1
Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points.
.
2
Q: How to compute the circle σ ? . .
3
Lifting: ℓ : (x, y) → (x, y, x2 + y2). . .
4
z(P) =
{
ℓ(x, y) = (x, y, x2 + y2)
}
Sariel (UIUC) CS573 19 Fall 2013 19 / 28
.
Theorem
. . Two sets of points R and B are separable by a circle in two dimensions, if and only if ℓ(R) and ℓ(B) are separable by a plane in three dimensions.
Sariel (UIUC) CS573 20 Fall 2013 20 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. . .
2
∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. . .
3
∀(x, y) ∈ R −2ax −2by +(x2 + y2)−r2 +a2 +b2 ≤ 0. ∀(x, y) ∈ B −2ax −2by +(x2 + y2)−r2 +a2 +b2 > 0. . .
4
Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 . .
5
⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 . .
6
p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. . .
7
Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane.
Sariel (UIUC) CS573 21 Fall 2013 21 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
. .
1
Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 . .
2
∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 . .
3
∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. . .
4
U(h) =
{
(x, y)
}
. . .
5
If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. . .
6
U(h) ≡ ax + by + c(x2 + y2) ≤ −d. . .
7
⇐ ⇒
(
x2 + a
cx
)
+
(
y2 + b
cy
)
≤ − d
c
. .
8
⇐ ⇒
(
x +
a 2c
)2 + (
y +
b 2c
)2 ≤ a2+b2
4c2
− d
c
. .
9
This is disk in the plane, as claimed.
Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Linear separability is a powerful technique that can be used to learn complicated concepts that are considerably more complicated than just hyperplane separation. This lifting technique showed above is the kernel technique or linearization.
Sariel (UIUC) CS573 23 Fall 2013 23 / 28
. .
1
Q: how complex is the function trying to learn? . .
2
VC-dimension is one way of capturing this notion. (VC = Vapnik, Chervonenkis,1971). . .
3
A matter of expressivity: What is harder to learn:
. . .
1
A rectangle in the plane. . . .
2
A halfplane. . . .
3
A convex polygon with k sides.
Sariel (UIUC) CS573 24 Fall 2013 24 / 28
. .
1
Q: how complex is the function trying to learn? . .
2
VC-dimension is one way of capturing this notion. (VC = Vapnik, Chervonenkis,1971). . .
3
A matter of expressivity: What is harder to learn:
. . .
1
A rectangle in the plane. . . .
2
A halfplane. . . .
3
A convex polygon with k sides.
Sariel (UIUC) CS573 24 Fall 2013 24 / 28
. .
1
Q: how complex is the function trying to learn? . .
2
VC-dimension is one way of capturing this notion. (VC = Vapnik, Chervonenkis,1971). . .
3
A matter of expressivity: What is harder to learn:
. . .
1
A rectangle in the plane. . . .
2
A halfplane. . . .
3
A convex polygon with k sides.
Sariel (UIUC) CS573 24 Fall 2013 24 / 28
. .
1
X = {p1,p2, . . . , pm}: points in the plane. . .
2
H: set of all halfplanes. . .
3
A half-plane r ∈ H defines a binary vector r(X) =(b1, . . . , bm) where bi = 1 if and only if pi is inside r. . .
4
Possible binary vectors generated by halfplanes: U(X, H) = {r(X) | r ∈ H} . . .
5
A set X of m elements is shattered by R if |U(X, R)| = 2m. . .
6
What does this mean? .
7
The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.
Sariel (UIUC) CS573 25 Fall 2013 25 / 28
. .
1
X = {p1,p2, . . . , pm}: points in the plane. . .
2
H: set of all halfplanes. . .
3
A half-plane r ∈ H defines a binary vector r(X) =(b1, . . . , bm) where bi = 1 if and only if pi is inside r. . .
4
Possible binary vectors generated by halfplanes: U(X, H) = {r(X) | r ∈ H} . . .
5
A set X of m elements is shattered by R if |U(X, R)| = 2m. . .
6
What does this mean? .
7
The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.
Sariel (UIUC) CS573 25 Fall 2013 25 / 28
. .
1
X = {p1,p2, . . . , pm}: points in the plane. . .
2
H: set of all halfplanes. . .
3
A half-plane r ∈ H defines a binary vector r(X) =(b1, . . . , bm) where bi = 1 if and only if pi is inside r. . .
4
Possible binary vectors generated by halfplanes: U(X, H) = {r(X) | r ∈ H} . . .
5
A set X of m elements is shattered by R if |U(X, R)| = 2m. . .
6
What does this mean? .
7
The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.
Sariel (UIUC) CS573 25 Fall 2013 25 / 28
. .
1
X = {p1,p2, . . . , pm}: points in the plane. . .
2
H: set of all halfplanes. . .
3
A half-plane r ∈ H defines a binary vector r(X) =(b1, . . . , bm) where bi = 1 if and only if pi is inside r. . .
4
Possible binary vectors generated by halfplanes: U(X, H) = {r(X) | r ∈ H} . . .
5
A set X of m elements is shattered by R if |U(X, R)| = 2m. . .
6
What does this mean? .
7
The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.
Sariel (UIUC) CS573 25 Fall 2013 25 / 28
. .
1
X = {p1,p2, . . . , pm}: points in the plane. . .
2
H: set of all halfplanes. . .
3
A half-plane r ∈ H defines a binary vector r(X) =(b1, . . . , bm) where bi = 1 if and only if pi is inside r. . .
4
Possible binary vectors generated by halfplanes: U(X, H) = {r(X) | r ∈ H} . . .
5
A set X of m elements is shattered by R if |U(X, R)| = 2m. . .
6
What does this mean? .
7
The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.
Sariel (UIUC) CS573 25 Fall 2013 25 / 28
What is the VC dimensions of circles in the plane? X is set of n points in the plane C is a set of all circles. X = {p, q, r, s} What subsets of X can we generate by circle?
p q r s
Sariel (UIUC) CS573 26 Fall 2013 26 / 28
What is the VC dimensions of circles in the plane? X is set of n points in the plane C is a set of all circles. X = {p, q, r, s} What subsets of X can we generate by circle?
p q r s
Sariel (UIUC) CS573 26 Fall 2013 26 / 28
p q r s
{}, {r}, {p}, {q}, {s},{p, s}, {p, q}, {p, r},{r, q}{q, s} and {r, p, q}, {p, r, s}{p, s, q},{s, q, r} and {r, p, q, s} We got only 15 sets. There is one set which is not there. Which one? The VC dimension of circles in the plane is 3.
Sariel (UIUC) CS573 27 Fall 2013 27 / 28
p q r s
{}, {r}, {p}, {q}, {s},{p, s}, {p, q}, {p, r},{r, q}{q, s} and {r, p, q}, {p, r, s}{p, s, q},{s, q, r} and {r, p, q, s} We got only 15 sets. There is one set which is not there. Which one? The VC dimension of circles in the plane is 3.
Sariel (UIUC) CS573 27 Fall 2013 27 / 28
p q r s
{}, {r}, {p}, {q}, {s},{p, s}, {p, q}, {p, r},{r, q}{q, s} and {r, p, q}, {p, r, s}{p, s, q},{s, q, r} and {r, p, q, s} We got only 15 sets. There is one set which is not there. Which one? The VC dimension of circles in the plane is 3.
Sariel (UIUC) CS573 27 Fall 2013 27 / 28
.
Lemma (Sauer Lemma)
. . If R has VC dimension d then |U(X, R)| = O
(
md) , where m is the size of X.
Sariel (UIUC) CS573 28 Fall 2013 28 / 28
Sariel (UIUC) CS573 29 Fall 2013 29 / 28
Sariel (UIUC) CS573 30 Fall 2013 30 / 28
Sariel (UIUC) CS573 31 Fall 2013 31 / 28
Sariel (UIUC) CS573 32 Fall 2013 32 / 28