Chapter 22 Learning, Linear Separability and Linear Programming CS - - PDF document

chapter 22 learning linear separability and linear
SMART_READER_LITE
LIVE PREVIEW

Chapter 22 Learning, Linear Separability and Linear Programming CS - - PDF document

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013 November 12, 2013 22.1 The Perceptron algorithm 22.1.0.1 Labeling... (A) given examples:a database of cars. (B) like to determine which cars are


slide-1
SLIDE 1

Chapter 22 Learning, Linear Separability and Linear Programming

CS 573: Algorithms, Fall 2013 November 12, 2013

22.1 The Perceptron algorithm

22.1.0.1 Labeling... (A) given examples:a database of cars. (B) like to determine which cars are sport cars.. (C) Each car record: interpreted as point in high dimensions. (D) Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4, 1997, 6). Labeled as a sport car. (E) Tractor by General Mess (manufacturer ID 3) in 1998: (0, 1997, 3) Labeled as not a sport car. (F) Real world: hundreds of attributes. In some cases even millions of attributes! (G) Automate this classification process: label sports/regular car automatically. 22.1.0.2 Automatic classification... (A) learning algorithm: (A) given several (or many) classified examples... (B) ...develop its own conjecture for rule of classification. (C) ... can use it for classifying new data. (B) learning: training + classifying. (C) Learn a function: f : I Rd → {−1, 1} . (D) challenge: f might have infinite complexity... (E) ...rare situation in real world. Assume learnable functions. (F) red and blue points that are linearly separable. (G) Trying to learn a line ℓ that separates the red points from the blue points. 1

slide-2
SLIDE 2

22.1.0.3 Linear separability example...

22.1.0.4 Learning linear separation (A) Given red and blue points – how to compute the separating line ℓ? (B) line/plane/hyperplane is the zero set of a linear function. (C) Form: ∀x ∈ I Rd f(x) = ⟨a, x⟩ + b, where a = (a1, . . . , ad) , b =(b1, . . . , bd) ∈ I R2. ⟨a, x⟩ = ∑

i aixi is the dot product of a and x.

(D) classification done by computing sign of f(x): sign(f(x)). (E) If sign(f(x)) is negative: x is not in class. If positive: inside. (F) A set of training examples: S =

{

(x1, y1) , . . . ,(xn, yn)

}

, where xi ∈ I Rd and yi ∈ {-1,1}, for i = 1, . . . , n. 22.1.0.5 Classification... (A) linear classifier h: (w, b) where w ∈ I Rd and b ∈ I R. (B) classification of x ∈ I Rd is sign(⟨w, x⟩ + b). (C) labeled example (x, y), h classifies (x, y) correctly if sign(⟨w, x⟩ + b) = y. (D) Assume a linear classifier exists. (E) Given n labeled example. How to compute the linear classifier for these examples? (F) Use linear programming.... (G) looking for (w, b), such that for an (xi, yi) we have sign(⟨w, xi⟩ + b) = yi, which is ⟨w, xi⟩ + b ≥ 0 if yi = 1, and ⟨w, xi⟩ + b ≤ 0 if yi = −1. 2

slide-3
SLIDE 3

22.1.0.6 Classification... (A) Or equivalently, let xi =

(

x1

i , . . . , xd i

)

∈ I Rd, for i = 1, . . . , m, and let w =

(

w1, . . . , wd) , then we get the linear constraint

d

k=1

wkxk

i + b ≥ 0

if yi = 1, and

d

k=1

wkxk

i + b ≤ 0

if yi = −1. Thus, we get a set of linear constraints, one for each training example, and we need to solve the resulting linear program. 22.1.0.7 Linear programming for learning? (A) Stumbling block: is that linear programming is very sensitive to noise. (B) If points are misclassified = ⇒ no solution. (C) use an iterative algorithm that converges to the optimal solution if it exists... 22.1.0.8 Perceptron algorithm...

perceptron(S: a set of l examples) w0 ← 0,k ← 0 R = max(x,y)∈S

  • x
  • .

repeat

for

(x, y) ∈ S do

if sign(⟨wk, x⟩) ̸= y then

wk+1 ← wk + y ∗ x k ← k + 1 until no mistakes are made in the classification

return wk and k

22.1.0.9 Perceptron algorithm (A) Why perceptron algorithm converges? (B) Assume made a mistake on a sample (x, y) and y = 1. Then, ⟨wk, x⟩ < 0, and ⟨wk+1, x⟩ = ⟨wk + y ∗ x, x⟩ = ⟨wk, x⟩ + y ⟨x, x⟩ = ⟨wk, x⟩ + y ∥x∥ > ⟨wk, x⟩ . (C) “walking” in the right direction.. (D) ... new value assigned to x by wk+1 is larger (“more positive”) than the old value assigned to x by wk. (E) After enough iterations of such fix-ups, label would change... 22.1.0.10 Perceptron algorithm converges Theorem 22.1.1. Let S be a training set of examples, and let R = max(x,y)∈S

  • x
  • . Suppose that there

exists a vector wopt such that

  • wopt
  • = 1, and a number γ > 0, such that

y ⟨wopt, x⟩ ≥ γ ∀(x, y) ∈ S. Then, the number of mistakes made by the online perceptron algorithm on S is at most

(R

γ

)2

. 3

slide-4
SLIDE 4

22.1.0.11 Claim by figure... hard easy

R R

R wopt

γ

R wopt

γ′

# errors: (R/γ)2 # errors: (R/γ′)2 22.1.0.12 Proof of Perceptron convergence... (A) Idea of proof: perceptron weight vector converges to wopt. (B) Distance between wopt and kth update vector: αk =

  • wk − R2

γ wopt

  • 2

. (C) Quantify the change between αk and αk+1 (D) Example being misclassified is (x, y). 4

slide-5
SLIDE 5

22.1.0.13 Proof of Perceptron convergence... (A) Example being misclassified is (x, y) (both are constants). (B) wk+1 ← wk + y ∗ x (C) αk+1 =

  • wk+1 − R2

γ wopt

  • 2

=

  • wk + yx − R2

γ wopt

  • 2

=

  • (

wk − R2 γ wopt

)

+ yx

  • 2

=

⟨ (

wk − R2

γ wopt

)

+ yx,

(

wk − R2

γ wopt

)

+ yx

=

⟨ (

wk − R2

γ wopt

)

,

(

wk − R2

γ wopt

)⟩

+2y

⟨ (

wk − R2

γ wopt

)

, x

+ ⟨x, x⟩ = αk + 2y

⟨ (

wk − R2

γ wopt

)

, x

+

  • x
  • 2 .

22.1.0.14 Proof of Perceptron convergence... (A) We proved: αk+1 = αk + 2y

⟨ (

wk − R2

γ wopt

)

, x

+

  • x
  • 2.

(B) (x, y) is misclassified: sign(⟨wk, x⟩) ̸= y (C) = ⇒ sign(y ⟨wk, x⟩) = −1 (D) = ⇒ y ⟨wk, x⟩ < 0. (E)

  • x
  • ≤ R =

⇒ αk+1 ≤ αk + R2 + 2y ⟨wk, x⟩ − 2y

⟨R2

γ wopt, x

≤ αk + R2 + −2R2 γ y ⟨wopt,x⟩ . (F) ... since 2y ⟨wk, x⟩ < 0. 22.1.0.15 Proof of Perceptron convergence... (A) Proved: αk+1 ≤ αk + R2 − 2R2

γ y ⟨wopt,x⟩.

(B) sign(⟨wopt, x⟩) = y. (C) By margin assumption: y ⟨wopt , x⟩ ≥ γ, ∀(x, y) ∈ S. (D) αk+1 ≤ αk + R2 − 2R2

γ y ⟨wopt,x⟩

≤ αk + R2 − 2R2

γ γ

≤ αk + R2 − 2R2 ≤ αk − R2. 22.1.0.16 Proof of Perceptron convergence... (A) We have: αk+1 ≤ αk − R2 (B) α0 =

  • 0 − R2

γ wopt

  • 2

= R4 γ2

  • wopt
  • 2 = R4

γ2 . (C) ∀i αi ≥ 0. (D) Q: max # classification errors can make? (E) ... # of updates (F) .. # of updates ≤ α0/R2... (G) A: ≤ R2 γ2 . 5

slide-6
SLIDE 6

22.1.0.17 Concluding comment... Any linear program can be written as the problem of separating red points from blue points. As such, the perceptron algorithm can be used to solve linear programs.

22.2 Learning A Circle

22.2.0.18 Learning a circle... (A) Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points.

σ

(B) Q: How to compute the circle σ ? (C) Lifting: ℓ : (x, y) → (x, y, x2 + y2). (D) z(P) =

{

ℓ(x, y) = (x, y, x2 + y2)

  • (x, y) ∈ P

}

22.2.0.19 Learning a circle... Theorem 22.2.1. Two sets of points R and B are separable by a circle in two dimensions, if and only if ℓ(R) and ℓ(B) are separable by a plane in three dimensions. 22.2.0.20 Proof (A) σ ≡ (x − a)2 + (y − b)2 = r2: circle containing R, and all points of B outside. (B) ∀(x, y) ∈ R (x − a)2 + (y − b)2 ≤ r2 ∀(x, y) ∈ B (x − a)2 + (y − b)2 > r2. (C) ∀(x, y) ∈ R − 2ax − 2by +(x2 + y2) − r2 + a2 + b2 ≤ 0. ∀(x, y) ∈ B − 2ax − 2by +(x2 + y2) − r2 + a2 + b2 > 0. (D) Setting z = z(x, y) = x2 + y2: h(x, y, z) = −2ax − 2by + z − r2 + a2 + b2 ∀(x, y) ∈ R h(x, y, z(x, y)) ≤ 0 (E) ⇐ ⇒ ∀(x, y) ∈ R h(ℓ(x, y)) ≤ 0 ∀(x, y) ∈ B h(ℓ(x, y)) > 0 (F) p ∈ σ ⇐ ⇒ h(ℓ(p)) ≤ 0. (G) Proved: if point set is separable by a circle = ⇒ lifted point set ℓ(R) and ℓ(B) are separable by a plane. 22.2.0.21 Proof: Other direction (A) Assume ℓ(R) and ℓ(B) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 (B) ∀(x, y, x2 + y2) ∈ ℓ(R): ax + by + c(x2 + y2) + d ≤ 0 6

slide-7
SLIDE 7

(C) ∀(x, y, x2 + y2) ∈ ℓ(B): ax + by + c(x2 + y2) + d ≥ 0. (D) U(h) =

{

(x, y)

  • h((x, y, x2 + y2)) ≤ 0

}

. (E) If U(h) is a circle = ⇒ R ⊂ U(h) and B ∩ U(h) = ∅. (F) U(h) ≡ ax + by + c(x2 + y2) ≤ −d. (G) ⇐ ⇒

(

x2 + a

cx

)

+

(

y2 + b

cy

)

≤ − d

c

(H) ⇐ ⇒

(

x + a

2c

)2 + (

y + b

2c

)2 ≤ a2+b2

4c2

− d

c

(I) This is disk in the plane, as claimed. 22.2.0.22 A closing comment... Linear separability is a powerful technique that can be used to learn complicated concepts that are considerably more complicated than just hyperplane separation. This lifting technique showed above is the kernel technique or linearization.

22.3 A Little Bit On VC Dimension

22.3.0.23 A Little Bit On VC Dimension (A) Q: how complex is the function trying to learn? (B) VC-dimension is one way of capturing this notion. (VC = Vapnik, Chervonenkis,1971). (C) A matter of expressivity: What is harder to learn: (a) A rectangle in the plane. (b) A halfplane. (c) A convex polygon with k sides. 22.3.0.24 Thinking about concepts as binary functions... (A) X = {p1,p2, . . . , pm}: points in the plane. (B) H: set of all halfplanes. (C) A half-plane r ∈ H defines a binary vector r(X) =(b1, . . . , bm) where bi = 1 if and only if pi is inside r. (D) Possible binary vectors generated by halfplanes: U(X, H) = {r(X) | r ∈ H} . (E) A set X of m elements is shattered by R if |U(X, R)| = 2m. (F) What does this mean? (G) The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.

22.3.1 Examples

22.3.1.1 Examples What is the VC dimensions of circles in the plane? X is set of n points in the plane C is a set of all circles. X = {p, q, r, s} 7

slide-8
SLIDE 8

What subsets of X can we generate by circle?

p q r s

22.3.1.2 Subsets realized by disks

p q r s

{}, {r}, {p}, {q}, {s},{p, s}, {p, q}, {p, r},{r, q}{q, s} and {r, p, q}, {p, r, s}{p, s, q},{s, q, r} and {r, p, q, s} We got only 15 sets. There is one set which is not there. Which one? The VC dimension of circles in the plane is 3. 22.3.1.3 Sauer’s Lemma Lemma 22.3.1 (Sauer Lemma). If R has VC dimension d then |U(X, R)| = O

(

md) , where m is the size of X. 8