learning to compare using operator-valued large-margin classiers - - PowerPoint PPT Presentation

learning to compare using operator valued large margin
SMART_READER_LITE
LIVE PREVIEW

learning to compare using operator-valued large-margin classiers - - PowerPoint PPT Presentation

learning to compare using operator-valued large-margin classiers andreas maurer a binary classication task for pairs X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam ( X ) 1 . . . . a binary


slide-1
SLIDE 1

learning to compare using operator-valued large-margin classi…ers

andreas maurer

slide-2
SLIDE 2

a binary classi…cation task for pairs

X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam (X) 1. . . .

slide-3
SLIDE 3

a binary classi…cation task for pairs

X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam (X) 1. = a probability measure on X 2 f1; 1g, the pair oracle

  • x; x0; r

is the probability to encounter the two inputs x; x0 2 X being

homonymous (same label) for r = 1 and heteronymous (di¤erent labels) for r = 1. A pair classi…er is a function on X 2 to predict the third argument of . . .

slide-4
SLIDE 4

a binary classi…cation task for pairs

X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam (X) 1. = a probability measure on X 2 f1; 1g, the pair oracle

  • x; x0; r

is the probability to encounter the two inputs x; x0 2 X being

homonymous (same label) for r = 1 and heteronymous (di¤erent labels) for r = 1. A pair classi…er is a function on X 2 to predict the third argument of . S =

  • x1; x0

1; r1

  • ; :::;

xm; x0

m; rm

  • 2
  • X 2 f1; 1g

m

training sample, generated in m independent, identical trials of , i.e. S m. .

slide-5
SLIDE 5

a binary classi…cation task for pairs

X = input space, embedded in a Hilbertspace H by a suitable kernel: X H and diam (X) 1. = a probability measure on X 2 f1; 1g, the pair oracle

  • x; x0; r

is the probability to encounter the two inputs x; x0 2 X being

homonymous (same label) for r = 1 and heteronymous (di¤erent labels) for r = 1. A pair classi…er is a function on X 2 to predict the third argument of . S =

  • x1; x0

1; r1

  • ; :::;

xm; x0

m; rm

  • 2
  • X 2 f1; 1g

m

training sample, generated in m independent, identical trials of , i.e. S m. Goal: Use S to …nd a pair classi…er with low error probability.

slide-6
SLIDE 6

pair classi…ers induced by linear transformations

We will select our classi…ers from the hypothesis space

n

fT :

  • x; x0

7! sgn

  • 1
  • Tx Tx0
  • : T 2 L (H)
  • .

.

slide-7
SLIDE 7

pair classi…ers induced by linear transformations

We will select our classi…ers from the hypothesis space

n

fT :

  • x; x0

7! sgn

  • 1
  • Tx Tx0
  • : T 2 L (H)
  • A choice of T 2 L (H) then implies a choice of

the pair classi…er fT, the pseudo-metric d

x; x0 =

  • Tx Tx0
  • the Mahalanobis distance d2 x; x0 =

T T x x0 ; x x0and

the positive semide…nite kernel

x; x0 = T Tx; x0

.

slide-8
SLIDE 8

pair classi…ers induced by linear transformations

We will select our classi…ers from the hypothesis space

n

fT :

  • x; x0

7! sgn

  • 1
  • Tx Tx0
  • : T 2 L (H)
  • A choice of T 2 L (H) then implies a choice of

the pair classi…er fT, the pseudo-metric d

x; x0 =

  • Tx Tx0
  • the Mahalanobis distance d2 x; x0 =

T T x x0 ; x x0and

the positive semide…nite kernel

x; x0 = T Tx; x0

The risk of the operator T is the error probability of the classi…er fT R (T) = Pr

(x;x0;r)

n

fT

  • x; x0

6= r

  • =

Pr

(x;x0;r)s

  • r
  • 1
  • Tx Tx0
  • 2
slide-9
SLIDE 9

estimation and generalization

Let f : R ! R, f 1(1;0] with Lipschitz constant L. For a training sample S =

  • x1; x0

1; r1

  • ; :::;

xm; x0

m; rm

  • de…ne the empirical risk estimate

^ Rf (T; S) = 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

: .

slide-10
SLIDE 10

estimation and generalization

Let f : R ! R, f 1(1;0] with Lipschitz constant L. For a training sample S =

  • x1; x0

1; r1

  • ; :::;

xm; x0

m; rm

  • de…ne the empirical risk estimate

^ Rf (T; S) = 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

: Theorem: 8 > 0, with probability greater 1 in a sample S m 8T 2 L (H) with kT Tk2 1 R (T) ^ Rf (T; S) + 8L kT Tk2 +

q

ln (2 kT Tk2 =) pm : where kAk2 = Tr (AA)1=2 is the Hilbert-Schmidt- or Frobenius- norm of A.

slide-11
SLIDE 11

regularized objectives

The theorem suggests to minimize the regularized objective f; (T) := 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

+ kT Tk2 pm : Since kT Tk2 kTk2

2 we can also use kTk2 2 as a stronger regularizer

(computationally more e¢cient, but slightly inferior in experiments).

slide-12
SLIDE 12

regularized objectives

The theorem suggests to minimize the regularized objective f; (T) := 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

+ kT Tk2 pm : Since kT Tk2 kTk2

2 we can also use kTk2 2 as a stronger regularizer

(computationally more e¢cient, but slightly inferior in experiments). For f we take the hinge loss f with margin : f has Lipschitz constant 1= and is convex. Since

  • T

x x0 2 = T T x x0 ; x x0 is linear in T T,

the objective f; (T) is a convex function of T T:

slide-13
SLIDE 13
  • ptimization problem

Find T 2 L (H) to minimize f; (T) = (T T) = 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

+ pm kT Tk2 : f; is not convex in T, but is convex in T T. .

slide-14
SLIDE 14
  • ptimization problem

Find T 2 L (H) to minimize f; (T) = (T T) = 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

+ pm kT Tk2 : f; is not convex in T, but is convex in T T. First possibility: Solve convex optimization problem for on set of positive semide…nite operators by alternating projections (as in Xing et al.) Then take square root operator to get T. .

slide-15
SLIDE 15
  • ptimization problem

Find T 2 L (H) to minimize f; (T) = (T T) = 1 m

m

X

i=1

f

  • ri
  • 1
  • T
  • xi x0

i

  • 2

+ pm kT Tk2 : f; is not convex in T, but is convex in T T. First possibility: Solve convex optimization problem for on set of positive semide…nite operators by alternating projections (as in Xing et al.) Then take square root operator to get T. Second possibility (my choice): Do gradient-descent of f; in T No problems with local minima: If T is a stable local minimizer of f;, then T T is a stable local minimizer of .

slide-16
SLIDE 16

algorithm

Given sample S, regularization parameter , margin , learning rate initialize 0 = =pm (where m = jSj) initialize T = (v1; :::; vm) (where the vi are row-vectors) repeat Compute kT Tk2 =

P

ij

D

vi; vj

E21=2

For i = 1; :::; d compute wi = 2 kT Tk1

2

P

j

D

vi; vj

E

vi Fetch

x; x0; r from sample S

For i = 1; :::; d compute ai

vi; x x0

Compute b Pd

i=1 a2 i

If r (1 b) < then for i := 1; :::; d do vi vi

r

ai

x x0 + 0wi

  • else for i := 1; :::; d do vi vi 0wi

until convergence

slide-17
SLIDE 17

experiments

with invariant character-recognition, spatial rotations (COIL100) and face recognition (ATT).

  • 1. training T from one task/group of tasks
  • 2. training nearest-neighbour test-classi…ers with a single example/class
  • n a test task, using both the input metric and the metric induced by T.
  • 3. recording the error rates of the test classi…ers

The pixel vectors x are embedded in the space H with the Gaussian rbf-kernel: (x1; x2) = 21 exp

@4

  • x1

kx1k x2 kx2k

  • 21

A :

The parameters = 1 and = 0:05 are used throughout.

slide-18
SLIDE 18

rotation- and scale-invariant character recognition

Typical pattern used to train the preprocessor (4000 examples from 20 classes) .

slide-19
SLIDE 19

rotation- and scale-invariant character recognition

Typical pattern used to train the preprocessor (4000 examples from 20 classes) Nine digits used to train a single-nearest-neighbour classi…er .

slide-20
SLIDE 20

rotation- and scale-invariant character recognition

Typical pattern used to train the preprocessor (4000 examples from 20 classes) Nine digits used to train a single-nearest-neighbour classi…er Some digits used to test the classi…er:

slide-21
SLIDE 21

results for rotation/scale-invariant OCR

ROC-Area input 0.539 ROC-Area T 0.982 1-NN Error input 0.822 1-NN Error T 0.093

  • 1
  • 4
  • 0.005

Sample size 4000 Iterations 1000k

slide-22
SLIDE 22

norms and singular-value-spectrum of T

kTk1 = 61.5 kTk2 = 27.7 kTk1 = 17.3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

slide-23
SLIDE 23

Thank you!