A Consistent Regularization Approach for Structured Prediction - - PowerPoint PPT Presentation

a consistent regularization approach for structured
SMART_READER_LITE
LIVE PREVIEW

A Consistent Regularization Approach for Structured Prediction - - PowerPoint PPT Presentation

A Consistent Regularization Approach for Structured Prediction Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016


slide-1
SLIDE 1

A Consistent Regularization Approach for Structured Prediction

Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco University of Genova Istituto Italiano di Tecnologia - Massachusetts Institute of Technology lcsl.mit.edu Dec 9th, NIPS 2016

slide-2
SLIDE 2

Structured Prediction

slide-3
SLIDE 3

Outline

Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

slide-4
SLIDE 4

Outline

Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

slide-5
SLIDE 5

Scalar Learning

Goal: given (xi, yi)n

i=1,

find fn : X → Y Let Y = R

◮ Parametrize

f(x) = w⊤ϕ(x) w ∈ RP ϕ : X → RP

◮ Learn

fn = w⊤

n ϕ(x)

wn = argmin

w∈RP

1 n

n

  • i=1

L(w⊤ϕ(xi), yi)

slide-6
SLIDE 6

Multi-variate Learning

Goal: given (xi, yi)n

i=1,

find fn : X → Y Let Y = RM

◮ Parametrize

f(x) = Wϕ(x) W ∈ RM×P ϕ : X → RP

◮ Learn

fn(x) = Wn ϕ(x) Wn = argmin

W ∈RM×P

1 n

n

  • i=1

L(Wϕ(xi), yi)

slide-7
SLIDE 7

Learning Theory

Expected Risk E(f) =

  • X×Y

L(f(x), y) dρ(x, y)

◮ Consistency

lim

n→+∞ E(fn) = inf f

E(f) (in probability)

◮ Excess Risk Bounds

E(fn) − inf

f∈H E(f) ǫ(n, ρ, H)

(w.h.p.)

slide-8
SLIDE 8

Outline

Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

slide-9
SLIDE 9

(Un)Structured prediction

What if Y is not a vector space? (e.g. strings, graphs, histograms, etc.)

  • Q. How do we:

◮ Parametrize ◮ Learn

a function f : X → Y ?

slide-10
SLIDE 10

Possible Approaches

◮ Score-Learning Methods

+ General algorithmic framework (e.g. StructSVM [Tsochandaridis et al ’05]) − Limited Theory ([McAllester ’06])

◮ Surrogate/Relaxation approaches:

+ Clear theory − Only for special cases (e.g. classification, ranking, multi-labeling etc.)

[Bartlett et al ’06, Duchi et al ’10, Mroueh et al ’12, Gao et al. ’13]

slide-11
SLIDE 11

Relaxation Approaches

  • 1. Encoding

choose c : Y → RM

  • 2. Learning

Given (xi, c(yi))n

i=1,

find gn : X → RM

  • 3. Decoding

choose d : RM → Y and let fn(x) = (d ◦ gn)(x)

slide-12
SLIDE 12

Example I: Binary Classification

Let Y = {−1, 1}

  • 1. c : {−1, 1} → R identity
  • 2. Scalar learning

gn : X → R

  • 3. d = sign : R → {−1, 1}

fn(x) = sign(gn(x))

slide-13
SLIDE 13

Example II: Multi-class Classification

Let Y = {1, . . . , M}

  • 1. c : Y → {e1, . . . , eM} ⊂ RM canonical basis,

c(j) = ej ∈ RM

  • 2. Multi-variate learning

gn : X → RM

  • 3. d : RM → {1, . . . , M}

fn(x) = argmax

j=1,...,M

e⊤

j gn(x)

  • j−th value of gn(x)
slide-14
SLIDE 14

A General Relaxation Approach

slide-15
SLIDE 15

A General Relaxation Approach

Main Assumption. Structure Encoding Loss Function (SELF) Given △ : Y × Y → R, there exist:

◮ HY RKHS

with c : Y → HY feature map

◮ V : HY → HY bounded linear operator

such that: △(y, y′) = c(y), V c(y′)HY ∀y, y′ ∈ Y

  • Note. If V is Positive Semidefinite =

⇒ △ is a kernel.

slide-16
SLIDE 16

SELF: Examples

◮ Binary classification: c : {−1, 1} → R and V = 1. ◮ Multi-class classification: c(j) = ej ∈ RM and V = 1 − I ∈ RM×M. ◮ Kernel Dependency Estimation (KDE) [Weston et al. ’02, Cortes et al. ’05]:

△(y, y′) = 1 − h(y, y′), h : Y × Y → R kernel on Y.

slide-17
SLIDE 17

SELF: Finite Y All △ on discrete Y are SELF

Examples:

◮ Strings: edit distance, KL divergence, word error rate, . . . ◮ Ordered sequences: rank loss, . . . ◮ Graphs/Trees: graph/trees edit distance, subgraph matching . . . ◮ Discrete subsets: weighted overlap loss, . . . ◮ . . .

slide-18
SLIDE 18

SELF: More examples

◮ Histograms/Probabilities: e.g. χ2, Hellinger, . . . ◮ Manifolds: Diffusion distances ◮ . . .

slide-19
SLIDE 19

Relaxation with SELF

  • 1. Encoding. c : Y → HY canonical feature map of HY
  • 2. Surrogate Learning. Multi-variate regression gn : X → HY
  • 3. Decoding.

fn(x) = argmin

y∈Y

c(y), V gn(x)HY

slide-20
SLIDE 20

Surrogate Learning

Multi-variate learning with ridge regression

◮ Parametrize

g(x) = Wϕ(x) W ∈ RM×P ϕ : X → RP

◮ Learn

gn = Wn ϕ(x) Wn = argmin

W ∈RM×P

1 n

n

  • i=1

Wϕ(xi) − c(yi)

  • least-squares

2 HY

slide-21
SLIDE 21

Learning (cont.)

Solution1 gn(x) = Wn ϕ(x) Wn = C (Φ⊤Φ)−1Φ⊤

  • A∈Rn×n

= CA

◮ Φ = [ϕ(x1), . . . , ϕ(xn)] ∈ RP ×n

input features

◮ C = [c(y1), . . . , c(yn)] ∈ RM×n

  • utput features

1In practice add a regularizer!

slide-22
SLIDE 22

Decoding Lemma (Ciliberto, Rudi, Rosasco ’16)

Let gn(x) = CA ϕ(x) solution the surrogate problem. Then fn(x) = argmin

y∈Y

c(y), V gn(x)HY can be written as fn(x) = argmin

y∈Y n

  • i=1

αi(x) △ (y, yi) where (α1(x), . . . , αn(x))⊤ = A ϕ(x) ∈ Rn

slide-23
SLIDE 23

Decoding

Sketch of the proof:

◮ gn(x) = CA ϕ(x) = n i=1 αi(x)c(yi)

with (α1(x), . . . , αn(x))⊤ = A ϕ(x) ∈ Rn

◮ Plugging gn(x) in

c(y), V gn(x)HY = c(y), V

  • i=1

αi(x)c(yi)HY =

i=1 αi(x) c(y), V c(yi)HY

= n

i=1 αi(x) △ (y, yi)

(SELF)

slide-24
SLIDE 24

SELF Learning

Two steps:

  • 1. Surrogate Learning

(α1(x), . . . , αn(x))⊤ = A ϕ(x) A = (Φ⊤Φ + λ)−1Φ⊤

  • 2. Decoding

fn(x) = argmin

y∈Y n

  • i=1

αi(x) △ (y, yi) Note:

◮ Implicit encoding: no need to know HY, V (extends kernel trick)! ◮ Optimization over Y is problem specific and can be a challenge.

slide-25
SLIDE 25

Connections with Previous Work

◮ Score-Learning approaches (e.g. StructSVM [Tsochandaridis et al ’05])

In StructSVM is possible to choose any feature map on the output... ... here we show that this choice must be compatible with △

◮ Kernel dependency estimation, △ is (one minus) a kernel ◮ Conditional mean embeddings ?

[Smola et al ’07]

slide-26
SLIDE 26

Relaxation Analysis

slide-27
SLIDE 27

Relaxation Analysis

Consider E(f) =

  • X×Y

△(f(x), y) dρ(x, y) and R(g) =

  • X×Y

g(x) − c(y)2 dρ(x, y) How are R(gn) and E(fn) related?

slide-28
SLIDE 28

Relaxation Analysis

f∗ = argmin

f:X→Y

E(f) and g∗ = argmin

g:X→HY

R(g) Key properties:

◮ Fisher Consistency (FC)

E(d ◦ g∗) = E(f∗)

◮ Comparison Inequality (CI)

∃ θ : R → R such that θ(r) → 0 when r → 0 and E(d ◦ g) − E(f∗) ≤ θ(R(g) − R(g∗)) ∀g : X → HY

slide-29
SLIDE 29

SELF Relaxation Analysis Theorem (Ciliberto, Rudi, Rosasco ’16)

△ : Y × Y → R SELF loss, g∗ : X → HY least-square “relaxed” solution. Then

◮ Fisher Consistency

E(d ◦ g∗) = E(f∗)

◮ Comparison Inequality ∀g : X → HY

E(d ◦ g) − E(f∗)

  • R(g) − R(g∗)
slide-30
SLIDE 30

SELF Relaxation Analysis (cont.) Lemma (Ciliberto, Rudi, Rosasco ’16)

△ : Y × Y → R SELF loss. Then E(f) =

  • X

c(f(x)), V g∗(x)HY dρX (x) where g∗ : X → HY minimizes R(g) =

  • X×Y

g(x) − c(y)2

HY dρ(x, y)

Least-squares on HY is a good surrogate loss

slide-31
SLIDE 31

Consistency and Generalization Bounds Theorem (Ciliberto, Rudi, Rosasco ’16)

If we consider a universal feature map and λ = 1/√n, then, lim

n→∞ E(fn) = E(f∗),

almost surely Moreover, under mild assumptions E(fn) − E(f∗) n−1/4 (w.h.p.)

Proof.

Relaxation analysis + (kernel) ridge regression results R(gn) − R(g∗) n−1/2

slide-32
SLIDE 32

Remarks

◮ First result proving universal consistency and excess risk bounds for

general structured prediction (partial results for KDE in [Gigure et al ’13])

◮ Rates are sharp for the class of SELF loss functions △: i.e.

matching classification results.

◮ Faster rates under further regularity conditions.

slide-33
SLIDE 33

Experiments: Ranking

△rank(f(x), y) =

M

  • i,j=1

γ(y)ij (1 − sign(f(x)i − f(x)j))/2

Rank Loss [Herbrich et al. ’99] 0.432 ± 0.008 [Dekel et al. ’04] 0.432 ± 0.012 [Duchi et al. ’10] 0.430 ± 0.004 [Tsochantaridis et al. ’05] 0.451 ± 0.008 [Ciliberto, Rudi, R. ’16] 0.396 ± 0.003

Ranking experiments on the MovieLens dataset with △rank [Dekel et al. ’04, Duchi et al. ’10]. ∼ 1600 Movies for ∼ 900 users.

slide-34
SLIDE 34

Experiments: Digit Reconstruction

Digit reconstruction on USPS dataset

Loss KDE SELF △G △H △G 0.149 ± 0.013 0.172 ± 0.011 △H 0.736 ± 0.032 0.647 ± 0.017 △R 0.294 ± 0.012 0.193 ± 0.015

◮ △G(f(x), y) = 1 − k(f(x), y)

k Gaussian kernel on the output.

◮ △H(f(x), y) =

  • f(x) − √y

Hellinger distance.

◮ △R(f(x), y)

Recognition accuracy of an SVM digit classifier.

slide-35
SLIDE 35

Experiments: Robust Estimation

△Cauchy(f(x), y) = c 2 log(1 + f(x) − y2 c ) c > 0

− 1 − 0.8 − 0.6 − 0.4 − 0.2 0.2 0.4 0.6 0.8 1 − 2 2 4

  • Alg. 1

RNW KRLS

n SELF RNW KRR 50 0.39 ± 0.17 0.45 ± 0.18 0.62 ± 0.13 100 0.21 ± 0.04 0.29 ± 0.04 0.47 ± 0.09 200 0.12 ± 0.02 0.24 ± 0.03 0.33 ± 0.04 500 0.08 ± 0.01 0.22 ± 0.02 0.31 ± 0.03 1000 0.07 ± 0.01 0.21 ± 0.02 0.19 ± 0.02

slide-36
SLIDE 36

Outline

Standard Supervised Learning Structured Prediction with SELF Algorithm Theory Experiments Conclusions

slide-37
SLIDE 37

Wrapping Up

Contributions

  • 1. A relaxation/regularization framework for structured prediction.
  • 2. Theoretical guarantees: universal consistency+sharp bounds
  • 3. Promising empirical results

Open Questions

◮ Surrogate loss functions beyond least-squares. ◮ Efficent decoding, exploit loss structure. ◮ Tsybakov noise “like” conditions

P.S. I have post-doc positions! Ping me if you are interested.