Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul - - PowerPoint PPT Presentation

fast di fg erentiable so ru ing and ranking
SMART_READER_LITE
LIVE PREVIEW

Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul - - PowerPoint PPT Presentation

Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul Q. Berthet J. Djolonga March 12th, 2020 Background Proposed method Experimental results Background Proposed method Experimental results DL as Di fg erentiable Programming


slide-1
SLIDE 1

Fast difgerentiable 
 soruing and ranking

March 12th, 2020 M.Blondel

  • O. Teboul
  • Q. Berthet J. Djolonga
slide-2
SLIDE 2

Experimental results Proposed method Background

slide-3
SLIDE 3

Experimental results Proposed method Background

slide-4
SLIDE 4

DL as Difgerentiable Programming

slide-5
SLIDE 5

Deep learning increasingly synonymous with differentiable programming “People are now building a new kind of software by assembling networks of parameterized functional blocks (including loops and conditionals) and by training them from examples using some form of gradient-based optimization.”

Yann LeCun, 2018. People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization. An increasingly large number of people are dening the networks procedurally in a data-dependent way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them.

Yann LeCun, 2018

DL as Difgerentiable Programming

slide-6
SLIDE 6

Deep learning increasingly synonymous with differentiable programming “People are now building a new kind of software by assembling networks of parameterized functional blocks (including loops and conditionals) and by training them from examples using some form of gradient-based optimization.”

Yann LeCun, 2018. People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization. An increasingly large number of people are dening the networks procedurally in a data-dependent way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them.

Yann LeCun, 2018

Many computer programming operations remain poorly differentiable In this work, we focus on sorting and ranking.

DL as Difgerentiable Programming

slide-7
SLIDE 7

Ranking / Sorting

O(n log n) k-NN


(1) select neighbours 
 (2) majority vote

Classifiers


select top-k activations

Learning to rank


NDCG loss and others

Rank-based statistics


data viewed as ranks

Trimmed regression


ignore large errors

Descriptive statistics


Empirical distribution function
 quantile normalization

MoM
 estimators

Slide credit: Marco Cuturi

Soruing as subroutine in ML

slide-8
SLIDE 8

σ(θ) = (2,4,3,1)

Argsort (decending)

θ1 θ4 θ3 θ2

Soruing

slide-9
SLIDE 9

σ(θ) = (2,4,3,1)

Argsort (decending) Sort (descending)

s(θ) ≜ θσ(θ) θ1 θ4 θ3 θ2

Soruing

slide-10
SLIDE 10

σ(θ) = (2,4,3,1)

Argsort (decending) Sort (descending)

s(θ) ≜ θσ(θ) = (θ2, θ4, θ3, θ1) θ1 θ4 θ3 θ2

Soruing

slide-11
SLIDE 11

σ(θ) = (2,4,3,1)

Argsort (decending) Sort (descending)

s(θ) ≜ θσ(θ) = (θ2, θ4, θ3, θ1) θ1 θ4 θ3 θ2

piecewise linear induces 
 non-convexity

Soruing

slide-12
SLIDE 12

Ranks

r(θ) ≜ σ−1(θ) θ1 θ4 θ3 θ2

Ranking

slide-13
SLIDE 13

Ranks

r(θ) ≜ σ−1(θ) = (4,1,3,2) θ1 θ4 θ3 θ2

Ranking

slide-14
SLIDE 14

Ranks

r(θ) ≜ σ−1(θ) = (4,1,3,2) θ1 θ4 θ3 θ2

piecewise constant discontinuous

Ranking

slide-15
SLIDE 15

Soft ranks : differentiable proxies to “hard” ranks

Related work on sofu ranks

slide-16
SLIDE 16
  • Random perturbation technique to compute expected

ranks in O(n3) time [Taylor et al., 2008]

Soft ranks : differentiable proxies to “hard” ranks

Related work on sofu ranks

slide-17
SLIDE 17
  • Random perturbation technique to compute expected

ranks in O(n3) time [Taylor et al., 2008]

Soft ranks : differentiable proxies to “hard” ranks

  • Using pairwise comparisons in O(n2) time [Qin et al., 2010]

ri(θ) ≜ 1 + ∑

i≠j

1[θi < θj]

Related work on sofu ranks

slide-18
SLIDE 18
  • Random perturbation technique to compute expected

ranks in O(n3) time [Taylor et al., 2008]

Soft ranks : differentiable proxies to “hard” ranks

  • Using pairwise comparisons in O(n2) time [Qin et al., 2010]

ri(θ) ≜ 1 + ∑

i≠j

1[θi < θj]

  • Regularized optimal transport approach and Sinkhorn in 


O(T n2) time [Cuturi et al., 2019]

Related work on sofu ranks

slide-19
SLIDE 19
  • Random perturbation technique to compute expected

ranks in O(n3) time [Taylor et al., 2008]

Soft ranks : differentiable proxies to “hard” ranks

  • Using pairwise comparisons in O(n2) time [Qin et al., 2010]

ri(θ) ≜ 1 + ∑

i≠j

1[θi < θj]

  • Regularized optimal transport approach and Sinkhorn in 


O(T n2) time [Cuturi et al., 2019]

Related work on sofu ranks

None of these works achieves O(n log n) complexity

slide-20
SLIDE 20

Experimental results Proposed method Background

slide-21
SLIDE 21

Our proposal

slide-22
SLIDE 22

Our proposal

  • Differentiable (soft) relaxations of s(θ) and r(θ)
slide-23
SLIDE 23

Our proposal

  • Differentiable (soft) relaxations of s(θ) and r(θ)
  • Two formulations: L2 and Entropy regularised
slide-24
SLIDE 24

Our proposal

  • Differentiable (soft) relaxations of s(θ) and r(θ)
  • Two formulations: L2 and Entropy regularised
  • “Convexification” effect
slide-25
SLIDE 25

Our proposal

  • Differentiable (soft) relaxations of s(θ) and r(θ)
  • Two formulations: L2 and Entropy regularised
  • “Convexification” effect
  • Exact computation in O(n log n) time (forward pass)
slide-26
SLIDE 26

Our proposal

  • Differentiable (soft) relaxations of s(θ) and r(θ)
  • Two formulations: L2 and Entropy regularised
  • “Convexification” effect
  • Exact computation in O(n log n) time (forward pass)
  • Exact multiplication with the Jacobian in O(n) time


without unrolling (backward pass)

slide-27
SLIDE 27

Strategy outline

slide-28
SLIDE 28

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes
slide-29
SLIDE 29

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

slide-30
SLIDE 30

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

  • 2. Introduce regularization in the LP
slide-31
SLIDE 31

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

  • 2. Introduce regularization in the LP

→ Turn LP into a projection onto convex polytopes

slide-32
SLIDE 32

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

  • 2. Introduce regularization in the LP

→ Turn LP into a projection onto convex polytopes

  • 3. Derive algorithm for computing the projection
slide-33
SLIDE 33

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

  • 2. Introduce regularization in the LP

→ Turn LP into a projection onto convex polytopes

  • 3. Derive algorithm for computing the projection

→ Ideally, the projection shoud be computable in the same cost as the original function…

slide-34
SLIDE 34

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

  • 2. Introduce regularization in the LP

→ Turn LP into a projection onto convex polytopes

  • 3. Derive algorithm for computing the projection

→ Ideally, the projection shoud be computable in the same cost as the original function…

  • 4. Derive algorithm for differentiating the projection
slide-35
SLIDE 35

Strategy outline

  • 1. Express s(θ) and r(θ) as linear programs (LP) over convex polytopes

→ Turn algorithmic function into an optimization problem

  • 2. Introduce regularization in the LP

→ Turn LP into a projection onto convex polytopes

  • 3. Derive algorithm for computing the projection

→ Ideally, the projection shoud be computable in the same cost as the original function…

  • 4. Derive algorithm for differentiating the projection

→ Could be challenging (argmin differentiation problem)

slide-36
SLIDE 36

Strategy outline

Cuturi et al. [2019] This work

slide-37
SLIDE 37

Strategy outline

Cuturi et al. [2019] This work

ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))

ϕ((1, 2, 3))

ϕ((1, 3, 2))

ℬ ⊂ ℝn×n

  • 1. LP

Birkhoff polytope Permutahedron

(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)

𝒬 ⊂ ℝn

slide-38
SLIDE 38

Strategy outline

Cuturi et al. [2019] This work

  • 2. Regularization

Entropy L2 or Entropy

ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))

ϕ((1, 2, 3))

ϕ((1, 3, 2))

ℬ ⊂ ℝn×n

  • 1. LP

Birkhoff polytope Permutahedron

(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)

𝒬 ⊂ ℝn

slide-39
SLIDE 39

Strategy outline

Cuturi et al. [2019] This work

  • 2. Regularization

Entropy L2 or Entropy

  • 3. Computation

Sinkhorn Pool Adjacent 
 Violators (PAV)

ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))

ϕ((1, 2, 3))

ϕ((1, 3, 2))

ℬ ⊂ ℝn×n

  • 1. LP

Birkhoff polytope Permutahedron

(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)

𝒬 ⊂ ℝn

slide-40
SLIDE 40

Strategy outline

Cuturi et al. [2019] This work

  • 2. Regularization

Entropy L2 or Entropy

  • 3. Computation

Sinkhorn Pool Adjacent 
 Violators (PAV)

  • 4. Differentiation

Backprop through Sinkhorn iterates Differentiate PAV solution

ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))

ϕ((1, 2, 3))

ϕ((1, 3, 2))

ℬ ⊂ ℝn×n

  • 1. LP

Birkhoff polytope Permutahedron

(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)

𝒬 ⊂ ℝn

slide-41
SLIDE 41

Permutahedron

(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)

𝒬(w) ≜ conv({wσ: σ ∈ Σ}) ⊂ ℝn

ρ ≜ (n, n − 1,…,1)

𝒬(ρ) ⊂ ℝn

slide-42
SLIDE 42

Step 1: linear programming formulations

slide-43
SLIDE 43

s(θ) = arg max

y∈𝒬(θ)⟨y, ρ⟩

ρ ≜ (n, n − 1,…,1)

Proposition

Step 1: linear programming formulations

slide-44
SLIDE 44

s(θ) = arg max

y∈𝒬(θ)⟨y, ρ⟩

ρ ≜ (n, n − 1,…,1)

Proposition

Step 1: linear programming formulations

r(θ) = arg max

y∈𝒬(ρ)⟨y,−θ⟩

slide-45
SLIDE 45

Proof of the fjrst claim

ρn > ρn−1 > … > 1 ⇒ σ(θ) = arg max

σ∈Σ ⟨θσ, ρ⟩

slide-46
SLIDE 46

Proof of the fjrst claim

ρn > ρn−1 > … > 1 ⇒ σ(θ) = arg max

σ∈Σ ⟨θσ, ρ⟩

s(θ) ≜ θσ(θ)

slide-47
SLIDE 47

Proof of the fjrst claim

ρn > ρn−1 > … > 1 ⇒ σ(θ) = arg max

σ∈Σ ⟨θσ, ρ⟩

s(θ) ≜ θσ(θ) = arg max

θσ: σ∈Σ⟨θσ, ρ⟩

slide-48
SLIDE 48

Proof of the fjrst claim

ρn > ρn−1 > … > 1 ⇒ σ(θ) = arg max

σ∈Σ ⟨θσ, ρ⟩

s(θ) ≜ θσ(θ) = arg max

θσ: σ∈Σ⟨θσ, ρ⟩

= arg max

y∈Σ(θ)⟨y, ρ⟩

slide-49
SLIDE 49

Proof of the fjrst claim

ρn > ρn−1 > … > 1 ⇒ σ(θ) = arg max

σ∈Σ ⟨θσ, ρ⟩

s(θ) ≜ θσ(θ) = arg max

θσ: σ∈Σ⟨θσ, ρ⟩

= arg max

y∈Σ(θ)⟨y, ρ⟩

= arg max

y∈𝒬(θ) ⟨y, ρ⟩

slide-50
SLIDE 50

Step 2: introducing regularization

slide-51
SLIDE 51

Step 2: introducing regularization

PQ(z, w) ≜ arg max

y∈𝒬(w) ⟨y, z⟩−Q(y)

Quadratic regularization Q(y) ≜ 1

2 ∥y∥2

slide-52
SLIDE 52

Step 2: introducing regularization

= arg min

y∈𝒬(w) ∥y − z∥2

PQ(z, w) ≜ arg max

y∈𝒬(w) ⟨y, z⟩−Q(y)

Quadratic regularization Q(y) ≜ 1

2 ∥y∥2

slide-53
SLIDE 53

Step 2: introducing regularization

= arg min

y∈𝒬(w) ∥y − z∥2

PQ(z, w) ≜ arg max

y∈𝒬(w) ⟨y, z⟩−Q(y)

Quadratic regularization Q(y) ≜ 1

2 ∥y∥2

sεQ(θ) ≜ PεQ(ρ, θ) = PQ(ρ/ε, θ)

Definition

slide-54
SLIDE 54

Step 2: introducing regularization

= arg min

y∈𝒬(w) ∥y − z∥2

PQ(z, w) ≜ arg max

y∈𝒬(w) ⟨y, z⟩−Q(y)

Quadratic regularization Q(y) ≜ 1

2 ∥y∥2

sεQ(θ) ≜ PεQ(ρ, θ) = PQ(ρ/ε, θ)

Definition

rεQ(θ) ≜ PεQ(−θ, ρ) = PQ(−θ/ε, ρ)

slide-55
SLIDE 55

Continuity and difgerentiability

slide-56
SLIDE 56

Continuity and difgerentiability

slide-57
SLIDE 57

Continuity and difgerentiability

Properties

sQ and rQ are 1-Lipchitz continuous and differentiable almost everywhere.

slide-58
SLIDE 58

Efgect of regularization strength ε

slide-59
SLIDE 59

Efgect of regularization strength ε

Properties

Converge to hard version when ε ≤ εmin

slide-60
SLIDE 60

Efgect of regularization strength ε

Properties

Converge to hard version when ε ≤ εmin Collapse to a mean when ε → ∞

slide-61
SLIDE 61

Efgect of regularization strength ε

Properties

Converge to hard version when ε ≤ εmin Collapse to a mean when ε → ∞ Order preserving (paths don’t cross)

slide-62
SLIDE 62

Regularization path

Collapse to a mean(ρ)1 when ε → ∞

(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)

rQ(θ) r2Q(θ) r3Q(θ) r100Q(θ) θ

(2.9, 0.1, 1.2)

slide-63
SLIDE 63

Step 3: Computation

slide-64
SLIDE 64

Step 3: Computation

Proposition

Reduction to isotonic regression Total time cost: O(n log n)

e.g. [Negrignho & Martins, 2014; Lim & Wright 2016]

PQ(z, w) = z − vQ(zσ(z), w)σ−1(z)

vQ(s, w) ≜ arg min

v1≥…≥vn

∥v − (s − w)∥2

slide-65
SLIDE 65

Step 3: Computation

Proposition

Reduction to isotonic regression Total time cost: O(n log n)

e.g. [Negrignho & Martins, 2014; Lim & Wright 2016]

PQ(z, w) = z − vQ(zσ(z), w)σ−1(z)

vQ(s, w) ≜ arg min

v1≥…≥vn

∥v − (s − w)∥2

dual solution

slide-66
SLIDE 66

Step 3: Computation

Proposition

Reduction to isotonic regression Total time cost: O(n log n)

e.g. [Negrignho & Martins, 2014; Lim & Wright 2016]

PQ(z, w) = z − vQ(zσ(z), w)σ−1(z)

vQ(s, w) ≜ arg min

v1≥…≥vn

∥v − (s − w)∥2

primal dual relation dual solution

slide-67
SLIDE 67

Step 3: Computation

Boils down to solving v⋆ = arg min

v1≥…≥vn

∥v − u∥2

[Best, 2000] u = s - w

slide-68
SLIDE 68

Step 3: Computation

Boils down to solving v⋆ = arg min

v1≥…≥vn

∥v − u∥2

[Best, 2000] Pool Adjacent Violators (PAV): Finds a partition by repeatedly splitting coordinates. The worst-case cost is O(n).

ℬ1, …, ℬm

u = s - w

slide-69
SLIDE 69

Step 3: Computation

Boils down to solving v⋆ = arg min

v1≥…≥vn

∥v − u∥2

[Best, 2000] Pool Adjacent Violators (PAV): Finds a partition by repeatedly splitting coordinates. The worst-case cost is O(n).

ℬ1, …, ℬm ℬ1 = {1,2} ℬ2 = {3} ℬ3 = {4,5,6} v⋆

1 = v⋆ 2 = mean(u1, u2)

v⋆

3 = mean(u3) = u3

v⋆

4 = v⋆ 5 = v⋆ 6 = mean(u4, u5, u6)

Ex: n=6 u = s - w

slide-70
SLIDE 70

Step 4: Difgerentiation

See also [Djolonga & Krause, 2017]

slide-71
SLIDE 71

Step 4: Difgerentiation

Differentiate vQ(s, w) = arg min

v1≥…≥vn

∥v − (s − w)∥2 w.r.t. s and w

See also [Djolonga & Krause, 2017]

slide-72
SLIDE 72

Step 4: Difgerentiation

Differentiate vQ(s, w) = arg min

v1≥…≥vn

∥v − (s − w)∥2 w.r.t. s and w

∂vQ(s, w) ∂s = B1 ⋱ Bm ∈ ℝn×n

Proposition

Bj ≜ 1/|ℬj| ∈ R|ℬj|×|ℬj|, j ∈ [m]

See also [Djolonga & Krause, 2017]

slide-73
SLIDE 73

Step 4: Difgerentiation

Differentiate PQ(z, w) w.r.t. z and w

slide-74
SLIDE 74

Step 4: Difgerentiation

Differentiate PQ(z, w) w.r.t. z and w

∂PQ(z, w) ∂z = JQ(zσ(z), w)σ−1(z)

Proposition

JQ(s, w) ≜ I − ∂vQ(s, w) ∂s

Multiplication with the Jacobian in O(n) time and space (see paper)

slide-75
SLIDE 75

Experimental results Proposed method Background

slide-76
SLIDE 76

Robust regression

slide-77
SLIDE 77

Robust regression

min

w

1 n

n

i=1

ℓi(w)

ℓi(w) ≜ 1 2 (yi − gw(xi))2

Least squares (LS)

ith loss

slide-78
SLIDE 78

Robust regression

min

w

1 n

n

i=1

ℓi(w)

ℓi(w) ≜ 1 2 (yi − gw(xi))2

Least squares (LS)

ith loss

min

w

1 n − k

n

i=k+1

ℓε

i (w)

ℓε

i (w) ≜ [sεQ(ℓ(w))]i

Soft Least trimmed squares (SLTS)

ith “soft sorted” loss

slide-79
SLIDE 79

Robust regression

min

w

1 n

n

i=1

ℓi(w)

ℓi(w) ≜ 1 2 (yi − gw(xi))2

Least squares (LS)

ith loss

min

w

1 n − k

n

i=k+1

ℓε

i (w)

ℓε

i (w) ≜ [sεQ(ℓ(w))]i

Soft Least trimmed squares (SLTS)

ith “soft sorted” loss

ε → 0 SLTS → LTS

slide-80
SLIDE 80

Robust regression

min

w

1 n

n

i=1

ℓi(w)

ℓi(w) ≜ 1 2 (yi − gw(xi))2

Least squares (LS)

ith loss

min

w

1 n − k

n

i=k+1

ℓε

i (w)

ℓε

i (w) ≜ [sεQ(ℓ(w))]i

Soft Least trimmed squares (SLTS)

ith “soft sorted” loss

ε → 0 SLTS → LTS ε → ∞ SLTS → LS

slide-81
SLIDE 81

Robust regression

Evaluation: 10-fold CV Hyper-parameter selection: 5-fold CV

slide-82
SLIDE 82

Top-k classifjcation

ℓ: [n] × ℝn → ℝ+

Cuturi et al. [2019]

Ground 
 truth Predicted
 soft
 ranks

slide-83
SLIDE 83

Top-k classifjcation

ℓ: [n] × ℝn → ℝ+

Cuturi et al. [2019]

Ground 
 truth Predicted
 soft
 ranks

slide-84
SLIDE 84

Top-k classifjcation

ℓ: [n] × ℝn → ℝ+

Cuturi et al. [2019]

Ground 
 truth Predicted
 soft
 ranks

slide-85
SLIDE 85

Speed benchmark

slide-86
SLIDE 86

Label ranking experiment

Comparison on 21 datasets, 5-fold CV

ℓi ≜ 1 2 ∥yi − f(xi)∥2

f(x)

=

rQ(g(x)) f(x) = g(x)

yi ∈ Σ

slide-87
SLIDE 87

Summary

slide-88
SLIDE 88
  • We proposed sorting and ranking relaxations with 


O(n log n) computation and O(n) differentiation

Summary

slide-89
SLIDE 89
  • We proposed sorting and ranking relaxations with 


O(n log n) computation and O(n) differentiation

  • Key techniques: projections onto the permutahedron and

reduction to isotonic optimization

Summary

slide-90
SLIDE 90
  • We proposed sorting and ranking relaxations with 


O(n log n) computation and O(n) differentiation

  • Key techniques: projections onto the permutahedron and

reduction to isotonic optimization

  • Applications to least trimmed squares, top-k classification

and label ranking

Summary

slide-91
SLIDE 91
  • We proposed sorting and ranking relaxations with 


O(n log n) computation and O(n) differentiation

  • Key techniques: projections onto the permutahedron and

reduction to isotonic optimization

  • Applications to least trimmed squares, top-k classification

and label ranking

Summary

Preprint: Fast Differentiable Sorting and Ranking [arXiv:2002.08871] Code: coming soon!