Fast difgerentiable soruing and ranking
March 12th, 2020 M.Blondel
- O. Teboul
- Q. Berthet J. Djolonga
Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul - - PowerPoint PPT Presentation
Fast di fg erentiable so ru ing and ranking M.Blondel O. Teboul Q. Berthet J. Djolonga March 12th, 2020 Background Proposed method Experimental results Background Proposed method Experimental results DL as Di fg erentiable Programming
March 12th, 2020 M.Blondel
Deep learning increasingly synonymous with differentiable programming “People are now building a new kind of software by assembling networks of parameterized functional blocks (including loops and conditionals) and by training them from examples using some form of gradient-based optimization.”
Yann LeCun, 2018. People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization. An increasingly large number of people are dening the networks procedurally in a data-dependent way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them.
Yann LeCun, 2018
Deep learning increasingly synonymous with differentiable programming “People are now building a new kind of software by assembling networks of parameterized functional blocks (including loops and conditionals) and by training them from examples using some form of gradient-based optimization.”
Yann LeCun, 2018. People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization. An increasingly large number of people are dening the networks procedurally in a data-dependent way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them.
Yann LeCun, 2018
Many computer programming operations remain poorly differentiable In this work, we focus on sorting and ranking.
(1) select neighbours (2) majority vote
select top-k activations
NDCG loss and others
data viewed as ranks
ignore large errors
Empirical distribution function quantile normalization
Slide credit: Marco Cuturi
Argsort (decending)
Argsort (decending) Sort (descending)
Argsort (decending) Sort (descending)
Argsort (decending) Sort (descending)
piecewise linear induces non-convexity
Ranks
Ranks
Ranks
piecewise constant discontinuous
Soft ranks : differentiable proxies to “hard” ranks
ranks in O(n3) time [Taylor et al., 2008]
Soft ranks : differentiable proxies to “hard” ranks
ranks in O(n3) time [Taylor et al., 2008]
Soft ranks : differentiable proxies to “hard” ranks
i≠j
ranks in O(n3) time [Taylor et al., 2008]
Soft ranks : differentiable proxies to “hard” ranks
i≠j
O(T n2) time [Cuturi et al., 2019]
ranks in O(n3) time [Taylor et al., 2008]
Soft ranks : differentiable proxies to “hard” ranks
i≠j
O(T n2) time [Cuturi et al., 2019]
None of these works achieves O(n log n) complexity
without unrolling (backward pass)
→ Turn algorithmic function into an optimization problem
→ Turn algorithmic function into an optimization problem
→ Turn algorithmic function into an optimization problem
→ Turn LP into a projection onto convex polytopes
→ Turn algorithmic function into an optimization problem
→ Turn LP into a projection onto convex polytopes
→ Turn algorithmic function into an optimization problem
→ Turn LP into a projection onto convex polytopes
→ Ideally, the projection shoud be computable in the same cost as the original function…
→ Turn algorithmic function into an optimization problem
→ Turn LP into a projection onto convex polytopes
→ Ideally, the projection shoud be computable in the same cost as the original function…
→ Turn algorithmic function into an optimization problem
→ Turn LP into a projection onto convex polytopes
→ Ideally, the projection shoud be computable in the same cost as the original function…
→ Could be challenging (argmin differentiation problem)
Cuturi et al. [2019] This work
Cuturi et al. [2019] This work
ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))
ϕ((1, 2, 3))
ϕ((1, 3, 2))
ℬ ⊂ ℝn×n
Birkhoff polytope Permutahedron
(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)
𝒬 ⊂ ℝn
Cuturi et al. [2019] This work
Entropy L2 or Entropy
ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))
ϕ((1, 2, 3))
ϕ((1, 3, 2))
ℬ ⊂ ℝn×n
Birkhoff polytope Permutahedron
(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)
𝒬 ⊂ ℝn
Cuturi et al. [2019] This work
Entropy L2 or Entropy
Sinkhorn Pool Adjacent Violators (PAV)
ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))
ϕ((1, 2, 3))
ϕ((1, 3, 2))
ℬ ⊂ ℝn×n
Birkhoff polytope Permutahedron
(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)
𝒬 ⊂ ℝn
Cuturi et al. [2019] This work
Entropy L2 or Entropy
Sinkhorn Pool Adjacent Violators (PAV)
Backprop through Sinkhorn iterates Differentiate PAV solution
ϕ((2, 3, 1)) ϕ((3, 2, 1)) ϕ((3, 1, 2)) ϕ((2, 1, 3))
ϕ((1, 2, 3))
ϕ((1, 3, 2))
ℬ ⊂ ℝn×n
Birkhoff polytope Permutahedron
(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)
𝒬 ⊂ ℝn
(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)
y∈𝒬(θ)⟨y, ρ⟩
Proposition
y∈𝒬(θ)⟨y, ρ⟩
Proposition
y∈𝒬(ρ)⟨y,−θ⟩
σ∈Σ ⟨θσ, ρ⟩
σ∈Σ ⟨θσ, ρ⟩
σ∈Σ ⟨θσ, ρ⟩
θσ: σ∈Σ⟨θσ, ρ⟩
σ∈Σ ⟨θσ, ρ⟩
θσ: σ∈Σ⟨θσ, ρ⟩
y∈Σ(θ)⟨y, ρ⟩
σ∈Σ ⟨θσ, ρ⟩
θσ: σ∈Σ⟨θσ, ρ⟩
y∈Σ(θ)⟨y, ρ⟩
y∈𝒬(θ) ⟨y, ρ⟩
y∈𝒬(w) ⟨y, z⟩−Q(y)
Quadratic regularization Q(y) ≜ 1
y∈𝒬(w) ∥y − z∥2
y∈𝒬(w) ⟨y, z⟩−Q(y)
Quadratic regularization Q(y) ≜ 1
y∈𝒬(w) ∥y − z∥2
y∈𝒬(w) ⟨y, z⟩−Q(y)
Quadratic regularization Q(y) ≜ 1
Definition
y∈𝒬(w) ∥y − z∥2
y∈𝒬(w) ⟨y, z⟩−Q(y)
Quadratic regularization Q(y) ≜ 1
Definition
Properties
sQ and rQ are 1-Lipchitz continuous and differentiable almost everywhere.
Properties
Converge to hard version when ε ≤ εmin
Properties
Converge to hard version when ε ≤ εmin Collapse to a mean when ε → ∞
Properties
Converge to hard version when ε ≤ εmin Collapse to a mean when ε → ∞ Order preserving (paths don’t cross)
Collapse to a mean(ρ)1 when ε → ∞
(3, 2, 1) (3, 1, 2) (2, 1, 3) (1, 2, 3) (1, 3, 2) (2, 3, 1)
rQ(θ) r2Q(θ) r3Q(θ) r100Q(θ) θ
(2.9, 0.1, 1.2)
Proposition
Reduction to isotonic regression Total time cost: O(n log n)
e.g. [Negrignho & Martins, 2014; Lim & Wright 2016]
v1≥…≥vn
Proposition
Reduction to isotonic regression Total time cost: O(n log n)
e.g. [Negrignho & Martins, 2014; Lim & Wright 2016]
v1≥…≥vn
dual solution
Proposition
Reduction to isotonic regression Total time cost: O(n log n)
e.g. [Negrignho & Martins, 2014; Lim & Wright 2016]
v1≥…≥vn
primal dual relation dual solution
Boils down to solving v⋆ = arg min
v1≥…≥vn
[Best, 2000] u = s - w
Boils down to solving v⋆ = arg min
v1≥…≥vn
[Best, 2000] Pool Adjacent Violators (PAV): Finds a partition by repeatedly splitting coordinates. The worst-case cost is O(n).
u = s - w
Boils down to solving v⋆ = arg min
v1≥…≥vn
[Best, 2000] Pool Adjacent Violators (PAV): Finds a partition by repeatedly splitting coordinates. The worst-case cost is O(n).
1 = v⋆ 2 = mean(u1, u2)
3 = mean(u3) = u3
4 = v⋆ 5 = v⋆ 6 = mean(u4, u5, u6)
Ex: n=6 u = s - w
See also [Djolonga & Krause, 2017]
Differentiate vQ(s, w) = arg min
v1≥…≥vn
∥v − (s − w)∥2 w.r.t. s and w
See also [Djolonga & Krause, 2017]
Differentiate vQ(s, w) = arg min
v1≥…≥vn
∥v − (s − w)∥2 w.r.t. s and w
Proposition
See also [Djolonga & Krause, 2017]
Differentiate PQ(z, w) w.r.t. z and w
Differentiate PQ(z, w) w.r.t. z and w
Proposition
Multiplication with the Jacobian in O(n) time and space (see paper)
w
n
i=1
Least squares (LS)
ith loss
w
n
i=1
Least squares (LS)
ith loss
w
n
i=k+1
i (w)
i (w) ≜ [sεQ(ℓ(w))]i
Soft Least trimmed squares (SLTS)
ith “soft sorted” loss
w
n
i=1
Least squares (LS)
ith loss
w
n
i=k+1
i (w)
i (w) ≜ [sεQ(ℓ(w))]i
Soft Least trimmed squares (SLTS)
ith “soft sorted” loss
w
n
i=1
Least squares (LS)
ith loss
w
n
i=k+1
i (w)
i (w) ≜ [sεQ(ℓ(w))]i
Soft Least trimmed squares (SLTS)
ith “soft sorted” loss
Evaluation: 10-fold CV Hyper-parameter selection: 5-fold CV
Cuturi et al. [2019]
Ground truth Predicted soft ranks
Cuturi et al. [2019]
Ground truth Predicted soft ranks
Cuturi et al. [2019]
Ground truth Predicted soft ranks
Comparison on 21 datasets, 5-fold CV
=
O(n log n) computation and O(n) differentiation
O(n log n) computation and O(n) differentiation
reduction to isotonic optimization
O(n log n) computation and O(n) differentiation
reduction to isotonic optimization
and label ranking
O(n log n) computation and O(n) differentiation
reduction to isotonic optimization
and label ranking
Preprint: Fast Differentiable Sorting and Ranking [arXiv:2002.08871] Code: coming soon!