Towards Principled Methodologies and Efficient Algorithms for - - PowerPoint PPT Presentation

towards principled methodologies and efficient algorithms
SMART_READER_LITE
LIVE PREVIEW

Towards Principled Methodologies and Efficient Algorithms for - - PowerPoint PPT Presentation

Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning Tuo Zhao Georgia Tech, Jun. 26. 2019 Joint work with Haoming Jiang, Minshuo Chen (Georgia Tech), Bo Dai (Google Brain), Zhaoran Wang (Northwestern U) and


slide-1
SLIDE 1

Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning

Tuo Zhao

Georgia Tech, Jun. 26. 2019 Joint work with Haoming Jiang, Minshuo Chen (Georgia Tech), Bo Dai (Google Brain), Zhaoran Wang (Northwestern U) and others.

slide-2
SLIDE 2

Background

slide-3
SLIDE 3

VALSE Webinar, Jun. 26 2019

Minimax Machine Learning

Conventional Empirical Risk Minimization: Given training data z1, ..., zn, we minimize an empirical risk function, min

θ

1 n

n

  • i=1

f(zi; θ). Minimax Formulation: We solve a minimax problem, min

θ

max

φ

1 n

n

  • i=1

f(zi; θ, φ). More Flexible.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 2/38

slide-4
SLIDE 4

VALSE Webinar, Jun. 26 2019

Minimax Machine Learning

Conventional Empirical Risk Minimization: Given training data z1, ..., zn, we minimize an empirical risk function, min

θ

1 n

n

  • i=1

f(zi; θ). Minimax Formulation: We solve a minimax problem, min

θ

max

φ

1 n

n

  • i=1

f(zi; θ, φ). More Flexible.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 2/38

slide-5
SLIDE 5

VALSE Webinar, Jun. 26 2019

Motivating Application: Robust Deep Learning

Neural Networks are vulnerable to adversarial examples (Goodfellow et al. 2014, Madry et al. 2017).

Adversarial Example Clean Sample Perturbation

Adversarial Perturbation: max

δi∈B ℓ(f(xi + δi; θ), yi),

Adversarial Training: min

θ

1 n

n

  • i=1

max

δi∈B ℓ(f(xi + δi; θ), yi),

where δi ∈ B denotes the imperceptible perturbation.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 3/38

slide-6
SLIDE 6

VALSE Webinar, Jun. 26 2019

Motivating Application: Robust Deep Learning

Neural Networks are vulnerable to adversarial examples (Goodfellow et al. 2014, Madry et al. 2017).

Adversarial Example Clean Sample Perturbation

Adversarial Perturbation: max

δi∈B ℓ(f(xi + δi; θ), yi),

Adversarial Training: min

θ

1 n

n

  • i=1

max

δi∈B ℓ(f(xi + δi; θ), yi),

where δi ∈ B denotes the imperceptible perturbation.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 3/38

slide-7
SLIDE 7

VALSE Webinar, Jun. 26 2019

Motivating Application: Image Generation

Brock et al. (2019)

All are fake!

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 4/38

slide-8
SLIDE 8

VALSE Webinar, Jun. 26 2019

Motivating Application: Unsupervised Learning

Generative Adversarial Network: Goodfellow et al. (2014), Arjovsky et al. (2017), Miyato et al. (2018), Brock et al. (2019) min

θ

max

W

1 n

n

  • i=1

φ (A(DW(xi))) + Ex∼DGθ [φ (1 − A(DW(x)))]. DW: Discriminator; Gθ: Generator; φ: log(˙ ) and A: Softmax.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 5/38

slide-9
SLIDE 9

VALSE Webinar, Jun. 26 2019

Motivating Application: Unsupervised Learning

Generative Adversarial Network: Goodfellow et al. (2014), Arjovsky et al. (2017), Miyato et al. (2018), Brock et al. (2019) min

θ

max

W

1 n

n

  • i=1

φ (A(DW(xi))) + Ex∼DGθ [φ (1 − A(DW(x)))]. DW: Discriminator; Gθ: Generator; φ: log(˙ ) and A: Softmax.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 5/38

slide-10
SLIDE 10

VALSE Webinar, Jun. 26 2019

Motivating Application: Reinforcement Learning

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 6/38

slide-11
SLIDE 11

VALSE Webinar, Jun. 26 2019

Motivating Application: Reinforcement Learning

Minimax Formulation: Given M = (A, A, P, R, γ), we solve min

π,V max ν

L(π, V ; ν) = 2Es,a,s′[ν(s, a)(R(s, a) + γV (s′) − λ log(π(a|s))] − Es,a,s′ν2(s, a), where s denotes the state, a denotes the action, and Policy: π : S → P(A), Value: V : S → R, Reward: R : S × A → R, Axillary Dual: ν : S × A → R. The policy π is parameterized as a neural network, where as ν is parameterized as a reproducing kernel function (Dai et al. 2018).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 7/38

slide-12
SLIDE 12

VALSE Webinar, Jun. 26 2019

Successes of Minimax Machine Learning

Adversarial Robust Learning Unsupervised Learning Learning with Constraints Reinforcement Learning Domain Adaptation Generative Adversarial Imitation Learning . . . = ⇒ Identify the fundamental hardness of minimax machine learning and make optimization easier.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 8/38

slide-13
SLIDE 13

Challenges

slide-14
SLIDE 14

VALSE Webinar, Jun. 26 2019

Minimax Optimization

General Formula: min

x∈X max y∈Y f(x, y),

X ⊂ Rd, Y ⊂ Rp, f is some continuous function. Two Stage Optimization: Stage 1: g(x) = maxy∈Y f(x, y), Stage 2: minx∈X g(x), Solve Stage 2 using gradient descent. Limitation: A global maximum of maxy∈Y f(x, y) needs to be

  • btained for evaluating ∇g(x) (Envelope Theorem, Afriat et al.

(1971)).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 9/38

slide-15
SLIDE 15

VALSE Webinar, Jun. 26 2019

Minimax Optimization

General Formula: min

x∈X max y∈Y f(x, y),

X ⊂ Rd, Y ⊂ Rp, f is some continuous function. Two Stage Optimization: Stage 1: g(x) = maxy∈Y f(x, y), Stage 2: minx∈X g(x), Solve Stage 2 using gradient descent. Limitation: A global maximum of maxy∈Y f(x, y) needs to be

  • btained for evaluating ∇g(x) (Envelope Theorem, Afriat et al.

(1971)).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 9/38

slide-16
SLIDE 16

VALSE Webinar, Jun. 26 2019

Minimax Optimization

General Formula: min

x∈X max y∈Y f(x, y),

X ⊂ Rd, Y ⊂ Rp, f is some continuous function. Two Stage Optimization: Stage 1: g(x) = maxy∈Y f(x, y), Stage 2: minx∈X g(x), Solve Stage 2 using gradient descent. Limitation: A global maximum of maxy∈Y f(x, y) needs to be

  • btained for evaluating ∇g(x) (Envelope Theorem, Afriat et al.

(1971)).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 9/38

slide-17
SLIDE 17

VALSE Webinar, Jun. 26 2019

Existing Literature

Bilinear Saddle Point Problem: min

x∈X

  • p(x) + max

y∈Y Ax, y − q(y)

  • .

X ⊂ Rd and Y ⊂ Rp: closed convex domain; A ∈ Rp×d; p(·) and q(·): convex functions satisfying certain assumptions. Nice Structure: Convex in x and Concave in y; Bilinear interaction (can be slightly relaxed). Algorithms with Theoretical Guarantees: Primal-Dual Algorihtm, Mirror-Prox Algorithm · · · (Nemirovski 2005, Chen et al. 2014, Dang et al. 2015).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 10/38

slide-18
SLIDE 18

VALSE Webinar, Jun. 26 2019

Existing Literature

Bilinear Saddle Point Problem: min

x∈X

  • p(x) + max

y∈Y Ax, y − q(y)

  • .

X ⊂ Rd and Y ⊂ Rp: closed convex domain; A ∈ Rp×d; p(·) and q(·): convex functions satisfying certain assumptions. Nice Structure: Convex in x and Concave in y; Bilinear interaction (can be slightly relaxed). Algorithms with Theoretical Guarantees: Primal-Dual Algorihtm, Mirror-Prox Algorithm · · · (Nemirovski 2005, Chen et al. 2014, Dang et al. 2015).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 10/38

slide-19
SLIDE 19

VALSE Webinar, Jun. 26 2019

Existing Literature

Bilinear Saddle Point Problem: min

x∈X

  • p(x) + max

y∈Y Ax, y − q(y)

  • .

X ⊂ Rd and Y ⊂ Rp: closed convex domain; A ∈ Rp×d; p(·) and q(·): convex functions satisfying certain assumptions. Nice Structure: Convex in x and Concave in y; Bilinear interaction (can be slightly relaxed). Algorithms with Theoretical Guarantees: Primal-Dual Algorihtm, Mirror-Prox Algorithm · · · (Nemirovski 2005, Chen et al. 2014, Dang et al. 2015).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 10/38

slide-20
SLIDE 20

VALSE Webinar, Jun. 26 2019

Challenges: Nonconcavity of Inner Maximization

Recall Stage 2: min

x∈X

  • g(x) := max

y∈Y f(x, y)

  • .

Why Fail to Converge?: y = arg maxy f(x, y) may even lead to ∂g(x) ∂x , ∂f(x, y) ∂x

  • ≪ 0.

Noisy Gradient

θ φ θ

Minimization Minmax Limit Cycles

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 11/38

slide-21
SLIDE 21

VALSE Webinar, Jun. 26 2019

Challenges: Nonconcavity of Inner Maximization

Recall Stage 2: min

x∈X

  • g(x) := max

y∈Y f(x, y)

  • .

Why Fail to Converge?: y = arg maxy f(x, y) may even lead to ∂g(x) ∂x , ∂f(x, y) ∂x

  • ≪ 0.

Noisy Gradient

θ φ θ

Minimization Minmax Limit Cycles

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 11/38

slide-22
SLIDE 22

VALSE Webinar, Jun. 26 2019

Our Proposed Solutions

State of the Art: Convex-concave: Well studied. Nonconvex-concave: Limitedly studied. Reinforcement Learning: Dai et al. (2018); Constrained OptimizationChen et al. (2019); · · · Beyond: No algorithm works well. Our Solutions: Improving Landscape and Learning to Optimize

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 12/38

slide-23
SLIDE 23

Generative Adversarial Networks

slide-24
SLIDE 24

VALSE Webinar, Jun. 26 2019

Generative Adversarial Networks

Highly Nonconvex-Nonconcave Minimax Problem: min

θ

max

W

1 n

n

  • i=1

φ (A(DW(xi))) + Ex∼DGθ [φ (1 − A(DW(x)))]. DW: Discriminator; Gθ: Generator; φ, A: Properly chosen functions (e.g., log(·) and Softmax).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 13/38

slide-25
SLIDE 25

VALSE Webinar, Jun. 26 2019

Generative Adversarial Networks

Instability Issue: Mode Collapse

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 14/38

slide-26
SLIDE 26

VALSE Webinar, Jun. 26 2019

Stabilizing GAN Training

Better Algorithm: Two Time-Scale Update Functional Gradient Progressive Learning . . . Better Landscape: Gradient Penalty Weight Clipping Orthogonal Regularization Spectral Normalization . . . Algorithm works only if the landscape is good enough.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 15/38

slide-27
SLIDE 27

VALSE Webinar, Jun. 26 2019

Stabilizing GAN Training

Better Algorithm: Two Time-Scale Update Functional Gradient Progressive Learning . . . Better Landscape: Gradient Penalty Weight Clipping Orthogonal Regularization Spectral Normalization . . . Algorithm works only if the landscape is good enough.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 15/38

slide-28
SLIDE 28

VALSE Webinar, Jun. 26 2019

Better Optimization Landscape

Lipschitz Continuous Discriminator: An L-layer discriminator can be formulated as follows: DW(x) = WLσL−1(WL−1 · · · σ1(W1x) · · · ), where Wi’s are weight matrices and σi’s are activations. 1-Lipschitz condition: |DW(x) − DW(x′)| ≤

  • x − x′
  • Inspired by Wasserstein GAN (Arjovsky et al., 2017)

Empirically works well, but why?

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 16/38

slide-29
SLIDE 29

VALSE Webinar, Jun. 26 2019

Better Optimization Landscape

Lipschitz Continuous Discriminator: An L-layer discriminator can be formulated as follows: DW(x) = WLσL−1(WL−1 · · · σ1(W1x) · · · ), where Wi’s are weight matrices and σi’s are activations. 1-Lipschitz condition: |DW(x) − DW(x′)| ≤

  • x − x′
  • Inspired by Wasserstein GAN (Arjovsky et al., 2017)

Empirically works well, but why?

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 16/38

slide-30
SLIDE 30

VALSE Webinar, Jun. 26 2019

Control Weight Matrix Scaling

Scaling Issue: Consider a simple 2-layer discriminator with ReLU activation (σ(·) = max(·, 0)): DW(x) = W2σ(W1x). Since the ReLU activation is homogeneous, we can rescale the weight matrices by a factor λ > 0 as W1 ⇒ λ · W1 W2 ⇒ W2/λ. Although the neural network model remains the same, the

  • ptimization landscape becomes worse.

Orthogonal Regularization: min

W1,W2 L(W1, W2) + λ

  • W ⊤

1 W1 − I

  • 2

F +

  • W ⊤

2 W2 − I

  • 2

F

  • .

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 17/38

slide-31
SLIDE 31

VALSE Webinar, Jun. 26 2019

Control Weight Matrix Scaling

Scaling Issue: Consider a simple 2-layer discriminator with ReLU activation (σ(·) = max(·, 0)): DW(x) = W2σ(W1x). Since the ReLU activation is homogeneous, we can rescale the weight matrices by a factor λ > 0 as W1 ⇒ λ · W1 W2 ⇒ W2/λ. Although the neural network model remains the same, the

  • ptimization landscape becomes worse.

Orthogonal Regularization: min

W1,W2 L(W1, W2) + λ

  • W ⊤

1 W1 − I

  • 2

F +

  • W ⊤

2 W2 − I

  • 2

F

  • .

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 17/38

slide-32
SLIDE 32

VALSE Webinar, Jun. 26 2019

Illustrations of Landscape

min

x,y F(x, y) = (1 − xy)2,

min

x,y Fλ(x, y) = (1 − xy)2 + λ(x2 − y2)2.

x y F(x, y) x y F(x, y) x y Fλ(x, y) x y Fλ(x, y)

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 18/38

slide-33
SLIDE 33

VALSE Webinar, Jun. 26 2019

Also Improves Generalization

Theorem (Informal, Jiang et al. 2019) Under some technical assumptions and assume Wi2 ≤ BWi for i ∈ [L] and xk2 ≤ Bx for k ∈ [n]. Generator and discriminator are well trained, i.e., dF,φ( µn, νn) − inf

ν∈DG dF,φ(

µn, ν) ≤ ǫ, where dF,φ(·, ·) is the neural distance with probability at least 1 − δ, we have dF,φ(µ, νn) − inf

ν∈DG dF,φ(µ, ν) ≤

O

  • Bx

L

i=1 BWi

√ d2L √n

  • .

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 19/38

slide-34
SLIDE 34

VALSE Webinar, Jun. 26 2019

From Lipschitz Continuity to Generalization

Importance of Spectrum Control: dF,φ(µ, νn) − inf

ν∈DG dF,φ(µ, ν) ≤

O

  • Bx

L

i=1 BWi

√ d2L √n

  • .

1-Lipschitz = ⇒ polynomial bound O

  • d2L

n

  • .

Controling the product of spectral norms avoids bad landscape and benefits the generalization of GANs.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 20/38

slide-35
SLIDE 35

VALSE Webinar, Jun. 26 2019

Better then Orthogonal Regularization

Spectral Normalization (SN, Miyato et al. 2018):

25000 50000 75000 100000 125000 150000 175000 200000 5 6 7 8 9

SN (Miyato et al. 2018) SN (Alternative) Orthgonal Regularization

Inception Score on STL-10

Miyato et al. (2018) > Orth. Reg. > SN (Alternative)

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 21/38

slide-36
SLIDE 36

VALSE Webinar, Jun. 26 2019

Better than Spectral Normalization

Singular Value Decay: Decay patterns of sorted singular values

  • f weight matrices.

0.0 0.2 0.4 0.6 0.8 1.0 0.994 0.996 0.998 1.000 1.002 1.004 1.006 0-th layer 1-th layer 2-th layer 3-th layer 4-th layer 5-th layer 6-th layer 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0-th layer 1-th layer 2-th layer 3-th layer 4-th layer 5-th layer 6-th layer 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0-th layer 1-th layer 2-th layer 3-th layer 4-th layer 5-th layer 6-th layer 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0-th layer 1-th layer 2-th layer 3-th layer 4-th layer 5-th layer 6-th layer

Orthogonal Reg. Miyato et al. 2018 SN (Alt.) Jiang et al. (2019) No Decay Slow Decay Fast Decay Slower Decay IS: 8.77 IS: 8.83 IS: 8.69 IS: 9.25

Observation: Slow singular value decay is better than both no decay and fast decay.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 22/38

slide-37
SLIDE 37

VALSE Webinar, Jun. 26 2019

Experiments (CIFAR10 and STL-10)

CIFAR: FID CIFAR: Inception Score STL: FID STL: Inception Score

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 23/38

slide-38
SLIDE 38

VALSE Webinar, Jun. 26 2019

Experiments (ImageNet)

Valley Jellyfish Pizza Anemone Shoji Brain Coral Cardoon Altar Jack-o’-lantern

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 24/38

slide-39
SLIDE 39

Adversarial Robust Learning

slide-40
SLIDE 40

VALSE Webinar, Jun. 26 2019

Adversarial Training

Adversarial Example Clean Sample Perturbation

Highly Nonconvex-Nonconcave Minimax Problem: min

θ

1 n

n

  • i=1

(max

δi∈B ℓ(f(xi + δi; θ), yi).

xi: feature; yi: label; δi: perturbation; f(·; θ): neural network; ℓ: loss function; B: constraint;

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 25/38

slide-41
SLIDE 41

VALSE Webinar, Jun. 26 2019

Adversarial Training

Adversarial Example Clean Sample Perturbation

Highly Nonconvex-Nonconcave Minimax Problem: min

θ

1 n

n

  • i=1

(max

δi∈B ℓ(f(xi + δi; θ), yi).

xi: feature; yi: label; δi: perturbation; f(·; θ): neural network; ℓ: loss function; B: constraint;

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 25/38

slide-42
SLIDE 42

VALSE Webinar, Jun. 26 2019

Adversarial Training

min

θ

1 n

n

  • i=1

(max

δi∈B ℓ(f(xi + δi; θ), yi).

Two Stage Optimization: Inner Maximization Problem (Attack) Outer Minimization Problem (Defense) Popular Approaches for Attack: Fast Gradient Sign Method (Goodfellow et al. (2014)) Projected Gradient Method (Kurakin et al. (2016)) Carlini-Wagner Attack (Paszke et al. (2017))

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 26/38

slide-43
SLIDE 43

VALSE Webinar, Jun. 26 2019

Adversarial Training

min

θ

1 n

n

  • i=1

(max

δi∈B ℓ(f(xi + δi; θ), yi).

Two Stage Optimization: Inner Maximization Problem (Attack) Outer Minimization Problem (Defense) Popular Approaches for Attack: Fast Gradient Sign Method (Goodfellow et al. (2014)) Projected Gradient Method (Kurakin et al. (2016)) Carlini-Wagner Attack (Paszke et al. (2017))

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 26/38

slide-44
SLIDE 44

VALSE Webinar, Jun. 26 2019

Adversarial Training

min

θ

1 n

n

  • i=1

(max

δi∈B ℓ(f(xi + δi; θ), yi).

Two Stage Optimization: Inner Maximization Problem (Attack) Outer Minimization Problem (Defense) Popular Approaches for Attack: Fast Gradient Sign Method (Goodfellow et al. (2014)) Projected Gradient Method (Kurakin et al. (2016)) Carlini-Wagner Attack (Paszke et al. (2017))

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 26/38

slide-45
SLIDE 45

VALSE Webinar, Jun. 26 2019

Learn to Learn/Optimize (L2L)

High Level Idea: Cast the optimizer as a learning model; Allow the model to learn to exploit structure automatically. Implementation: Parameterize optimizer as a neural network, and learn its parameters (Andrychowicz et al. 2016).

Optimization Algorithms (e.g., Gradient Descent) x0

<latexit sha1_base64="pJX5d/KVD2TALrkO/cbwYf0jxQ=">AB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szMwLEikMu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qn9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AELeo2j</latexit>

Initial Solution Gradient

rf(xt)

<latexit sha1_base64="2wkOktp8wgq5ZwsD8qB2VvNlNqY=">AB9HicbVDLSgNBEOyNrxhfUY9eBoMQL2E3CnoMevEYwTwgWcLsZDYZMju7zvQGQ8h3ePGgiFc/xpt/4+Rx0MSChqKqm+6uIJHCoOt+O5m19Y3Nrex2bmd3b/8gf3hUN3GqGa+xWMa6GVDpVC8hgIlbya0yiQvBEMbqd+Y8i1EbF6wFHC/Yj2lAgFo2glv61oICkJi08dPO/kC27JnYGsEm9BCrBAtZP/andjlkZcIZPUmJbnJuiPqUbBJ/k2qnhCWUD2uMtSxWNuPHs6Mn5MwqXRLG2pZCMlN/T4xpZMwoCmxnRLFvlr2p+J/XSjG89sdCJSlyxeaLwlQSjMk0AdIVmjOUI0so08LeSlifasrQ5pSzIXjL6+SernkXZTK95eFys0ijiycwCkUwYMrqMAdVKEGDB7hGV7hzRk6L8678zFvzTiLmWP4A+fzB9WIkXw=</latexit>

Output Solution

xT

<latexit sha1_base64="X9lF3KIwpkijMFVCiEBA1kh19o=">AB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4j5gXJEmYnvcmQ2dlZlYMIZ/gxYMiXv0ib/6Nk2QPmljQUFR1090VJIJr47rfTm5tfWNzK79d2Nnd2z8oHh41dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfmtR1Sax7Juxgn6ER1IHnJGjZUenr1XrHklt05yCrxMlKCDLVe8avbj1kaoTRMUK07npsYf0KV4UzgtNBNSaUjegAO5ZKGqH2J/NTp+TMKn0SxsqWNGSu/p6Y0EjrcRTYzoiaoV72ZuJ/Xic14bU/4TJDUq2WBSmgpiYzP4mfa6QGTG2hDLF7a2EDamizNh0CjYEb/nlVdKslL2LcuX+slS9yeLIwmcwjl4cAVuIMaNIDBAJ7hFd4c4bw4787HojXnZDPH8AfO5w9Co3H</latexit>
  • V. S.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 27/38

slide-46
SLIDE 46

VALSE Webinar, Jun. 26 2019

Learn to Learn/Optimize (L2L)

Advantages: Attacker Network is powerful in representation. = ⇒ Yield strong and flexible perturbations. Shared attacker model. = ⇒ Learn common structures across all perturbations. Learning through overparametrization. = ⇒ Ease the training process. Reduce search space. = ⇒ Computational efficiency

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 28/38

slide-47
SLIDE 47

VALSE Webinar, Jun. 26 2019

Learn to Learn/Optimize (L2L)

New Formulation: min

θ

max

φ

1 n

n

  • i=1
  • ℓ(f(xi + g(A(xi, yi, θ); φ); θ), yi)
  • ,

Notations: f(·; θ): Classifier; g(·; φ): Attacker/Optimizer; A(xi, yi, θ): Input of Optimizer g (Interact g with f via A).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 29/38

slide-48
SLIDE 48

VALSE Webinar, Jun. 26 2019

Learn to Learn/Optimize (L2L)

New Formulation: min

θ

max

φ

1 n

n

  • i=1
  • ℓ(f(xi + g(A(xi, yi, θ); φ); θ), yi)
  • ,

Notations: f(·; θ): Classifier; g(·; φ): Attacker/Optimizer; A(xi, yi, θ): Input of Optimizer g (Interact g with f via A).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 29/38

slide-49
SLIDE 49

VALSE Webinar, Jun. 26 2019

Learn to Attack:

Grad L2L: Motivated by gradient ascent with A(xi, yi, θ) = [xi, ∇xℓ(f(xi; θ), yi)].

Original Input Classifier ℎ Attacker 𝑕 Gradient w.r.t. Input Noise Perturbed Inputs Concatenate Input and Gradient Clean Loss

  • Adv. Loss

+

Backpropagation

Multi-Step Grad L2L: Recursively apply Grad L2L (RNN).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 30/38

slide-50
SLIDE 50

VALSE Webinar, Jun. 26 2019

Learn to Attack:

Grad L2L: Motivated by gradient ascent with A(xi, yi, θ) = [xi, ∇xℓ(f(xi; θ), yi)].

Original Input Classifier ℎ Attacker 𝑕 Gradient w.r.t. Input Noise Perturbed Inputs Concatenate Input and Gradient Clean Loss

  • Adv. Loss

+

Backpropagation

Multi-Step Grad L2L: Recursively apply Grad L2L (RNN).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 30/38

slide-51
SLIDE 51

VALSE Webinar, Jun. 26 2019

Learn to Attack:

Grad L2L: Motivated by gradient ascent with A(xi, yi, θ) = [xi, ∇xℓ(f(xi; θ), yi)].

Original Input Classifier ℎ Attacker 𝑕 Gradient w.r.t. Input Noise Perturbed Inputs Concatenate Input and Gradient Clean Loss

  • Adv. Loss

+

Backpropagation

Multi-Step Grad L2L: Recursively apply Grad L2L (RNN).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 30/38

slide-52
SLIDE 52

VALSE Webinar, Jun. 26 2019

Experiments

Accuracy on Clean Samples and PGM adversaries Per Iteration Computational Cost

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 31/38

slide-53
SLIDE 53

Reinforcement Learning

slide-54
SLIDE 54

VALSE Webinar, Jun. 26 2019

Smoothed Bellman Error Minimization

Minimax Formulation: Given M = (A, A, P, R, γ), we solve min

π,V max ν

L(π, V ; ν) = 2Es,a,s′[ν(s, a)(R(s, a) + γV (s′) − λ log(π(a|s))] − Es,a,s′ν2(s, a), where s denotes the state, a denotes the action, and Policy: π : S → P(A), Value: V : S → R, Reward: R : S × A → R, Axillary Dual: ν : S × A → R. The policy π and ν are parameterized as a neural network and a reproducing kernel function, respectively (Dai et al. 2018).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 32/38

slide-55
SLIDE 55

VALSE Webinar, Jun. 26 2019

Smoothed Bellman Error Minimization

Minimax Formulation: Given M = (A, A, P, R, γ), we solve min

π,V max ν

L(π, V ; ν) = 2Es,a,s′[ν(s, a)(R(s, a) + γV (s′) − λ log(π(a|s))] − Es,a,s′ν2(s, a), where s denotes the state, a denotes the action, and Policy: π : S → P(A), Value: V : S → R, Reward: R : S × A → R, Axillary Dual: ν : S × A → R. The policy π and ν are parameterized as a neural network and a reproducing kernel function, respectively (Dai et al. 2018).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 32/38

slide-56
SLIDE 56

VALSE Webinar, Jun. 26 2019

Smoothed Bellman Error Minimization

Minimax Formulation: Given M = (A, A, P, R, γ), we solve min

π,V max ν

L(π, V ; ν) = 2Es,a,s′[ν(s, a)(R(s, a) + γV (s′) − λ log(π(a|s))] − Es,a,s′ν2(s, a), where s denotes the state, a denotes the action, and Policy: π : S → P(A), Value: V : S → R, Reward: R : S × A → R, Axillary Dual: ν : S × A → R. The policy π and ν are parameterized as a neural network and a reproducing kernel function, respectively (Dai et al. 2018).

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 32/38

slide-57
SLIDE 57

VALSE Webinar, Jun. 26 2019

Parameterization of V , π and ν

State Approximation: There exists a feature vector ψ(s) associated with every state s ∈ S. Neural Networks for π and V : π(aj|s) = fj(ψ(s); Θ) and V (s) = h(ψ(s), ∆), where fj is a neural network with parameter Θ and

  • aj∈A π(aj|s) = 1.

Reproducing Kernel Functions for ν: ν(aj|s) = gj(ψ(s); Ω), where gj is a reproducing kernel function with parameter Ω.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 33/38

slide-58
SLIDE 58

VALSE Webinar, Jun. 26 2019

Benefit of Reproducing Kernel Parameterization

Alternative Minimax Formulation: min

∆,Θ max Ω∈C L(∆, Θ, Ω) − R(Ω)

, where R(Ω) is a strongly concave regularizer. Stochastic Alternating Gradient Algorithm: Ω(t+1) = ΠC(Ω(t) + ηΩ∇Ω L(∆(t), Θ(t), Ω(t))), ∆(t+1) = ∆(t) − η∆∇∆ L′(∆(t), Θ(t), Ω(t+1)), V (t+1) = V (t) − ηV ∇V L′(∆(t), Θ(t), Ω(t+1)), where ηV , η∆ and ηΩ are properly chosen step sizes, and L and L′ are unbiased independent stochastic approximations of L.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 34/38

slide-59
SLIDE 59

VALSE Webinar, Jun. 26 2019

Benefit of Reproducing Kernel Parameterization

Alternative Minimax Formulation: min

∆,Θ max Ω∈C L(∆, Θ, Ω) − R(Ω)

, where R(Ω) is a strongly concave regularizer. Stochastic Alternating Gradient Algorithm: Ω(t+1) = ΠC(Ω(t) + ηΩ∇Ω L(∆(t), Θ(t), Ω(t))), ∆(t+1) = ∆(t) − η∆∇∆ L′(∆(t), Θ(t), Ω(t+1)), V (t+1) = V (t) − ηV ∇V L′(∆(t), Θ(t), Ω(t+1)), where ηV , η∆ and ηΩ are properly chosen step sizes, and L and L′ are unbiased independent stochastic approximations of L.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 34/38

slide-60
SLIDE 60

VALSE Webinar, Jun. 26 2019

Sublinear Convergence

Theorem (Informal, Chen et al. 2019) Given a pre-specified error ǫ > 0, we assume that L(∆, Θ, Ω) is sufficiently smooth in ∆, Θ, Ω ∈ C, and strongly concave in Ω. Given properly chosen step sizes and a batch size of O(1/ǫ), we need at most T = O(1/ǫ) iterations such that min

1≤t≤TE

  • ∇∆L(∆t, Θ(t), Ω(t+1))
  • 2

2 + E

  • ∇ΘL(∆t, Θ(t), Ω(t+1))
  • 2

2

+ E

  • Ω(t) − ΠC(Ω(t) + ∇Ω

L(∆(t), Θ(t), Ω(t)))

  • 2

2 ≤ ǫ.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 35/38

slide-61
SLIDE 61

VALSE Webinar, Jun. 26 2019

Experiments

Reproducing Kernel v.s. Neural Networks for ν.

0.0 0.2 0.4 0.6 0.8 1.0

Number of iterations

0.0 0.2 0.4 0.6 0.8 1.0

Performance (scaled)

GAIL GMMIL

50 100 150 200 −2 −1 1

Reacher

100 200 300 400 500 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

HalfCheetah

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Hopper

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Walker

100 200 300 400 500 −0.50 −0.25 0.00 0.25 0.50 0.75

Ant

200 400 600 800 0.0 0.2 0.4 0.6 0.8 1.0

Humanoid

The reproducing kernel parameterization leads to an easier

  • ptimization problem.

However, it might not be advanta- geous on more complicated problems.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 36/38

slide-62
SLIDE 62

VALSE Webinar, Jun. 26 2019

Experiments

Reproducing Kernel v.s. Neural Networks for ν.

0.0 0.2 0.4 0.6 0.8 1.0

Number of iterations

0.0 0.2 0.4 0.6 0.8 1.0

Performance (scaled)

GAIL GMMIL

50 100 150 200 −2 −1 1

Reacher

100 200 300 400 500 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

HalfCheetah

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Hopper

100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0

Walker

100 200 300 400 500 −0.50 −0.25 0.00 0.25 0.50 0.75

Ant

200 400 600 800 0.0 0.2 0.4 0.6 0.8 1.0

Humanoid

The reproducing kernel parameterization leads to an easier

  • ptimization problem.

However, it might not be advanta- geous on more complicated problems.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 36/38

slide-63
SLIDE 63

Take Home Messages

slide-64
SLIDE 64

VALSE Webinar, Jun. 26 2019

Summary

Minimax optimization is very difficult in general; Heuristics leverage specific structures in machine learning problems; Normalization techniques improve the optimization landscape, and stabilize the training of GAN; The learning to optimize techniques have potentials to

  • utperform hand-designed algorithms;

The “large-batch” stochastic alternating gradient descent attains sublinear convergence to some stationary solution for nonconvex-concave stochastic minimax optimization problems; Lots of new problems, and open to everyone!

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 37/38

slide-65
SLIDE 65

VALSE Webinar, Jun. 26 2019

References

[1] Jiang et al., “On Computation and Generalization of Generative Adversarial Networks under Spectrum Control”. International Conference

  • n Learning Representations (ICLR), 2019

[2] Jiang et al., “Learning to Defense by Learning to Attack”. ICLR Workshop on Deep Generative Models for Highly Structured Data, 2019 [3] Chen et al., “On Computation and Generalization of Generative Adversarial Imitation Learning”. Submitted. [4] Chen et al., “On Landscape of Lagrangian Functions and Stochastic Search for Constrained Nonconvex Optimization”. International Conference on Artificial Intelligence and Statistics (AISTATS), 2019 [5] Liu et al., “Deep Hyperspherical Learning”. Annual Conference on Neural Information Processing Systems (NIPS), 2017 [6] Li et al. “Symmetry, Saddle Points and Global Optimization Landscape of Nonconvex Matrix Factorization”, IEEE Transactions on Information Theory (TIT), 2019.

Tuo Zhao — Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning 38/38

slide-66
SLIDE 66

Questions?