EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis - PowerPoint PPT Presentation

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse, Sanja Fidler and Guodong Zhang University of Toronto, Vector Institute Jun 12, 2019 Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Structured Pruning Structured Pruning: Prunes filters or neurons. GPU-friendly. Figure 1: An illustration of structured pruning. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods Hessian-based methods: 1 The pruning criteria is calibrated across layers, 2 Automatically determines the network structure, 3 Fewer hyper-parameters required. (Only the pruning ratio) It relies on the Taylor expansion around the minimum θ ∗ , and directly approximates the effect on the loss when removing a given weight ∆ θ , � ⊤ 2∆ θ ⊤ H ∆ θ + ✘✘✘✘✘ ✘ � ∂ L + 1 O ( � ∆ θ � 3 ) ∆ L = ∆ θ (1) � ∂ θ � � �� ≈ 0 Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods Two representative methods Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS),: OBD uses a diagonal Hessian for fast computation, and computes the importance of each weight θ ∗ q by: ∆ L OBD = 1 q ) 2 H qq 2( θ ∗ (2) OBS uses the full Hessian for accounting the correlations, and computes the importance of each weight θ ∗ q by: ( θ ∗ q ) ∆ L OBS = 1 (3) 2 [ H − 1 ] qq Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Is OBS always better than OBD? In the original paper, OBS is guaranteed to be better than OBD when pruning weights one by one ( i.e. recompute the Hessian for each iteration). But in practice, we will prune multiple weights at a time . Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Is OBS always better than OBD? We would like to ask: Is OBS always better than OBD for pruning multiple weights at a time? At the first glance.... Yes? Because OBS uses full Hessian, while OBD only uses diagonal Hessian. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Bayesian Interpretations Surprisingly, No ! Even if we can compute the exact Hessian. Bayesian Interpretations of OBD and OBS : (a) (c) (b) Both OBS and OBD are using a factorial Gaussian to approximate the highly coupled weight posterior (c) under different objectives : and neither of them will necessarily be better than the other. More details in the Paper and Poster#22! Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Bayesian Interpretations Surprisingly, No ! Even if we can compute the exact Hessian. Bayesian Interpretations of OBD and OBS : (a) (c) (b) Both OBS and OBD are using a factorial Gaussian to approximate the highly coupled weight posterior (c) under different objectives : OBD: Reverse KL divergence (b). (Too pessimistic) OBS: Forward KL divergence (a). (Too optimistic) and neither of them will necessarily be better than the other. More details in the Paper and Poster#22! Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Method OBD and OBS use diagonal Hessian and diagonal Hessian inverse for pruning. Both of them fail to capture the correlations when pruning multiple weights at a time. Solution: Pruning in a new coordinate system ( i.e. a new basis), in which the posterior is closer to factorial! The new basis ideally would be the eigenbasis of the Hessian! But Issues : 1 Exact Hessian is intractable for large neural networks. 2 The new basis will introduce extra parameters. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Method OBD and OBS use diagonal Hessian and diagonal Hessian inverse for pruning. Both of them fail to capture the correlations when pruning multiple weights at a time. Solution: Pruning in a new coordinate system ( i.e. a new basis), in which the posterior is closer to factorial! The new basis ideally would be the eigenbasis of the Hessian! But.. Issues : 1 Exact Hessian is intractable for large neural networks. 2 The new basis will introduce extra parameters. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Approximating Hessian with K-FAC Fisher 1 Fisher Information Matrix (FIM) is commonly adopted for approximating the Hessian: F = E [ ∇ θ log p ( y | x ; θ ) ∇ θ log p ( y | x ; θ ) ⊤ ] (4) 2 K-FAC decomposes the FIM of a neural network into the Kronecker product of two small matrices under the independence assumption: F = E [ D s D s ⊤ ⊗ aa ⊤ ] ≈ E [ D s D s ⊤ ] ⊗ E [ aa ⊤ ] = S ⊗ A (5) 3 Kronecker-factored Eigenbasis (KFE): F ≈ ( Q S Λ S Q ⊤ S ) ⊗ ( Q A Λ A Q ⊤ A ) (Λ S ⊗ Λ A )( Q S ⊗ Q A ) ⊤ (6) = ( Q S ⊗ Q A ) � �� KFE Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Approximating Hessian with K-FAC Fisher 1 Fisher Information Matrix (FIM) is commonly adopted for approximating the Hessian: F = E [ ∇ θ log p ( y | x ; θ ) ∇ θ log p ( y | x ; θ ) ⊤ ] (7) 2 K-FAC decomposes the FIM of a neural network into the Kronecker product of two small matrices under the independence assumption: F = E [ D s D s ⊤ ⊗ aa ⊤ ] ≈ E [ D s D s ⊤ ] ⊗ E [ aa ⊤ ] = S ⊗ A (8) 3 Kronecker-factored Eigenbasis (KFE): F ≈ ( Q S Λ S Q ⊤ S ) ⊗ ( Q A Λ A Q ⊤ A ) (Λ S ⊗ Λ A )( Q S ⊗ Q A ) ⊤ (9) = ( Q S ⊗ Q A ) � �� KFE Chaoqi Wang EigenDamage: Structured Pruning in the KFE

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Approximating Hessian with K-FAC Fisher 1 Fisher Information Matrix (FIM) is commonly adopted for approximating the Hessian: F = E [ ∇ θ log p ( y | x ; θ ) ∇ θ log p ( y | x ; θ ) ⊤ ] (10) 2 K-FAC decomposes the FIM of a neural network into the Kronecker product of two small matrices under the independence assumption: F = E [ D s D s ⊤ ⊗ aa ⊤ ] ≈ E [ D s D s ⊤ ] ⊗ E [ aa ⊤ ] = S ⊗ A (11) 3 Kronecker-Factored Eigenbasis (KFE): F ≈ ( Q S Λ S Q ⊤ S ) ⊗ ( Q A Λ A Q ⊤ A ) (Λ S ⊗ Λ A )( Q S ⊗ Q A ) ⊤ (12) = ( Q S ⊗ Q A ) � �� KFE Chaoqi Wang EigenDamage: Structured Pruning in the KFE

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis - PowerPoint PPT Presentation

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

BASICS Natural Target Pruning Terminology and Tools Reasons for Pruning Fruit Trees

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Pruning for Cropload Management and Productivity 2013 Winter Pruning Workshop Dr. Mercy

The Kronecker product and the partition algebra Christopher Bowman Maud De Visscher Rosa

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Kronecker coefficients: bounds and complexity Igor Pak, UCLA Triangle Lectures in Combinatorics,

Estimation equations for multivariate linear models with Kronecker structured covariance matrices

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

ENVIRONMENT STANDING COMMITTEE 18 September 2017 Street Trees & Pruning Requests Criteria

Identification of Pruning Branches for for Automated Dormant Pruning M Manoj Karkee j K k

Welcome to the DCGO Presentation Basic Pruning Agenda Reasons for Pruning Tools

What is the State of Neural Network Pruning? Davis Blalock* Jose Javier Gonzalez* Jonathan

More on games (Ch. 5.4-5.6) Announcements Writing 2 posted Minimax Pruning in real life:

Testing theories of fairness Intentions matter Armin Falk, Ernst Fehr, Urs Fischbacher

12-11-06 Phylogenetics 1: An overview Phylogenetics 1: An overview Phylogenetic tree used in The

Is there a new infinitive in Russian Romani?: a corpus-based study of subject-verb agreement

Feature and model selection Subhransu Maji CMPSCI 689: Machine Learning 10 February 2015 12

Subsidence wetland formation and transition in the coal mining areas with high ground water table

On Design and Analysis of Chemical Reaction Network Algorithms Anne Condon with Ben Chugg, Monir

Rift Volcanism: Past, Present and Future What controls volcanism in a continental rift? 120 km

A multi-criteria approach for asset management of waste water networks Am ir NAFI, Caty WEREY

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis - PowerPoint PPT Presentation

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,

Natural Target Pruning Making Proper Pruning Cuts Natural Target Pruning In this lesson we

BASICS Natural Target Pruning Terminology and Tools Reasons for Pruning Fruit Trees

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Pruning for Cropload Management and Productivity 2013 Winter Pruning Workshop Dr. Mercy

The Kronecker product and the partition algebra Christopher Bowman Maud De Visscher Rosa

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Kronecker coefficients: bounds and complexity Igor Pak, UCLA Triangle Lectures in Combinatorics,

Estimation equations for multivariate linear models with Kronecker structured covariance matrices

Berries, Grapes and Kiwi Pruning Blueberries Prune to an open vase shape, leaving 4 to 6

ENVIRONMENT STANDING COMMITTEE 18 September 2017 Street Trees &amp; Pruning Requests Criteria

Identification of Pruning Branches for for Automated Dormant Pruning M Manoj Karkee j K k

Welcome to the DCGO Presentation Basic Pruning Agenda Reasons for Pruning Tools

What is the State of Neural Network Pruning? Davis Blalock* Jose Javier Gonzalez* Jonathan

More on games (Ch. 5.4-5.6) Announcements Writing 2 posted Minimax Pruning in real life:

Testing theories of fairness Intentions matter Armin Falk, Ernst Fehr, Urs Fischbacher

12-11-06 Phylogenetics 1: An overview Phylogenetics 1: An overview Phylogenetic tree used in The

Is there a new infinitive in Russian Romani?: a corpus-based study of subject-verb agreement

Feature and model selection Subhransu Maji CMPSCI 689: Machine Learning 10 February 2015 12

Subsidence wetland formation and transition in the coal mining areas with high ground water table

On Design and Analysis of Chemical Reaction Network Algorithms Anne Condon with Ben Chugg, Monir

Rift Volcanism: Past, Present and Future What controls volcanism in a continental rift? 120 km

A multi-criteria approach for asset management of waste water networks Am ir NAFI, Caty WEREY

ENVIRONMENT STANDING COMMITTEE 18 September 2017 Street Trees & Pruning Requests Criteria