Towards Understanding Learning Representations: To What Extent Do - PowerPoint PPT Presentation

. . . . . . . . . . . . . . Towards Understanding Learning Representations: To What Extent Do Difgerent Neural Networks Learn the Same Representation Liwei Wang Lunjia Hu Jiayuan Gu Yue Wu Zhiqiang Hu Kun He John Hopcroft . . . . . . . . . . . . . . . . . . . . . . . . . . NeurIPS 2018 Spotlight

In other words, are the learned representations commonly shared across multiple . . . . . . . . . . . . . . . . Motivation intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). However, there is a lack of theory on what these representations really are. One fundamental question: are the representations learned by deep nets robust? . . . . . . . . . . . . . . . . . . . . . . . . deep nets trained on the same task? ▶ It’s widely believed that deep nets learn particular features/representations in their

In other words, are the learned representations commonly shared across multiple . . . . . . . . . . . . . . . . Motivation intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). One fundamental question: are the representations learned by deep nets robust? . . . . . . . . . . . . . . . . . . . . . . . . deep nets trained on the same task? ▶ It’s widely believed that deep nets learn particular features/representations in their ▶ However, there is a lack of theory on what these representations really are.

. . . . . . . . . . . . . . . . . Motivation intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). In other words, are the learned representations commonly shared across multiple . . . . . . . . . . . . . . . . . . . . . . . deep nets trained on the same task? ▶ It’s widely believed that deep nets learn particular features/representations in their ▶ However, there is a lack of theory on what these representations really are. ▶ One fundamental question: are the representations learned by deep nets robust?

. . . . . . . . . . . . . . Motivation Given a set of test examples, do the two deep nets share similarity in their output of layer i ? When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations.

. . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples,

. . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the

. . . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the

. . . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. similarity is also high assuming both deep nets have tiny test error. Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? . . . . ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the ▶ How similar are intermediate layers?

. . . . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. similarity is also high assuming both deep nets have tiny test error. . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? . . . . . ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the ▶ How similar are intermediate layers? ▶ Do some groups of neurons in an intermediate layer learn features/representations

. B there exist A and B such that a d , For test examples a 1 i after ReLU Output of layer Layer B B B B B X a i B B B B B Y X W Z i after ReLU Output of layer for all i , Y a i . Y a d Z W X Y We say activation vector of W W a d W a 1 activation vector of Z Z a d Z a 1 span activation vector of Y Y a 1 Z a i activation vector of X X a d X a 1 span B A Y a i X a i W a i Z a i W a i Layer A A . . . . . . . . . . . . . . . . . . . . . . . . . . . A . A A A A A A A A W Z Y X Two Groups of Neurons Learning the Same Representation: Exact Matches . . . . . . . . . . form an exact match! i + 1 i + 1

. B For test examples a 1 i after ReLU Output of layer Layer B B B B B B B B there exist A and B such that B B Y X W Z i after ReLU Output of layer Layer . A A a d , for all i , A Y a 1 Z W X Y We say activation vector of W W a d W a 1 activation vector of Z Z a d Z a 1 span activation vector of Y Y a d activation vector of X X a i X a d X a 1 span B A Y a i X a i W a i Z a i W a i Z a i Y a i A A A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Y A A A A W Z form an exact match! X . . . . . Two Groups of Neurons Learning the Same Representation: Exact Matches i + 1 i + 1                

. W B B B B B B B . Y X Z B i after ReLU Output of layer Layer A A A A A A A A B B A activation vector of Y Z W X Y We say activation vector of W W a d W a 1 activation vector of Z Z a d Z a 1 span Y a d Layer Y a 1 activation vector of X X a d X a 1 span B A for all i , there exist A and B such that i after ReLU Output of layer A B A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . form an exact match! . Y . . . Two Groups of Neurons Learning the Same Representation: Exact Matches X . Z W . . . . For test examples a 1 , · · · , a d , i + 1 i + 1     X ( a i ) Z ( a i )         =             Y ( a i ) W ( a i )           Z ( a i ) X ( a i )         =       W ( a i ) Y ( a i )

Towards Understanding Learning Representations: To What Extent Do - PowerPoint PPT Presentation

. . . . . . . . . . . . . . Towards Understanding Learning Representations: To What Extent Do Difgerent Neural Networks Learn the Same Representation Liwei Wang Lunjia Hu Jiayuan Gu Yue Wu Zhiqiang Hu Kun He John Hopcroft .

Towards Understanding Towards Understanding Objectives Objectives Good basic understanding of

61A Lecture 16 Announcements String Representations String Representations 4 String

On SAT representations of XOR constraints (towards a theory of good SAT representations) Oliver

On SAT representations of XOR constraints (towards a theory of good SAT representations) Matthew

Rich representations for Rich representations for learning visual recognition learning visual

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

free 18-May-17 Towards Weakly Supervised Image Understanding 1/50 Towards Weakly Supervised

Understanding image representations by measuring their equivariance and equivalence Karel Lenc,

Learning text representations from character-level data Grzegorz Chrupa la Department of

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

Negotiating Commercial Loan Covenants, Representations and Warranties Representations and

New formula representations of high- New formula representations of high- latitude O + +

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Publishing with Amusewiki Oslo, Nordic Perl Workshop 2018 Marco Pessotto (melmothX) September 6,

Communicating Results of Data Analysis Hctor Corrada Bravo University of Maryland, College

Algorithms and Data Structures for Embedded Network Data Minkyoung Cho, David Mount, and Eunhui

practices when developing content at INAF Serena Pastore INAF Pisa, April 4 2011 W3C Workshop

Hiding the Input Size in Secure Two-Party Computation Yehuda Lindell , Kobbi Nissim, Claudio

Computational Optimization Last of unconstrained 2/26 Half-way there Minimize f(x) (objective

O ( d , d ) preserve classical integrability Yuta Sekiguchi University of Bern (AEC, ITP)

Metallurgical Processes Chapter Thirty One: Welding Processes Dr. Eng. Yazan Al-Zain Department