Towards Understanding Learning Representations: To What Extent Do - - PowerPoint PPT Presentation

towards understanding learning representations to what
SMART_READER_LITE
LIVE PREVIEW

Towards Understanding Learning Representations: To What Extent Do - - PowerPoint PPT Presentation

. . . . . . . . . . . . . . Towards Understanding Learning Representations: To What Extent Do Difgerent Neural Networks Learn the Same Representation Liwei Wang Lunjia Hu Jiayuan Gu Yue Wu Zhiqiang Hu Kun He John Hopcroft .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Towards Understanding Learning Representations: To What Extent Do Difgerent Neural Networks Learn the Same Representation

Liwei Wang Lunjia Hu Jiayuan Gu Yue Wu Zhiqiang Hu Kun He John Hopcroft NeurIPS 2018 Spotlight

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ It’s widely believed that deep nets learn particular features/representations in their intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). However, there is a lack of theory on what these representations really are. One fundamental question: are the representations learned by deep nets robust? In other words, are the learned representations commonly shared across multiple deep nets trained on the same task?

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ It’s widely believed that deep nets learn particular features/representations in their intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). ▶ However, there is a lack of theory on what these representations really are. One fundamental question: are the representations learned by deep nets robust? In other words, are the learned representations commonly shared across multiple deep nets trained on the same task?

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ It’s widely believed that deep nets learn particular features/representations in their intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). ▶ However, there is a lack of theory on what these representations really are. ▶ One fundamental question: are the representations learned by deep nets robust? In other words, are the learned representations commonly shared across multiple deep nets trained on the same task?

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ In particular, suppose we have two deep nets with the same architecture trained

  • n the same training data but from difgerent initializations.

Given a set of test examples,

do the two deep nets share similarity in their output of layer i ?

When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error.

How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations that both deep nets share in common? How large are these groups?

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ In particular, suppose we have two deep nets with the same architecture trained

  • n the same training data but from difgerent initializations.

▶ Given a set of test examples,

do the two deep nets share similarity in their output of layer i ?

When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error.

How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations that both deep nets share in common? How large are these groups?

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ In particular, suppose we have two deep nets with the same architecture trained

  • n the same training data but from difgerent initializations.

▶ Given a set of test examples,

do the two deep nets share similarity in their output of layer i ?

▶ When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error.

How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations that both deep nets share in common? How large are these groups?

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ In particular, suppose we have two deep nets with the same architecture trained

  • n the same training data but from difgerent initializations.

▶ Given a set of test examples,

do the two deep nets share similarity in their output of layer i ?

▶ When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error.

How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations that both deep nets share in common? How large are these groups?

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ In particular, suppose we have two deep nets with the same architecture trained

  • n the same training data but from difgerent initializations.

▶ Given a set of test examples,

do the two deep nets share similarity in their output of layer i ?

▶ When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error.

▶ How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations that both deep nets share in common? How large are these groups?

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Motivation

▶ In particular, suppose we have two deep nets with the same architecture trained

  • n the same training data but from difgerent initializations.

▶ Given a set of test examples,

do the two deep nets share similarity in their output of layer i ?

▶ When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error.

▶ How similar are intermediate layers? ▶ Do some groups of neurons in an intermediate layer learn features/representations that both deep nets share in common? How large are these groups?

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W

A

A A A A A A A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y

B

B B B B B B B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1 ad, there exist A and B such that for all i,

X ai Y ai Z ai W ai Z ai W ai X ai Y ai

A B

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A A A A A A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B B B B B B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1 ad, there exist A and B such that for all i,

X ai Y ai Z ai W ai Z ai W ai X ai Y ai

A B

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A A A A A A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B B B B B B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A A A A A A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B B B B B B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

   

A A A A A

   

A A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

   

B B B B B

   

B B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

   

A⊤ A A A A

   

A⊤ A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

   

B⊤ B B B B

   

B⊤ B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A

   

A⊤ A A A A

   

A⊤ A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B

   

B⊤ B B B B

   

B⊤ B B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A

   

A⊤ A A A A

   

A⊤ A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B

   

B⊤ B B B B

   

B⊤ B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A A

   

A⊤ A A A A

   

A⊤ A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B B

   

B⊤ B B B B

   

B⊤ B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A A A

   

A⊤ A A A A

   

A⊤

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B B B

   

B⊤ B B B B

   

B⊤

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W        

A

A A A A A A A A A A

Layer i + 1 Output of layer i after ReLU Z W X Y        

B

B B B B B B B B B B

Layer i + 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span X a1 X ad

activation vector of X

Y a1 Y ad

activation vector of Y

span Z a1 Z ad

activation vector of Z

W a1 W ad

activation vector of W

We say X Y Z W form an exact match!

slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W

A

A A A A A A A A A A

Layer i 1 Output of layer i after ReLU Z W X Y

B

B B B B B B B B B B

Layer i 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span( [X(a1), · · · , X(ad)]

activation vector of X

, [Y(a1), · · · , Y(ad)]

activation vector of Y

) = span( [Z(a1), · · · , Z(ad)]

activation vector of Z

, [W(a1), · · · , W(ad)]

activation vector of W

) We say X Y Z W form an exact match!

slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two Groups of Neurons Learning the Same Representation: Exact Matches

X Y Z W

A

A A A A A A A A A A

Layer i 1 Output of layer i after ReLU Z W X Y

B

B B B B B B B B B B

Layer i 1 Output of layer i after ReLU For test examples a1, · · · , ad, there exist A and B such that for all i,

X(ai) Y(ai) Z(ai) W(ai)

               

Z(ai) W(ai) X(ai) Y(ai)

               

A B = =

span( [X(a1), · · · , X(ad)]

  • activation vector of X

, [Y(a1), · · · , Y(ad)]

  • activation vector of Y

) = span( [Z(a1), · · · , Z(ad)]

  • activation vector of Z

, [W(a1), · · · , W(ad)]

  • activation vector of W

) We say ({X, Y}, {Z, W}) form an exact match!

slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Exact/Approximate Matches between Two Groups of Neurons

▶ Suppose a1, a2, · · · , ad are the test examples. The output of neuron X on these test examples form a vector (X(a1), X(a2), · · · , X(ad)) called the activation vector [Raghu et al., 2017]. If the activation vectors of two groups of neurons span the same linear subspace, we say the two groups of neurons form an exact match. If the activation vector of every neuron in each group is

  • close to the linear

subspace spanned by the other group, we say the two groups form an

  • approximate match.

Vector u is -close to linear subspace S if the sine of the angle between u and S is at most , or equivalently,

v S u

v 2 u 2

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Exact/Approximate Matches between Two Groups of Neurons

▶ Suppose a1, a2, · · · , ad are the test examples. The output of neuron X on these test examples form a vector (X(a1), X(a2), · · · , X(ad)) called the activation vector [Raghu et al., 2017]. ▶ If the activation vectors of two groups of neurons span the same linear subspace, we say the two groups of neurons form an exact match. If the activation vector of every neuron in each group is

  • close to the linear

subspace spanned by the other group, we say the two groups form an

  • approximate match.

Vector u is -close to linear subspace S if the sine of the angle between u and S is at most , or equivalently,

v S u

v 2 u 2

slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Exact/Approximate Matches between Two Groups of Neurons

▶ Suppose a1, a2, · · · , ad are the test examples. The output of neuron X on these test examples form a vector (X(a1), X(a2), · · · , X(ad)) called the activation vector [Raghu et al., 2017]. ▶ If the activation vectors of two groups of neurons span the same linear subspace, we say the two groups of neurons form an exact match. ▶ If the activation vector of every neuron in each group is ε-close to the linear subspace spanned by the other group, we say the two groups form an ε-approximate match.

▶ Vector u is ε-close to linear subspace S if the sine of the angle between u and S is at most ε, or equivalently, minv∈S ∥u − v∥2 ≤ ε∥u∥2.

slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Maximum Matches and Simple Matches

▶ Matches are closed under union, so there is a unique maximum match. We defjne simple matches to be matches that are not the union of smaller matches. Any match is a union of simple matches. We designed algorithms for fjnding the maximum match and the simple matches, and we implemented the algorithms to conduct experiments.

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Maximum Matches and Simple Matches

▶ Matches are closed under union, so there is a unique maximum match. ▶ We defjne simple matches to be matches that are not the union of smaller matches. Any match is a union of simple matches. We designed algorithms for fjnding the maximum match and the simple matches, and we implemented the algorithms to conduct experiments.

slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Maximum Matches and Simple Matches

▶ Matches are closed under union, so there is a unique maximum match. ▶ We defjne simple matches to be matches that are not the union of smaller matches. ▶ Any match is a union of simple matches. We designed algorithms for fjnding the maximum match and the simple matches, and we implemented the algorithms to conduct experiments.

slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Maximum Matches and Simple Matches

▶ Matches are closed under union, so there is a unique maximum match. ▶ We defjne simple matches to be matches that are not the union of smaller matches. ▶ Any match is a union of simple matches. ▶ We designed algorithms for fjnding the maximum match and the simple matches, and we implemented the algorithms to conduct experiments.

slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental Findings: Few Matches in Intermediate Layers

Figure: Size of maximum match / number of neurons across layers

Low similarity in intermediate layers!

slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental Findings: Few Matches in Intermediate Layers

Figure: Size of maximum match / number of neurons across layers

Low similarity in intermediate layers!

slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thank you! Come to the poster for more details! 05:00 – 07:00 PM @ Room 210 & 230 AB #26