interaction effects helpful or harmful
play

Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI - PowerPoint PPT Presentation

Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI Seminar Feb 18, 2020 1 Today 1. What is an Interaction Effect? 2. Interaction Effects in Neural Networks Based on: Purifying Interaction Effects with the Functional ANOVA.


  1. Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI Seminar Feb 18, 2020 1

  2. Today 1. What is an Interaction Effect? 2. Interaction Effects in Neural Networks Based on: • Purifying Interaction Effects with the Functional ANOVA. AISTATS 2020 ◦ Lengerich, Tan, Chang, Hooker, Caruana 
 • On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks. Under Review 2020. ◦ Lengerich, Xing, Caruana 2

  3. Why do we care about interaction effects? • Interpreting models • Identifiability • Understanding how big machine learning models work 3

  4. What is an Interaction Effect? Intuitively: “E ff ect of one variable changes based on the value of another variable” But this definition is incomplete: 3 stories 4

  5. Is “AND” an Interaction Effect? 0 1 Suppose we data: X 2 Y = AND ( X 1 , X 2 ) 0 0 with Boolean . Let’s fit an X 1 , X 2 X 1 additive model (no interactions): Y = f 0 + f 1 ( X 1 ) + f 2 ( X 2 ) 2 1 How well can we fit the data? X 2 0 1 Perfectly*! X 1 5

  6. Is Multiplication an Interaction? Common model: Y = a + bX 1 + cX 2 + dX 1 X 2 But this is equivalent to: Y = ( a − d αβ ) + ( b + d β ) X 1 + ( c + d α ) X 2 + d ( X 1 − α )( X 2 − β ) We can pick any o ff sets without changing the function output. α , β Picking di ff erent values of drastically changes the interpretation. α , β 6

  7. Is Multiplication an Interaction? Y = ( a − d αβ ) + ( b + d β ) X 1 + ( c + d α ) X 2 + d ( X 1 − a )( X 2 − b ) Picking di ff erent values of drastically changes the interpretation: α , β 100% interaction e ff ect 20% interaction e ff ect 7

  8. Is Multiplication an Interaction? Mean-Center? • Does mean- centering solve this problem? • No — If the correlation ρ ( X 1 , X 2 ) is not zero, then we can’t simultaneously center . X 1 , X 2 , X 1 X 2 • Choosing which term to center changes the interpretation! 8

  9. Is Multiplication an Interaction? One more wrinkle If we say that Y = X 1 X 2 is an interaction e ff ect, then is log( Y ) = log( X 1 X 2 ) = log( X 1 ) + log( X 2 ) an interaction e ff ect? 9

  10. Are “AND”, “OR”, “XOR” the same or different? Suppose we have: Y = f 0 + f 1 ( X 1 ) + f 2 ( X 2 ) + f 3 ( X 1 , X 2 ) Equivalent realizations can look like “AND”, “OR”, or “XOR” 10

  11. Pure Interaction Effects To make things identifiable, let’s define a Pure Interaction E ff ect of k Variables as variance in the outcome which cannot be explained any function of fewer than k variables. This gives us an optimization criterion: maximize the variance of lower-order terms . 11

  12. Functional ANOVA Statistical framework designed to decompose a function into orthogonal functions on sets of input variables. Deep roots: [Hoe ff ding 1948, Huang 1998, Cuevas 2004, Hooker 2004, Hooker 2007] 12

  13. Functional ANOVA Given where , the weighted fANOVA F ( X ) X = ( X 1 , …, X d ) decomposition [Hooker 2004,2007] of is: F ( X ) 2 { f u ( X u ) | u ⊆ [ d ]} = argmin { g u ∈ℱ u } u ∈ [ d ] ∫ ( ∑ g u ( X u ) − F ( X ) ) w ( X ) dX , u ⊆ [ d ] where indicates the power set of features, such that [ d ] d ∫ f u ( X u ) g v ( X v ) w ( X ) dX = 0 ∀ v ⊆ u , ∀ g v 13

  14. Functional ANOVA Key property 1 (Orthogonality): [Hooker 2004] ∫ f u ( X u ) g v ( X v ) w ( X ) dX = 0 ∀ v ⊆ u , Every function is orthogonal to any function which operates on f u f v any subset of variables in . u When , this means that the functions in the w ( X ) = P ( X ) decomposition are all mean-centered and uncorrelated with functions on fewer variables. 14

  15. Functional ANOVA Key property 2 (Existence and Uniqueness): [Hooker 2004] Under reasonable assumptions on the joint distribution , (e.g. P ( X , Y ) no duplicated variables), the functional ANOVA decomposition exists and is unique . 15

  16. Functional ANOVA Example ρ 1,2 = 0.01 f 1 ( X 1 ) f 2 ( X 2 ) f 3 ( X 1 , X 2 ) ρ 1,2 = 0.99 Y = X 1 X 2 f 1 ( X 1 ) f 2 ( X 2 ) f 3 ( X 1 , X 2 ) 16

  17. Interaction Effects in Neural Networks 17

  18. The Challenge of Finding Interaction Effects • Define: a -order interaction e ff ect has | u | = k k f u • Give input variables, there are a potential: d • interaction e ff ects of order O ( d ) 1 • O ( d 2 ) interaction e ff ects of order 2 • O ( d 3 ) interaction e ff ects of order 3 • … • How do deep nets learn? How do they generalize to test sets? 18

  19. Dropout • “Input Dropout” if we drop input features. • “Activation Dropout” if we drop hidden activations. • Dropout rate will refer to the probability that the variable is set to 0. 19

  20. Dropout Regularizes Interaction Effects • With fANOVA, we can decompose the function estimated by each network into orthogonal functions of k variables. • As we increase the Dropout rate, the estimated function is increasingly made up of low- order e ff ects. 20

  21. Dropout Preferentially Targets High-Order Effects Intuition: Let’s consider Input Dropout. For a pure interaction e ff ect of k variables, all variables must be retained for the interaction e ff ect to k survive. The probability that variables all survive Input Dropout decays k exponentially with . k This balances out the exponential growth in of the size of the k hypothesis space. 21

  22. Dropout Preferentially Targets High-Order Effects F ( X ) = ∑ Let with the fANOVA 𝔽 [ Y | X ] = F ( X ) + ϵ f u ( X u ) u ∈ [ d ] ˜ decomposition, with . Let be perturbed by Input 𝔽 [ Y ] = 0 X X v = { j : ˜ Dropout, and define . Then X j = 0} X u ] = { f u ( ˜ X u ) | v | = 0 𝔽 X u [ f u ( X u ) | ˜ otherwise 0 If a single variable in has been u dropped, then we have no information about f u ( X u ) 22

  23. Dropout Preferentially Targets High-Order Effects X u ] = { f u ( ˜ X u ) | v | = 0 𝔽 X u [ f u ( X u ) | ˜ otherwise 0 • What is the probability that ? | v | = 0 • (1 − p ) | u | • Define: r p ( k ) = (1 − p ) k the e ff ective learning rate of a -order k e ff ect. 23

  24. A Symmetry d = 25 • Define: r p ( k ) = (1 − p ) k the e ff ective learning rate of a -order e ff ect. k | ℋ k | = ( k ) d hypothesis • space size • E ff ective learning rate decay and hypothesis space growth in balance k each other out! 24

  25. A Symmetry d = 25 25

  26. Activation Input Act.+Input 26

  27. Activation Input Act.+Input 27

  28. Early Stopping Neural networks tend to start near simple functions, and train toward complex functions [Weigand 1994, De Palma 2019, Nakkiran 2019]. Dropout slows down the training of high-order interactions, making early stopping even more e ff ective. 28

  29. Implications • When should we use higher Dropout rates? • Higher in Later Layers • Lower in ConvNets • Explicitly modeling interaction e ff ects • Dropout for explanations / saliency? 29

  30. Conclusions • Interaction e ff ects are tricky — not everything that looks like an interaction is fully interaction. • Defining pure interaction e ff ects according to the Functional ANOVA gives us an identifiable form. • The number of potential interaction e ff ects explodes exponentially with order, so searching for high-order interaction e ff ects from data is impossible in practice. • Dropout is an e ff ective regularizer against interaction e ff ects. It penalizes higher-order e ff ects more than lower-order e ff ects. 30

  31. Thank You Collaborators: • Eric Xing • Rich Caruana (MSR) • Chun-Hao Chang (Toronto) • Sarah Tan (Facebook) • Giles Hooker (Cornell) • Purifying Interaction Effects with the Functional ANOVA. AISTATS 2020 ◦ Lengerich, Tan, Chang, Hooker, Caruana 
 • On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks. Under Review 2020. • Lengerich, Xing, Caruana 31

  32. 32

  33. Dropout Preferentially Targets High-Order Effects F ( X ) = ∑ Let with the fANOVA decomposition, with 𝔽 [ Y | X ] = F ( X ) + ϵ f u ( X u ) u ∈ [ d ] ˜ v = { j : ˜ . Let be perturbed by Input Dropout, and define . Then 𝔽 [ Y ] = 0 X j = 0} X X X u ] = ∫ f u ( X u ) P ( X u | ˜ 𝔽 X u [ f u ( X u ) | ˜ X ) dX u = ∫ f u ( X u ) I ( X u \ v = ˜ X u \ v ) P ( X v | ˜ X ) dX u = ∫ f h ( X v , ˜ X u \ v ) P ( X v | ˜ X ) dX v = { f u ( ˜ Advantage of using fANVOA to | v | = 0 X u ) define — these are zero! f u otherwise 0 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend