Towards Explainable AI: Significance Tests for Neural Networks Kay - PowerPoint PPT Presentation

Towards Explainable AI: Significance Tests for Neural Networks Kay Giesecke Advanced Financial Technologies Laboratory Stanford University people.stanford.edu/giesecke/ fintech.stanford.edu Joint work with Enguerrand Horel (Stanford) 1 / 27

Introduction Neural networks underpin many of the best-performing AI systems, including speech recognizers on smartphones or Google’s latest automatic translator The tremendous success of these applications has spurred the interest in applying neural networks in a variety of other fields including finance, economics, operations, marketing, medicine, and many others In finance, researchers have developed several promising applications in risk management, asset pricing, and investment management 2 / 27

Literature First wave: single-layer nets Financial time series: White (1989), Kuan & White (1994) Nonlinearity testing: Lee, White & Granger (1993) Economic forecasting: Swanson & White (1997) Stock market prediction: Brown, Goetzmann & Kumar (1998) Pricing kernel modeling: Bansal & Viswanathan (1993) Option pricing: Hutchinson, Lo & Poggio (1994) Credit scoring: Desai, Crook & Overstreet (1996) Second wave: multi-layer nets (deep learning) Mortgages: Sirignano, Sadhwani & Giesecke (2016) Order books: Sirignano (2016), Cont and Sirignano (2018) Portfolio selection: Heaton, Polson & Witte (2016) Returns: Chen, Pelger & Zhu (2018), Gu, Kelly & Xiu (2018) Hedging: Halperin (2018), B¨ uhler, Gonon & Teichmann (2018) Optimal stopping: Becker, Cheridito & Jentzen (2018) Treasury markets: Filipovic, Giesecke, Pelger, Ye (2019) Real estate: Giesecke, Ohlrogge, Ramos & Wei (2019) Insurance: W¨ uthrich and Merz (2019) 3 / 27

Explainability The success of NNs is largely due to their amazing approximation properties, superior predictive performance, and their scalability A major caveat however is model explainability : NNs are perceived as black boxes that permit little insight into how predictions are being made Key inference questions are difficult to answer Which input variables are statistically significant? If significant, how can a variable’s impact be measured? What’s the relative importance of the different variables? 4 / 27

Explainability matters in practice This issue is not just academic; it has slowed the implementation of NNs in financial practice, where regulators and other stakeholders often insist on model explainability Credit and insurance underwriting (regulated) Transparency of underwriting decisions Investment management (unregulated) Transparency of portfolio designs Economic rationale of trading decisions 5 / 27

This paper We develop a pivotal test to assess the statistical significance of the input variables of a NN Focus on single-layer feedforward networks Focus on regression setting We propose a gradient-based test statistic and study its asymptotics using nonparametric techniques Asymptotic distribution is a mixture of χ 2 laws The test enables one to address key inference issues: Assess statistical significance of variables Measure the impact of variables Rank order variables according to their influence Simulation and empirical experiments illustrate the test 6 / 27

Problem formulation Regression model Y = f 0 ( X ) + ǫ X ∈ X ⊂ R d is a vector of d feature variables with law P f 0 : X → R is an unknown deterministic C 1 -function X , E ( ǫ ) = 0 , E ( ǫ 2 ) = σ 2 < ∞ ǫ is an error variable: ǫ = | To assess the significance of a variable X j , we consider sensitivity-based hypotheses: � � ∂ f 0 ( x ) � 2 H 0 : λ j := d µ ( x ) = 0 ∂ x j X H A : λ j � = 0 Here, µ is a positive weight measure A typical choice is µ = P and then λ j = E [( ∂ f 0 ( X ) ∂ x j ) 2 ] 7 / 27

Intuition Suppose the function f 0 is linear (multiple linear regression) d � f 0 ( x ) = β k x k k =1 Then λ j ∝ β 2 j , the squared regression coefficient for X j , and the null takes the form H 0 : β j = 0 ( → t -test) In the general nonlinear case, the derivative ∂ f 0 ( x ) depends on ∂ x j X ( ∂ f 0 ( x ) � ∂ x j ) 2 d µ ( x ) is a weighted average x , and λ j = 8 / 27

Neural network We study the case where the unknown regression function f 0 is modeled by a single-layer feedforward NN A single-layer NN f is specified by a bounded activation function ψ on R and the number of hidden units K : K � b k ψ ( a 0 , k + a ⊤ f ( x ) = b 0 + k x ) k =1 where b 0 , b k , a 0 , k ∈ R and a k ∈ R d are to be estimated Functions of the form f are dense in C ( X ) (they are universal approximators ): choosing K large enough, f can approximate f 0 to any given precision 9 / 27

Neural network: d = 4 features, K = 3 hidden units 10 / 27

Sieve estimator of neural network We use n i.i.d. samples ( Y i , X i ) to construct a sieve M-estimator f n of f for which K = K n increases with n We assume f 0 ∈ Θ = class of C 1 functions on d -hypercube X with uniformly bounded Sobolev norm Sieve subsets Θ n ⊆ Θ generated by NNs f with K n hidden units, bounded L 1 norms of weights, and sigmoid ψ The sieve M-estimator f n is the approximate maximizer of the empirical criterion function L n ( g ) = 1 � n i =1 l ( Y i , X i , g ), n where l : R × X × Θ → R , over Θ n : L n ( f n ) ≥ sup L n ( g ) − o P (1) g ∈ Θ n 11 / 27

Neural network test statistic The NN test statistic is given by � ∂ f n ( x ) � � 2 λ n j = d µ ( x ) = φ [ f n ] ∂ x j X We will use the asymptotic ( n → ∞ ) distribution of λ n j for testing the null since a bootstrap approach would typically be too computationally expensive Asymptotic distribution of f n 1 Functional delta method 2 In the large- n regime, due to the universal approximation property, we are actually performing inference on the “ground truth” f 0 ( model-free inference ) 12 / 27

Asymptotic distribution of NN estimator Theorem Assume that dP = ν d λ for bounded and strictly positive ν The dimension K n of the NN satisfies K 2+1 / d log K n = O ( n ) , n The loss function l ( y , x , g ) = − 1 2 ( y − g ( x )) 2 . Then ⇒ h ⋆ r n ( f n − f 0 ) = in (Θ , L 2 ( P )) where d +1 n � � 2(2 d +1) r n = log n and h ⋆ is the argmax of the Gaussian process { G f : f ∈ Θ } with mean zero and Cov ( G s , G t ) = 4 σ 2 E ( s ( X ) t ( X )) . 13 / 27

Comments r n is the estimation rate of the NN (Chen and Shen (1998)): E X [( f n ( X ) − f 0 ( X )) 2 ] = O P ( r − 1 n ) assuming the number of hidden units K n is chosen such that K 2+1 / d log K n = O ( n ) n Proof uses empirical process arguments Estimation rate implies tightness of h n = r n ( f n − f 0 ) Rescaled and shifted criterion function converges weakly to Gaussian process Gaussian process has a unique maximum at h ⋆ Argmax continuous mapping theorem 14 / 27

Asymptotic distribution of test statistic Theorem Under the conditions of Theorem 1 and the null hypothesis, � ∂ h ⋆ ( x ) � � 2 r 2 n λ n j = ⇒ d µ ( x ) ∂ x j X 15 / 27

Empirical test statistic Theorem Assume µ = P so that the test statistic �� ∂ f n ( X ) � 2 � λ n j = E X . ∂ x j Under the conditions of Theorem 1 and the null hypothesis, the empirical test statistic satisfies n �� ∂ h ⋆ ( X ) � ∂ f n ( X i ) � 2 � 2 � r 2 n n − 1 � = ⇒ E X ∂ x j ∂ x j i =1 16 / 27

Identifying the asymptotic distribution Theorem Take µ = P. Let { φ i } be an orthonormal basis of Θ . If that basis is C 1 and stable under differentiation, then ∞ α 2 �� ∂ h ⋆ ( X ) B 2 � 2 � i , j � χ 2 = i , E X χ 2 d 4 ∂ x j � ∞ i i i =0 i =0 d 2 i where { χ 2 i } are i.i.d. samples from the chi-square distribution, and where α i , j ∈ R satisfies ∂φ i ∂ x j = α i , j φ k ( i ) for some k : N → N , and the d i ’s are certain functions of the α i , j ’s. 17 / 27

Implementing the test Truncate the infinite sum at some finite order N Draw samples from the χ 2 distribution to construct a sample of the approximate limiting law Repeat m times and compute the empirical quantile Q N , m at level α ∈ (0 , 1) of the corresponding samples If m = m N → ∞ as N → ∞ , then Q N , m N is a consistent estimator of the true quantile of interest Reject H 0 if λ n j > Q N , m N (1 − α ) such that the test will be asymptotically of level α : λ n � � P H 0 j > Q N , m N (1 − α ) ≤ α 18 / 27

Simulation study 8 variables: X = ( X 1 , . . . , X 8 ) ∼ U ( − 1 , 1) 8 Ground truth: Y = 8 + X 2 1 + X 2 X 3 + cos( X 4 ) + exp( X 5 X 6 ) + 0 . 1 X 7 + ǫ where ǫ ∼ N (0 , 0 . 01 2 ) and X 8 has no influence on Y Training (via TensorFlow): 100,000 samples ( Y i , X i ) Validation, Testing: 10,000 samples each Out-of-sample MSE: Model Mean Squared Error 3 . 1 · 10 − 4 ∼ Var( ǫ ) NN with K = 25 Linear Regression 0.35 19 / 27

Linear model fails to identify significant variables Variable coef std err t P > | t | const 10.2297 0.002 5459.250 0.000 1 -0.0031 0.003 -0.964 0.335 2 0.0051 0.003 1.561 0.118 3 -0.0026 0.003 -0.800 0.424 4 0.0003 0.003 0.085 0.932 5 0.0016 0.003 0.493 0.622 6 -0.0033 0.003 -1.035 0.300 7 0.0976 0.003 30.059 0.000 8 -0.0018 0.003 -0.563 0.573 Only the intercept and the linear term 0 . 1 X 7 are identified as significant. The irrelevant X 8 is correctly identified as insignificant. 20 / 27

Towards Explainable AI: Significance Tests for Neural Networks Kay - PowerPoint PPT Presentation

Towards Explainable AI: Significance Tests for Neural Networks Kay Giesecke Advanced Financial Technologies Laboratory Stanford University people.stanford.edu/giesecke/ fintech.stanford.edu Joint work with Enguerrand Horel (Stanford) 1 / 27

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Explainable Neural Computation via Stack Neural Module Networks (July, 2018) Ronghang Hu, Jacob

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Automated Reasoning for EXplainable Artificial Intelligence Maria Paola Bonacina Dipartimento di

The Role of Normware in Trustworthy and Explainable AI Giovanni Sileno (g.sileno@uva.nl),

Explainable (Deep) Learning and Simulation approaches Torsten Mller Visualization and

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Kowledge-Based Programs as Explainable Policies for Contingent Planning J. Lang, A. Saffidine,

Explainable(?) Statistical ML Derek Doran Dept. of Computer Science and Engineering Wright

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Solvency II newspeak one year uncertainty for IBNR the boostrap approach Arthur

On the strong Scott conjecture for Chandrasekhar atoms Konstantin Merz 1 Joint work with Rupert

Objective Explain basic concepts of TLA + modeling systems: static and dynamic aspects

Luminosity at LHCb Vladislav Balagura (LLR Ecole polytechnique / CNRS / IN2P3) on behalf of

NETWORK MODELS FOR THE QUANTUM HALL EFFECT AND ITS GENERALISATIONS John Chalker Physics

Strain and Temperature Dependence of Defect Formation at AlGaN/GaN High Electron Mobility

MeshLab (WiSe 2009/2010) Lecture 0: Organization Cigdem Sengul, Ph.D. Ruben Merz, Ph.D.

DCC-MAC: a Rate Adaptive MAC Protocol for Uncoordinated UWB Networks Ruben Merz J org Widmer

Towards Explainable AI: Significance Tests for Neural Networks Kay - PowerPoint PPT Presentation

Towards Explainable AI: Significance Tests for Neural Networks Kay Giesecke Advanced Financial Technologies Laboratory Stanford University people.stanford.edu/giesecke/ fintech.stanford.edu Joint work with Enguerrand Horel (Stanford) 1 / 27

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

Explainable Neural Computation via Stack Neural Module Networks (July, 2018) Ronghang Hu, Jacob

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Automated Reasoning for EXplainable Artificial Intelligence Maria Paola Bonacina Dipartimento di

The Role of Normware in Trustworthy and Explainable AI Giovanni Sileno (g.sileno@uva.nl),

Explainable (Deep) Learning and Simulation approaches Torsten Mller Visualization and

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Kowledge-Based Programs as Explainable Policies for Contingent Planning J. Lang, A. Saffidine,

Explainable(?) Statistical ML Derek Doran Dept. of Computer Science and Engineering Wright

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

Solvency II newspeak one year uncertainty for IBNR the boostrap approach Arthur

On the strong Scott conjecture for Chandrasekhar atoms Konstantin Merz 1 Joint work with Rupert

Objective Explain basic concepts of TLA + modeling systems: static and dynamic aspects

Luminosity at LHCb Vladislav Balagura (LLR Ecole polytechnique / CNRS / IN2P3) on behalf of

NETWORK MODELS FOR THE QUANTUM HALL EFFECT AND ITS GENERALISATIONS John Chalker Physics

Strain and Temperature Dependence of Defect Formation at AlGaN/GaN High Electron Mobility

MeshLab (WiSe 2009/2010) Lecture 0: Organization Cigdem Sengul, Ph.D. Ruben Merz, Ph.D.

DCC-MAC: a Rate Adaptive MAC Protocol for Uncoordinated UWB Networks Ruben Merz J org Widmer

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of