failing loudly detecting dataset shift
play

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 - PowerPoint PPT Presentation

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon


  1. Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G¨ Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon University AWS AI Labs Department of Informatics Tepper School of Business Data Analytics and Machine Learning Machine Learning Department January 24, 2020

  2. Table of contents Motivation & Overview Methods Experiments Conclusion Failing Loudly: Detecting Dataset Shift 2

  3. Motivation & Overview

  4. Motivation • The reliable functioning of software depends crucially on tests. • Despite their power, ML models are sensitive to shifts in the data distribution. • ML pipelines rarely inspect incoming data for signs of distribution shift. • Best practices for testing equivalence of the source distribution p and the target distribution q in real-life, high-dim. data settings have not yet been established. • Existing solutions to addressing covariate shift q ( x , y ) = q ( x ) p ( y | x ) or label shift q ( x , y ) = q ( y ) p ( x | y ) often rely on strict preconditions, producing wrong predictions if not met. Failing Loudly: Detecting Dataset Shift 4

  5. Shift Detection Overview Faced with distribution shift, our goals are three-fold: • detect when distribution shift occurs from as few examples as possible; • characterize the shift (e.g. by identifying those samples from the test set that appear over-represented in the target data); and • provide some guidance on whether the shift is harmful or not. … … … … … … … Combined Test Statistic x source x source Two-Sample Test(s) & Shift Detection Dimensionality … … … … Reduction x target Failing Loudly: Detecting Dataset Shift 5

  6. Methods

  7. Our Framework Given labeled data ( x 1 , y 1 ) , ..., ( x n , y n ) ∼ p and unlabeled data x ′ 1 , ..., x ′ m ∼ q , our task is to determine whether p ( x ) equals q ( x ′ ): H 0 : p ( x ) = q ( x ′ ) H A : p ( x ) � = q ( x ′ ) . vs We explore the following design choices: • what representation to run the test on; • which two-sample test to run; • when the representation is multidimensional; whether to run a single multivariate test or multiple univariate two-sample tests ; and • how to combine their results. Failing Loudly: Detecting Dataset Shift 7

  8. Dimensionality Reduction Techniques: NoRed & PCA No Reduction (NoRed ): Principal Components Analysis (PCA ): … … • To justify the use of any DR • Find an optimal orthogonal transf. technique, our default baseline is to matrix such that points are linearly run tests on the original raw features. uncorrelated after transf. Failing Loudly: Detecting Dataset Shift 8

  9. Dimensionality Reduction Techniques: SRP & AE Sparse Random Projection (SRP ): Autoencoders (TAE and UAE ): … … … + � v  1 with prob. • Encoder φ : X → L 2 v K  R ij = with prob. 1 − 1 0 v • Decoder ψ : L → X − � v 1 with prob.  K 2 v φ, ψ = arg min φ,ψ � X − ( ψ ◦ φ ) X � 2 1 with v = √ D Failing Loudly: Detecting Dataset Shift 9

  10. Dimensionality Reduction Techniques: BBSD & Classif Domain Classifier (Classif × ): Label Classifiers (BBSDs ⊳ and BBSDh ⊲ ): source C -many … classes target … … • Label classifier with softmax outputs • Explicitly train a domain classifier to (BBSDs ⊳ ) or hard-thresholded discriminate between data from source predictions (BBSDh ⊲ ). and target domains. Failing Loudly: Detecting Dataset Shift 10

  11. Statistical Hypothesis Testing: Maximum Mean Discrepancy (MMD) • Popular kernel-based technique for • Empirical estimate: multivariate two-sample testing. m m 1 MMD 2 = • Distinguish two distrib. based on their � � κ ( x i , x j ) m ( m − 1) mean embeddings µ p and µ q in a i =1 j � = i reproducing kernel Hilbert space F : n n 1 � � κ ( x ′ i , x ′ + j ) n ( n − 1) MMD( F , p , q ) = || µ p − µ q || 2 i =1 j � = i F m n − 2 � � κ ( x i , x ′ j ) p mn MMD i =1 j =1 q • Kernel: κ ( x 1 , x 2 ) = e − 1 σ � x 1 − x 2 � 2 • Used with NoRed , PCA , SRP , TAE , UAE , and BBSDs ⊳ . Failing Loudly: Detecting Dataset Shift 11

  12. Statistical Hypothesis Testing: Kolmogorov-Smirnov + Bonferroni • Test each of the K dimensions • Multiple hypothesis testing: we must separately (instead of jointly) using subsequently combine the p -values the Kolmogorov-Smirnov (KS) test. from the K -many test. • Largest difference S of the cumulative • Problem: We cannot make strong density functions over all values z : assumptions about the (in)dependence among the tests. S = sup z | F p ( z ) − F q ( z ) | • Solution: Bonferroni correction: • Does not assume (in)dependence. • Bounds the family-wise error rate, F p S i.e. it is a conservative aggregation. F • Rejects H 0 if p min ≤ q α K . • Used with NoRed , PCA , SRP , TAE , UAE , and BBSDs ⊳ . Failing Loudly: Detecting Dataset Shift 12

  13. Statistical Hypothesis Testing: Chi-Squared Test • Evaluate whether the freq. distr. of � Sample Cat 1 · · · Cat C certain events observed in a sample is · · · p n p 1 n pC n p • consistent with a particular theo. distr. · · · q n q 1 n qC n q • • Difference can be calculated as � n • 1 · · · n • C N sum 2 C ( O ij − E ij ) 2 X 2 = � � E ij i =1 j =1 with observed counts O ij and expected counts E ij = N sum p i • p • j with Rejection N sum = � c n i • n ij • p i • = N sum and j =1 N sum = � r n • j n ij • p • j = N sum . i =1 • Under H 0 , X 2 ∼ χ 2 C − 1 . • Used with BBSDh ⊲ . Failing Loudly: Detecting Dataset Shift 13

  14. Statistical Hypothesis Testing: Binomial Test • Compare difference classifier accuracy (acc) on held-out data to random chance via a binomial test. H 0 : acc = 0 . 5 vs H A : acc > 0 . 5 Rejection • Under H 0 , the acc follows a binomial distribution acc ∼ Bin( N hold , 0 . 5) where N hold corresponds to the • Used with Classif × . number of held-out samples. Failing Loudly: Detecting Dataset Shift 14

  15. Obtaining Most Anomalous Samples • Recall: our detection framework does not detect outliers but rather aims at capturing top-level shift dynamics. • We can not decide whether any given sample is in- or out-of-distribution. • But: we can harness domain assignments from the domain classifier. • It is easy to identify the exemplars which the domain classifier was most confident in assigning to the target domain. • Other shift detectors compare entire distributions against each other. • Identification of samples which if removed would lead to a large increase in the overall p -value was not successful. Failing Loudly: Detecting Dataset Shift 15

  16. Determining the Malignancy of a Shift • Distribution shifts can cause arbitrarily severe degradation in performance. • In practice distributions shift constantly and often these changes are benign. • Goal: distinguishing malignant shifts from benign shifts. • Problem: although prediction quality can be assessed easily on source data, we are not able compute the target error directly without labels. • Heuristic methods for approximating the target performance: • Difference classifier assignments: assess black-box model’s accuracy on the labeled top anomalous samples ( implicit shift characterization). • Domain expert: Get hints on the target accuracy by evaluating the classifier on held-out source data that has been explicitly perturbed by a function determined by a domain expert. Failing Loudly: Detecting Dataset Shift 16

  17. Experiments

  18. Experimental Setup • Core experiments: synthetic shifts on MNIST and CIFAR-10 image datasets. • Autoencoders: convolutional architecture with 3 convolutional layers. • BBSD and Classif: ResNet-18 architecture. , BBSDs ⊳ , BBSDh ⊲ , Classif × ): SGD with • Network training (TAE momentum in batches of 128 examples over 200 epochs with early stopping. • Dimensionality reduction to K = 32 (PCA , SRP , UAE , and TAE ), C = 10 (BBSDs ⊳ ), and 1 (BBSDh ⊲ and Classif × ). • Evaluate shift detection at a significance level of α = 0 . 05. • Shift detection performance is averaged over a total of 5 random splits. • Randomly split the data into training, validation, and test sets and then apply a particular shift to the test set only. • Evaluate the models with various amounts of samples from the test set s ∈ { 10 , 20 , 50 , 100 , 200 , 500 , 1000 , 10000 } . Failing Loudly: Detecting Dataset Shift 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend