Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 - PowerPoint PPT Presentation

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G¨ Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon University AWS AI Labs Department of Informatics Tepper School of Business Data Analytics and Machine Learning Machine Learning Department January 24, 2020

Table of contents Motivation & Overview Methods Experiments Conclusion Failing Loudly: Detecting Dataset Shift 2

Motivation & Overview

Motivation • The reliable functioning of software depends crucially on tests. • Despite their power, ML models are sensitive to shifts in the data distribution. • ML pipelines rarely inspect incoming data for signs of distribution shift. • Best practices for testing equivalence of the source distribution p and the target distribution q in real-life, high-dim. data settings have not yet been established. • Existing solutions to addressing covariate shift q ( x , y ) = q ( x ) p ( y | x ) or label shift q ( x , y ) = q ( y ) p ( x | y ) often rely on strict preconditions, producing wrong predictions if not met. Failing Loudly: Detecting Dataset Shift 4

Shift Detection Overview Faced with distribution shift, our goals are three-fold: • detect when distribution shift occurs from as few examples as possible; • characterize the shift (e.g. by identifying those samples from the test set that appear over-represented in the target data); and • provide some guidance on whether the shift is harmful or not. … … … … … … … Combined Test Statistic x source x source Two-Sample Test(s) & Shift Detection Dimensionality … … … … Reduction x target Failing Loudly: Detecting Dataset Shift 5

Methods

Our Framework Given labeled data ( x 1 , y 1 ) , ..., ( x n , y n ) ∼ p and unlabeled data x ′ 1 , ..., x ′ m ∼ q , our task is to determine whether p ( x ) equals q ( x ′ ): H 0 : p ( x ) = q ( x ′ ) H A : p ( x ) � = q ( x ′ ) . vs We explore the following design choices: • what representation to run the test on; • which two-sample test to run; • when the representation is multidimensional; whether to run a single multivariate test or multiple univariate two-sample tests ; and • how to combine their results. Failing Loudly: Detecting Dataset Shift 7

Dimensionality Reduction Techniques: NoRed & PCA No Reduction (NoRed ): Principal Components Analysis (PCA ): … … • To justify the use of any DR • Find an optimal orthogonal transf. technique, our default baseline is to matrix such that points are linearly run tests on the original raw features. uncorrelated after transf. Failing Loudly: Detecting Dataset Shift 8

Dimensionality Reduction Techniques: SRP & AE Sparse Random Projection (SRP ): Autoencoders (TAE and UAE ): … … … + � v  1 with prob. • Encoder φ : X → L 2 v K  R ij = with prob. 1 − 1 0 v • Decoder ψ : L → X − � v 1 with prob.  K 2 v φ, ψ = arg min φ,ψ � X − ( ψ ◦ φ ) X � 2 1 with v = √ D Failing Loudly: Detecting Dataset Shift 9

Dimensionality Reduction Techniques: BBSD & Classif Domain Classifier (Classif × ): Label Classifiers (BBSDs ⊳ and BBSDh ⊲ ): source C -many … classes target … … • Label classifier with softmax outputs • Explicitly train a domain classifier to (BBSDs ⊳ ) or hard-thresholded discriminate between data from source predictions (BBSDh ⊲ ). and target domains. Failing Loudly: Detecting Dataset Shift 10

Statistical Hypothesis Testing: Maximum Mean Discrepancy (MMD) • Popular kernel-based technique for • Empirical estimate: multivariate two-sample testing. m m 1 MMD 2 = • Distinguish two distrib. based on their � � κ ( x i , x j ) m ( m − 1) mean embeddings µ p and µ q in a i =1 j � = i reproducing kernel Hilbert space F : n n 1 � � κ ( x ′ i , x ′ + j ) n ( n − 1) MMD( F , p , q ) = || µ p − µ q || 2 i =1 j � = i F m n − 2 � � κ ( x i , x ′ j ) p mn MMD i =1 j =1 q • Kernel: κ ( x 1 , x 2 ) = e − 1 σ � x 1 − x 2 � 2 • Used with NoRed , PCA , SRP , TAE , UAE , and BBSDs ⊳ . Failing Loudly: Detecting Dataset Shift 11

Statistical Hypothesis Testing: Kolmogorov-Smirnov + Bonferroni • Test each of the K dimensions • Multiple hypothesis testing: we must separately (instead of jointly) using subsequently combine the p -values the Kolmogorov-Smirnov (KS) test. from the K -many test. • Largest difference S of the cumulative • Problem: We cannot make strong density functions over all values z : assumptions about the (in)dependence among the tests. S = sup z | F p ( z ) − F q ( z ) | • Solution: Bonferroni correction: • Does not assume (in)dependence. • Bounds the family-wise error rate, F p S i.e. it is a conservative aggregation. F • Rejects H 0 if p min ≤ q α K . • Used with NoRed , PCA , SRP , TAE , UAE , and BBSDs ⊳ . Failing Loudly: Detecting Dataset Shift 12

Statistical Hypothesis Testing: Chi-Squared Test • Evaluate whether the freq. distr. of � Sample Cat 1 · · · Cat C certain events observed in a sample is · · · p n p 1 n pC n p • consistent with a particular theo. distr. · · · q n q 1 n qC n q • • Difference can be calculated as � n • 1 · · · n • C N sum 2 C ( O ij − E ij ) 2 X 2 = � � E ij i =1 j =1 with observed counts O ij and expected counts E ij = N sum p i • p • j with Rejection N sum = � c n i • n ij • p i • = N sum and j =1 N sum = � r n • j n ij • p • j = N sum . i =1 • Under H 0 , X 2 ∼ χ 2 C − 1 . • Used with BBSDh ⊲ . Failing Loudly: Detecting Dataset Shift 13

Statistical Hypothesis Testing: Binomial Test • Compare difference classifier accuracy (acc) on held-out data to random chance via a binomial test. H 0 : acc = 0 . 5 vs H A : acc > 0 . 5 Rejection • Under H 0 , the acc follows a binomial distribution acc ∼ Bin( N hold , 0 . 5) where N hold corresponds to the • Used with Classif × . number of held-out samples. Failing Loudly: Detecting Dataset Shift 14

Obtaining Most Anomalous Samples • Recall: our detection framework does not detect outliers but rather aims at capturing top-level shift dynamics. • We can not decide whether any given sample is in- or out-of-distribution. • But: we can harness domain assignments from the domain classifier. • It is easy to identify the exemplars which the domain classifier was most confident in assigning to the target domain. • Other shift detectors compare entire distributions against each other. • Identification of samples which if removed would lead to a large increase in the overall p -value was not successful. Failing Loudly: Detecting Dataset Shift 15

Determining the Malignancy of a Shift • Distribution shifts can cause arbitrarily severe degradation in performance. • In practice distributions shift constantly and often these changes are benign. • Goal: distinguishing malignant shifts from benign shifts. • Problem: although prediction quality can be assessed easily on source data, we are not able compute the target error directly without labels. • Heuristic methods for approximating the target performance: • Difference classifier assignments: assess black-box model’s accuracy on the labeled top anomalous samples ( implicit shift characterization). • Domain expert: Get hints on the target accuracy by evaluating the classifier on held-out source data that has been explicitly perturbed by a function determined by a domain expert. Failing Loudly: Detecting Dataset Shift 16

Experiments

Experimental Setup • Core experiments: synthetic shifts on MNIST and CIFAR-10 image datasets. • Autoencoders: convolutional architecture with 3 convolutional layers. • BBSD and Classif: ResNet-18 architecture. , BBSDs ⊳ , BBSDh ⊲ , Classif × ): SGD with • Network training (TAE momentum in batches of 128 examples over 200 epochs with early stopping. • Dimensionality reduction to K = 32 (PCA , SRP , UAE , and TAE ), C = 10 (BBSDs ⊳ ), and 1 (BBSDh ⊲ and Classif × ). • Evaluate shift detection at a significance level of α = 0 . 05. • Shift detection performance is averaged over a total of 5 random splits. • Randomly split the data into training, validation, and test sets and then apply a particular shift to the test set only. • Evaluate the models with various amounts of samples from the test set s ∈ { 10 , 20 , 50 , 100 , 200 , 500 , 1000 , 10000 } . Failing Loudly: Detecting Dataset Shift 18

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 - PowerPoint PPT Presentation

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon

1 2 nd Shift Associates 2 nd Shift Associates 3 rd Shift Associates 3 rd Shift Associates 2

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

HOLY SHIFT! Linda Zheng Roadmap You are here My Shift Introduction Shift AST Experience

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

HELP, IM FAILING AND HELP, IM FAILING AND I CANT GET UP I CANT GET UP And other

Paradigm Shift: Moving from Vertical Paradigm Shift: Moving from Vertical Paradigm Shift:

Sharon Mast, Facilitator IIRP World Conference Bethlehem PA October 27, 2014 Shift your

Paris Opera4onal Defini4ons Call-outs: Any 4me Paris talks out loudly (to whole class or

TATA HARRIER Harrier Gear Shift Knob TATA HARRIER GEAR KNOB TATA NEXON Nexon Gear Shift Knob

Shift Work and the Impact on Wellbeing Helen Lawson Objectives Shift work in context &

VHDL Modeling for Synthesis Hierarchical Design Textbook Section 4.8: Add and Shift Multiplier

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Neutron reco in ECAL with TOF Chris Marshall Lawrence Berkeley National Laboratory 18 March,

KNOCK OUT Blue B Flashover in action KNOCK OUT Key features: Takeaways and Path Forward

1.4 Medieval Religious Writers ECON 452 History of Economic Thought Fall 2020 Ryan

S -adic conjecture November 2, 2010 What is it? Related results The case # S = 1 S -adic

Lookbacks and Barriers A lookback option has terms that depend on the path. For example, a

H A P P Y S AT U R D AY F R O M U C B E R K E L E Y T H E N E U R O P H Y S I O L O G Y O F

Check out BST_2013 project from SVN Hardy/Colorize Partner Evaluation Doublets Partner

How to Get Donor Visits & Knock Them Out of the Park @fundraiserchad Who is this guy? And

Sambuz

Useful Links

Newsletter

Mail Us

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 - PowerPoint PPT Presentation

Failing Loudly: Detecting Dataset Shift Stephan Rabanser 1 rabans@amazon.com unnemann 2 guennemann@in.tum.de Prof. Dr. Stephan G Prof. Dr. Zachary C. Lipton 3 zlipton@cmu.edu 1 Amazon 2 Technical University of Munich 3 Carnegie Mellon

1 2 nd Shift Associates 2 nd Shift Associates 3 rd Shift Associates 3 rd Shift Associates 2

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

HOLY SHIFT! Linda Zheng Roadmap You are here My Shift Introduction Shift AST Experience

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

HELP, IM FAILING AND HELP, IM FAILING AND I CANT GET UP I CANT GET UP And other

Paradigm Shift: Moving from Vertical Paradigm Shift: Moving from Vertical Paradigm Shift:

Sharon Mast, Facilitator IIRP World Conference Bethlehem PA October 27, 2014 Shift your

Paris Opera4onal Defini4ons Call-outs: Any 4me Paris talks out loudly (to whole class or

TATA HARRIER Harrier Gear Shift Knob TATA HARRIER GEAR KNOB TATA NEXON Nexon Gear Shift Knob

Shift Work and the Impact on Wellbeing Helen Lawson Objectives Shift work in context &amp;

VHDL Modeling for Synthesis Hierarchical Design Textbook Section 4.8: Add and Shift Multiplier

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Neutron reco in ECAL with TOF Chris Marshall Lawrence Berkeley National Laboratory 18 March,

KNOCK OUT Blue B Flashover in action KNOCK OUT Key features: Takeaways and Path Forward

1.4 Medieval Religious Writers ECON 452 History of Economic Thought Fall 2020 Ryan

S -adic conjecture November 2, 2010 What is it? Related results The case # S = 1 S -adic

Lookbacks and Barriers A lookback option has terms that depend on the path. For example, a

H A P P Y S AT U R D AY F R O M U C B E R K E L E Y T H E N E U R O P H Y S I O L O G Y O F

Check out BST_2013 project from SVN Hardy/Colorize Partner Evaluation Doublets Partner

How to Get Donor Visits &amp; Knock Them Out of the Park @fundraiserchad Who is this guy? And

Sambuz

Useful Links

Newsletter

Mail Us

Shift Work and the Impact on Wellbeing Helen Lawson Objectives Shift work in context &

How to Get Donor Visits & Knock Them Out of the Park @fundraiserchad Who is this guy? And