Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles - PowerPoint PPT Presentation

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles Louppe , Michael Kagan, Kyle Cranmer

Credits: Jorge Cham Testing for new physics 2 / 22

Credits: Jorge Cham Testing for new physics p ( data | background + signal ) p ( data | background ) 2 / 22

Credits: 1209.1489, 1612.01551, 1702.00748 Supervised learning p ( data | background + signal ) , Classifying background vs. signal p ( data | background ) . # & Boosted decision trees Conv. nets Recursive nets 3 / 22

Credits: 1703.03507, ATL-PHYS-PUB-2017-004 Independence from physics variates Analysis often rely on the assumption that the classifier is independent from some physics variates (e.g., mass). Correlation with these variates leads to systematic uncertainties that cannot easily be characterized and controlled. 4 / 22

Credits: Kyle Cranmer Independence from known unknowns • The data generation process is often not uniquely specified or known exactly, hence the presence of systematic uncertainties. • Data generation processes are formulated as a family of data generation processes parametrized by nuisance parameters. • Ideally, we would like a classifier that is robust to nuisance parameters. 5 / 22

Problem statement • Let us assume a family of data generation processes p ( X , Y , Z ) where X are the data (taking values x 2 X ), Y are the target labels (taking values y 2 Y ), Z is an auxiliary random variable (taking values z 2 Z ). • Z corresponds to physics variates or nuisance parameters. • We want to learn a regression function f ( · ; θ f ) : X 7 ! Y . • We want inference based on f ( X ; θ f ) to be robust to the value z 2 Z . E.g., we want a classifier that does not change with systematic variations, even though the data might. 6 / 22

Pivot • We define robustness as requiring the distribution of f ( X ; θ f ) conditional on Z to be invariant with Z . That is, such that p ( f ( X ; θ f ) = s | z ) = p ( f ( X ; θ f ) = s | z 0 ) for all z , z 0 2 Z and all values s 2 S of f ( X ; θ f ) . If f satisfies this criterion, then f is known as a pivotal quantity. • Alternatively, this amounts to find f such that f ( X ; θ f ) and Z are independent random variables. 7 / 22

Adversarial Networks p ( signal | data ) p ( signal | data ) Classifier f Adversary r Z γ 1 ( f ( X ; θ f ); θ r ) f ( X ; θ f ) γ 2 ( f ( X ; θ f ); θ r ) ... ... P ( γ 1 , γ 2 , . . . ) X . . . p θ r ( Z | f ( X ; θ f )) θ f L f ( θ f ) θ r L r ( θ f , θ r ) data Goal is to solve: ^ θ f , ^ θ r = arg min θ f max θ r L f ( θ f ) − L r ( θ f , θ r ) Let consider a classifier f built as usual, minimizing the We pit f against an adversary network r producing as cross-entropy L f ( θ f ) = E x ∼ X E y ∼ Y | x [ − log p θ f ( y | x )] . Regression of Z from f ’s output output the posterior p θ r ( z | f ( X ; θ f ) = s ) . We set r to minimize the cross entropy Intuitively, r penalizes f for outputs that can be used to infer Z . L r ( θ f , θ r ) = E s ∼ f ( X ; θ f ) E z ∼ Z | s [ − log p θ r ( z | s )] . 8 / 22

Theoretical motivation Proposition. If there exists a minimax solution (^ θ f , ^ θ r ) such that L f ( θ f ) − L r ( θ f , θ r ) = H ( Y | X ) − H ( Z ) , then f ( · ; ^ θ f ) is both an optimal classifier and a pivotal quantity. Proof (sketch): L f ( θ f ) − L r ( θ f , θ r ) min max θ f θ r = min L f ( θ f ) − E s ∼ f ( X ; θ f ) [ H ( Z | f ( X ; θ f ) = s )] θ f L f ( θ f ) − H ( Z | f ( X ; θ f )) = min θ f � H ( Y | X ) − H ( Z ) where the equality holds when • f is an optimal classifier (in which case L f ( θ f ) = H ( Y | X ) ); • f ( X ; θ f ) and Z are independent random variables (in which case H ( Z | f ( X ; θ f )) = H ( Z ) ). 9 / 22

Alternating stochastic gradient descent 1: for t = 1 to T do 2: for k = 1 to K do . Update r Sample minibatch { x m , z m , s m = f ( x m ; θ f ) } M 3: m = 1 of size M ; 4: With θ f fixed, update r by ascending its stochastic gradient r θ r E ( θ f , θ r ) := M X log p θ r ( z m | s m ); r θ r m = 1 5: end for Sample minibatch { x m , y m , z m , s m = f ( x m ; θ f ) } M 6: m = 1 of size M ; . Update f 7: With θ r fixed, update f by descending its stochastic gradient r θ f E ( θ f , θ r ) := M X ⇥ ⇤ − log p θ f ( y m | x m ) + log p θ r ( z m | s m ) r θ f , m = 1 where p θ f ( y m | x m ) denotes 1 ( y m = 0 )( 1 − s m ) + 1 ( y m = 1 ) s m ; 8: end for 10 / 22

Accuracy versus robustness trade-o ff • The assumption of existence of a classifier that is both optimal and pivotal may not hold. • However, the minimax objective can be rewritten as E λ ( θ f , θ r ) = L f ( θ f ) − λ L r ( θ f , θ r ) where λ is a hyper-parameter controlling the trade-o ff between the performance of f and its independence with respect to the nuisance parameter. Setting λ to a large value enforces f to be pivotal. Setting λ close to 0 constraints f to be optimal. • Tuning λ is guided by a higher-level objective (e.g., statistical significance). 11 / 22

Architecture for the adversary • If Z is categorical, then the posterior can be modeled with a standard classifier (e.g., a NN with a softmax output layer). • If Z is continuous, then the posterior can be modeled with a mixture density network . • No constraint on the prior p ( Z ) . Mixture density network 12 / 22

Toy example • Binary classification of 2D data drawn from multivariate gaussians with equal priors, such that  1 ✓ − 0 . 5 �◆ x ∼ N ( 0 , 0 ) , when Y = 0 , − 0 . 5 1 ✓  1 0 �◆ x ∼ N ( 1 , 1 + Z ) , when Y = 1 . 0 1 • The continuous nuisance parameter Z represents in this case our uncertainty about the exact location of the mean of the second gaussian. We assume a gaussian prior z ∼ N ( 0 , 1 ) . • We assume training data { x i , y i , z i } N i = 1 ∼ p ( X , Y , Z ) . 13 / 22

Standard training without the adversary r (Left) The conditional probability distributions of f ( X ; θ f ) | Z = z changes with z . (Right) The decision surface strongly depends on X 2 . 14 / 22

Reshaping f with adversarial training (Left) The conditional probability distributions of f ( X ; θ f ) | Z = z are now (almost) invariant with z ! (Right) The decision surface is now independent of X 2 . 15 / 22

Dynamics of adversarial training 16 / 22

Physics example: pileup independence • Discriminate between QCD jets ( Y = 0) and W -jets ( Y = 1) from high-level features (data from Baldi et al, arXiv:1603.09349). • Taking some liberty, we consider an extreme categorical nuisance parameter where Z = 0 corresponds to events without pileup, Z = 1 corresponds to events with pileup, for which there are an average of 50 independent pileup interactions overlaid. 17 / 22

Maximizing significance by tuning λ • We optimize the accuracy-independence trade-o ff by tuning λ with respect to a higher level objective. • Cut and count analysis: Hypothesis test of a null with no signal events against an alternate hypothesis that is a mixture of signal and background events. Background = 1000 weighted QCD jets, Signal = 100 weighted boosted W’s. Without systematics, optimizing L f indirectly optimizes the power of a classical hypothesis test. With systematics, we need to balance classification performance against robustness to the nuisance parameter. To this end, we use the Approximate Median Significance (AMS) as higher-level objective. Note that since we are performing a hypothesis test of the null, we only wish to impose the pivotal property on background events. 18 / 22

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles - PowerPoint PPT Presentation

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles Louppe , Michael Kagan, Kyle Cranmer Credits: Jorge Cham Testing for new physics 2 / 22 Credits: Jorge Cham Testing for new physics 2 / 22 Credits: Jorge Cham Testing for

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles Louppe , Michael Kagan, Kyle

X1D: Create Pivot Tables using Excel 2013 3/07/2018 V1N Create Pivot Tables using Excel 2013 1

Create Pivot Tables using Excel 2008/2013 1/26/2016 V1H Create Pivot Tables using Excel 2008 1

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Traveling The PIVOT FOOT is what matters!!! If the pivot foot is lifted the ball MUST be passed

PIVOT TABLES AND CHARTS Leena Razzaq lrazzaq@ccs.neu.edu CS1100 Pivot tables and charts 1

PIVOT TABLES AND CHARTS Leena Razzaq lrazzaq@ccs.neu.edu CS1100 Pivot tables and charts 1

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Michael Duff Imperial College London based on [arXiv:1301.4176 arXiv:1309.0546 arXiv:1312.6523

Pivot Table Demonstration Tools for LBOHs May 27, 2020 cott Troppy, Surveillance Epidemiologist

Trend Lines, Pivot Tables, and Pivot Charts Objectives Create a line chart and trendline Create

Why is Dual-Pivot Quicksort Fast? Sebastian Wild wild@cs.uni-kl.de 29 September 2015

Dual Pivot Quicksort: Verification and Proof using KeY Jonas Schiffl Karlsruher Institut f ur

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

REGIONAL RESOURCES PNG LTD IPA Company No: 1-52546 EL 1611- Pagl Porphyry Cu-Au An Emerging

Expectation Continued: Tail Sum, Coupon Collector, and Functions of RVs CS 70, Summer 2019

a South African experience Public Economics for Development, Maputo, July 2017 0 OUTLINE

Modeling dynamic diurnal patterns in high frequency financial data Ryoko Ito 1 1 Faculty of

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and

Code Modification Forum Clayton Hotel, Cork Wednesday, 26 September 2018 Agenda (1 of 2) 1.

Selected Applications of Sub-Symbolic AI Methods Keith L. Downing The Norwegian University of

What is Data Science? Business efficiency: Wal-Mart

Do Funds Make More When They Trade More? astor (Chicago Booth) Lubo s P Rob Stambaugh