Weak Supervision, noisy labels, and error propagation Marat - PowerPoint PPT Presentation

Weak Supervision, noisy labels, and error propagation Marat Freytsis hep-ai journal club — December 11, 2018 based on Yu et al. [arXiv:1402.5902], Cohen, MF, Osdiek [arXiv:1706.09451] + bits of others

Why Weak supervision? Fully supervised learning on real data often diffjcult/impossible demographic data) Several classes of learning tasks on partially labels well developed One which nicely maps onto many scientifjc data measurements is Learning from Label Proportions 1/ 14 • Individual labels are prohibitively expensive to assign • Personalized information legally protected ( e.g. , medical, • For quantum systems, unique labels may be unphysical • semi-supervised: augmenting labeled with unlabeled data • multiple instance: presence of signal in bag is marked but not identifjed

Plan 1/ 14 • Learning from Label Proportions • Viability and generalization error • Proportion uncertainties, stability, and error propagation

Learning from Label Proportions general setting perfects separated by their features rate/cross-section measurement/calculation even if individual events cannot be labels. bags, the task is to fjnd a classifjer from individual events to 2/ 14 Learner has no access to labels, but instead receives label Domain of instance features denoted by X and (discrete) labels by Y . Data consists of bags of events with features ˜ x = ( x 1 , . . . x r ) and labels ˜ y = ( y 1 , . . . , y r ) , drawn iid from a distribution over ( X × Y ) r . proportions (˜ x , f i (˜ y )) , with f i (˜ y ) = � r n =1 I y n = i / r . From a set of m For experimental measurements, f i (˜ y ) can be naturally interpreted as, e.g. , a

Is this even possible? heuristic argument were made difgerent should be uncorrelated from the distribution for each class, i.e. , however the label proportions label proportions unique for each bag. Requirements: and the distributions can be inverted algebraically. 3/ 14 = Consider binary classifjcation ( y i = { 0 , 1 } ). Discretize data into bins b m , j . If 2 bags are present, in each bin b 0 , j = f A , 1 b B , j − f B , 1 b A , j b A , j = f A , 1 b 1 , j + (1 − f A , 1 ) b 0 , j f A , 1 − f B , 1 ⇒ b B , j = f B , 1 b 1 , j + (1 − f B , 1 ) b 0 , j b 1 , j = (1 − f B , 1 ) b A , j − (1 − f A , 1 ) b B , j f A , 1 − f B , 1 • Number of bags ≥ number of classes to be distinguished, with • The bags need to be drawing from the same underlying distribution over ( X × Y ) r .

Classifjcation in practice Don’t want to discretize, no guarantee events sample feature space densely enough it even makes sense. How to classify events? Modify loss function! 1. direct attack: typically need re-optimization of hyperparameters 2. clever trick (classifjcation without labels): Metodiev et al. [arXiv:1708.02949] with your fully-supervised loss function of choice 4/ 14 ℓ LLP = arg min h ∈H ℓ ( � h ( x i ) � batch , � f (˜ y ) � batch ) h ∈H ℓ ( h ( x i ) , f (˜ ℓ CWoLa = arg min y ))

Classifjcation without labels relate these two likelihood ratios algebraically: Still need to know label proportions to calibrate classifjer. Only makes sense for binary classifjcation! why does the second version work at all? 5/ 14 Proof. distinguishing S from B. Theorem Given mixed samples M 1 and M 2 defjned in terms of pure samples S and B with signal fractions f 1 > f 2 , an optimal classifjer trained to distinguish M 1 from M 2 is also optimal for The optimal classifjer to distinguish examples drawn from p M 1 and p M 2 is the likelihood ratio L M 1/ M 2 ( x ) = p M 1 ( x )/ p M 2 ( x ) . Similarly, the optimal classifjer to distinguish examples drawn from p S and p B is the likelihood ratio L S / B ( x ) = p S ( x )/ p B ( x ) . Where p B has support, we can = f 1 L S / B + (1 − f 1 ) = f 1 p S + (1 − f 1 ) p B p M 1 L M 1/ M 2 = f 2 L S / B + (1 − f 2 ) , f 2 p S + (1 − f 2 ) p B p M 2 which is a monotonically increasing rescaling of the likelihood L S / B as long as f 1 > f 2 , since ∂ LS / B L M 1/ M 2 = ( f 1 − f 2 )/( f 2 L S / B − f 2 + 1) 2 > 0 . If f 1 < f 2 , then one obtains the reversed classifjer. Therefore, L S / B and L M 1/ M 2 defjne the same classifjer.

Plan 5/ 14 • Learning from Label Proportions • Viability and generalization error • Proportion uncertainties, stability, and error propagation

When is all of this viable? All of this should clearly work in at least some cases, but can we will also solve the original task with high accuracy. bags arg min classifjer selected by 6/ 14 guaranteed to achieve low error on event labels. classifjer which can accurately predict bag proportions can be general than they seem. Under mild assumptions (more later) a It turns out the classifjcation without labels results are more know when will fails? More precisely, for φ r ( h ) : X r → R , φ r ( h )(˜ x ) = � r n =1 h ( x i )/ r , the � ℓ ( φ r ( h ) , f (˜ y )) h ∈H

Generalization errors for label proportions As a function of the VC dimension of the hypothesis class, with for this proof and following, see arXiv:1402.5902 method by adding more data is not a large concern. The mild dependence on bag size r means that destabilizing the 7/ 14 For a given empirical bag label proportion error for loss function ℓ , err ℓ ( h ) , it is possible to prove a bound on the expected error over the full distribution X × Y , err ℓ y ) ℓ ( φ r ( h ) , f (˜ G ( h ) = E (˜ y )) . x , ˜ probability 1 − δ , err ℓ G ( h ) ≤ err ℓ ( h ) + ǫ if the number of bags m is m ≥ 64 � 2 VC ( H ) log 12 r + log 4 � . ǫ 2 ǫ δ

Event errors from proportion errors With some mild assumptions, the above founds can be extended to individual events. Unfortunately, these bounds are somewhat weak. Guaranteed high performance generically requires extremely pure samples. 8/ 14 If err ℓ G ( h ) ≤ ǫ with probability 1 − δ , and each bag is at least (1 − η ) -pure 1 − ρ of the time, then h ( x ) correctly classifjes a fraction (1 − τ )(1 − δ − ρ )(1 − 2 η − ǫ ) of N events with probability 1 − e − N τ 2 (1 − δ − ρ )(1 − 2 η − ǫ ) . 2

Class distribution independence For binary classifjcation, the The preceding was so weak because no conditional independence to reproduce it. involved in this case, and I won’t attempt The general answer becomes quite probability of getting a classifjer 9/ 14 can be written as a generative model. of the underlying distributions from the bags was assumed, i.e. , If all bags are drawn from mixtures of underlying class earlier. the assumption that allowed us to invert the class distributions distributions with difgerent fractions, the probability of event error 1 0.8 with error ≤ ǫ is then bounded r = 10 0.6 r = 15 u ( ǫ , r ) from below by u ( ǫ, r ) . r = 20 r = 25 0.4 r = 30 r = 35 r = 40 0.2 r = 45 r = 50 r = 100 0 0 0.2 0.4 0.6 0.8 1 ǫ

Plan propagation 9/ 14 • Learning from Label Proportions • Viability and generalization error • Proportion uncertainties, stability, and error

Label uncertainties The supervised aspect comes from the provided label proportions. What if these are wrong? Return to the heuristic argument = dependence on the error from a shift/uncertainty in any label proportion can be worked out analytically. 10/ 14 b 0 , j = f A , 1 b B , j − f B , 1 b A , j b A , j = f A , 1 b 1 , j + (1 − f A , 1 ) b 0 , j f A , 1 − f B , 1 ⇒ b B , j = f B , 1 b 1 , j + (1 − f B , 1 ) b 0 , j b 1 , j = (1 − f B , 1 ) b A , j − (1 − f A , 1 ) b B , j f A , 1 − f B , 1 A Neyman–Pearson-optimal classifjer is z = b 0 /( b 0 + b 1 ) . The

Label insensitivity good equivalent As long as the resulting distortion is monotonic, the classifjers are all cuts cut z cut bad cartoon version 11/ 14 z z ′ ¯ ¯ z ′ ¯ bad only z ′ ¯ z ′ ¯ z ′ ¯ ¯ ← − more signal more background − →

Label insensitivity concrete example The classifjer remains equivalent to the optimal one if z i 12/ 14 For a binary classifjer and 2 bags with error f A , 1 → f A , 1 + δ , � 1 − f B 1 − f A − δ ¯ z 2 � − r ( x ) 1 − 2 f B − 1 − 2 f B + 2(¯ i − ¯ z i ) z ′ = 1 − f B 1 − f B ¯ = ¯ z i + δ , 1 − 2 f A − 2 δ 1 − 2 f B + 2 δ ( 1 − f B 1 − 2 f B f A − f B − r ( x ) 1 − 2 f B − ¯ z i ) 1 − 2 f B where r ( x ) = b A ( x )/ b B ( x ) is the ratio of inferred bag distributions. f A − f B δ � 3 − 2 min ( f B , 1 − f B )

A numerical study randomly swap 15% of each class Using random mutli-gaussian mixture models (background-like) swap the 10% (15%) most signal-like impact of mismodelling 13/ 14 1.0 1.0 True positive rate 0.9 True positive rate 0.9 0.8 0.8 0.7 0.7 Fully supervised (original) 0.6 0.6 Weakly supervised (original) Fully supervised (mis-modeled) 0.5 0.5 Weakly supervised (mis-modeled) 0.4 0.4 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 False positve rate False positve rate

Concluding thoughts without assuming distribution independence? (Or assuming something weaker) statistics/correlations? 14/ 14 • Can bounds on generalization errors be made stronger • Understand how optimality arguments change with fjnite • Can we propagate input uncertainties through the network? ◮ Where would this be useful? • Thank you!

Weak Supervision, noisy labels, and error propagation Marat - PowerPoint PPT Presentation

Weak Supervision, noisy labels, and error propagation Marat Freytsis hep-ai journal club December 11, 2018 based on Yu et al. [arXiv:1402.5902], Cohen, MF, Osdiek [arXiv:1706.09451] + bits of others Why Weak supervision? Fully supervised

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

Few-shot learning of weak supervision sources in Snorkel (or, learning weakly supervised weak

Supervision Strengthening Our Practice The plan Supervision what is it? Benefits

Supervision Mandatory Webinar 4 Webinar overview I. Background II. Why supervision? III.

Weak Supervision Vincent Chen and Nish Khandwala Outline Motivation We want more

Learning Dependency Structures for Weak Supervision Models Fred Sala , Paroma Varma, Ann He, Alex

Weak-Signal Digital Modes Weak-Signal Digital Modes The weak-signal digimodes have been

To the weak I became weak, that I might win the weak. I have become all things to all people,

WEAK INTERPOLATION PROPERTY over THE MINIMAL LOGIC Larisa Maksimova Sobolev Institute of

Linking linking Weak forms Linking Weak forms Elision (sound cut)

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

The weak-charged WIMP Shigeki Matsumoto (Kavli IPMU) The weak-charged WIMP, Majorana fermion with

Making weak maps compose strictly Richard Garner Uppsala University CT 2008, Calais Outline

Modelling and Verification Lecture 4 Weak bisimilarity and weak bisimulation games Properties of

Group and Commercial Insurer Supervision Presenter: Gerald Gakundi Assistant Director of

Linked Weak Reference Arrays A Hybrid Approach to Efficient Bulk Finalization Andrs

Model-independent spin and coupling determination of Higgs-like resonances Nhan Tran Johns

NUCLEAR AND SUBNUCLEAR PHYSICS [Module 3] Gabriele Sirri Istituto Nazionale di Fisica Nucleare

On the foundations of non-equilibrium quantum statistical mechanics Vojkan Jaksic McGill

IceCube-DeepCore: Sensitivity study for the Southern Hemisphere. Claudine Colnard for the IceCube

CSC321 Lecture 17: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 17:

History of Arrest in Intimate Partner Violence 1970s/1980s 7% to 15% arrest rate in IPV

Medical Home & NH Medicaid Care Management Mary Vallier-Kaplan, Chair Governors

AIRS V6 CO Tests and Preliminary Validation J. Warner, Z. Wei, E. Maddy, E. Manning, J. Blaisdell,

Sambuz

Useful Links

Newsletter

Mail Us

Weak Supervision, noisy labels, and error propagation Marat - PowerPoint PPT Presentation

Weak Supervision, noisy labels, and error propagation Marat Freytsis hep-ai journal club December 11, 2018 based on Yu et al. [arXiv:1402.5902], Cohen, MF, Osdiek [arXiv:1706.09451] + bits of others Why Weak supervision? Fully supervised

Noise2Self: Blind Denoising by Self-Supervision Joshua Batson Loc Royer Noisy Data

Few-shot learning of weak supervision sources in Snorkel (or, learning weakly supervised weak

Supervision Strengthening Our Practice The plan Supervision what is it? Benefits

Supervision Mandatory Webinar 4 Webinar overview I. Background II. Why supervision? III.

Weak Supervision Vincent Chen and Nish Khandwala Outline Motivation We want more

Learning Dependency Structures for Weak Supervision Models Fred Sala , Paroma Varma, Ann He, Alex

Weak-Signal Digital Modes Weak-Signal Digital Modes The weak-signal digimodes have been

To the weak I became weak, that I might win the weak. I have become all things to all people,

WEAK INTERPOLATION PROPERTY over THE MINIMAL LOGIC Larisa Maksimova Sobolev Institute of

Linking linking Weak forms Linking Weak forms Elision (sound cut)

Weak memory models INF4140 - Models of concurrency Weak memory models Fall 2016 30. 10. 2016

The weak-charged WIMP Shigeki Matsumoto (Kavli IPMU) The weak-charged WIMP, Majorana fermion with

Making weak maps compose strictly Richard Garner Uppsala University CT 2008, Calais Outline

Modelling and Verification Lecture 4 Weak bisimilarity and weak bisimulation games Properties of

Group and Commercial Insurer Supervision Presenter: Gerald Gakundi Assistant Director of

Linked Weak Reference Arrays A Hybrid Approach to Efficient Bulk Finalization Andrs

Model-independent spin and coupling determination of Higgs-like resonances Nhan Tran Johns

NUCLEAR AND SUBNUCLEAR PHYSICS [Module 3] Gabriele Sirri Istituto Nazionale di Fisica Nucleare

On the foundations of non-equilibrium quantum statistical mechanics Vojkan Jaksic McGill

IceCube-DeepCore: Sensitivity study for the Southern Hemisphere. Claudine Colnard for the IceCube

CSC321 Lecture 17: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 17:

History of Arrest in Intimate Partner Violence 1970s/1980s 7% to 15% arrest rate in IPV

Medical Home &amp; NH Medicaid Care Management Mary Vallier-Kaplan, Chair Governors

AIRS V6 CO Tests and Preliminary Validation J. Warner, Z. Wei, E. Maddy, E. Manning, J. Blaisdell,

Sambuz

Useful Links

Newsletter

Mail Us

Medical Home & NH Medicaid Care Management Mary Vallier-Kaplan, Chair Governors