Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia - PowerPoint PPT Presentation

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 , Mehryar Mohri 1 , 2 , Ningshan Zhang 3 1 Google Research 2 Courant Institute, NYU 3 NYU ICML, June 12, 2019

On-line Active Learning Setup ◮ At each round t ∈ [ T ] , receives unlabeled x t ∼ D X i.i.d. ◮ Decides whether to request label: ◮ If label requested, receives y t . ◮ After T rounds, returns a hypothesis h T ∈ H . Objective: ◮ Generalizations error: � � ◮ Accurate predictor h T : small expected loss R ( h T ) = E x , y ℓ ( h T ( x ) , y ) . ◮ Close to best-in-class h ∗ = argmin h ∈ H R ( h ) . ◮ Label complexity: few label requests.

Disagreement-based Active Learning Key idea: Request label when there is some disagreement among hypotheses. Examples: ◮ Separable case: CAL (Cohn et al., 1994). ◮ Non-separable case: A 2 (Balcan et al., 2006), DHM (Dasgupta et al., 2008). ◮ IWAL (Beygelzimer et al., 2009). Can we improve upon existing disagreement-based algorithms, such as IWAL? ◮ Better guarantees? ◮ Leverage average disagreements?

This talk ◮ IWAL-D algorithm: enhanced IWAL with disagreement graph. ◮ IZOOM algorithm: enhanced IWAL-D with zooming-in. ◮ Better generalization and label complexity guarantees. ◮ Experimental results.

Disagreement Graph (D-Graph) ◮ Vertices: hypotheses in H (a finite hypothesis set) ◮ Edges: fully connected. The edge between h , h ′ ∈ H is weighted by their expected disagreement: � � � � � ℓ ( h ( x ) , y ) − ℓ ( h ′ ( x ) , y ) � L ( h , h ′ ) = max . E x ∼ D X y ∈ Y L symmetric, ℓ ≤ 1 ⇒ L ≤ 1. ◮ D-Graph can be accurately estimated using unlabeled data.

Disagreement Graph (D-Graph) One favorable scenario: ◮ Best-in-class h ∗ ( ) is within an isolated cluster; ◮ L ( h , h ∗ ) is small within the cluster.

IWAL-D Algorithm: IWAL with D-Graph ◮ At round t ∈ [ T ] , receive x t . 1. Flip a coin Q t ∼ Ber ( p t ) , with disagreement-based bias: � � � ℓ ( h ( x t ) , y ) − ℓ ( h ′ ( x t ) , y ) � . p t = max h , h ′ ∈ H t max y ∈ Y 2. If Q t = 1, request the label y t . 3. Trim the version space: � � � � h ∈ H t : L t ( h ) ≤ L t ( � 1 + L ( h , � H t + 1 = h t ) + h t ) ∆ t , which uses importance weighted empirical risk �� t � L t ( h ) = 1 Q s � ∆ t = � log ( T | H | ) ℓ ( h ( x s ) , y s ) , h t = argmin L t ( h ) , O . t t p s h ∈ H t s = 1 ◮ After T rounds, return � h T .

IWAL-D vs. IWAL: Quantify the Improvement Theorem (IWAL-D) With high probability, � � h T ) ≤ R ∗ + R ( � 1 + L ( � h T , h ∗ ) ∆ T , � � � � � � 2 R ∗ + max 2 + L ( h , � h t − 1 ) + L ( h , h ∗ ) E p t |F t − 1 ≤ 2 θ ∆ t − 1 . x ∼ D X h ∈ H t ◮ θ : disagreement coefficient (Hanneke, 2007). ◮ More aggressive trimming of the version space. ◮ Slightly better generalization guarantee and label complexity.

IWAL and IWAL-D Problem: ◮ Theoretical guarantees only hold for finite hypothesis sets. ◮ Need ǫ -cover to extend to infinite hypothesis sets. ◮ Expensive to construct ǫ -cover in practice. Can we adaptively enrich the hypothesis set, with theoretical guarantees?

IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) Trim Resample

IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) ◮ H ′ t + 1 ← Trim ( H t ) Trim Resample

IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) ◮ H ′ t + 1 ← Trim ( H t ) ◮ H ′′ t + 1 ← Resample ( H ′ t + 1 ) Trim Resample Resample ( H ′ t + 1 ) : sample new h ∈ ConvexHull ( H ′ t + 1 ) . ◮ E.g., random convex combination of � h t and h ∈ H ′ t + 1 .

IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) ◮ H ′ t + 1 ← Trim ( H t ) ◮ H ′′ t + 1 ← Resample ( H ′ t + 1 ) ◮ H t + 1 ← H ′ t + 1 ∪ H ′′ t + 1 Trim Resample

IZOOM vs. IWAL-D Let H t = ∪ t s = 1 H t , i.e. all the hypotheses ever considered up to time t . Let h ∗ t = argmin h ∈ H t R ( h ) . Theorem (IZOOM) With high probability, � � R ( � 1 + L ( � ∆ T + O ( 1 h T ) ≤ R ∗ h T , h ∗ T + T ) T ) , � � � � � � 2 + L ( h , � + O ( 1 2 R ∗ h t ) + L ( h , h ∗ p t + 1 |F t ≤ 2 θ t t + max t ) ∆ t T ) . E x ∼ D X h ∈ H t + 1 t = min h ∈ H t R ( h ) is smaller than R ∗ = min h ∈ H 0 R ( h ) . ◮ R ∗ ◮ More accurate � h T , with fewer label requests.

Experiments Tasks: 8 Binary classification datasets from UCI repository. ◮ ℓ : logistic loss rescaled to [ 0 , 1 ] . Baselines: ◮ IWAL with 3,000 hypotheses. ◮ IWAL with 12,000 hypotheses. ◮ IZOOM with 3,000 hypotheses. Performance measure: ◮ 0-1 loss on test data vs. number of label requests.

Experiments nomao codrna skin IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 0.14 Misclassification Loss Misclassification Loss Misclassification Loss 0.175 0.20 0.13 0.18 0.150 0.12 0.16 0.125 0.11 0.14 7 8 9 10 11 7 8 9 10 11 7 8 9 10 11 log 2 (Number of Labels) log 2 (Number of Labels) log 2 (Number of Labels) covtype magic04 a9a IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 0.26 Misclassification Loss Misclassification Loss Misclassification Loss 0.42 0.25 0.24 0.24 0.40 0.22 0.38 0.23 0.20 0.36 0.22 0.18 7 8 9 10 11 7 8 9 10 11 7 8 9 10 11 log 2 (Number of Labels) log 2 (Number of Labels) log 2 (Number of Labels)

Conclusion ◮ Key introduction and role of disagreement graph. ◮ More favorable generalization and label complexity guarantees. ◮ Substantial performance improvements. ◮ Effective solutions for active learning. Poster: Pacific Ballroom #265 KDD workshop (Alaska, August 2019) on Active Learning: Data Collection, Curation, and Labeling for Mining and Learning.

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia - PowerPoint PPT Presentation

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 , Mehryar Mohri 1 , 2 , Ningshan Zhang 3 1 Google Research 2 Courant Institute, NYU 3 NYU ICML, June 12, 2019 On-line Active Learning Setup At

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Algorithms for Lipschitz Learning on Graphs Sushant Sachdeva Yale Institute of Network Sciences

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Value Disagreement and Two Aspects of Meaning Erich Rast erich@snafu.de IFILNOVA Institute of

Information Flows and Disagreement Cristian Badarinza Marco Buchmann FRBNY C ONFERENCE ON C

Disagreement and Political Liberalism Matthias Brinkmann, matthias.brinkmann@philosophy.ox.ac.uk

Minimizing Polarization and Disagreement in Social Networks Cameron Musco Chris Musco Charalampos

Measuring disagreement in science Dakota Murray, Wout Lamers, Kevin Boyack, Vincent Larivire,

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

Be,er Privacy and Security via Secure Computa9on Jonathan Katz Security/privacy would be much

Weak Fourier-Schur Sampling, the Hidden Subgroup Problem & the Quantum Collision Problem

Strategic Aim for Cluster Based Integration (CSS/CHC) Design and implement a cluster-based service

Machine Learning 2 DS 4420 - Spring 2020 Sequence-2-sequence models Byron C. Wallace Today

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel) Digital Signatures

Training-Time Optimization of a Budgeted Booster Yi Huang *Brian Powers Lev Reyzin University

A Coactive Learning View of Online Structured Prediction in SMT Artem Sokolov , Stefan Riezler

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia - PowerPoint PPT Presentation

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 , Mehryar Mohri 1 , 2 , Ningshan Zhang 3 1 Google Research 2 Courant Institute, NYU 3 NYU ICML, June 12, 2019 On-line Active Learning Setup At

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Algorithms for Lipschitz Learning on Graphs Sushant Sachdeva Yale Institute of Network Sciences

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Value Disagreement and Two Aspects of Meaning Erich Rast erich@snafu.de IFILNOVA Institute of

Information Flows and Disagreement Cristian Badarinza Marco Buchmann FRBNY C ONFERENCE ON C

Disagreement and Political Liberalism Matthias Brinkmann, matthias.brinkmann@philosophy.ox.ac.uk

Minimizing Polarization and Disagreement in Social Networks Cameron Musco Chris Musco Charalampos

Measuring disagreement in science Dakota Murray, Wout Lamers, Kevin Boyack, Vincent Larivire,

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

Be,er Privacy and Security via Secure Computa9on Jonathan Katz Security/privacy would be much

Weak Fourier-Schur Sampling, the Hidden Subgroup Problem &amp; the Quantum Collision Problem

Strategic Aim for Cluster Based Integration (CSS/CHC) Design and implement a cluster-based service

Machine Learning 2 DS 4420 - Spring 2020 Sequence-2-sequence models Byron C. Wallace Today

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel) Digital Signatures

Training-Time Optimization of a Budgeted Booster Yi Huang *Brian Powers Lev Reyzin University

A Coactive Learning View of Online Structured Prediction in SMT Artem Sokolov , Stefan Riezler

Weak Fourier-Schur Sampling, the Hidden Subgroup Problem & the Quantum Collision Problem