noise adaptive margin based active learning and
play

Noise-adaptive Margin- based Active Learning, and Yining Wang , - PowerPoint PPT Presentation

Noise-adaptive Margin- based Active Learning, and Yining Wang , Aarti Singh Carnegie Mellon University Lower Bounds Machine Learning: the setup The machine learning problem Each data point consists of data and label ( x


  1. Noise-adaptive Margin- based Active Learning, and Yining Wang , Aarti Singh Carnegie Mellon University Lower Bounds

  2. Machine Learning: the setup ❖ The machine learning problem ❖ Each data point consists of data and label ( x i , y i ) y i x i ❖ Access to training data ( x 1 , y 1 ) , · · · , ( x n , y n ) ❖ Goal: train classifier to predict y based on x ˆ f ❖ Example: Classification x i ∈ R d , y i ∈ { +1 , − 1 }

  3. Machine learning: passive vs. active ❖ Classical framework: passive learning i.i.d. ❖ I.I.D. training data ( x i , y i ) ∼ D h i ❖ Evaluation: generalization error y 6 = ˆ Pr f ( x ) D ❖ An active learning framework ❖ Data are cheap, but labels are expensive! ❖ Example : medical data (labels require domain knowledge) ❖ Active learning: minimize label requests

  4. Active Learning ❖ Pool-based active learning ❖ The learner A has access to unlabeled data stream i.i.d. x 1 , x 2 , · · · ∼ D ❖ For each , the learner decides whether to query; if x i label requested, A obtains y i ❖ Minimize number of requests, while scanning through polynomial number of unlabeled data.

  5. Active Learning ❖ Example: learning homogeneous linear classifier y i = sgn( w > x i ) + noise ❖ Basic (passive) approach: empirical risk minimization (ERM) n X I [ y i 6 = sgn( w > x i )] w 2 argmin k w k 2 =1 ˆ i =1 ❖ How about active learning?

  6. Margin-based Active Learning BALCAN, BRODER and ZHANG, COLT’07 ❖ Data dimension d , query budget T , no. of iterations E ❖ At each iteration k ∈ { 1 , · · · , E } ❖ Determine parameters b k − 1 , β k − 1 { x ∈ R d : | ˆ ❖ Find samples in w k − 1 · x | ≤ b k − 1 } n = T/E ❖ Constrained ERM: w k − 1 ) ≤ β k − 1 L ( { x i , y i } n w k = ˆ min i =1 ; w ) θ ( w, ˆ ❖ Final output: ˆ w E

  7. Tsybakov Noise Condition ❖ There exist constants such that µ > 0 , α ∈ (0 , 1) µ · θ ( w, w ∗ ) 1 / (1 − α ) ≤ err( w ) − err( w ∗ ) ❖ : key noise magnitude parameter in TNC α ∈ (0 , 1) ❖ Which one is harder? err( w ) − err( w ∗ ) small α large α θ ( w, w ∗ )

  8. Margin-based Active Learning ❖ Main Theorem [BBZ07]: when D is the uniform distribution, the margin-based algorithm achieves (✓ d ◆ 1 / 2 α ) w ) − err( w ∗ ) = e err( ˆ O P . T Passive Learning: 1 − α 2 α ) O (( d/T )

  9. Proof outline BALCAN, BRODER and ZHANG, COLT’07 ❖ At each iteration k , perform restricted ERM over within- margin data c w k = ˆ argmin err( w | S 1 ) , θ ( w, ˆ w k − 1 )  β k − 1 S 1 = { x : | x > ˆ w k � 1 | ≤ b k � 1 }

  10. Proof outline ❖ Key fact: if and then b k = ˜ √ θ ( ˆ w k − 1 , w ∗ ) ≤ β k − 1 Θ ( β k / d ) ⇣ ⌘ w k ) − err( w ∗ ) = ˜ p err( ˆ O d/T β k − 1 ❖ Proof idea: decompose the excess error into two terms [err( ˆ w k | S 1 ) − err( w ∗ | S 1 )] Pr[ x ∈ S 1 ] | {z } | {z } O ( √ √ ˜ ˜ O ( b k − 1 d ) d/T ) 1 ] = ˜ w k | S c 1 ) − err( w ∗ | S c 1 )] Pr[ x ∈ S c [err( ˆ O (tan β k − 1 ) ❖ Must ensure w * is always within reach! β k = 2 α − 1 β k − 1

  11. Problem ❖ What if is not known? How to set key parameters α b k , β k ❖ If the true parameter is but the algorithm is run with α α 0 > α ❖ The convergence is instead of ! α 0 α

  12. Noise-adaptive Algorithm ❖ Agnostic parameter settings 2 log T, β k = 2 − k π , b k = 2 β k E = 1 √ 2 E √ d ❖ Main analysis: two-phase behaviors ❖ “Tipping point” : , depending on k ∗ ∈ { 1 , · · · , E } α ❖ Phase I: , we have that θ ( ˆ w k , w ∗ ) ≤ β k k ≤ k ∗ ❖ Phase II: , we have that k > k ∗ p w k ) ≤ β k · e err( ˆ w k +1 ) − err( ˆ O ( d/T )

  13. Noise-Adaptive Analysis ❖ Main theorem: for all α ∈ (0 , 1 / 2) (✓ d ◆ 1 / 2 α ) w ) − err( w ∗ ) = e err( ˆ O P . T ❖ Matching the upper bound in [BBZ07] ❖ … and also a lower bound (this paper)

  14. Lower Bound ❖ Is there any active learning algorithm that can do better e O P (( d/T ) 1 / 2 α ) than the sample complexity? ❖ In general, no [Henneke, 2015]. But the data distribution D is quite contrived in the negative example. ❖ We show that is tight even if D is as e O P (( d/T ) 1 / 2 α ) simple as the uniform distribution over unit sphere.

  15. Lower Bound ❖ The “Membership Query Synthesis” (QS) setting ❖ The algorithm A picks an arbitrary data point x i ❖ The algorithm receives its label y i ❖ Repeat the procedure T times, with T the budget ❖ QS is more powerful than pool-based setting when D has density bounded away from below. ❖ We prove lower bounds for the QS setting, which implies lower bounds in the pool-based setting.

  16. Tsybakov’s Main Theorem TSYBAKOV and ZAIATS, Introduction to Nonparametric Estimation ❖ Let be a set of models. Suppose F 0 = { f 0 , · · · , f M } ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k M 1 ❖ Closeness: X KL( P f j k P f 0 )  γ log M M j =1 ❖ Regularity: P f j ⌧ P f 0 , 8 j 2 { 1 , · · · , M } ❖ Then the following bound holds √ ✓ ◆ r M h i γ D ( ˆ inf sup Pr f, f ) ≥ ρ 1 − 2 γ − 2 . √ ≥ log M 1 + M f ˆ f f ∈ F 0

  17. Negative Example Construction ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k ❖ Find hypothesis class such that W = { w 1 , · · · , w m } t  θ ( w i , w j )  6 . 5 t, 8 i 6 = j ❖ … can be done for all , using constant t ∈ (0 , 1 / 4) weight coding ❖ … can guarantee that log |W| = Ω ( d )

  18. Negative Example Construction

  19. Negative Example Construction M 1 ❖ Closeness: X KL( P f j k P f 0 )  γ log M M j =1 P ( i ) " # X 1 ,Y 1 , ··· ,X T ,Y T ( x 1 , y 1 , · · · , x T , y T ) KL( P i,T k P j,T ) = log E i P ( j ) X 1 ,Y 1 , ··· ,X T ,Y T ( x 1 , y 1 , · · · , x T , y T ) 2 3 t =1 P ( i ) Q T Y t | X t ( y t | x t ) P X t | X 1 ,Y 1 , ··· ,X t − 1 ,Y t − 1 ( x t | x 1 , y 1 , · · · , x t � 1 , y t � 1 ) = 4 log E i 5 Q T t =1 P ( j ) Y t | X t ( y t | x t ) P X t | X 1 ,Y 1 , ··· ,X t − 1 ,Y t − 1 ( x t | x 1 , y 1 , · · · , x t � 1 , y t � 1 ) 2 3 Q T t =1 P ( i ) Y t | X t ( y t | x t ) = 4 log E i 5 t =1 P ( j ) Q T Y t | X t ( y t | x t ) 2 2 3 3 P ( i ) T � Y | X ( y t | x t ) � X = 4 log � X 1 = x 1 , · · · , X T = x T E i 4 E i � 5 5 P ( j ) � Y | X ( y t | x t ) t =1 KL( P ( i ) Y | X ( ·| x ) k P ( j )  T · sup Y | X ( ·| x )) . 2 X

  20. Lower Bound TSYBAKOV and ZAIATS, Introduction to Nonparametric Estimation ❖ Let be a set of models. Suppose F 0 = { f 0 , · · · , f M } ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k M 1 ❖ Closeness: X KL( P f j k P f 0 )  γ log M M j =1 ❖ Regularity: P f j ⌧ P f 0 , 8 j 2 { 1 , · · · , M } ρ = Θ ( t ) = Θ (( d/T ) (1 − α ) / 2 α ) ❖ Take log M = Θ ( d ) ❖ We have that  � w, w ∗ ) ≥ t inf w sup w ∗ Pr θ ( ˆ = Ω (1) 2 ˆ

  21. Lower Bound ❖ Suppose D has density bounded away from below and µ > 0 , α ∈ (0 , 1) fix . Let be class of distributions P Y | X ( µ, α ) satisfying -TNC. Then we have that "✓ d ◆ 1 / 2 α # w ) − err( w ∗ )] ≥ Ω inf sup E P [err( ˆ . T A P ∈ P Y | X

  22. Extension: “Proactive” learning ❖ Suppose there are m different users (labelers) who share the same classifier w * but with different TNC parameters α 1 , · · · , α m ❖ The TNC parameters are not known. ❖ At each iteration, the algorithm picks a data point x and also a user j , and observes f(x;j) ❖ The goal is to estimate the Bayes classifier w*

  23. Extension: “Proactive” learning ❖ Algorithm framework: ❖ Operate in iterations. E = O (log T ) ❖ At each iteration, use conventional Bandit algorithms to address exploration-exploitation tradeoff ❖ Key property: search space and margin does { b k } { β k } not depend on unknown TNC parameters. ❖ Many interesting extensions: what if multiple labelers can be involved each time?

  24. Thanks! Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend