selective sampling realizable
play

Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic - PowerPoint PPT Presentation

Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic Settings Model: D : a distribution over X Y where X is the input space and Y = { 1 } are the possible labels. ( X , Y ) X Y be a pair of random variables with


  1. Selective Sampling (Realizable) Ji Xu October 2nd, 2017

  2. Basic Settings Model: ◮ D : a distribution over X × Y where X is the input space and Y = {± 1 } are the possible labels. ◮ ( X , Y ) ∈ X × Y be a pair of random variables with joint distribution D . ◮ H be a set of hypotheses mapping from X to Y . The error of a hypothesis h : X → Y is err ( h ) := Pr ( h ( X ) � = Y ) . ◮ Let h ∗ := argmin { err ( h ) : h ∈ H} be a hypothesis with minimum error in H .

  3. Basic Settings Goal: with high probability, we return ˆ h ∈ H such that err (ˆ h ) ≤ err ( h ∗ ) + ǫ. In realizable case, we have err ( h ∗ ) = 0, hence, we want err (ˆ h ) ≤ ǫ.

  4. Basic Settings Passive VS Active: ◮ Passive setting: ◮ At time t , observe X t and choose h t ∈ H . ◮ Make prediction h t ( X t ) and then observe feedback Y t . ◮ Minimize the total number of mistakes of h t ( X t ) � = Y t .

  5. Basic Settings Passive VS Active: ◮ Active setting: ◮ At time t , observe X t . ◮ We choose whether we need the feedback Y t . ◮ Minimize the number of mistakes of ˆ h and the total number of queries of the correct label Y t .

  6. Basic Settings Passive VS Active: ◮ Active setting: ◮ At time t , observe X t . ◮ We choose whether we need the feedback Y t . ◮ Minimize the number of mistakes of ˆ h and the total number of queries of the correct label Y t . Hence, intuitively, ( X t , Y t ) does not provide any information if h ( X t ) are the same for all the potential hypotheses at time t , and thus we should not query for such X t .

  7. Concepts Definition For a set of hypotheses V , the region of disagreement R ( V ) is R ( V ) := { x ∈ X : ∃ h , h ′ ∈ V such that h ( x ) � = h ′ ( x ) } . Definition For a given set of hypotheses H and sample set Z T = { ( X t , Y t ) , t = 1 · · · T } , the uncertainty region U ( H , Z T ) is { x ∈ X : ∃ h , h ′ ∈ H such that h ( x ) � = h ′ ( x ) U ( H , Z T ) := and h ( X t ) = h ′ ( X t ) = Y t , ∀ t ∈ [ T ] } .

  8. Remarks ◮ Let C = { h ∈ H : h ( X t ) = Y t , ∀ t ∈ [ T ] } . Then we have U ( H , Z T ) = R ( C ) . ◮ Ideally, the area of the uncertainty region will be monotonically non-increasing by more training samples. ◮ If we can control the sampling procedure over X t , it is better to only sample on U ( H , Z t ) . (Selective Sampling or Approximate Selective Sampling) ◮ Correctness of all labels Y t for X t not in the query. Need to query X t + 1 if X t + 1 ∈ U ( H , Z t ) . ◮ The complexity of finding a good set ˆ H such that h ∗ ∈ ˆ H ⊆ H can be intuitively measured by the ratio between H err ( h ) and Pr ( R ( ˆ sup h ∈ ˆ H )) .

  9. Concepts Definition We redefine the region of disagreement by R ( h , r ) of radius r around a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is R ( h , r ) := { x ∈ X : ∃ h ′ ∈ B ( h , r ) such that h ( x ) � = h ′ ( x ) } . where the disagreement (pseudo) metric ρ on H is defined by ρ ( h , h ′ ) := Pr ( h ( X ) � = h ′ ( X )) . Hence, we have err ( h ) = ρ ( h , h ∗ ) .

  10. Concepts Definition We redefine the region of disagreement by R ( h , r ) of radius r around a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is R ( h , r ) := { x ∈ X : ∃ h ′ ∈ B ( h , r ) such that h ( x ) � = h ′ ( x ) } . where the disagreement (pseudo) metric ρ on H is defined by ρ ( h , h ′ ) := Pr ( h ( X ) � = h ′ ( X )) . Hence, we have err ( h ) = ρ ( h , h ∗ ) . Remarks: We have R ( h ∗ , r ) ⊆ R ( B ( h ∗ , r )) , but the reverse may not be true.

  11. Concepts Definition The disagreement coefficient θ ( h , H , D ) with respect to a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is Pr ( X ∈ R ( h , r )) θ ( h , H , D ) := sup . r r > 0

  12. Concepts Definition The disagreement coefficient θ ( h , H , D ) with respect to a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is Pr ( X ∈ R ( h , r )) θ ( h , H , D ) := sup . r r > 0 Examples: ◮ X is uniform on [ 0 , 1 ] . H = { h = I X ≥ r , ∀ r > 0 } . Then θ ( h , H , D ) = 2 , ∀ h ∈ H . ◮ Replace H by H = { h = I X ∈ [ a , b ] , ∀ 0 < a < b < 1 } . Then θ ( h , H , D ) = max ( 4 , 1 / Pr ( h ( X ) = 1 )) , ∀ h ∈ H .

  13. Examples Proposition Let P X be the uniform distribution on the unit sphere S d − 1 := { x ∈ R d : � x � 2 = 1 } ⊂ R d , and let H be the class of homogeneous linear threshold functions in R d , i.e, H = { h w : h w ( x ) = sign ( � w , x � ) , ∀ w ∈ S d − 1 } . There is an absolute constant C > 0 such that √ θ ( h , H , P X ) ≤ C · d .

  14. Algorithm (CAL) ◮ Initialize: Z 0 := ∅ , V 0 := H . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If X t ∈ R ( V t − 1 ) : (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: Set ˜ Y t := h ( X t ) for any h ∈ V t − 1 , and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } OR Set Z t := Z t − 1 . ◮ Set V t := { h ∈ H : h ( X i ) = Y i , ∀ ( X i , Y i ) ∈ Z t } . ◮ Return: any h ∈ V n .

  15. Algorithm (Reduction-based CAL) ◮ Initialize: Z 0 := ∅ . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If there exists both: • h + ∈ H consistent with Z t − 1 � { ( X t , + 1 ) } • h − ∈ H consistent with Z t − 1 � { ( X t , − 1 ) } (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: only h y exists for some y ∈ {± 1 } : Set ˜ Y t := y and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } ◮ Return: any h ∈ H consistent with Z n .

  16. Algorithm (Reduction-based CAL) ◮ Initialize: Z 0 := ∅ . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If there exists both: • h + ∈ H consistent with Z t − 1 � { ( X t , + 1 ) } • h − ∈ H consistent with Z t − 1 � { ( X t , − 1 ) } (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: only h y exists for some y ∈ {± 1 } : Set ˜ Y t := y and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } ◮ Return: any h ∈ H consistent with Z n . Remark: Reduction-based CAL is equivalent to CAL.

  17. Label Complexity Analysis Theorem The expected number of labels queried by Reduction-based CAL after n iterations is at most � θ ( h ∗ , H , D ) d log 2 n � O , where d is the VC-dimension of class H . For any ǫ > 0 and δ > 0 , if we have � 1 ǫ ( d log 1 ǫ + log 1 � n = O δ ) , then with probability 1 − δ , the return of Reduction-based CAL ˆ h satisfies that err (ˆ h ) ≤ ǫ.

  18. Proof Note that, with probability 1 − δ t , any h ∈ H consistent with Z t has error err ( h ) at most � 1 � d log t + log 1 �� O := r t , δ t t where δ t > 0 will be chosen later. (case when P n f n = 0 , Pf = 0). � 1 ǫ ( d log 1 ǫ + log 1 � This also implies that n = O δ )

  19. Proof Note that, with probability 1 − δ t , any h ∈ H consistent with Z t has error err ( h ) at most � 1 � d log t + log 1 �� O := r t , δ t t where δ t > 0 will be chosen later. (case when P n f n = 0 , Pf = 0). � 1 ǫ ( d log 1 ǫ + log 1 � This also implies that n = O δ ) Let G t is the event that described above happens. Hence, condition on G t , we have { h ∈ H : h is consistent with Z t } ⊆ B ( h ∗ , r t ) .

  20. Proof Note that, we query Y t + 1 if and only if � { ( X t + 1 , − h ∗ ( X t + 1 )) } , ∃ h ∈ H consistent with Z t (i.e., there is h disagree with h ∗ ) Hence, condition on G t , if we query Y t + 1 , then X t + 1 ∈ R ( h ∗ , r t ) . Therefore, we have � G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) . � Pr ( Y t + 1 is queried

  21. Proof Let Q t = I { Y t is queried } . The expected total number of queries is n − 1 n � � E [ Q t ] ≤ 1 + Pr ( Q t + 1 = 1 ) t = 1 t = 1 n − 1 � � = 1 + Pr ( Q t + 1 = 1 � G t ) Pr ( G t ) t = 1 n − 1 � � not G t )( 1 − Pr ( G t )) � + Pr ( Q t + 1 = 1 t = 0 n − 1 � � ≤ 1 + Pr ( Q t + 1 = 1 � G t ) Pr ( G t ) + δ t t = 1 n − 1 � Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) + δ t . ≤ 1 + t = 1

  22. Proof By definition of the coefficient of disagreement, we have Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t )) ≤ r t · θ ( h ∗ , H , D ) . Hence, we have n n − 1 � � r t · θ ( h ∗ , H , D ) + δ t E [ Q t ] ≤ 1 + t = 1 t = 1 n − 1 � θ ( h ∗ , H , D ) � � � d log t + log 1 � = O + δ t . t δ t t = 1

  23. Proof By definition of the coefficient of disagreement, we have Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t )) ≤ r t · θ ( h ∗ , H , D ) . Hence, we have n n − 1 � � r t · θ ( h ∗ , H , D ) + δ t E [ Q t ] ≤ 1 + t = 1 t = 1 n − 1 � θ ( h ∗ , H , D ) � � � d log t + log 1 � = O + δ t . t δ t t = 1 Choose δ t = 1 t , we have n � θ ( h ∗ , H , D ) d log 2 n � � ≤ E [ Q t ] O . t = 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend