crowdsourced classification with xor queries an algorithm
play

Crowdsourced Classification with XOR Queries: An Algorithm with - PowerPoint PPT Presentation

Crowdsourced Classification with XOR Queries: An Algorithm with Optimal Sample Complexity Daesung Kim and Hye Won Chung School of Electrical Engineering, KAIST ISIT, 2020 Introduction Crowdsourcing A method to label data with very small


  1. Crowdsourced Classification with XOR Queries: An Algorithm with Optimal Sample Complexity Daesung Kim and Hye Won Chung School of Electrical Engineering, KAIST ISIT, 2020

  2. Introduction Crowdsourcing A method to label data with very small budget; Collect data from some easy to access, unreliable people (crowd) Goal : Recover the true label from noisy answers with minumum number of queries. What is the most efficient querying method? What is the best recovery algorithm? 2 / 21

  3. Introduction Querying method Repetition (REP) query : ask about an item several times to different people Homogeneous (HOMO) query : ask whether a group of items has the same label Let m be the total number of items, and d be the number of items in each group of Homo query. ( d is called query degree ) When queries are randomly made, and the answer is not noisy, we require � 2 d − 1 � Ω( m log m ) and Ω m log m d queries to perfectly recover the binary labels with REP and HOMO query, respectively. Any other querying method whose required number of queries decreases as d increases? 3 / 21

  4. Introduction XOR query We ask whether there is an even or odd number of items with certain label in a group of items. A recent work by Ahn et al. [1] proved that we require � m log m � Ω d ( √ ǫ − √ 1 − ǫ ) 2 queries to perfectly recover the binary labels with XOR query. Contributions What is the required number of queries when the query degree and error probability is not uniform over the queries? Can we achieve the bound with some polynomial time algorithms ? [1] K. Ahn, K. Lee, and C. Suh, “Community recovery in hypergraphs,” IEEE Transactions on Infor- mation Theory , vol. 65, no. 10, pp. 6561–6579, 2019. 4 / 21

  5. Model Goal Recover m binary labels denoted x ∈ {− 1 , 1 } m . Random Query Design 1. Obtain a query degree d from a distribution Φ = { Φ 1 , · · · , Φ D } . � m � 2. Select d items randomly among possibilities and make a XOR query. d 3. Choose a worker randomly and assign the query. 4. Repeat 1–3 n times. item query worker query degree = 2 5 / 21

  6. Model Noise Model Dawid and Skene model: error probability depends on the assigned worker. Error probability depends on both the assigned worker ( k ) and query degree ( d ). ( ǫ k,d < 0 . 5 ) Error probability usually increases with respect to d . Tripartite Graph Representation item query worker 6 / 21

  7. Information-theoretic Bound Theorem 1 x ∈ { 1 , − 1 } m , the strong recovery is With the maximum likelihood (ML) estimator ˆ possible, i.e., P [ˆ x � = x ] → 0 as m → ∞ , if and only if the number of queries is m log m n ≥ (1 + η ) w ( √ 1 − ǫ k,d − √ ǫ k,d ) 2 , (1) � D � w d Φ d d =1 k =1 for a small constant η > 0 . The bound is inversely proportional to the average query degree , d = � D d =1 d Φ d . We assume that the worker reliabilities are known and the ML estimator makes use of it. 7 / 21

  8. Algorithm We are given the number of queries equal to the bound in Theorem 1. Algorithm Sketch x (1) 1. Detection of labels: P [ˆ � = x i ] < 1 / 2 . i x (2) 2. Weak recovery of labels: P [ˆ � = x i ] → 0 . i 3. Estimating worker reliabilities: For any δ > 0 , P [ | ˆ ǫ k,d − ǫ k,d | > δ ] → 0 . x (4) � = x ] → 0 . 4. Strong recovery of labels: P [ˆ Notes For independence, we separate the queries into 4 groups and use each of them only at one step of the algorithm. We assume that there are at least m degree-1 queries. 8 / 21

  9. Algorithm Step 1: Detection of labels m degree-1 queries are used in this step. item query worker � � � x (1) Estimate of this step: ˆ = sign y j . i j ∈ ∂x i x (1) � = x i ] < 1 / 2 holds because the average workers’ error probability is less than 1 P [ˆ 2 , i i.e. ǫ = 1 k ∈ [ w ] ǫ k, 1 < 1 � w 2 9 / 21

  10. Algorithm Step 2: Weak recovery of labels m log log m queries are used in this step. item query worker Estimate of j th query to i th item: m (2) x (1) � j → i = y j ˆ i ′ i ′ ∈ ∂y j \{ i } 10 / 21

  11. Algorithm Step 2: Weak recovery of labels m log log m queries are used in this step. item query worker Estimate of j th query to i th item: m (2) x (1) � j → i = y j ˆ i ′ i ′ ∈ ∂y j \{ i } � � x (2) j ∈ ∂x i m (2) Estimate of this step : ˆ = sign � . i j → i � � ≤ 1 / (log m ) K for positive constant K . x (2) If | ∂x i | = Θ(log log m ) , P ˆ � = x i i 11 / 21

  12. Algorithm Step 3: Estimating workers’ reliabilities w log m queries are used in this step. item query worker � � Estimate of j th query to k th worker: E (3) � x (2) = 1 y j � = ˆ ∼ Ber ( ǫ k,d ) j i i ∈ ∂y j 12 / 21

  13. Algorithm Step 3: Estimating workers’ reliabilities w log m queries are used in this step. item query worker � � Estimate of j th query to k th worker: E (3) x (2) � = 1 y j � = ˆ ∼ Ber ( ǫ k,d ) j i i ∈ ∂y j j ∈ ∂wk,d E (3) � j ˆ ǫ k,d = | ∂w k,d | ˆ ǫ k,d is arbitrary close to ǫ k,d for all k, d when | ∂w k,d | = Θ(log m ) . 13 / 21

  14. Algorithm Step 4: Strong recovery of labels All remaining queries are used in this step. item query worker � � Estimate of j th query to i th item: M (4) 1 − ˆ ǫ k,d m (4) j → i = log j → i ˆ ǫ k,d 14 / 21

  15. Algorithm Step 4: Strong recovery of labels All remaining queries are used in this step. item query worker � � Estimate of j th query to i th item: M (4) 1 − ˆ ǫ k,d m (4) j → i = log j → i ǫ k,d ˆ � � x (4) j ∈ ∂x i M (4) The final estimate: ˆ = sign � i j → i x (4) We prove that P [ˆ � = x i ] ≤ exp( − (1 + η/ 3) log m ) . i 15 / 21

  16. Simulation Results Comparison between queries Compare XOR, REP, HOMO queries. d -coin flip model: a worker makes d independent judgements with error probability ǫ k and the XOR (HOMO) of them becomes the output. � � k (1 − ǫ k ) d − l = 1 − (1 − 2 ǫ k ) d d � ǫ l ǫ k,d = . l 2 l ∈ [1: d ] , l odd d = 3 ∼ 6 (randomly), ǫ k = { 0 . 005 , 0 . 010 , · · · , 0 . 100 } . 16 / 21

  17. Simulation Results Comparison between queries 1.0 XOR 0.9 REP - BP REP - EoR 0.8 REP - SEM 0.7 HOMO 0.6 Frame Error Rate 0.5 0.4 0.3 0.2 0.1 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Info. limit Info. limit (XOR) (REP) 17 / 21

  18. Simulation Results Importance of Estimating Worker Reliabilities Compare the performace of the proposed algorithm with and without Steps 3-4. For half of the workers, ǫ 1 = 0 . 05 , for the other half, ǫ 2 = 0 . 15 , 0 . 25 , 0 . 35 , 0 . 45 . solid line: with Steps 3-4, dashed line: without Steps 3-4. 1.0 0.9 0.8 0.7 0.6 Frame Error Rate 0.5 0.4 0.3 0.2 0.1 0 0.5 1.0 1.5 2.0 18 / 21

  19. Simulation Results Real Experiment Compare XOR and REP queries. Collected data from Amazon Mechanical Turk. Q. Check TRUE if odd (1, 3) number of images contain cat, and Q. Check TRUE if the image contains cat, and check FALSE if check FALSE if even (0, 2, 4) number of images contain cat. the image contains dog. (a) degree-1 (REP) query (b) degree-4 XOR query 19 / 21

  20. Simulation Results Real Experiment Compare XOR and REP queries. Collected data from Amazon Mechanical Turk. 1.0 XOR 0.9 REP-SEM REP-BP 0.8 REP-EoR 0.7 0.6 Frame Error Rate 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 20 / 21

  21. Summary Crowdsourced classification with multi-degree XOR queries and general error model. Characterized information theoretic limits on the number of queries for strong recovery of labels. Proposed an efficient algorithm achieving this limit without the knowledge of error parameters. Demonstrated the effectiveness of XOR queries through synthetic and real experiments. Full paper available at ArXiv: 2001.11775 21 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend