Crowdsourced Classification with XOR Queries: An Algorithm with - PowerPoint PPT Presentation

Crowdsourced Classification with XOR Queries: An Algorithm with Optimal Sample Complexity Daesung Kim and Hye Won Chung School of Electrical Engineering, KAIST ISIT, 2020

Introduction Crowdsourcing A method to label data with very small budget; Collect data from some easy to access, unreliable people (crowd) Goal : Recover the true label from noisy answers with minumum number of queries. What is the most efficient querying method? What is the best recovery algorithm? 2 / 21

Introduction Querying method Repetition (REP) query : ask about an item several times to different people Homogeneous (HOMO) query : ask whether a group of items has the same label Let m be the total number of items, and d be the number of items in each group of Homo query. ( d is called query degree ) When queries are randomly made, and the answer is not noisy, we require � 2 d − 1 � Ω( m log m ) and Ω m log m d queries to perfectly recover the binary labels with REP and HOMO query, respectively. Any other querying method whose required number of queries decreases as d increases? 3 / 21

Introduction XOR query We ask whether there is an even or odd number of items with certain label in a group of items. A recent work by Ahn et al. [1] proved that we require � m log m � Ω d ( √ ǫ − √ 1 − ǫ ) 2 queries to perfectly recover the binary labels with XOR query. Contributions What is the required number of queries when the query degree and error probability is not uniform over the queries? Can we achieve the bound with some polynomial time algorithms ? [1] K. Ahn, K. Lee, and C. Suh, “Community recovery in hypergraphs,” IEEE Transactions on Infor- mation Theory , vol. 65, no. 10, pp. 6561–6579, 2019. 4 / 21

Model Goal Recover m binary labels denoted x ∈ {− 1 , 1 } m . Random Query Design 1. Obtain a query degree d from a distribution Φ = { Φ 1 , · · · , Φ D } . � m � 2. Select d items randomly among possibilities and make a XOR query. d 3. Choose a worker randomly and assign the query. 4. Repeat 1–3 n times. item query worker query degree = 2 5 / 21

Model Noise Model Dawid and Skene model: error probability depends on the assigned worker. Error probability depends on both the assigned worker ( k ) and query degree ( d ). ( ǫ k,d < 0 . 5 ) Error probability usually increases with respect to d . Tripartite Graph Representation item query worker 6 / 21

Information-theoretic Bound Theorem 1 x ∈ { 1 , − 1 } m , the strong recovery is With the maximum likelihood (ML) estimator ˆ possible, i.e., P [ˆ x � = x ] → 0 as m → ∞ , if and only if the number of queries is m log m n ≥ (1 + η ) w ( √ 1 − ǫ k,d − √ ǫ k,d ) 2 , (1) � D � w d Φ d d =1 k =1 for a small constant η > 0 . The bound is inversely proportional to the average query degree , d = � D d =1 d Φ d . We assume that the worker reliabilities are known and the ML estimator makes use of it. 7 / 21

Algorithm We are given the number of queries equal to the bound in Theorem 1. Algorithm Sketch x (1) 1. Detection of labels: P [ˆ � = x i ] < 1 / 2 . i x (2) 2. Weak recovery of labels: P [ˆ � = x i ] → 0 . i 3. Estimating worker reliabilities: For any δ > 0 , P [ | ˆ ǫ k,d − ǫ k,d | > δ ] → 0 . x (4) � = x ] → 0 . 4. Strong recovery of labels: P [ˆ Notes For independence, we separate the queries into 4 groups and use each of them only at one step of the algorithm. We assume that there are at least m degree-1 queries. 8 / 21

Algorithm Step 1: Detection of labels m degree-1 queries are used in this step. item query worker � � � x (1) Estimate of this step: ˆ = sign y j . i j ∈ ∂x i x (1) � = x i ] < 1 / 2 holds because the average workers’ error probability is less than 1 P [ˆ 2 , i i.e. ǫ = 1 k ∈ [ w ] ǫ k, 1 < 1 � w 2 9 / 21

Algorithm Step 2: Weak recovery of labels m log log m queries are used in this step. item query worker Estimate of j th query to i th item: m (2) x (1) � j → i = y j ˆ i ′ i ′ ∈ ∂y j \{ i } 10 / 21

Algorithm Step 2: Weak recovery of labels m log log m queries are used in this step. item query worker Estimate of j th query to i th item: m (2) x (1) � j → i = y j ˆ i ′ i ′ ∈ ∂y j \{ i } � � x (2) j ∈ ∂x i m (2) Estimate of this step : ˆ = sign � . i j → i � � ≤ 1 / (log m ) K for positive constant K . x (2) If | ∂x i | = Θ(log log m ) , P ˆ � = x i i 11 / 21

Algorithm Step 3: Estimating workers’ reliabilities w log m queries are used in this step. item query worker � � Estimate of j th query to k th worker: E (3) � x (2) = 1 y j � = ˆ ∼ Ber ( ǫ k,d ) j i i ∈ ∂y j 12 / 21

Algorithm Step 3: Estimating workers’ reliabilities w log m queries are used in this step. item query worker � � Estimate of j th query to k th worker: E (3) x (2) � = 1 y j � = ˆ ∼ Ber ( ǫ k,d ) j i i ∈ ∂y j j ∈ ∂wk,d E (3) � j ˆ ǫ k,d = | ∂w k,d | ˆ ǫ k,d is arbitrary close to ǫ k,d for all k, d when | ∂w k,d | = Θ(log m ) . 13 / 21

Algorithm Step 4: Strong recovery of labels All remaining queries are used in this step. item query worker � � Estimate of j th query to i th item: M (4) 1 − ˆ ǫ k,d m (4) j → i = log j → i ˆ ǫ k,d 14 / 21

Algorithm Step 4: Strong recovery of labels All remaining queries are used in this step. item query worker � � Estimate of j th query to i th item: M (4) 1 − ˆ ǫ k,d m (4) j → i = log j → i ǫ k,d ˆ � � x (4) j ∈ ∂x i M (4) The final estimate: ˆ = sign � i j → i x (4) We prove that P [ˆ � = x i ] ≤ exp( − (1 + η/ 3) log m ) . i 15 / 21

Simulation Results Comparison between queries Compare XOR, REP, HOMO queries. d -coin flip model: a worker makes d independent judgements with error probability ǫ k and the XOR (HOMO) of them becomes the output. � � k (1 − ǫ k ) d − l = 1 − (1 − 2 ǫ k ) d d � ǫ l ǫ k,d = . l 2 l ∈ [1: d ] , l odd d = 3 ∼ 6 (randomly), ǫ k = { 0 . 005 , 0 . 010 , · · · , 0 . 100 } . 16 / 21

Simulation Results Comparison between queries 1.0 XOR 0.9 REP - BP REP - EoR 0.8 REP - SEM 0.7 HOMO 0.6 Frame Error Rate 0.5 0.4 0.3 0.2 0.1 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Info. limit Info. limit (XOR) (REP) 17 / 21

Simulation Results Importance of Estimating Worker Reliabilities Compare the performace of the proposed algorithm with and without Steps 3-4. For half of the workers, ǫ 1 = 0 . 05 , for the other half, ǫ 2 = 0 . 15 , 0 . 25 , 0 . 35 , 0 . 45 . solid line: with Steps 3-4, dashed line: without Steps 3-4. 1.0 0.9 0.8 0.7 0.6 Frame Error Rate 0.5 0.4 0.3 0.2 0.1 0 0.5 1.0 1.5 2.0 18 / 21

Simulation Results Real Experiment Compare XOR and REP queries. Collected data from Amazon Mechanical Turk. Q. Check TRUE if odd (1, 3) number of images contain cat, and Q. Check TRUE if the image contains cat, and check FALSE if check FALSE if even (0, 2, 4) number of images contain cat. the image contains dog. (a) degree-1 (REP) query (b) degree-4 XOR query 19 / 21

Simulation Results Real Experiment Compare XOR and REP queries. Collected data from Amazon Mechanical Turk. 1.0 XOR 0.9 REP-SEM REP-BP 0.8 REP-EoR 0.7 0.6 Frame Error Rate 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 20 / 21

Summary Crowdsourced classification with multi-degree XOR queries and general error model. Characterized information theoretic limits on the number of queries for strong recovery of labels. Proposed an efficient algorithm achieving this limit without the knowledge of error parameters. Demonstrated the effectiveness of XOR queries through synthetic and real experiments. Full paper available at ArXiv: 2001.11775 21 / 21

Crowdsourced Classification with XOR Queries: An Algorithm with - PowerPoint PPT Presentation

Crowdsourced Classification with XOR Queries: An Algorithm with Optimal Sample Complexity Daesung Kim and Hye Won Chung School of Electrical Engineering, KAIST ISIT, 2020 Introduction Crowdsourcing A method to label data with very small

Array BP-XOR Codes for Reliable Cloud Storage Systems Yongge Wang UNC Charlotte, USA July

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Quantum Algorithms for the k -xor Problem Lorenzo Grassi 1 , Mara Naya-Plasencia 2 , Andr

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Truth discovery in crowdsourced detection of spatial events Robin Wentao Ouyang Mani Srivastava

Crowdsourced translation using MediaWiki Siebrand Mazeland i18n/L10n contractor, Wikimedia

SPECIFIC SEARCH OF CROWDSOURCED OPENSTREETMAP DATASET AND WIKI Prof. Stefan Keller and Michel

Deduction with XOR Constraints in Security API Modelling Graham Steel V I N E U R S E I

Height representation of XOR-Ising loops via bipartite dimers C edric Boutillier (UPMC) B

Imaginary multiplicative chaos and the XOR-Ising model Janne Junnila (EPFL) joint work with Eero

. Vladimir Kolesnikov . . Payman Mohassel Mike Rosulek . .

Multi-Layered Perceptrons (MLPs) The XOR problem is solvable if we add an extra node

Lecture 8: SOS Lower Bound for 3-XOR Lecture Outline Part I: SOS Lower Bounds from Pseudo-

Explicit Fusions Philippa Gardner, Lucian Wischik MFCS 2000, Bratislava what is an explicit

Unit3Day4-LaBrake Monday, October 21, 2013 11:32 AM Vanden Bout/LaBrake/Crawford CH301 WHY

Vis/NIR Data Quality, Cloud Detection and IR co-registration Catherine Gautier and Yang Shiren

Thin Film PV Technologies Organic PV Technology Week 5.5

SURGERY ON TREES ANDREW RANICKI Homotop y vs. homeomo rphism T ransversalit y

On The Complexity Of Computing Grbner Bases For Weighted Homogeneous Systems Jean-Charles

Characteristic classes of homological surface bundles and four-dimensional topology Shigeyuki

Outline for Today Monday, Nov. 19 Chapter 9: Theories of Bonding Molecular Orbitals of

JUST THE MATHS SLIDES NUMBER 15.10 ORDINARY DIFFERENTIAL EQUATIONS 10 (Simultaneous