mergedts for large scale condorcet dueling bandits
play

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya - PowerPoint PPT Presentation

MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and Masrour Zoghi What are dueling bandits? The K -armed dueling bandits (Yue et al, COLT 2009) : K arms (aka actions) Each time-step:


  1. MergeDTS for Large Scale Condorcet Dueling Bandits Chang Li , Ilya Markov, Maarten de Rijke and Masrour Zoghi

  2. What are dueling bandits? • The K -armed dueling bandits (Yue et al, COLT 2009) : K arms (aka actions) • Each time-step: • ➡ the algorithm chooses two arms, l and r (for “left” and “right”); ➡ the dueling happens between l and r with one returned as the winner. Goal : converge to the optimal play for both l and r. • � 2

  3. What is the optimal play? • Notation : is the preference matrix with P := [ P ij ] P ij = Pr (arm i beats arm j ) • Assumption : there exists one arm that on average beats all the other arms: called the Condorcet winner. P 1 j > 0 . 5 for all j 6 = 1 • Regret : the loss of comparing non-Condorcet winner. r t = 0 . 5 ∗ ( P 1 l − 0 . 5) + 0 . 5 ∗ ( P 1 r − 0 . 5) • Optimal play : only play the Condorcet winner, i.e. choose the Condorcet winner as l and r. � 3

  4. Related works • DTS (Wu et al. NIPS 2016) , etc. 
 Limited to small scale set up, i.e. K is small • Self-Sparring (Sui et al. UAI 2017) , etc. 
 Designed under strict assumptions, i.e. not cyclic relationship • MergeRUCB (Zoghi, WSDM 2014) 
 Designed for large scale dueling bandits yet with high cumulative regret � 4

  5. Merge Double Thompson Sampling • Randomly partition arms into small groups. • Each time step: 1. Sample a tournament inside a small group; 2. Choose the winner and loser of the tournament as l and r , respectively; 3. Compare l and r online, and update statistic; 4. Eliminate an arm if it is dominated by any other arm with high confidence. 5. If half arms are eliminated, re-partition rankers. • Stop if only one arm left. � 5

  6. Experiment: online ranker evaluation MSLR-Navigational MergeRUCB α = 0 . 8 6 25000 DTS α = 0 . 8 6 Self-Sparring 20000 Cumulative regret MergeDTS α = 0 . 8 6 15000 10000 5000 10 4 10 5 10 6 10 7 10 8 Iteration � 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend