http cs246 stanford edu web advertising
play

http://cs246.stanford.edu Web advertising Weve learned how to - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Caroline Lo, Stanford University http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to queries in real-time But how to estimate the CTR (Click-Through Rate)?


  1. CS246: Mining Massive Datasets Caroline Lo, Stanford University http://cs246.stanford.edu

  2.  Web advertising  We’ve learned how to match advertisers to queries in real-time  But how to estimate the CTR (Click-Through Rate)?  Recommendation engines  We’ve learned how to build recommender systems  But how to solve the cold-start problem? 3/10/2016 2

  3.  What do CTR and cold start have in common?  Getting the answer requires experimentation  With every ad we show/ product we recommend we gather more data about the ad/product  Theme: Learning through experimentation 3/10/2016 3

  4.  Google’s goal: Maximize revenue  The old way: Pay by impression (CPM) 3/10/2016 4

  5.  Google’s goal: Maximize revenue  The old way: Pay by impression (CPM)  Best strategy: Go with the highest bidder  But this ignores “effectiveness” of an ad  The new way: Pay per click! (CPC)  Best strategy: Go with expected revenue  What’s the expected revenue of ad a for query q ?  E[revenue a,q ] = P(click a | q) * amount a,q Bid amount for Prob. user will click on ad a given ad a on query q that she issues query q (Known) (Unknown! Need to gather information) 3/10/2016 5

  6.  Clinical trials:  Investigate effects of different treatments while minimizing patient losses  Adaptive routing:  Minimize delay in the network by investigating different routes  Asset pricing:  Figure out product prices while trying to make most money 3/10/2016 6

  7. 3/10/2016 7

  8. 3/10/2016 8

  9.  Each arm a  Wins (reward= 1 ) with fixed (unknown) prob. μ a  Loses (reward= 0 ) with fixed (unknown) prob. 1- μ a  All draws are independent given μ 1 … μ k  How to pull arms to maximize total reward? 3/10/2016 9

  10.  How does this map to our advertising example?  Each query is a bandit  Each ad is an arm  We want to estimate the arm’s probability of winning μ a (i.e., the ad’s CTR μ a )  Every time we pull an arm we do an ‘experiment’ 3/10/2016 10

  11. The setting:  Set of k choices (arms)  Each choice a is tied to a probability distribution P a with average reward/payoff μ a (between [0, 1])  We play the game for T rounds  For each round t :  (1) We pick some arm j  (2) We win reward 𝒀 𝒖 drawn from P j  Note reward is independent of previous draws 𝑼  Our goal is to maximize σ 𝒖=𝟐 𝒀 𝒖  We don’t know μ a ! But every time we pull some arm a we get to learn a bit about μ a 3/10/2016 11

  12.  Online optimization with limited feedback Choices X 1 X 2 X 3 X 4 X 5 X 6 … a 1 1 1 a 2 0 1 0 … a k 0 Time  Like in online algorithms:  Have to make a choice each time  But we only receive information about the chosen action 3/10/2016 12

  13.  Policy : a strategy/rule that in each iteration tells me which arm to pull  Hopefully policy depends on the history of rewards  How to quantify performance of the algorithm? Regret! 3/10/2016 13

  14.  𝝂 𝒃 is the mean of 𝑸 𝒃  Payoff/reward of best arm : 𝝂 ∗ = 𝐧𝐛𝐲 𝝂 𝒃 𝒃  Let 𝒃 𝟐 , 𝒃 𝟑 … 𝒃 𝑼 be the sequence of arms pulled  Instantaneous regret at time 𝒖 : 𝒔 𝒖 = 𝝂 ∗ − 𝝂 𝒃 𝒖  Total regret: 𝑼 𝑺 𝑼 = ෍ 𝒔 𝒖 𝒖=𝟐  Typical goal: Want a policy (arm allocation strategy) that guarantees: 𝑺 𝑼 𝑼 → 𝟏 as 𝑼 → ∞  Note: Ensuring 𝑆 𝑈 /𝑈 → 0 is stronger than maximizing payoffs (minimizing regret), as it means that in the limit we discover the true best arm. 3/10/2016 14

  15.  If we knew the payoffs, which arm would we pull? 𝐐𝐣𝐝𝐥 𝐛𝐬𝐡 𝐧𝐛𝐲 𝝂 𝒃 𝒃  We’d always pull the arm with the highest average reward.  But we don’t know which arm that is without exploring /experimenting with the arms first. 𝑌 𝑏,𝑘 … payoff received when pulling arm 𝑏 for 𝑘 -th time 3/10/2016 15

  16.  Minimizing regret illustrates a classic problem in decision making:  We need to trade off exploration (gathering data about arm payoffs) and exploitation (making decisions based on data already gathered)  Exploration: Pull an arm we never pulled before  Exploitation: Pull an arm 𝒃 for which we currently have the highest estimate of 𝝂 𝒃 3/10/2016 16

  17. Algorithm: Epsilon-Greedy  For t=1:T  Set 𝜻 𝒖 = 𝑷(𝟐/𝒖)  With prob. 𝜻 𝒖 : Explore by picking an arm chosen uniformly at random  With prob. 𝟐 − 𝜻 𝒖 : Exploit by picking an arm with highest empirical mean payoff  Theorem [Auer et al. ‘02] For suitable choice of 𝜻 𝒖 it holds that 𝑆 𝑈 = 𝑃(𝑙 log 𝑈) ֜ 𝑆 𝑈 𝑙 log 𝑈 𝑈 = 𝑃 → 0 𝑈 k …number of arms 3/10/2016 17

  18.  What are some issues with Epsilon Greedy ?  “Not elegant” : Algorithm explicitly distinguishes between exploration and exploitation  More importantly: Exploration makes suboptimal choices (since it picks any arm with equal likelihood)  Idea: When exploring/exploiting we need to compare arms 3/10/2016 18

  19.  Suppose we have done experiments:  Arm 1 : 1 0 0 1 1 0 0 1 0 1  Arm 2 : 1  Arm 3 : 1 1 0 1 1 1 0 1 1 1  Mean arm values:  Arm 1 : 5/10, Arm 2 : 1, Arm 3 : 8/10  Which arm would you pick next?  Idea: Don’t just look at the mean (that is, expected payoff) but also the confidence! 3/10/2016 19

  20.  A confidence interval is a range of values within which we are sure the mean lies with a certain probability  We could believe 𝝂 𝒃 is within [0.2,0.5] with probability 0.95  If we have tried an action less often, our estimated reward is less accurate so the confidence interval is larger  Interval shrinks as we get more information (i.e. try the action more often) 3/10/2016 20

  21.  Assuming we know the confidence intervals  Then, instead of trying the action with the highest mean we can try the action with the highest upper bound on its confidence interval  This is called an optimistic policy  We believe an action is as good as possible given the available evidence 3/10/2016 21

  22. 99.99% confidence interval 𝝂 𝒃 After more 𝝂 𝒃 exploration arm a arm a 3/10/2016 22

  23. Suppose we fix arm a:  Let 𝒁 𝒃,𝟐 … 𝒁 𝒃,𝒏 be the payoffs of arm a in the first m trials  𝒁 𝒃,𝟐 … 𝒁 𝒃,𝒏 are i.i.d. rnd. vars. with values in [0,1]  Expected mean payoff of arm a : 𝝂 𝒃 = 𝑭[𝒁 𝒃,𝒏 ] 𝟐 𝒏 𝒏 σ ℓ=𝟐  Our estimate: ෟ 𝝂 𝒃,𝒏 = 𝒁 𝒃,ℓ  Want to find confidence bound 𝒄 such that with high probability 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄  Also want 𝒄 to be as small as possible ( why? )  Goal: Want to bound 𝐐( 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄) 3/10/2016 23

  24.  Hoeffding’s inequality bounds 𝐐( 𝝂 𝒃 − ෟ 𝝂 𝒃,𝒏 ≤ 𝒄)  Let 𝒁 𝟐 … 𝒁 𝒏 be i.i.d. rnd. vars. with values between [0,1] 𝟐 𝒏  Let 𝝂 = 𝑭[𝒁] 𝒏 σ ℓ=𝟐 and ෞ 𝝂 𝒏 = 𝒁 ℓ 𝝂 𝒏 ≥ 𝒄 ≤ 𝒇𝒚𝒒 −𝟑𝒄 𝟑 𝒏 = 𝜺  Then: 𝐐 𝝂 − ෞ  To find out the confidence interval 𝒄 (for a given confidence level 𝜺 ) we solve:  𝑓 −2𝑐 2 𝑛 ≤ 𝜀 then −2𝑐 2 𝑛 ≤ ln(𝜀) 𝐦𝐨 𝟐/𝜺  So: 𝒄 ≥ 𝟑 𝒏 3/10/2016 24

  25. [Auer et al. ‘02]  UCB1 (Upper confidence sampling) algorithm  Set: ෞ 𝝂 𝟐 = ⋯ = ෞ 𝝂 𝒍 = 𝟏 and 𝒏 𝟐 = ⋯ = 𝒏 𝒍 = 𝟏  ෞ 𝝂 𝒃 is our estimate of payoff of arm 𝒋 Upper confidence  𝒏 𝒃 is the number of pulls of arm 𝒋 so far interval ( Hoeffding’s inequality)  For t = 1:T 𝟑 ln 𝒖  For each arm a calculate: 𝑽𝑫𝑪 𝒃 = ෞ 𝝂 𝒃 + 𝒏 𝒃  Pick arm 𝒌 = 𝒃𝒔𝒉 𝒏𝒃𝒚 𝒃 𝑽𝑫𝑪 𝒃  Pull arm 𝒌 and observe 𝒛 𝒖 𝟐  Set: 𝒏 𝒌 ← 𝒏 𝒌 + 𝟐 and ෞ 𝒏 𝒌 (𝒛 𝒖 + 𝒏 𝒌 − 𝟐 ෞ 𝝂 𝒌 ← 𝝂 𝒌 ) 3/10/2016 25

  26. 𝐦𝐨 𝟐/𝜺 𝟑 ln 𝒖 𝒄 ≥  𝑽𝑫𝑪 𝒃 = ෞ 𝝂 𝒃 + 𝟑 𝒏 𝒏 𝒃  𝒖 impacts the value of 𝜺 : 𝒖 = 𝒈 𝟐/𝜺  Confidence interval grows with the total number of actions 𝒖 we have taken  But shrinks with the number of times 𝒏 𝒃 we have tried arm 𝒃  This ensures each arm is tried infinitely often but still balances exploration and exploitation “Optimism in face of uncertainty”: The algorithm believes that it can obtain extra rewards by reaching the unexplored parts of the state space 3/10/2016 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

  27.  Theorem [Auer et al. 2002]  Suppose optimal mean payoff is 𝝂 ∗ = 𝐧𝐛𝐲 𝝂 𝒃 𝒃  And for each arm let 𝚬 𝐛 = 𝝂 ∗ − 𝝂 𝒃  Then it holds that 𝒍 + 𝟐 + 𝝆 𝟑 𝐦𝐨 𝑼 𝑭 𝑺 𝑼 ≤ 𝟗 ෍ ෍ 𝚬 𝒃 𝚬 𝒃 𝟒 𝒃:𝝂 𝒃 <𝝂 ∗ 𝒋=𝒃 𝑷(𝒍 ln 𝑼) 𝑷(𝒍) 𝑺 𝑼 𝒎𝒐 𝑼  So: 𝑷 ≤ 𝒍 𝑼 𝑼 3/10/2016 27

  28.  k -armed bandit problem is a formalization of the exploration-exploitation tradeoff  Simple algorithms are able to achieve no regret (limit towards infinity)  Epsilon-greedy  UCB (Upper Confidence Sampling) 3/10/2016 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend