Ranking article comments using reinforcement learning Lester - - PowerPoint PPT Presentation

ranking article comments using reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Ranking article comments using reinforcement learning Lester - - PowerPoint PPT Presentation

vespa.ai Ranking article comments using reinforcement learning Lester Solbakken | October 28th 2019 Encourages meaningful discussion? Vespa at Select comments using neural nets and reinforcement learning Hundreds of Vespa applications


slide-1
SLIDE 1

Ranking article comments using reinforcement learning

Lester Solbakken | October 28th 2019 vespa.ai

slide-2
SLIDE 2
slide-3
SLIDE 3

Encourages meaningful discussion?

slide-4
SLIDE 4
slide-5
SLIDE 5

Vespa at

Hundreds of Vespa applications (Flickr, Tumblr, TechCrunch, Huffington Post, Aol, Gemini, Engadget, Yahoo News Sports Finance Mail etc.):

  • serving over a billion users,
  • hundreds of thousands of

queries per second,

  • billions of content items.

Personalized article recommendations Personalized real-time native ads selection Searching 20+ billion images Select comments using neural nets and reinforcement learning

slide-6
SLIDE 6

Around 30 developers in Trondheim, Norway

Vespa team

Fast Search & Transfer (alltheweb.com)

1998

Overture Yahoo

2004 2017 2019

Oath Verizon Media Group Vespa Open Source

slide-7
SLIDE 7

Baseline - existing solution

Comments found on many Yahoo properties such as Yahoo Finance, Yahoo News, and Yahoo Sports

  • ~ 1 billion comments stored
  • ~ 12.000 queries per second
  • 2x that for updates

Some articles have > 100.000 comments!

https://blog.vespa.ai/post/182759620076/serving-article-comments-using-reinforcement

slide-8
SLIDE 8

Potential features

Wilson score*: probability of comment being overwhelmingly liked by all users

(*) Zhang et. al. 2011. How to Count Thumb-Ups and Thumb-Downs: User-Rating Based Ranking of Items from an Axiomatic Perspective.

Community How users interacted with comment Comment Relevance to topic, moderation Author Reputation User Preferences Other Time Conversation AI (https://conversationai.github.io)

slide-9
SLIDE 9

Previous ranking algorithm

Community features Comment features Author features User features Other features Final score Hardcoded weighting

slide-10
SLIDE 10

Scoring

Question Answer

Ranking Learning

slide-11
SLIDE 11

Scoring

Question Answer

How should features be combined intelligently? Ranking How can we overcome position bias? Learning How do we learn directly from user behavior?

slide-12
SLIDE 12

Scoring

Question Answer

How should features be combined intelligently? Neural network over comment features Ranking How can we overcome position bias? Learning How do we learn directly from user behavior?

slide-13
SLIDE 13

Scoring

Question Answer

How should features be combined intelligently? Neural network over comment features Ranking How can we overcome position bias? Exploration with sampling Learning How do we learn directly from user behavior?

slide-14
SLIDE 14

Scoring

Question Answer

How should features be combined intelligently? Neural network over comment features Ranking How can we overcome position bias? Exploration with sampling Learning How do we learn directly from user behavior? Reinforcement learning with dwell time rewards

slide-15
SLIDE 15

Reinforcement learning in general

RL is a general-purpose framework for artificial intelligence

  • RL is for an agent with the

capacity to act

  • Each action influences the

agent’s future state

  • Success is measured by a

scalar reward signal

  • Select actions to maximise

future reward

slide-16
SLIDE 16

Multi-arm bandits with context Reward r is conditioned on chosen action - feedback is partial Canonical example: ad serving

Contextual bandits

Source: Microsoft research

features x score v = f(x) action a

}

policy

slide-17
SLIDE 17

Sometimes called contextual semibandits* Importance weighted sampling to construct unbiased estimates for rewards

}

Contextual bandits in ranking

features x score v = f(x) ranking policy

(*) Krishnamurthy, Agarwal, Dudík 2016. Contextual Semibandits via Supervised Learning Oracles.

Policy chooses a ranking, not an action

slide-18
SLIDE 18

Comment

Scoring

slide-19
SLIDE 19

Comment

Scoring

Features

Community Comment Author User Other

slide-20
SLIDE 20

Comment

Scoring

Features Model

Community Comment Author User Other

slide-21
SLIDE 21

Comment

Scoring

Features Model Positive score

Community Comment Author User Other

slide-22
SLIDE 22

Comments

Ranking

Scores

slide-23
SLIDE 23

Comments

Ranking

Scores Sampling

slide-24
SLIDE 24

Comments

Ranking

Scores Sampling Ranking

slide-25
SLIDE 25

Comments

Ranking

Scores Sampling Ranking

slide-26
SLIDE 26

Comments

Ranking

Scores Sampling Ranking

slide-27
SLIDE 27

Comments

Ranking

Scores Sampling Ranking

slide-28
SLIDE 28

Learning

Model Rankings

slide-29
SLIDE 29

Learning

Model Rankings Reward

slide-30
SLIDE 30

Learning

Model Rankings Reward Gradient ascent

in direction of expected reward

slide-31
SLIDE 31

Learning

Model Rankings Reward Gradient ascent

in direction of expected reward

slide-32
SLIDE 32

Learning

Model Rankings Reward Gradient ascent

Can use any reward

in direction of expected reward

slide-33
SLIDE 33

Cold start: pre-train neural network to emulate previous ranking

  • Gradient ascent with Kendall’s tau coefficient as reward

Off-policy evaluation: interactions are logged as (x, a, r, p), where p is the policy’s probability of choosing a given x.

  • Inverse-Propensity Scoring* for estimating average reward of a some policy

from data collected by another policy

Bootstrapping and testing

(*) Peter C. Austin. 2011. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.

slide-34
SLIDE 34

Elements of a solution

Comments Scoring model Ranking

slide-35
SLIDE 35

Elements of a solution

log (r) Comments Scoring model Ranking Distributed DB Reward instrumentation

slide-36
SLIDE 36

Elements of a solution

log (r) log (x, a, p) Comments Scoring model Ranking Distributed DB Reward instrumentation

slide-37
SLIDE 37

Elements of a solution

log (r) log (x, a, p) Comments Scoring model Machine learning Ranking (x, a, r, p) Distributed DB Reward instrumentation

slide-38
SLIDE 38

Implementation

Presentation Comment processing Hadoop

votes

TensorFlow

(r) (x, a, p) create, update feed rankings

Vespa

slide-39
SLIDE 39

Vespa

  • Search and filter over

structured and unstructured data

  • Query time organization and

aggregation of matching data

  • Real-time writes
  • Advanced relevance scoring

with tensors as first class citizens*

  • Scaleable and fast
  • Elastic and fault tolerant
  • Pluggable
  • Easy to operate

Typical use cases: text search, personalization, recommendation, targeting, real-time data display A platform for low latency computations over large, evolving data sets:

(*) https://github.com/jobergum/dense-vector-ranking-performance

slide-40
SLIDE 40

Scaleable and fast

  • About 1 billion comments / ~12.000 queries per second
  • Read latency 7ms for 10k comments - including model evaluation
  • Write latency ~1ms

Direct deployment of ML scoring models Advanced computation framework for complex features Custom logic for implementing sampling and logging Hosted for simpler architecture *

Vespa as comment serving system

(*) https://vespa.ai/cloud

slide-41
SLIDE 41

Scalable low latency execution

Container node Query

Application Package

Admin & Config Content node

Deploy

  • Configuration
  • Components
  • ML models

Scatter-gather

Core sharding models models models

How to bound latency: 1) Parallelization 2) Prepared data structures (indexes etc.) 3) Move execution to data nodes

slide-42
SLIDE 42

Deploying ML models to Vespa

map( join( reduce( join( placeholder, weights, f(x,y)(x * y) ), sum, d1 ), bias, f(x,y)(x + y) ), f(x)(max(0,x)) ) placeholder weights matmul bias add relu

1. Model in application package 2. Download model from external source during (re-)deployment 3. Feed model weights as tensors

slide-43
SLIDE 43

Deployment strategy

Experimental bucket A/B test Production Traffic splitter Freeze scoring model

slide-44
SLIDE 44

~25% increase in time spent Experimenting with

  • more features for a larger neural networks
  • personalized comment ranking
  • more sophisticated rewards

Results and ongoing work

slide-45
SLIDE 45

Generalizing the implementation

Presentation Content processing Distributed DB Machine learning

(r) (x, a, p) feed rankings

Vespa

External content

Search News recommendation Product recommendation Ad selection Q&A +++

slide-46
SLIDE 46

Thanks to

Verizon Media Engineering Sreekanth Ramakrishnan Aaron Nagao Zhi Qu Xue Wu Verizon Media Science Akshay Soni Kapil Thadani

slide-47
SLIDE 47

Thank you!

https://vespa.ai/cloud vespa.ai