Statistical Ranking Problem Tong Zhang Statistics Department, - - PowerPoint PPT Presentation

statistical ranking problem
SMART_READER_LITE
LIVE PREVIEW

Statistical Ranking Problem Tong Zhang Statistics Department, - - PowerPoint PPT Presentation

Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing with large search space.


slide-1
SLIDE 1

Statistical Ranking Problem

Tong Zhang

Statistics Department, Rutgers University

slide-2
SLIDE 2

Ranking Problems

  • Rank a set of items and display to users in corresponding order.
  • Two issues: performance on top and dealing with large search space.

– web-page ranking ∗ rank pages for a query ∗ theoretical analysis with error criterion focusing on top – machine translation ∗ rank possible (English) translations for a given input (Chinese) sentence ∗ algorithm handling large search space

1

slide-3
SLIDE 3

Web-Search Problem

  • User types a query, search engine returns a result page:

– selects from billions of pages. – assign a score for each page, and return pages ranked by the scores.

  • Quality of search engine:

– relevance (whether returned pages are on topic and authoritative) – other issues ∗ presentation (diversity, perceived relevance, etc) ∗ personalization (predict user specific intention) ∗ coverage (size and quality of index). ∗ freshness (whether contents are timely). ∗ responsiveness (how quickly search engine responds to the query).

2

slide-4
SLIDE 4

Relevance Ranking: Statistical Learning Formulation

  • Training:

– randomly select queries q, and web-pages p for each query. – use editorial judgment to assign relevance grade y(p, q). – construct a feature x(p, q) for each query/page pair. – learn scoring function ˆ f(x(p, q)) to preserve the order of y(p, q) for each q.

  • Deployment:

– query q comes in. – return pages p1, . . . , pm in descending order of ˆ f(x(p, q)).

3

slide-5
SLIDE 5

Measuring Ranking Quality

  • Given scoring function ˆ

f, return ordered page-list p1, . . . , pm for a query q. – only the order information is important. – should focus on the relevance of returned pages near the top.

  • DCG (discounted cumulative gain) with decreasing weight ci

DCG( ˆ f, q) =

m

  • j=1

cir(pi, q).

  • ci: reflects effort (or likelihood) of user clicking on the i-th position.

4

slide-6
SLIDE 6

Subset Ranking Model

  • x ∈ X: feature (x(p, q) ∈ X)
  • S ∈ S: subset of X ({x1, . . . , xm} = {x(p, q) : p} ∈ S)

– each subset corresponds to a fixed query q. – assume each subset of size m for convenience: m is large.

  • y: quality grade of each x ∈ X (y(p, q)).
  • scoring function f : X × S → R.

– ranking function rf(S) = {ji}: ordering of S ∈ S based on scoring function f.

  • quality: DCG(f, S) = m

i=1 ciEyji|(xji,S) yji.

5

slide-7
SLIDE 7

Some Theoretical Questions

  • Learnability:

– subset size m is huge: do we need many samples (rows) to learn. – focusing quality on top.

  • Learning method:

– regression. – pair-wise learning? other methods?

  • Limited goal to address here:

– can we learn ranking by using regression when m is large. ∗ massive data size (more than 20 billion) ∗ want to derive: error bounds independent of m. – what are some feasible algorithms and statistical implications.

6

slide-8
SLIDE 8

Bayes Optimal Scoring

  • Given a set S ∈ S, for each xj ∈ S, we define the Bayes-scoring function as

fB(xj, S) = Eyj|(xj,S) yj

  • The optimal Bayes ranking function rfB that maximizes DCG

– induced by fB – returns a rank list J = [j1, . . . , jm] in descending order of fB(xji, S). – not necessarily unique (depending on cj)

  • The function is subset dependent: require appropriate result set features.

7

slide-9
SLIDE 9

Simple Regression

  • Given subsets Si = {xi,1, . . . , xi,m} and corresponding relevance score

{yi,1, . . . , yi,m}.

  • We can estimate fB(xj, S) using regression in a family F:

ˆ f = arg min

f∈F n

  • i=1

m

  • j=1

(f(xi,j, Si) − yi,j)2

  • Problem: m is massive (> 20 billion)

– computationally inefficient – statistically slow convergence ∗ ranking error bounded by O(√m)× root-mean-squared-error.

  • Solution: should emphasize estimation quality on top.

8

slide-10
SLIDE 10

Importance Weighted Regression

  • Some samples are more important than other samples (focus on top).
  • A revised formulation: ˆ

f = arg minf∈F 1

n

n

i=1 L(f, Si, {yi,j}j), with

L(f, S, {yj}j) =

m

  • j=1

w(xj, S)(f(xj, S) − yj)2 + u sup

j

w′(xj, S)(f(xj, S) − δ(xj, S))2

+

  • weight w: importance weighting focusing regression error on top

– zero for irrelevant pages

  • weight w′: large for irrelevant pages

– for which f(xj, S) should be less than threshold δ.

  • importance weighting can be implemented through importance sampling.

9

slide-11
SLIDE 11

Relationship of Regression and Ranking

Let Q(f) = ESL(f, S), where L(f, S) = E{yj}j|SL(f, S, {yj}j) =

m

  • j=1

w(xj, S)Eyj|(xj,S) (f(xj, S) − yj)2 + u sup

j

w′(xj, S)(f(xj, S) − δ(xj, S))2

+.

Theorem 1. Assume that ci = 0 for all i > k. Under appropriate parameter choices with some constants u and γ, for all f: DCG(rB) − DCG(rf) ≤ C(γ, u)(Q(f) − inf

f′ Q(f ′))1/2.

Key point: focus on relevant documents on top;

j w(xj, S) is much smaller

than m.

10

slide-12
SLIDE 12

Generalization Performance with Square Regularization

Consider scoring f ˆ

β(x, S) = ˆ

βTψ(x, S), with feature vector ψ(x, S):

ˆ β = arg min

β∈H

  • 1

n

n

  • i=1

L(β, Si, {yi,j}j) + λβTβ

  • ,

(1) L(β, S, {yj}j) =

m

  • j=1

w(xj, S)(fβ(xj, S) − yj)2 + u sup

j

w′(xj, S)(fβ(xj, S) − δ(xj, S))2

+.

Theorem 2. Let M = supx,S φ(x, S)2 and W = supS[

xj∈S w(xj, S) +

u supxj∈S w′(xj, S)]. Let f ˆ

β be the estimator defined in (1). Then we have

DCG(rB) − E{Si,{yi,j}j}n

i=1 DCG(rf ˆ β)

≤C(γ, u)

  • 1 + W M

√ 2λn 2 inf

β∈H(Q(fβ) + λβTβ) − inf f Q(f)

1/2 .

11

slide-13
SLIDE 13

Interpretation of Results

  • Result does not depend on m, but the much smaller quantity quantity W =

supS[

xj∈S w(xj, S) + u supxj∈S w′(xj, S)]

– emphasize relevant samples on top: w is small for irrelevant documents. – a refined analysis can replace sup over S by some p-norm over S.

  • Can control generalization for the top portion of the rank-list even with large

m. – learning complexity does not depend on the majority of items near the bottom of the rank-list. – the bottom items are usually easy to estimate.

12

slide-14
SLIDE 14

Key Points

  • Ranking quality near the top is most important

– statistical analysis to deal with the scenario

  • Regression based algorithm to handle large search space

– importance weighting of regression terms – error bounds independent of the massive web-size.

13

slide-15
SLIDE 15

Statistical Translation and Algorithm Challenge

  • Problem:

– conditioned on a source sentence in one language. – generate a target sentence in another language.

  • General approach:

– scoring function: measuring the quality of translated sentence based on the source sentence (similar to web-search) – search strategy: effectively generate target sentence candidates. ∗ search for the optimal score. ∗ structure used in the scoring function (through block model).

  • Main challenge: exponential growth of search space

14

slide-16
SLIDE 16

Graphical illustration

b1

Lebanese violate warplanes Israeli A l T A } r A t } A l H r b y P A l A s r A y l y P t n t h k airspace l l b n A n y A l m j A l A l j w y A

b2 b3 b4

15

slide-17
SLIDE 17

Block Sequence Decoder

  • Database: a set of possible translation blocks

– e.g. block “a b” translates into potential block “z x y”.

  • Scoring function:

– candidate translation: a block sequence (b1, . . . , bn) – map each sequence of blocks into a non-negative score sw(bn

1) = n i=1 wTF(bi−1, bi; o).

  • Input: source sentence.
  • Translation:

– generate block sequence consistent with the source sentence. – find the sequence with largest score.

16

slide-18
SLIDE 18

Decoder Training

  • Given source/target sentence pairs {(Si, Ti)}i=1,...,n.
  • Given a decoding scheme implementing: ˆ

z(w, S) = arg maxz∈V (S) sw(z).

  • Find parameter w such that on the training data:

– ˆ z(w, Si) achieves high BLEU score on average.

  • Key difficulty: large search space (similar issues in MLR)
  • Traditional tuning.
  • Our methods:

– local training. – global training.

17

slide-19
SLIDE 19

Traditional scoring function

  • A bunch of local statistics gathered from the training data, and beyond.

– different language models of the target P(Ti|Ti−1, Ti−2...). ∗ can depend on statistics beyond training data. ∗ Google benefited significantly from huge language models. – orientation (swap) frequency. – block frequency. – block quality score: ∗ e.g. normalized in-block unigram translation probability: (S, T) = ({s1, . . . , sp}, {t1, . . . , tq}); P(S|T) =

j nj

  • i p(sj|ti).
  • Linear combination of log-frequencies (five or six features):

– sw({bn

1}) = i

  • j wj log fj(bi; bi−1, · · · )
  • Tuning: hand-tuning; gradient descent adjustment to optimize BLEU score

18

slide-20
SLIDE 20

Large scale training

  • Motivation:

– want: the ability of incorporating large number of features. – require: automated training method to optimize millions of parameters. – similar to the transition from inktomi rank function tuning to MLR.

  • Challenges: search space V (S) is too large. Need to break it down.

– direct global training using relevant set model: treats decoder as a black- box.

19

slide-21
SLIDE 21

Global training of decoder parameter

  • Treat decoder as a black box, implementing: ˆ

z(w, S) = arg maxz∈V (S) sw(z).

  • No need to know V (S).
  • Generate truth set as block sequences with K-largest blue scores: VK(S).
  • Generate alternatives as a subset of V (S) that are “most relevant”.

– if w does well on the relevant alternatives, then it does well on the whole set V (S). – related to active sampling procedure in MLR.

  • Some related works in parsing exist.

20

slide-22
SLIDE 22

Learning method

  • Try to minimize the following regularized empirical risk minimization:

ˆ w = arg min

w

  • 1

m

N

  • i=1

Φ(w, VK(Si), V (Si)) + λw2

  • Φ(w, VK, V ) = 1

K

  • z∈VK

max

z′∈V −VK

ψ(w, z, z′), ψ(w, z, z′) = φ(sw(z), Bl(z); sw(z′), Bl(z′)),

  • Relevant set: let ξi(w, z) = maxz′∈V (Si)−VK(Si) ψ(w, z, z′)

V (r)( ˆ w, Si) = {z′ ∈ V (Si) : ∃z ∈ VK(Si), ξi( ˆ w, z) = 0 & ψ( ˆ w, z, z′) = ξi( ˆ w, z)}.

  • Key observation: can replace V by V (r) without changing solution.

21

slide-23
SLIDE 23

Example loss functions

  • Truth z ∈ VK, alternative z′ ∈ V − VK (or in V (r)).
  • Bl(z) > Bl(z′): want to penalize if score sw(z) ≤ sw(z′).
  • Estimate sw(z) using least squares (consistent):

φ(sw(z), Bl(z); sw(z′), Bl(z′)) = α(sw(z) − Bl(z))2 + α′(sw(z′) − Bl(z′))2. – does not do well in our experiments. Possible reasons: we did implement correctly (re-weighting); sw(z) cannot approximate Bl(z) very well.

  • Estimate sw(z) using pair-wise loss (inconsistent):

φ(sw(z), Bl(z); sw(z′), Bl(z′)) = (Bl(z)−Bl(z′)) max(1−(sw(z)−sw(z′)), 0)2.

22

slide-24
SLIDE 24

Approximate Relevant Set Method

  • Observation:

– relevant set depends on w; – w can be calculated based on approximate relevant set.

  • Iterate to find both (similar to the active learning procedure in MLR).

– fix V (r), update w. – fix w, update V (r).

23

slide-25
SLIDE 25

Table 1: Generic Relevant Set Algorithm divide training points into m blocks J1, . . . , Jm initialize weight vector w ← w0 for each data point S initialize truth VK(S) and alternative V (r)(S) for ℓ = 1, · · · , L for j = 1, · · · , m for each S ∈ Jj select relevant points {˜ zk} ∈ S (*) update V (r)(S) ← V (r)(S) ∪ {˜ zk} update w by solving the following approximately (**) minw 1

m

N

i=1 Φ(w, VK(Si), V (r)(Si)) + λw2

24

slide-26
SLIDE 26

Convergence analysis

  • Approximate relevant set size:

– relevant set update rule (*): ˜ zk ∈ V (S) − Vk(S) for each zk ∈ VK(S) (k = 1, . . . , K) such that ψ(w, zk, ˜ zk) = max

z′∈V (S)−VK(S) ψ(w, zk, z′)

– convergence bound independent of the size of V (S). – generalization bound: independent of size of V (S). – real implementation using decoder output: ˜ z = maxz′∈V (S) sw(z′)

  • Weight update rule in (**):

– can be implemented using stochastic gradient descent.

25

slide-27
SLIDE 27

Experiments (MT03 Arabic-English DARPA evaluation)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 5 10 15 20 25 ’dj363.BLEUTEST’ ’dl363.BLEUTEST’ ’dj363.BLEUTRAINING’ ’dl363.BLEUTRAINING’

Figure 1: BLEU score on test and training data as a function of the number of training iterations.

26

slide-28
SLIDE 28

Translation results (25 training iterations)

Model training test

MON-PHR

0.3661 0.261

MON

0.4773 0.359

SWAP

0.4741 0.362

  • Using only simple binary-features without language models, etc (never done

before in SMT) – MON-PHR: phrase id based features; – MON/SWAP: plus internal word features with and without swapping.

  • Context:

– many published grammar based methods, in the .20s. – state of the art (with many additional engineering tuning): Google with huge language model (around .50), IBM (around .45).

27

slide-29
SLIDE 29

Remarks

  • Method to handle large search space for arbitrary decoding procedure

– key: restrict search space dependency ∗ through a loss function of the form maxz′∈V (S)−A φ(z′, · · · ). – z′ achieved at a small relevant set (can ignore the non-relevant points) – using decoder output to approximate relevant set. ∗ automatic construction of most relevant alternatives.

28