Statistical Ranking Problem Tong Zhang Statistics Department, - PowerPoint PPT Presentation

Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University

Ranking Problems • Rank a set of items and display to users in corresponding order. • Two issues: performance on top and dealing with large search space. – web-page ranking ∗ rank pages for a query ∗ theoretical analysis with error criterion focusing on top – machine translation ∗ rank possible (English) translations for a given input (Chinese) sentence ∗ algorithm handling large search space 1

Web-Search Problem • User types a query, search engine returns a result page: – selects from billions of pages. – assign a score for each page, and return pages ranked by the scores. • Quality of search engine: – relevance (whether returned pages are on topic and authoritative) – other issues ∗ presentation (diversity, perceived relevance, etc) ∗ personalization (predict user specific intention) ∗ coverage (size and quality of index). ∗ freshness (whether contents are timely). ∗ responsiveness (how quickly search engine responds to the query). 2

Relevance Ranking: Statistical Learning Formulation • Training: – randomly select queries q , and web-pages p for each query. – use editorial judgment to assign relevance grade y ( p, q ) . – construct a feature x ( p, q ) for each query/page pair. – learn scoring function ˆ f ( x ( p, q )) to preserve the order of y ( p, q ) for each q . • Deployment: – query q comes in. – return pages p 1 , . . . , p m in descending order of ˆ f ( x ( p, q )) . 3

Measuring Ranking Quality • Given scoring function ˆ f , return ordered page-list p 1 , . . . , p m for a query q . – only the order information is important. – should focus on the relevance of returned pages near the top. • DCG (discounted cumulative gain) with decreasing weight c i m DCG ( ˆ � f, q ) = c i r ( p i , q ) . j =1 • c i : reflects effort (or likelihood) of user clicking on the i -th position. 4

Subset Ranking Model • x ∈ X : feature ( x ( p, q ) ∈ X ) • S ∈ S : subset of X ( { x 1 , . . . , x m } = { x ( p, q ) : p } ∈ S ) – each subset corresponds to a fixed query q . – assume each subset of size m for convenience: m is large. • y : quality grade of each x ∈ X ( y ( p, q ) ). • scoring function f : X × S → R . – ranking function r f ( S ) = { j i } : ordering of S ∈ S based on scoring function f . • quality: DCG ( f, S ) = � m i =1 c i E y ji | ( x ji ,S ) y j i . 5

Some Theoretical Questions • Learnability: – subset size m is huge: do we need many samples (rows) to learn. – focusing quality on top. • Learning method: – regression. – pair-wise learning? other methods? • Limited goal to address here: – can we learn ranking by using regression when m is large. ∗ massive data size (more than 20 billion) ∗ want to derive: error bounds independent of m . – what are some feasible algorithms and statistical implications. 6

Bayes Optimal Scoring • Given a set S ∈ S , for each x j ∈ S , we define the Bayes-scoring function as f B ( x j , S ) = E y j | ( x j ,S ) y j • The optimal Bayes ranking function r f B that maximizes DCG – induced by f B – returns a rank list J = [ j 1 , . . . , j m ] in descending order of f B ( x j i , S ) . – not necessarily unique (depending on c j ) • The function is subset dependent: require appropriate result set features. 7

Simple Regression • Given subsets S i = { x i, 1 , . . . , x i,m } and corresponding relevance score { y i, 1 , . . . , y i,m } . • We can estimate f B ( x j , S ) using regression in a family F : n m ˆ � � ( f ( x i,j , S i ) − y i,j ) 2 f = arg min f ∈F i =1 j =1 • Problem: m is massive ( > 20 billion) – computationally inefficient – statistically slow convergence ∗ ranking error bounded by O ( √ m ) × root-mean-squared-error. • Solution: should emphasize estimation quality on top. 8

Importance Weighted Regression • Some samples are more important than other samples (focus on top). • A revised formulation: ˆ � n f = arg min f ∈F 1 i =1 L ( f, S i , { y i,j } j ) , with n m w ( x j , S )( f ( x j , S ) − y j ) 2 + u sup w ′ ( x j , S )( f ( x j , S ) − δ ( x j , S )) 2 � L ( f, S, { y j } j ) = + j j =1 • weight w : importance weighting focusing regression error on top – zero for irrelevant pages • weight w ′ : large for irrelevant pages – for which f ( x j , S ) should be less than threshold δ . • importance weighting can be implemented through importance sampling. 9

Relationship of Regression and Ranking Let Q ( f ) = E S L ( f, S ) , where L ( f, S ) = E { y j } j | S L ( f, S, { y j } j ) m w ( x j , S ) E y j | ( x j ,S ) ( f ( x j , S ) − y j ) 2 + u sup � w ′ ( x j , S )( f ( x j , S ) − δ ( x j , S )) 2 = + . j j =1 Assume that c i = 0 for all i > k . Under appropriate parameter Theorem 1. choices with some constants u and γ , for all f : f ′ Q ( f ′ )) 1 / 2 . DCG ( r B ) − DCG ( r f ) ≤ C ( γ, u )( Q ( f ) − inf Key point: focus on relevant documents on top; � j w ( x j , S ) is much smaller than m . 10

Generalization Performance with Square Regularization β ( x, S ) = ˆ β T ψ ( x, S ) , with feature vector ψ ( x, S ) : Consider scoring f ˆ n � � 1 ˆ L ( β, S i , { y i,j } j ) + λβ T β � β = arg min , (1) n β ∈H i =1 m w ( x j , S )( f β ( x j , S ) − y j ) 2 + u sup w ′ ( x j , S )( f β ( x j , S ) − δ ( x j , S )) 2 � L ( β, S, { y j } j ) = + . j j =1 Let M = sup x,S � φ ( x, S ) � 2 and W = sup S [ � Theorem 2. x j ∈ S w ( x j , S ) + u sup x j ∈ S w ′ ( x j , S )] . Let f ˆ β be the estimator defined in (1). Then we have DCG ( r B ) − E { Si, { yi,j } j } n i =1 DCG ( r f ˆ β ) � 1 / 2 �� 2 1 + W M β ∈H ( Q ( f β ) + λβ T β ) − inf ≤ C ( γ, u ) inf f Q ( f ) . √ 2 λn 11

Interpretation of Results • Result does not depend on m , but the much smaller quantity quantity W = x j ∈ S w ( x j , S ) + u sup x j ∈ S w ′ ( x j , S )] sup S [ � – emphasize relevant samples on top: w is small for irrelevant documents. – a refined analysis can replace sup over S by some p -norm over S . • Can control generalization for the top portion of the rank-list even with large m . – learning complexity does not depend on the majority of items near the bottom of the rank-list. – the bottom items are usually easy to estimate. 12

Key Points • Ranking quality near the top is most important – statistical analysis to deal with the scenario • Regression based algorithm to handle large search space – importance weighting of regression terms – error bounds independent of the massive web-size. 13

Statistical Translation and Algorithm Challenge • Problem: – conditioned on a source sentence in one language. – generate a target sentence in another language. • General approach: – scoring function: measuring the quality of translated sentence based on the source sentence (similar to web-search) – search strategy: effectively generate target sentence candidates. ∗ search for the optimal score. ∗ structure used in the scoring function (through block model). • Main challenge: exponential growth of search space 14

Graphical illustration b4 airspace Lebanese b3 violate b2 warplanes b1 Israeli A A A t A A A l l l n l l l T H A t m j l A r s h j w b } b r k A y n y r A l A A P } n t y y l y P 15

Block Sequence Decoder • Database: a set of possible translation blocks – e.g. block “a b” translates into potential block “z x y”. • Scoring function: – candidate translation: a block sequence ( b 1 , . . . , b n ) – map each sequence of blocks into a non-negative score 1 ) = � n s w ( b n i =1 w T F ( b i − 1 , b i ; o ) . • Input: source sentence. • Translation: – generate block sequence consistent with the source sentence. – find the sequence with largest score. 16

Decoder Training • Given source/target sentence pairs { ( S i , T i ) } i =1 ,...,n . • Given a decoding scheme implementing: ˆ z ( w, S ) = arg max z ∈ V ( S ) s w ( z ) . • Find parameter w such that on the training data: – ˆ z ( w, S i ) achieves high BLEU score on average. • Key difficulty: large search space (similar issues in MLR) • Traditional tuning. • Our methods: – local training. – global training. 17

Traditional scoring function • A bunch of local statistics gathered from the training data, and beyond. – different language models of the target P ( T i | T i − 1 , T i − 2 ... ) . ∗ can depend on statistics beyond training data. ∗ Google benefited significantly from huge language models. – orientation (swap) frequency. – block frequency. – block quality score: ∗ e.g. normalized in-block unigram translation probability: ( S, T ) = ( { s 1 , . . . , s p } , { t 1 , . . . , t q } ) ; P ( S | T ) = � � i p ( s j | t i ) . j n j • Linear combination of log-frequencies (five or six features): – s w ( { b n 1 } ) = � � j w j log f j ( b i ; b i − 1 , · · · ) i • Tuning: hand-tuning; gradient descent adjustment to optimize BLEU score 18

Statistical Ranking Problem Tong Zhang Statistics Department, - PowerPoint PPT Presentation

Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing with large search space.

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

PRanking with Ranking Koby Crammer Technion Israel Institute of Technology Based on joint

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

A Ranking Method to Improve A Ranking Method to Improve Detection of Disease Using Selectively

+ Ranking Factor Latest Trends What factors matter in 2016-2017 for ranking your Google

Kernel Principal Component Ranking: Robust Ranking on Noisy Data Evgeni Tsivtsivadze Botond

KNN and re ranking models for English KNN and re-ranking models for English patent mining at

Tutorial Ranking Mechanisms in Games Vanessa Volz and Boris Naujoks CIG 2018, Maastricht

Web Mining and Recommender Systems Advanced Recommender Systems: Bayesian Personalized Ranking

Lecture 3: Improving Ranking with Lecture 3: Improving Ranking with Behavior Data Eugene

Online Ranking Combination Erzs ebet Frig o Institute for Computer Science and Control (MTA

QPAT Pension Workshop Names You Should Know Retraite Qubec merged administrative body

Ideas worth spreading: How does network position influence the spread of research topics?

Outcome Effectiveness of the Widely Adopted EFNEP Curriculum: mart Being Active Eating S

On network analysis and user behavior Ramayya Krishnan iLab, The H. John Heinz III College

Outline Cross-site scripting Announcements intermission CSci 5271 More cross-site risks

Towards Solver-Independent Propagators 1 Jean-No el Monette, Pierre Flener, and Justin Pearson

Bounds on the non-real spectrum of indefinite Sturm-Liouville operators Operator Theory in

EXISTING Commercial & Institutional Buildings W H Y W H O H O W WHAT WHEN USCE Shopping

Statistical Ranking Problem Tong Zhang Statistics Department, - PowerPoint PPT Presentation

Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing with large search space.

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

PRanking with Ranking Koby Crammer Technion Israel Institute of Technology Based on joint

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Problem Definition Problem Definition Problem Definition Problem Definition Problem Definition

A Ranking Method to Improve A Ranking Method to Improve Detection of Disease Using Selectively

+ Ranking Factor Latest Trends What factors matter in 2016-2017 for ranking your Google

Kernel Principal Component Ranking: Robust Ranking on Noisy Data Evgeni Tsivtsivadze Botond

KNN and re ranking models for English KNN and re-ranking models for English patent mining at

Tutorial Ranking Mechanisms in Games Vanessa Volz and Boris Naujoks CIG 2018, Maastricht

Web Mining and Recommender Systems Advanced Recommender Systems: Bayesian Personalized Ranking

Lecture 3: Improving Ranking with Lecture 3: Improving Ranking with Behavior Data Eugene

Online Ranking Combination Erzs ebet Frig o Institute for Computer Science and Control (MTA

QPAT Pension Workshop Names You Should Know Retraite Qubec merged administrative body

Ideas worth spreading: How does network position influence the spread of research topics?

Outcome Effectiveness of the Widely Adopted EFNEP Curriculum: mart Being Active Eating S

On network analysis and user behavior Ramayya Krishnan iLab, The H. John Heinz III College

Outline Cross-site scripting Announcements intermission CSci 5271 More cross-site risks

Towards Solver-Independent Propagators 1 Jean-No el Monette, Pierre Flener, and Justin Pearson

Bounds on the non-real spectrum of indefinite Sturm-Liouville operators Operator Theory in

EXISTING Commercial &amp; Institutional Buildings W H Y W H O H O W WHAT WHEN USCE Shopping

EXISTING Commercial & Institutional Buildings W H Y W H O H O W WHAT WHEN USCE Shopping