Query Understanding in Web Search
- by Large Scale Log Data Mining and
Statistical Learning
Hang Li Microsoft Research Asia
COLING 2010 NLPIX Workshop August 28, 2010
Joint Work with Colleagues, Interns, Collaborators
1
Query Understanding in Web Search - by Large Scale Log Data Mining - - PowerPoint PPT Presentation
COLING 2010 NLPIX Workshop August 28, 2010 Query Understanding in Web Search - by Large Scale Log Data Mining and Statistical Learning Hang Li Microsoft Research Asia Joint Work with Colleagues, Interns, Collaborators 1 Web Search is Part
Hang Li Microsoft Research Asia
COLING 2010 NLPIX Workshop August 28, 2010
Joint Work with Colleagues, Interns, Collaborators
1
2
3
4
– Classification, structure prediction, topic modeling, similarity learning
– Classification, structure prediction, topic modeling, learning on graph
– Language model, similarity learning
– Learning to rank
– Classification, topic modeling
5
6
7
2010/8/30 8
9
Joint work with Daxin Jiang, Xiaohui Sun
10
Users Queries Srch/Ads clicks Follow- up clicks Sessions
result pages
Log Mining Applications Search Applications Ads Applications Query Understanding Document Understanding User Understanding Query-Doc Matching Query Suggestion Query Expansion Query Substitution Query Classification Document Annotation Document Classification Document Summarization Personalized Search Search UI Design User Satisfaction Prediction Document & Ad (re-)Ranking Search Results Clustering Search Results Diversification Web Site Recommendation Keyword Generation Behavior Targeting Ad Click-Through Prediction Contextual Advertising
12
Search Log Apps Log Data Query Understanding
Toolbar Data
Web Site Log Document Understanding User Understanding Query-Doc Matching
13
Search Log Apps Log Data Query Understanding Toolbar Data
Web Site Log Document Understanding User Understanding Query-Doc Matching
Each researcher or developer
data directly
from scratch Very difficult to build large-scale log mining applications
14
Log Objects Gallery (LOGAL) Toolbar Data Web Site Log Search Log Ads. Log Raw Logs Query Understanding Document Understanding Query Doc Matching User Understanding App Level Middle Level Raw Data Level
Data Platform
15
Example applications:
Query Count facebook 3,157 K google 1,796 K youtube 1,162 K myspace 702 K facebook com 665 K yahoo 658 K yahoo mail 486 K yahoo com 486 K ebay 486 K facebook login 445 K
click-through bipartite
Query
Doc 1 Doc 2 … … … … … … … Doc N Doc 1 Doc 2 … … … … … … … Doc N Doc 1 Doc 2 … … … … … … … Doc N
Pattern 1 (count) Pattern n (count) Pattern 2 (count) …
– Estimate relevance of document to query – Predict users’ satisfaction – Query classification (informational vs navigational)
– Doc (re-)ranking – Query suggestion – Site recommendation – User satisfaction prediction
Srch click: search click Ads click: advertisement click
User activities in a session
Srch click Ads click …
20
Joint work with Gu Xu, Jun Xu, Jingfang Xu
21
earth
sun
earth
Microsoft Confidential
Structure Term Word Sense Topic
Level of Semantics
Match exactly same terms
NY New York disk disc
Match terms with same meanings
NY New York motherboard mainboard utube youtube
Match topics of query and documents
Microsoft Office … working for Microsoft … my office is in … Topic: PC Software Topic: Personal Homepage
Match intent with answers (structures of query and document)
Microsoft Office home find homepage of Microsoft Office 21 movie find movie named 21 buy laptop less than 1000 find online dealers to buy laptop with less than 1000 dollars
23
24
25
Query Representation
Query Index Document Index
Search Log Data Web Data Offline Query Processing Offline Document Processing
Ranked Documents Document Representations Query Query Knowledge
Online Query Processing Semantic Matching Microsoft
Online Offline
26
Named Entity Recognition in Query Query Topic Identification Similar Query Finding Query Refinement
Sense Topic Structure michael jordan berkele michael jordan berkeley michael jordan berkeley michael I. jordan berkeley michael jordan berkeley: academic michael I. jordan berkeley: academic [michael jordan: PersonName] [berkeley: Location]: academic [michael I. jordan: PersonName] [berkeley: Location]: academic
27
Named Entity Recognition in Doc. Document Topic Identification Key Concept Identification Tokenization
Sense Topic Structure
Michael Jordan is Professor in the Department of Electrical Engineering
[Michael Jordan] is [Professor] in the [Department] of [Electrical Engineering] [Michael Jordan/M. Jordan] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE] [Michael Jordan/M. Jordan] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic [Michael Jordan/M. Jordan: PersonName] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic
28
Query Representation Document Representation
[michael jordan: PersonName] [berkeley: Location]: academic [michael I. jordan: PersonName] [berkeley: Location]: academic [Michael Jordan/M. Jordan: PersonName] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic
Semantic Matching
Matching can be conducted at different levels
Ranking Results
29
30
window onecar
Query Refiner Search System
31
windows
window
Observed “noisy” word sequence “Ideal” word sequence
32
yi-1 yi yi+1 xi-1 xi xi+1 Operations
Spelling: insertion, deletion, substitution, transposition, … Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, …
33
34
35
harry potter harry potter film harry potter author
harry potter – Movie (0.5) harry potter – Book (0.4) harry potter – Game (0.1) harry potter film harry potter – Movie (0.95) harry potter author harry potter – Book (0.95)
36
37
final fantasy Movie Game gone with the wind Movie Book harry potter Movie Book Game
final fantasy 300 final fantasy movie 120 final fantasy wallpaper 50 gone with the wind movie 120 gone with the wind review 10 gone with the wind photos 10 harry potter 1000 harry potter book 650 gone with the wind book 80 gone with the wind summary 20 harry potter cheats 300 harry potter pics 200 harry potter summary 100 final fantasy xbox 10 final fantasy soundtrack 10 gone with the wind 250 harry potter movie 800 ……
38
\# 1000 \# movie 800 \# book 650 \# cheats 300 \# pics 200 \# summary 100 \# 250 \# movie 120 \# book 80 \# summary 20 \# review 10 \# photos 10 \# 300 \# movie 120 \# wallpaper 50 \# xbox 10 \# soundtrack 10 final fantasy Movie, Game gone with the wind harry potter Movie, Book, Game Movie, Book
39
z: Movie, Book, Game w: \#, \# movie, \# book, …. : distribution of classes for named entity : distribution of contexts for class
40
0.2 0.4 0.6 0.8 harry potter final fantacy gone with wind Movie Book Game Music
constraints
41
42
Joint work with Daxin Jiang, Jian Pei, and others
43
44
45
46
47
48
49
User query query click click click Current search Context
50
SID Search sessions S1 Ford Toyota GMC Allstate www.autohome.com S2 Ford cars Toyota cars GMC cars Allstate www.autohome.com S3 Ford cars Toyota cars Allstate www.allstate.com S4 GMC GMC dealers www.gmc.com
51
52
53
54
q1 u1 c1 qi ui ci qt ut ct … …
55
q1 u1 c1 qi ui ci qt ut ct … …
56
q1 u1 c1 qi ui ci qt ut ct … …
57
58
– Conduct clustering on click-bipartite graph and view clusters as hidden states.
59
– Deploy learning task on distributed system under map- reduce model
60
– Employ special initialization strategy based on the clusters mined from click-through bipartite
61
62
– Large scale mining platform – Advanced NLP and IR technologies – Advanced statistical learning technologies
– LOGAL: search and browse log mining platform – Semantic Matching: improving tail query relevance – Context aware Search: better search using context information
63
64