Query Understanding in Web Search - by Large Scale Log Data Mining - - PowerPoint PPT Presentation

query understanding in web search
SMART_READER_LITE
LIVE PREVIEW

Query Understanding in Web Search - by Large Scale Log Data Mining - - PowerPoint PPT Presentation

COLING 2010 NLPIX Workshop August 28, 2010 Query Understanding in Web Search - by Large Scale Log Data Mining and Statistical Learning Hang Li Microsoft Research Asia Joint Work with Colleagues, Interns, Collaborators 1 Web Search is Part


slide-1
SLIDE 1

Query Understanding in Web Search

  • by Large Scale Log Data Mining and

Statistical Learning

Hang Li Microsoft Research Asia

COLING 2010 NLPIX Workshop August 28, 2010

Joint Work with Colleagues, Interns, Collaborators

1

slide-2
SLIDE 2

Web Search is Part of Our Life

2

slide-3
SLIDE 3

Search System = `Black Boxes’

3

slide-4
SLIDE 4

Natural Language Processing Information Retrieval Data Mining Large Scale Distributed Computing

Advanced Web Search Technologies Are Used…

Statistical Learning

4

slide-5
SLIDE 5

Web Search Relies on NLP and IR

  • Query Understanding

– Classification, structure prediction, topic modeling, similarity learning

  • Document Understanding

– Classification, structure prediction, topic modeling, learning on graph

  • Query Document Matching

– Language model, similarity learning

  • Ranking

– Learning to rank

  • User Understanding

– Classification, topic modeling

5

slide-6
SLIDE 6

Query Understanding

  • Input: query
  • Output: query representation

–Refined query (e.g., spelling error correction) –Similar queries –Categories –Topics –Key phrases –Named entities

6

slide-7
SLIDE 7

This Talk = Query Understanding

7

slide-8
SLIDE 8

Talk Outline: Three Projects

  • LOGAL: Search and Browse Log Mining

Platform

  • Semantic Matching: Improving Tail Query

Relevance

  • Context aware Search: Better Search Using

Context Information

2010/8/30 8

slide-9
SLIDE 9

PROJECT: LOGAL (LOG OBJECT GALLERY)

9

Joint work with Daxin Jiang, Xiaohui Sun

slide-10
SLIDE 10

LOGAL

Search and Browse Log Mining Platform

10

slide-11
SLIDE 11

Data Structure of Search/Browse Logs

  • Various types of data
  • Complex relationship

among data objects – Hierarchical relationship – Sequential relationship

Users Queries Srch/Ads clicks Follow- up clicks Sessions

  • Search

result pages

slide-12
SLIDE 12

Rich Log Mining Applications

Log Mining Applications Search Applications Ads Applications Query Understanding Document Understanding User Understanding Query-Doc Matching Query Suggestion Query Expansion Query Substitution Query Classification Document Annotation Document Classification Document Summarization Personalized Search Search UI Design User Satisfaction Prediction Document & Ad (re-)Ranking Search Results Clustering Search Results Diversification Web Site Recommendation Keyword Generation Behavior Targeting Ad Click-Through Prediction Contextual Advertising

12

slide-13
SLIDE 13

The Problem

Search Log Apps Log Data Query Understanding

A huge gap between the data and the applications

Toolbar Data

  • Ads. Log

Web Site Log Document Understanding User Understanding Query-Doc Matching

13

slide-14
SLIDE 14

The Problem

Search Log Apps Log Data Query Understanding Toolbar Data

  • Ads. Log

Web Site Log Document Understanding User Understanding Query-Doc Matching

Each researcher or developer

  • 1. Has to access the raw log

data directly

  • 2. Has to build the application

from scratch Very difficult to build large-scale log mining applications

14

slide-15
SLIDE 15

Log Data Mining Platform

Log Objects Gallery (LOGAL) Toolbar Data Web Site Log Search Log Ads. Log Raw Logs Query Understanding Document Understanding Query Doc Matching User Understanding App Level Middle Level Raw Data Level

Data Platform

15

slide-16
SLIDE 16

Query Histogram

Example applications:

  • Query auto completion
  • Query suggestion
  • Query analysis: temporal changes of query frequency

Query Count facebook 3,157 K google 1,796 K youtube 1,162 K myspace 702 K facebook com 665 K yahoo 658 K yahoo mail 486 K yahoo com 486 K ebay 486 K facebook login 445 K

slide-17
SLIDE 17

Click-through Bipartite

  • Example applications

– Document (re-)ranking – Search results clustering – Web page summarization – Query suggestion

click-through bipartite

slide-18
SLIDE 18

Click Pattern

Query

 Doc 1 Doc 2 …  … … … … … … Doc N Doc 1  Doc 2 … … … … … … …  Doc N  Doc 1  Doc 2 …  … … … … … … Doc N

Pattern 1 (count) Pattern n (count) Pattern 2 (count) …

  • Example applications

– Estimate relevance of document to query – Predict users’ satisfaction – Query classification (informational vs navigational)

slide-19
SLIDE 19

Session Pattern

  • Example applications

– Doc (re-)ranking – Query suggestion – Site recommendation – User satisfaction prediction

Srch click: search click Ads click: advertisement click

User activities in a session

Query Click:

Srch click Ads click …

Browse

slide-20
SLIDE 20

PROJECT: SEMANTIC MATCHING

20

Joint work with Gu Xu, Jun Xu, Jingfang Xu

slide-21
SLIDE 21

Semantic Matching

Improving Tail Query Relevance

21

slide-22
SLIDE 22

Different Queries Can Represent Same Intent “Distance between Sun and Earth” - Luke DeLorme

  • distance from earth to the sun
  • distance from sun to earth
  • distance from sun to the earth
  • distance from the earth to the sun
  • distance from the sun to earth
  • distance from the sun to the earth
  • distance of earth from sun
  • distance of earth from the sun
  • distance of earth to sun
  • distance of earth to the sun
  • distance of sun from earth
  • distance of sun from the earth
  • distance of sun to earth
  • distance of the earth from the sun
  • distance of the earth to the sun
  • distance of the sun from earth
  • distance of the sun from the earth
  • distance of the sun to earth
  • distance of the sun to the earth
  • distance sun
  • distance sun and earth
  • distance sun earth
  • distance sun from earth
  • distance sun to earth
  • distance to sun from earth
  • distance to the sun from earth
  • earth and sun distance
  • "how far" earth sun
  • "how far" sun
  • "how far" sun earth
  • average distance earth sun
  • average distance from earth to sun
  • average distance from the earth to the sun
  • distance between earth & sun
  • distance between earth and sun
  • distance between earth and the sun
  • distance between earth sun
  • distance between sun and earth
  • distance between the earth and sun
  • distance between the earth and the sun
  • distance between the sun and earth
  • distance between the sun and the earth
  • distance earth and sun
  • distance earth from sun
  • distance earth is from the sun
  • distance earth sun
  • distance earth to sun
  • distance earth to the sun
  • distance from earth to sun
  • distance from earth to the sun
  • distance from sun to earth
  • distance from sun to the earth
  • distance from the earth to the sun
  • distance from the sun to earth
  • how far away is the sun from earth
  • how far away is the sun from the

earth

  • how far earth from sun
  • how far earth is from the sun
  • how far earth sun
  • how far from earth is the sun
  • how far from earth to sun
  • how far from the earth to the sun
  • how far from the sun is earth
  • how far from the sun is the earth
  • how far is earth away from the sun
  • how far is earth from sun
  • how far is earth from the sun
  • how far is earth to the sun
  • how far is it from earth to the sun
  • how far is it from the earth to the sun
  • how far is sun from earth
  • how far is the earth away from the

sun

  • how far is the earth from sun
  • how far is the earth from the sun
  • how far is the earth to the sun
  • how far is the sun
  • how far is the sun away from earth
  • how far is the sun away from the

earth

Microsoft Confidential

slide-23
SLIDE 23

Different Levels of Semantic Matching

Structure Term Word Sense Topic

Level of Semantics

Match exactly same terms

NY New York disk disc

Match terms with same meanings

NY New York motherboard mainboard utube youtube

Match topics of query and documents

Microsoft Office … working for Microsoft … my office is in … Topic: PC Software Topic: Personal Homepage

Match intent with answers (structures of query and document)

Microsoft Office home find homepage of Microsoft Office 21 movie find movie named 21 buy laptop less than 1000 find online dealers to buy laptop with less than 1000 dollars

23

slide-24
SLIDE 24

Semantic Matching Is Useful for

  • General Search Relevance
  • Vertical Search
  • Entity Search
  • Task Completion

24

slide-25
SLIDE 25

SYSTEM VIEW OF SEMANTIC MATCHING

25

slide-26
SLIDE 26

Overall System

Query Representation

Query Index Document Index

Search Log Data Web Data Offline Query Processing Offline Document Processing

Ranked Documents Document Representations Query Query Knowledge

Online Query Processing Semantic Matching Microsoft

Online Offline

26

slide-27
SLIDE 27

Online Query Processing

Named Entity Recognition in Query Query Topic Identification Similar Query Finding Query Refinement

Sense Topic Structure michael jordan berkele michael jordan berkeley michael jordan berkeley michael I. jordan berkeley michael jordan berkeley: academic michael I. jordan berkeley: academic [michael jordan: PersonName] [berkeley: Location]: academic [michael I. jordan: PersonName] [berkeley: Location]: academic

27

slide-28
SLIDE 28

Offline Document Processing

Named Entity Recognition in Doc. Document Topic Identification Key Concept Identification Tokenization

Sense Topic Structure

Michael Jordan is Professor in the Department of Electrical Engineering

[Michael Jordan] is [Professor] in the [Department] of [Electrical Engineering] [Michael Jordan/M. Jordan] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE] [Michael Jordan/M. Jordan] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic [Michael Jordan/M. Jordan: PersonName] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic

28

slide-29
SLIDE 29

Online Semantic Matching

Query Representation Document Representation

[michael jordan: PersonName] [berkeley: Location]: academic [michael I. jordan: PersonName] [berkeley: Location]: academic [Michael Jordan/M. Jordan: PersonName] is [Professor] in the [Department/Dept.] of [Electrical Engineering/EE]: academic

Semantic Matching

Matching can be conducted at different levels

Ranking Results

29

slide-30
SLIDE 30

QUERY REFINEMENT USING CRF MODEL

30

slide-31
SLIDE 31

Correcting Errors in Query

search “windows onecare”

window onecar

Query Refiner Search System

windows onecare

31

slide-32
SLIDE 32

Structured Prediction Problem

windows

  • necare

window

  • necar

Observed “noisy” word sequence “Ideal” word sequence

  • riginal query

word sequence “ideal” query word sequence

32

slide-33
SLIDE 33

Conditional Random Fields for Query Refinement

Introducing Refinement Operations

  • i-1
  • i
  • i+1

yi-1 yi yi+1 xi-1 xi xi+1 Operations

Spelling: insertion, deletion, substitution, transposition, … Word Stemming: +s/-s, +es/-es, +ed/-ed, +ing/-ing, …

33

slide-34
SLIDE 34

Query Refinement Using Conditional Random Fields

34

slide-35
SLIDE 35

NAMED ENTITY MINING FROM QUERY LOG USING TOPIC MODEL

35

slide-36
SLIDE 36

Named Entity Recognition in Query

harry potter harry potter film harry potter author

harry potter – Movie (0.5) harry potter – Book (0.4) harry potter – Game (0.1) harry potter film harry potter – Movie (0.95) harry potter author harry potter – Book (0.95)

36

slide-37
SLIDE 37

Our Approach

  • Using Query Log Data (or Click-through Data)
  • Using Topic Model
  • Weakly Supervised Latent Dirichlet Allocation
  • vs Pasca’s work (named entity mining from log

data, deterministic approach)

37

slide-38
SLIDE 38

Seed and Query Log

final fantasy Movie Game gone with the wind Movie Book harry potter Movie Book Game

final fantasy 300 final fantasy movie 120 final fantasy wallpaper 50 gone with the wind movie 120 gone with the wind review 10 gone with the wind photos 10 harry potter 1000 harry potter book 650 gone with the wind book 80 gone with the wind summary 20 harry potter cheats 300 harry potter pics 200 harry potter summary 100 final fantasy xbox 10 final fantasy soundtrack 10 gone with the wind 250 harry potter movie 800 ……

38

slide-39
SLIDE 39

Pseudo Documents of Named Entities

\# 1000 \# movie 800 \# book 650 \# cheats 300 \# pics 200 \# summary 100 \# 250 \# movie 120 \# book 80 \# summary 20 \# review 10 \# photos 10 \# 300 \# movie 120 \# wallpaper 50 \# xbox 10 \# soundtrack 10 final fantasy Movie, Game gone with the wind harry potter Movie, Book, Game Movie, Book

39

slide-40
SLIDE 40

Latent Dirichlet Allocation Model

z: Movie, Book, Game w: \#, \# movie, \# book, …. : distribution of classes for named entity : distribution of contexts for class  

40

slide-41
SLIDE 41

Weakly Supervised Latent Dirichlet Allocation

) , ( ) | ( log y C D p    

0.2 0.4 0.6 0.8 harry potter final fantacy gone with wind Movie Book Game Music

constraints

41

slide-42
SLIDE 42

PROJECT: CONTEXT AWARE SEARCH

42

Joint work with Daxin Jiang, Jian Pei, and others

slide-43
SLIDE 43

Context aware Search:

Better Search Using Context Information

43

slide-44
SLIDE 44
  • Job related

Search in Office

44

slide-45
SLIDE 45

Search at Home

  • Household related
  • Hobby and leisure

45

slide-46
SLIDE 46
  • Location
  • Time
  • Activity

Search in Mobile Context

46

slide-47
SLIDE 47
  • Community

Search in Social Context

47

slide-48
SLIDE 48

Conventional Web Search

48

slide-49
SLIDE 49
  • Suppose that user raises query “jaguar”
  • If we know the user raises query “”BMW” before “”jaguar”
  • Then we know that the user is likely to look for the car

Search Intent and Context

49

slide-50
SLIDE 50
  • User usually conducts multiple related

searches in a session

Context of Search

User query query click click click Current search Context

50

slide-51
SLIDE 51
  • Example of search sessions

Context Information is Useful

SID Search sessions S1 Ford  Toyota  GMC  Allstate www.autohome.com S2 Ford cars Toyota cars GMC cars Allstate www.autohome.com S3 Ford cars Toyota cars Allstate www.allstate.com S4 GMC  GMC dealers www.gmc.com

51

slide-52
SLIDE 52
  • 50% of users clicked car review site

www.autohome.com after searching several car names.

Context Information is Useful

52

slide-53
SLIDE 53
  • How to model context?
  • How to learn context model from data?
  • How to apply context model in search?

Challenges in Context aware Search

53

slide-54
SLIDE 54
  • Three Models

– Sequential Model (Cao et al. KDD 2008) – Hidden Markov Model (Cao et al. WWW 2009) – Conditional Random Fields Model (Cao et al. SIGIR 2009)

  • Large Scale Data Mining to Construct Models
  • Using Learning Models to Make Prediction

(Context aware Search)

Our Approach

54

slide-55
SLIDE 55

Modeling Context by Sequential Model

q1 u1 c1 qi ui ci qt ut ct … …

55

slide-56
SLIDE 56

Modeling Context by CRF

q1 u1 c1 qi ui ci qt ut ct … …

56

slide-57
SLIDE 57

Modeling Context by HMM

q1 u1 c1 qi ui ci qt ut ct … …

57

slide-58
SLIDE 58

TRAINING HIDDEN MARKOV MODEL

58

slide-59
SLIDE 59
  • Challenge 1:

– EM algorithm needs determined number of hidden states. – However, in our problem, hidden states correspond to search intents, for which the number is unknown.

  • Our Solution:

– Conduct clustering on click-bipartite graph and view clusters as hidden states.

Training Very Large HMM

59

slide-60
SLIDE 60
  • Challenge 2:

– Search log data contains hundreds of millions of sessions. – It is impractical to train HMM from such huge training data

  • n single machine.
  • Our Solution:

– Deploy learning task on distributed system under map- reduce model

Training Very Large HMM

60

slide-61
SLIDE 61
  • Challenge 3:

– Each machine needs to hold the values of all parameters. – Since search log data contains millions of unique queries and URLs, the space of parameters is extremely large.

  • Our Solution:

– Employ special initialization strategy based on the clusters mined from click-through bipartite

Training Very Large HMM

61

slide-62
SLIDE 62

SUMMARY

62

slide-63
SLIDE 63
  • Web search relies on NLP and IR
  • Query understanding = identify user search

intent

  • Query understanding needs

– Large scale mining platform – Advanced NLP and IR technologies – Advanced statistical learning technologies

  • Our Projects

– LOGAL: search and browse log mining platform – Semantic Matching: improving tail query relevance – Context aware Search: better search using context information

  • NLP Challenges and Technologies in Information Explosion Era

Summary

63

slide-64
SLIDE 64

THANKS!

64