[PDF] - Of Search and Semantics Patrick Pantel NSF Symposium on Semantic PDF Document

SLIDE 1

Of Search and Semantics

Patrick Pantel

NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008

2 -

SLIDE 2

Vannaver Vannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for all mankind: for all mankind: Memex Memex Bush: "technical difficulties Bush: "technical difficulties

f all sorts have been
f all sorts have been

ignored." ignored."

3 -

Memex, in the form of a desk, would instantly bring files and material on any subject to the operator’s fingertips.

Semantics captured by an Semantics captured by an associative trail associative trail, , personal personal comments comments and and side trails side trails Gerald Salton Gerald Salton, father of modern , father of modern search: introduces concepts such as search: introduces concepts such as vector space model vector space model, Inverse , Inverse Document Frequency Document Frequency (IDF), (IDF), Term Term q y q y Frequency Frequency (TF), and (TF), and relevancy relevancy feedback feedback

4 -

1990: the first search 1990: the first search engine engine Archie Archie, from McGill, , from McGill, matches keyword queries matches keyword queries against a database of Web against a database of Web filenames filenames

SLIDE 3

Yahoo! Yahoo! is founded around a is founded around a taxonomical organization of taxonomical organization of the Web… manually the Web… manually

5 -

Key innovator Key innovator: Advanced search features + Advanced search features + first SE to allow natural first SE to allow natural language queries language queries Key innovator Key innovator: Semantic models for ranking Semantic models for ranking paid search results paid search results

6 -

Semantics and Query Logs Semantics and Query Logs: Technologies developed/applied during the Technologies developed/applied during the DARPA Tipster program and TREC make it DARPA Tipster program and TREC make it commercial: spelling correction, query commercial: spelling correction, query reformulations, also try, … reformulations, also try, …

SLIDE 4

Future of search lies in a Future of search lies in a deep deep understanding and understanding and matching of information matching of information g request request behind user queries behind user queries Natural language questions Natural language questions answered by editors answered by editors

7 -

Semantic repositories Semantic repositories and user and user-

annotated

annotated content grow rapidly content grow rapidly Semantic Semantic search engines search engines

8 -

search engines search engines emerge emerge

SLIDE 5

9 -
10 -

SLIDE 6

11 -

Search Assist Search Assist Technology Technology

12 -

SLIDE 7

13 -

Smart Snippets

Tapas Kanungo et al.

14 -

SLIDE 8

Aggregate star Aggregate star rating rating Example review as Example review as summary summary Current prices at Current prices at

15 -

Current prices at Current prices at merchant sites merchant sites Images, maps, Images, maps, specs, … specs, … O i i Opportunities:

Marriage of information

extraction, content analysis, and query intent modeling

User experience design
Key Technologies
IE: Entity detection and

salience; attribute detection

16 -

salience; attribute detection

CA: Text classification,

aboutness, information fusion

QIM: Entity detection,

intent/task understanding

SLIDE 9

Task Modeling

Ana-Maria Popescu

17 -

Task Modeling

What is the user trying to do?

– Find product reviews of the Nikon p D300? – Buy a Nikon D300? – Find support for her camera?

UED: enriching the search experience

– How can we make use of this knowledge?

K T h l i

18 -
Key Technologies

– Entity detection – Document classification – Intent modeling / detection

SLIDE 10

Is Is battery life battery life a synonym of a synonym of image quality image quality?

Research Problem Research Problem: Intent Synonymy Intent Synonymy

How can we How can we automatically discover automatically discover

Review Intent

19 -

automatically discover automatically discover intent synonyms? intent synonyms?

Transactional Intent

Intent Synonymy

Key Assumption: one intent per session A searcher’s intent remains the same within a search session I ntent Synonyms h d l l d f d b ll l d Review Price Support best thin where can I get cheap common problems rate black Friday sale fault codes compare kodak vs. canon christmas sales installing new small portable cheap easy use Methodology very similar to constructing dictionary of distributionally similar words I ntent Discovery and Chaining via Clustering

20 -

small portable cheap easy use high zoom buy now pay later

perating instructions

rate best dell discount won’t heat battery life great deals best schematics user comments

verstock.com

keeps shutting

SLIDE 11

Road Ahead Key Applications Enabling Technologies

Entity Detection Entity Detection Attribute Attribute Mining Mining Intent Synonymy Intent Synonymy Distributional Similarity Distributional Similarity Entailment Entailment

21 -

Anaphora Resolution Anaphora Resolution Content Analysis Content Analysis Text Text Classification Classification

22 -

SLIDE 12

Dynamic Similarity Modeling

Eric Crestan and Vishnu Vyas

23 -

SeeLEx: Seed List Expansion

SeeLEx, developed at Yahoo!, is a weakly supervised

system for mining entity lists based on distributional similarity

Jaguar Honda Peugeot

24 -

Mercedes Ford

SLIDE 13

SeeLEx: Seed List Expansion

Nissan Opel cheetah lion Jaguar Peugeot Honda Nissan Volkswagen Porsche Mazda Fiat Hyundai Lexus Toyota Clinton

25 -

Mercedes Ford Renault Hyundai Saab Suburu puma caracal Carter Eisenhower

SeeLEx: Seed List Expansion

Nissan Opel cheetah lion Jaguar Peugeot Honda Nissan Volkswagen Porsche Mazda Fiat Hyundai Lexus Toyota

What is a caracal? endangered carnivore fast

26 -

Mercedes Ford Renault Hyundai Saab Suburu puma

fast hunted …

caracal

SLIDE 14

Questions Asked and Conclusions

What is the effect of corpus size on expansion accuracy?

Significant boost in performance

What is the effect of corpus quality on expansion accuracy? Does seed quality matter?

Significant boost in performance Great variance in performance depending on set seed composition

27 -

How many seeds are on average optimal for expansion accuracy and are more seeds better than less?

Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt

Archbishops of Canterbury Astronomers Australian Airlines Australian A-League Cognitive Scientists Composers Countries Electronic Companies Male Tennis Players Maryland Counties New Zealand Songwriters NHL Hockey Teams

Gold Standard Sets

Football Teams Australian Cities Australian Prime Ministers Best Actress Academy Award Winners Biology Disciplines Bottled Water Brands Boxing Weight Classes California Counties Canadian Stadiums Elements English Cities English Poets English Premier Football Clubs First Ladies Formula One Drivers French Artists Greek Gods Greek I slands North American Mountain Ranges Presidents of Argentina Rivers in England Roman Emperors Russian Authors Spanish Provinces Stars Superheroes Texas Counties

28 -

Canadian Stadiums Canadian Universities Charitable Foundations Classical Pianists Cocktails Greek I slands I rish Theatres I talian Regions Japanese Martial Arts Japanese Prefectures Texas Counties U.S. Army Generals U.S. Federal Departments U.S. Internet Companies

Average of 208 instances Minimum of 11 Maximum of 1116 Total of 10,377 instances

50 sets extracted from Wikipedia (2007/12)

SLIDE 15

Corpora

Table: Corpora used to build our expansion models.

CORPORA UNIQUE SEN TEN CES (MILLIONS) TOKEN S (MILLIONS) UN IQUE WORDS (MILLION S)

k100 5,201 217,940 542 k020† 1040 43,588 108

29 -

k004

†

208 8,717 22 Wikipedia 30 721 34

†Estima

ted from k100 st a tistics.

The Effect of: Corpus Size and Corpus Quality

30 -

SLIDE 16

0.7 0.8 0.9 1

System and Corpora Analysis (Precision vs. Recall)

full.k100 f ll k020

Takeaway: Corpus Size Matters

k100 yields 13% higher R-precision

than 1/5th its size

0.3 0.4 0.5 0.6 Recall

full.k020 full.k004 full.wikipedia

Opportunity: Model a more natural SeeLEx usage

k100 yields 53% higher R-precision

than 1/25th its size Takeaway: Corpus Quality Matters

Wikipedia yields similar performance

to a web corpus 60 times its size

31 -

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision

SeeLEx usage

What is the performance of the system
n typical sets searched by editors?

0.7 0.8 0.9 1

k100: List Effect Precision vs. Recall Opportunity: Bucket sets to find predictable behaviors O ld t d d t i ifi tl i th i

0.3 0.4 0.5 0.6 Recall

Our gold standard sets vary significantly in their

expansion performance

Look into predictability of open vs. closed class sets,

large vs. small class sets, and types of sets such as locations, people, …

32 -

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision

SLIDE 17

The Effect of: Seed Selection

33 -

0.7 0.8 0.9 1

k100: Seed Selection Effect Precision vs. Recall

full.k100 s010 s010.t01 s010.t02 s010.t03 s010.t04 s010.t05 s010.t06 s010.t07 s010.t08

Takeaway: Seed set composition greatly affects performance

0.3 0.4 0.5 0.6 Recall s010.t09 s010.t10 s010.t11 s010.t12 s010.t13 s010.t14 s010.t15 s010.t16 s010.t17 s010.t18 s010.t19 s010.t20 s010.t21 s010.t22

affects performance

Best performing seed set had 42% higher

precision and 39% higher recall than the worst performing seed set Opportunity: Reject seeds and/or ask for more

We are investigating ways to automatically detect

hi h d l t b tt th th i

34 -

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision s010.t23 s010.t24 s010.t25 s010.t26 s010.t27 s010.t28 s010.t29 s010.t30

which seed elements are better than others in

rder to reduce the impact of seed selection effect

SLIDE 18

The Effect of: Seed Size

35 -

2 2.5 3 s

Rate of New Correct Expansions

vs. Seed Size

Takeaway: Small numbers of seeds are sufficient to fully saturate the distributional similarity model

1 1.5 2 Rate of New Correct Expansion

to fully saturate the distributional similarity model

After 10-20 seeds, more seeds yield few new

correct instances in the expansion

36 -

0.5 20 40 60 80 100 120 140 160 180 200 Seed Size

SLIDE 19

0.7 0.8 0.9 1

Seed Size vs. % of Errors

0.3 0.4 0.5 0.6 % of Errors

Takeaway: Although bad seeds may be less desirable in a seed set than others, adding them does not seem to degrade performance

Percentage of errors does not increase with
37 -

0.1 0.2 20 40 60 80 100 120 140 160 180 200 Seed Size

Percentage of errors does not increase with increased seed set size

Questions Asked and Conclusions

What is the effect of corpus size on expansion accuracy?

Significant boost in performance

What is the effect of corpus quality on expansion accuracy? Does seed quality matter?

Significant boost in performance Great variance in performance depending on set seed composition

38 -

How many seeds are on average optimal for expansion accuracy and are more seeds better than less?

Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt

SLIDE 20

39 -