Of Search and Semantics Patrick Pantel NSF Symposium on Semantic - - PDF document

of search and semantics
SMART_READER_LITE
LIVE PREVIEW

Of Search and Semantics Patrick Pantel NSF Symposium on Semantic - - PDF document

Of Search and Semantics Patrick Pantel NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008 - 2 - Vannaver Vannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for


slide-1
SLIDE 1

Of Search and Semantics

Patrick Pantel

NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008

  • 2 -
slide-2
SLIDE 2

Vannaver Vannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for all mankind: for all mankind: Memex Memex Bush: "technical difficulties Bush: "technical difficulties

  • f all sorts have been
  • f all sorts have been

ignored." ignored."

  • 3 -

Memex, in the form of a desk, would instantly bring files and material on any subject to the operator’s fingertips.

Semantics captured by an Semantics captured by an associative trail associative trail, , personal personal comments comments and and side trails side trails Gerald Salton Gerald Salton, father of modern , father of modern search: introduces concepts such as search: introduces concepts such as vector space model vector space model, Inverse , Inverse Document Frequency Document Frequency (IDF), (IDF), Term Term q y q y Frequency Frequency (TF), and (TF), and relevancy relevancy feedback feedback

  • 4 -

1990: the first search 1990: the first search engine engine Archie Archie, from McGill, , from McGill, matches keyword queries matches keyword queries against a database of Web against a database of Web filenames filenames

slide-3
SLIDE 3

Yahoo! Yahoo! is founded around a is founded around a taxonomical organization of taxonomical organization of the Web… manually the Web… manually

  • 5 -

Key innovator Key innovator: Advanced search features + Advanced search features + first SE to allow natural first SE to allow natural language queries language queries Key innovator Key innovator: Semantic models for ranking Semantic models for ranking paid search results paid search results

  • 6 -

Semantics and Query Logs Semantics and Query Logs: Technologies developed/applied during the Technologies developed/applied during the DARPA Tipster program and TREC make it DARPA Tipster program and TREC make it commercial: spelling correction, query commercial: spelling correction, query reformulations, also try, … reformulations, also try, …

slide-4
SLIDE 4

Future of search lies in a Future of search lies in a deep deep understanding and understanding and matching of information matching of information g request request behind user queries behind user queries Natural language questions Natural language questions answered by editors answered by editors

  • 7 -

Semantic repositories Semantic repositories and user and user-

  • annotated

annotated content grow rapidly content grow rapidly Semantic Semantic search engines search engines

  • 8 -

search engines search engines emerge emerge

slide-5
SLIDE 5
  • 9 -
  • 10 -
slide-6
SLIDE 6
  • 11 -

Search Assist Search Assist Technology Technology

  • 12 -
slide-7
SLIDE 7
  • 13 -

Smart Snippets

Tapas Kanungo et al.

  • 14 -
slide-8
SLIDE 8

Aggregate star Aggregate star rating rating Example review as Example review as summary summary Current prices at Current prices at

  • 15 -

Current prices at Current prices at merchant sites merchant sites Images, maps, Images, maps, specs, … specs, … O i i Opportunities:

  • Marriage of information

extraction, content analysis, and query intent modeling

  • User experience design
  • Key Technologies
  • IE: Entity detection and

salience; attribute detection

  • 16 -

salience; attribute detection

  • CA: Text classification,

aboutness, information fusion

  • QIM: Entity detection,

intent/task understanding

slide-9
SLIDE 9

Task Modeling

Ana-Maria Popescu

  • 17 -

Task Modeling

  • What is the user trying to do?

– Find product reviews of the Nikon p D300? – Buy a Nikon D300? – Find support for her camera?

  • UED: enriching the search experience

– How can we make use of this knowledge?

K T h l i

  • 18 -
  • Key Technologies

– Entity detection – Document classification – Intent modeling / detection

slide-10
SLIDE 10

Is Is battery life battery life a synonym of a synonym of image quality image quality?

Research Problem Research Problem: Intent Synonymy Intent Synonymy

How can we How can we automatically discover automatically discover

Review Intent

  • 19 -

automatically discover automatically discover intent synonyms? intent synonyms?

Transactional Intent

Intent Synonymy

Key Assumption: one intent per session A searcher’s intent remains the same within a search session I ntent Synonyms h d l l d f d b ll l d Review Price Support best thin where can I get cheap common problems rate black Friday sale fault codes compare kodak vs. canon christmas sales installing new small portable cheap easy use Methodology very similar to constructing dictionary of distributionally similar words I ntent Discovery and Chaining via Clustering

  • 20 -

small portable cheap easy use high zoom buy now pay later

  • perating instructions

rate best dell discount won’t heat battery life great deals best schematics user comments

  • verstock.com

keeps shutting

slide-11
SLIDE 11

Road Ahead Key Applications Enabling Technologies

Entity Detection Entity Detection Attribute Attribute Mining Mining Intent Synonymy Intent Synonymy Distributional Similarity Distributional Similarity Entailment Entailment

  • 21 -

Anaphora Resolution Anaphora Resolution Content Analysis Content Analysis Text Text Classification Classification

  • 22 -
slide-12
SLIDE 12

Dynamic Similarity Modeling

Eric Crestan and Vishnu Vyas

  • 23 -

SeeLEx: Seed List Expansion

  • SeeLEx, developed at Yahoo!, is a weakly supervised

system for mining entity lists based on distributional similarity

Jaguar Honda Peugeot

  • 24 -

Mercedes Ford

slide-13
SLIDE 13

SeeLEx: Seed List Expansion

Nissan Opel cheetah lion Jaguar Peugeot Honda Nissan Volkswagen Porsche Mazda Fiat Hyundai Lexus Toyota Clinton

  • 25 -

Mercedes Ford Renault Hyundai Saab Suburu puma caracal Carter Eisenhower

SeeLEx: Seed List Expansion

Nissan Opel cheetah lion Jaguar Peugeot Honda Nissan Volkswagen Porsche Mazda Fiat Hyundai Lexus Toyota

What is a caracal? endangered carnivore fast

  • 26 -

Mercedes Ford Renault Hyundai Saab Suburu puma

fast hunted …

caracal

slide-14
SLIDE 14

Questions Asked and Conclusions

What is the effect of corpus size on expansion accuracy?

Significant boost in performance

What is the effect of corpus quality on expansion accuracy? Does seed quality matter?

Significant boost in performance Great variance in performance depending on set seed composition

  • 27 -

How many seeds are on average optimal for expansion accuracy and are more seeds better than less?

Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt

Archbishops of Canterbury Astronomers Australian Airlines Australian A-League Cognitive Scientists Composers Countries Electronic Companies Male Tennis Players Maryland Counties New Zealand Songwriters NHL Hockey Teams

Gold Standard Sets

Football Teams Australian Cities Australian Prime Ministers Best Actress Academy Award Winners Biology Disciplines Bottled Water Brands Boxing Weight Classes California Counties Canadian Stadiums Elements English Cities English Poets English Premier Football Clubs First Ladies Formula One Drivers French Artists Greek Gods Greek I slands North American Mountain Ranges Presidents of Argentina Rivers in England Roman Emperors Russian Authors Spanish Provinces Stars Superheroes Texas Counties

  • 28 -

Canadian Stadiums Canadian Universities Charitable Foundations Classical Pianists Cocktails Greek I slands I rish Theatres I talian Regions Japanese Martial Arts Japanese Prefectures Texas Counties U.S. Army Generals U.S. Federal Departments U.S. Internet Companies

Average of 208 instances Minimum of 11 Maximum of 1116 Total of 10,377 instances

50 sets extracted from Wikipedia (2007/12)

slide-15
SLIDE 15

Corpora

Table: Corpora used to build our expansion models.

CORPORA UNIQUE SEN TEN CES (MILLIONS) TOKEN S (MILLIONS) UN IQUE WORDS (MILLION S)

k100 5,201 217,940 542 k020† 1040 43,588 108

  • 29 -

k004

208 8,717 22 Wikipedia 30 721 34

†Estima

ted from k100 st a tistics.

The Effect of: Corpus Size and Corpus Quality

  • 30 -
slide-16
SLIDE 16

0.7 0.8 0.9 1

System and Corpora Analysis (Precision vs. Recall)

full.k100 f ll k020

Takeaway: Corpus Size Matters

  • k100 yields 13% higher R-precision

than 1/5th its size

0.3 0.4 0.5 0.6 Recall

full.k020 full.k004 full.wikipedia

Opportunity: Model a more natural SeeLEx usage

  • k100 yields 53% higher R-precision

than 1/25th its size Takeaway: Corpus Quality Matters

  • Wikipedia yields similar performance

to a web corpus 60 times its size

  • 31 -

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision

SeeLEx usage

  • What is the performance of the system
  • n typical sets searched by editors?

0.7 0.8 0.9 1

k100: List Effect Precision vs. Recall Opportunity: Bucket sets to find predictable behaviors O ld t d d t i ifi tl i th i

0.3 0.4 0.5 0.6 Recall

  • Our gold standard sets vary significantly in their

expansion performance

  • Look into predictability of open vs. closed class sets,

large vs. small class sets, and types of sets such as locations, people, …

  • 32 -

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision

slide-17
SLIDE 17

The Effect of: Seed Selection

  • 33 -

0.7 0.8 0.9 1

k100: Seed Selection Effect Precision vs. Recall

full.k100 s010 s010.t01 s010.t02 s010.t03 s010.t04 s010.t05 s010.t06 s010.t07 s010.t08

Takeaway: Seed set composition greatly affects performance

0.3 0.4 0.5 0.6 Recall s010.t09 s010.t10 s010.t11 s010.t12 s010.t13 s010.t14 s010.t15 s010.t16 s010.t17 s010.t18 s010.t19 s010.t20 s010.t21 s010.t22

affects performance

  • Best performing seed set had 42% higher

precision and 39% higher recall than the worst performing seed set Opportunity: Reject seeds and/or ask for more

  • We are investigating ways to automatically detect

hi h d l t b tt th th i

  • 34 -

0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision s010.t23 s010.t24 s010.t25 s010.t26 s010.t27 s010.t28 s010.t29 s010.t30

which seed elements are better than others in

  • rder to reduce the impact of seed selection effect
slide-18
SLIDE 18

The Effect of: Seed Size

  • 35 -

2 2.5 3 s

Rate of New Correct Expansions

  • vs. Seed Size

Takeaway: Small numbers of seeds are sufficient to fully saturate the distributional similarity model

1 1.5 2 Rate of New Correct Expansion

to fully saturate the distributional similarity model

  • After 10-20 seeds, more seeds yield few new

correct instances in the expansion

  • 36 -

0.5 20 40 60 80 100 120 140 160 180 200 Seed Size

slide-19
SLIDE 19

0.7 0.8 0.9 1

Seed Size vs. % of Errors

0.3 0.4 0.5 0.6 % of Errors

Takeaway: Although bad seeds may be less desirable in a seed set than others, adding them does not seem to degrade performance

  • Percentage of errors does not increase with
  • 37 -

0.1 0.2 20 40 60 80 100 120 140 160 180 200 Seed Size

Percentage of errors does not increase with increased seed set size

Questions Asked and Conclusions

What is the effect of corpus size on expansion accuracy?

Significant boost in performance

What is the effect of corpus quality on expansion accuracy? Does seed quality matter?

Significant boost in performance Great variance in performance depending on set seed composition

  • 38 -

How many seeds are on average optimal for expansion accuracy and are more seeds better than less?

Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt

slide-20
SLIDE 20
  • 39 -