Of Search and Semantics
Patrick Pantel
NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008
- 2 -
Of Search and Semantics Patrick Pantel NSF Symposium on Semantic - - PDF document
Of Search and Semantics Patrick Pantel NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008 - 2 - Vannaver Vannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for
NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008
Vannaver Vannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for all mankind: for all mankind: Memex Memex Bush: "technical difficulties Bush: "technical difficulties
ignored." ignored."
Memex, in the form of a desk, would instantly bring files and material on any subject to the operator’s fingertips.
Semantics captured by an Semantics captured by an associative trail associative trail, , personal personal comments comments and and side trails side trails Gerald Salton Gerald Salton, father of modern , father of modern search: introduces concepts such as search: introduces concepts such as vector space model vector space model, Inverse , Inverse Document Frequency Document Frequency (IDF), (IDF), Term Term q y q y Frequency Frequency (TF), and (TF), and relevancy relevancy feedback feedback
1990: the first search 1990: the first search engine engine Archie Archie, from McGill, , from McGill, matches keyword queries matches keyword queries against a database of Web against a database of Web filenames filenames
Yahoo! Yahoo! is founded around a is founded around a taxonomical organization of taxonomical organization of the Web… manually the Web… manually
Key innovator Key innovator: Advanced search features + Advanced search features + first SE to allow natural first SE to allow natural language queries language queries Key innovator Key innovator: Semantic models for ranking Semantic models for ranking paid search results paid search results
Semantics and Query Logs Semantics and Query Logs: Technologies developed/applied during the Technologies developed/applied during the DARPA Tipster program and TREC make it DARPA Tipster program and TREC make it commercial: spelling correction, query commercial: spelling correction, query reformulations, also try, … reformulations, also try, …
Future of search lies in a Future of search lies in a deep deep understanding and understanding and matching of information matching of information g request request behind user queries behind user queries Natural language questions Natural language questions answered by editors answered by editors
Semantic repositories Semantic repositories and user and user-
annotated content grow rapidly content grow rapidly Semantic Semantic search engines search engines
search engines search engines emerge emerge
Search Assist Search Assist Technology Technology
Aggregate star Aggregate star rating rating Example review as Example review as summary summary Current prices at Current prices at
Current prices at Current prices at merchant sites merchant sites Images, maps, Images, maps, specs, … specs, … O i i Opportunities:
extraction, content analysis, and query intent modeling
salience; attribute detection
salience; attribute detection
aboutness, information fusion
intent/task understanding
– Find product reviews of the Nikon p D300? – Buy a Nikon D300? – Find support for her camera?
– How can we make use of this knowledge?
K T h l i
– Entity detection – Document classification – Intent modeling / detection
Is Is battery life battery life a synonym of a synonym of image quality image quality?
How can we How can we automatically discover automatically discover
Review Intent
automatically discover automatically discover intent synonyms? intent synonyms?
Transactional Intent
Key Assumption: one intent per session A searcher’s intent remains the same within a search session I ntent Synonyms h d l l d f d b ll l d Review Price Support best thin where can I get cheap common problems rate black Friday sale fault codes compare kodak vs. canon christmas sales installing new small portable cheap easy use Methodology very similar to constructing dictionary of distributionally similar words I ntent Discovery and Chaining via Clustering
small portable cheap easy use high zoom buy now pay later
rate best dell discount won’t heat battery life great deals best schematics user comments
keeps shutting
system for mining entity lists based on distributional similarity
Jaguar Honda Peugeot
Mercedes Ford
Nissan Opel cheetah lion Jaguar Peugeot Honda Nissan Volkswagen Porsche Mazda Fiat Hyundai Lexus Toyota Clinton
Mercedes Ford Renault Hyundai Saab Suburu puma caracal Carter Eisenhower
Nissan Opel cheetah lion Jaguar Peugeot Honda Nissan Volkswagen Porsche Mazda Fiat Hyundai Lexus Toyota
What is a caracal? endangered carnivore fast
Mercedes Ford Renault Hyundai Saab Suburu puma
fast hunted …
caracal
Significant boost in performance
Significant boost in performance Great variance in performance depending on set seed composition
Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt
Archbishops of Canterbury Astronomers Australian Airlines Australian A-League Cognitive Scientists Composers Countries Electronic Companies Male Tennis Players Maryland Counties New Zealand Songwriters NHL Hockey Teams
Football Teams Australian Cities Australian Prime Ministers Best Actress Academy Award Winners Biology Disciplines Bottled Water Brands Boxing Weight Classes California Counties Canadian Stadiums Elements English Cities English Poets English Premier Football Clubs First Ladies Formula One Drivers French Artists Greek Gods Greek I slands North American Mountain Ranges Presidents of Argentina Rivers in England Roman Emperors Russian Authors Spanish Provinces Stars Superheroes Texas Counties
Canadian Stadiums Canadian Universities Charitable Foundations Classical Pianists Cocktails Greek I slands I rish Theatres I talian Regions Japanese Martial Arts Japanese Prefectures Texas Counties U.S. Army Generals U.S. Federal Departments U.S. Internet Companies
Average of 208 instances Minimum of 11 Maximum of 1116 Total of 10,377 instances
50 sets extracted from Wikipedia (2007/12)
CORPORA UNIQUE SEN TEN CES (MILLIONS) TOKEN S (MILLIONS) UN IQUE WORDS (MILLION S)
†
†Estima
ted from k100 st a tistics.
0.7 0.8 0.9 1
System and Corpora Analysis (Precision vs. Recall)
full.k100 f ll k020
Takeaway: Corpus Size Matters
than 1/5th its size
0.3 0.4 0.5 0.6 Recall
full.k020 full.k004 full.wikipedia
Opportunity: Model a more natural SeeLEx usage
than 1/25th its size Takeaway: Corpus Quality Matters
to a web corpus 60 times its size
0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision
SeeLEx usage
0.7 0.8 0.9 1
k100: List Effect Precision vs. Recall Opportunity: Bucket sets to find predictable behaviors O ld t d d t i ifi tl i th i
0.3 0.4 0.5 0.6 Recall
expansion performance
large vs. small class sets, and types of sets such as locations, people, …
0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision
0.7 0.8 0.9 1
k100: Seed Selection Effect Precision vs. Recall
full.k100 s010 s010.t01 s010.t02 s010.t03 s010.t04 s010.t05 s010.t06 s010.t07 s010.t08
Takeaway: Seed set composition greatly affects performance
0.3 0.4 0.5 0.6 Recall s010.t09 s010.t10 s010.t11 s010.t12 s010.t13 s010.t14 s010.t15 s010.t16 s010.t17 s010.t18 s010.t19 s010.t20 s010.t21 s010.t22
affects performance
precision and 39% higher recall than the worst performing seed set Opportunity: Reject seeds and/or ask for more
hi h d l t b tt th th i
0.1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision s010.t23 s010.t24 s010.t25 s010.t26 s010.t27 s010.t28 s010.t29 s010.t30
which seed elements are better than others in
2 2.5 3 s
Rate of New Correct Expansions
Takeaway: Small numbers of seeds are sufficient to fully saturate the distributional similarity model
1 1.5 2 Rate of New Correct Expansion
to fully saturate the distributional similarity model
correct instances in the expansion
0.5 20 40 60 80 100 120 140 160 180 200 Seed Size
0.7 0.8 0.9 1
Seed Size vs. % of Errors
0.3 0.4 0.5 0.6 % of Errors
Takeaway: Although bad seeds may be less desirable in a seed set than others, adding them does not seem to degrade performance
0.1 0.2 20 40 60 80 100 120 140 160 180 200 Seed Size
Percentage of errors does not increase with increased seed set size
Significant boost in performance
Significant boost in performance Great variance in performance depending on set seed composition
Somewhat surprising: 5-20 seeds in general is sufficient; more seeds gains little but don’t hurt