INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, - PowerPoint PPT Presentation

INF3800/INF4800 Søketeknologi 2017.01.16

Foreleser Aleksander Øhrn, Professor II aleksaoh@ifi.uio.no

Gruppelærere Camilla Emina Stenberg Eirik Isene camilest@student.matnat.uio.no eirikise@ifi.uio.no

http://nlp.stanford.edu/IR-book/information-retrieval-book.html Pensum +

Introduksjon

The Sweetspot Distributed Systems Information Language Retrieval Technology

Web Search

alltheweb.com 1999-2003

Enterprise Search Much more than intranets

Data Centers alltheweb.com 2000

Data Centers Microsoft 2010 http://www.youtube.com/watch?v=K3b5Ca6lzqE http://www.youtube.com/watch?v=PPnoKb9fTkA

Search Platform Anatomy The 50,000 Foot View Document Crawler Indexer Processing Result Data Mining Index Processing Query Search Front End Processing

Scaling Content Volume • – How many documents are there? How large are the documents? – Content Complexity • How many fields does each document have? – How complex are the field structures? – Query Traffic • How many queries per second are there? – – What is the latency per query? Update Frequency • – How often does the content change? Indexing Latency • – How quickly must new data become searchable? Query Complexity • How many query terms are there? – What is the type and structure of the query terms? –

Scaling Scale through replicating the partitions Query Traffic Content Volume Scale through partitioning the data

Crawling The Web

Processing The Content HTML, PDF, Word, UTF-8, ISCII, English, Polish, Title, headings, Excel, PowerPoint, KOI8-R, Shift-JIS, Danish, Japanese, body, navigation, XML, Zip, … ISO-8859-1, … Norwegian, … ads, footnotes, … Format detection Encoding detection Language detection Parsing “buljongterning”, “30,000”, Go, went, gone “Rindfleischetikett “L’Hôpital’s rule”, Øhrn, Ohrn, Car, cars ierungsüberwachu Oehrn, Öhrn, … Silly, sillier, silliest “ 台湾研究 “, … ngsaufgabenübert ragungsgesetz”, … Tokenization Character normalization Lemmatization Decompounding Persons, Sports, Health, Who said what, companies, World, Politics, Positive or who works where, events, locations, negative, liberal Entertainment, what happened dates, quotations, or conservative, … Spam, Offensive when, … … Content, … Entity extraction Relationship extraction Sentiment analysis Classification

Creating The Index Word Document Position tea 4 22 4 32 4 76 8 3 teacart 8 7 teach 2 102 2 233 8 77 teacher 2 57

Deploying The Index

Processing The Query “I am looking for “LED TVs between fish restaurants $1000 and $2000” near Majorstua” “hphotos-snc3 fbcdn” “brintney speers pics” “23445 + 43213”

Searching The Content http://www.stanford.edu/class/cs276/handouts/lecture2-dictionary.pdf Assess relevancy as we go along

Searching The Content Federation Query processing Result processing Dispatching Merging Searching Caption generation “Divide and conquer”

Searching The Content Tiering Organize the search nodes in a row into multiple • tiers Tier 1 • Top tier nodes may have fewer documents and run on better hardware Fall through? Keep the good stuff in the top tiers • • Only fall through to the lower tiers if not enough Tier 2 good hits are not found in the top tiers Analyze query logs to decide which documents • Fall through? that belong in which tiers Tier 3 “All search nodes are equal, but some are more equal than others”

Searching The Content Context Drilling Body, headings, title, click-through queries, anchor texts Headings, title, click- through queries, anchor texts Title, click-through queries, anchor texts Click-through queries, anchor texts “If the result set is too large, only consider the superior contexts”

Relevancy Anchor texts, click- through queries, tags, … Page rank, link Title, anchor texts, cardinality, item profit headings, body, … margin, popularity, … Crowdsourced annotations Document quality Match context Term frequency, inverse document Freshness, date of frequency, publication, buzz completeness in factor, … superior contexts, proximity, … Basic statistics Timeliness Relevancy score “Maximize the normalized discounted cumulative gain (NDCG)”

Processing The Results Faceted browsing • – What are the distributions of data across the various document fields? “Local” versus “global” meta data – Result arbitration • Which results from which sources should – be displayed in a federation setting? How should the SERP layout be rendered? – Unsupervised clustering • Can we automatically organize the results – set by grouping similar items together? Last-minute security trimming • Does the user still have access to each – result?

Data Mining

Applications

http://www.google.com/jobs/britney.html Spellchecking

Spellchecking britnay spears vidios Generate candidates britney shears videos bridney speaks vidoes birtney vidies Find the best path 1. Generate a set of candidates per query term using approximate matching techniques. Score each candidate according to, e.g., “distance” from the query term and usage frequency. 2. Find the best path in the lattice using the Viterbi algorithm. Use, e.g., candidate scores and bigram statistics to guide the search.

Entity Extraction … … … … … Levels of abstraction MAN FOOD N/proper V/past/eat DET ADJ N/singular Richard ate some bad curry 1. Logically annotate the text with zero or more computed layers of meta data. The original surface form of the text can be viewed as trivial meta data. 2. Apply a pattern matcher or grammar over selected layers. Use, e.g., handcrafted rules or machine-trained models. Extract the surface forms that correspond to the matching patterns.

Sentiment Analysis “What is the current perception of my brand?” “I want to stay at a hotel whose user reviews have a definite positive tone.” http://research.microsoft.com/en-us/projects/blews/ “What are the most 1. To construct a sentiment vocabulary, start by defining a small seed emotionally charged set of known polar opposites. issues in American politics right now?” 2. Expand the vocabulary by, e.g., looking at the context around the seeds in a training corpus. 3. Use the expanded vocabulary to build a classifier. Apply special heuristics to take care of, e.g., negations and irony.

Contextual Search “Sentences where someone says “Dates and locations something positive related to D-Day.” about Adidas.” xml:sentence:(“adidas” and sentiment:@degree:>0) xml:sentence:(”d-day” and (scope(date) or scope(location))) “Paragraphs that discuss a company “Sentences where the merger or “Paragraphs that acronym MIT is acquisition.” contain quotations by defined.” Alan Greenspan, where he mentions a xml:paragraph:(string(“merger”, linguistics=“on”) and monetary amount.” scope(company) and scope(price)) xml:sentence:acronym:(@base:”mit” and scope(@definition)) xml:paragraph:quotation:(@speaker:”greenspan” and scope(price)) Persons that appear in Persons that appear in documents that contain paragraphs that contain the word {soccer} the word {soccer} Example from Wikipedia

INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, - PowerPoint PPT Presentation

INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, Professor II aleksaoh@ifi.uio.no Gruppelrere Camilla Emina Stenberg Eirik Isene camilest@student.matnat.uio.no eirikise@ifi.uio.no

INF3800/INF4800 Sketeknologi 2015.01.19

INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander hrn, Professor II

String Extravaganza INF 3800/INF4800 2015.02.02 How do

Determination of topological charge following several definitions and preliminary results of t

Quantum walks as a quantum simulators Ivn Mrquez Martn 23/10/2019 Index I. Introduction

Large-Nc gauge theory and Chiral Random matrix theory Masanori Hanada Hana Da Masa Nori (KEK

Evolution to 5G A convergent operator perspective IEEE 5G Lisbon Summit, IT/ISCTE-IUL NOS | Lus

WINTER 2018 EDP Cohort Meeting Introductions Health Sciences Social Work Liaison

Program Slicing 2 1 Program Slicing 1. Slicing overview 2. Types of slices, levels of slices

What about negative nos.? Same binary representation Twos complement 32-bit word

CS4513 Goals Sof t war e Dist ribut ed Comput er Client Server Syst ems I nt roduct

R EGINALD M ILLS S ILBY The Westminster Connection Dr. Kevin Vogt University of Kansas Reginald

State-Federal RPS Collaborative and ESTAP Webinar Californias Energy Storage Mandate Hosted by

Mathematical String Notation 7 January 2019 OSU CSE 1 String Theory A mathematical model

Problems With Notation Mathematical notation is very precise. This contrasts with both oral

Introduction to Linear Programming Notation and Modeling Marco Chiarandini Department of

Introduction to Convex Optimization for Machine Learning John Duchi University of California,

Notation (1) is the space of all possible trees (and model parameters)

Reminder of Notation Language is always L NT = (0 , S, + , , E, < ). N is the natural numbers

Functional Notation and Lazy Evaluation in Ciao Amadeo Casas 1 Daniel Cabeza 2 Manuel Hermenegildo

Lecture 2.3: Symmetric and alternating groups Matthew Macauley Department of Mathematical

Representation of musical notation in Haskell Edward Lilley Institute of Astronomy, University of

AN INTRODUCTION TO BACKGROUND SETTINGS: Allows you to change background BACKGROUND SETTINGS: Allows

BU CS 332 Theory of Computation Lecture 18: Reading: Time Complexity Sipser Ch 7.1 7.2

Sambuz

Useful Links

Newsletter

Mail Us

INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, - PowerPoint PPT Presentation

INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, Professor II aleksaoh@ifi.uio.no Gruppelrere Camilla Emina Stenberg Eirik Isene camilest@student.matnat.uio.no eirikise@ifi.uio.no

INF3800/INF4800 Sketeknologi 2015.01.19

INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander hrn, Professor II

String Extravaganza INF 3800/INF4800 2015.02.02 How do

Determination of topological charge following several definitions and preliminary results of t

Quantum walks as a quantum simulators Ivn Mrquez Martn 23/10/2019 Index I. Introduction

Large-Nc gauge theory and Chiral Random matrix theory Masanori Hanada Hana Da Masa Nori (KEK

Evolution to 5G A convergent operator perspective IEEE 5G Lisbon Summit, IT/ISCTE-IUL NOS | Lus

WINTER 2018 EDP Cohort Meeting Introductions Health Sciences Social Work Liaison

Program Slicing 2 1 Program Slicing 1. Slicing overview 2. Types of slices, levels of slices

What about negative nos.? Same binary representation Twos complement 32-bit word

CS4513 Goals Sof t war e Dist ribut ed Comput er Client Server Syst ems I nt roduct

R EGINALD M ILLS S ILBY The Westminster Connection Dr. Kevin Vogt University of Kansas Reginald

State-Federal RPS Collaborative and ESTAP Webinar Californias Energy Storage Mandate Hosted by

Mathematical String Notation 7 January 2019 OSU CSE 1 String Theory A mathematical model

Problems With Notation Mathematical notation is very precise. This contrasts with both oral

Introduction to Linear Programming Notation and Modeling Marco Chiarandini Department of

Introduction to Convex Optimization for Machine Learning John Duchi University of California,

Notation (1) is the space of all possible trees (and model parameters)

Reminder of Notation Language is always L NT = (0 , S, + , , E, &lt; ). N is the natural numbers

Functional Notation and Lazy Evaluation in Ciao Amadeo Casas 1 Daniel Cabeza 2 Manuel Hermenegildo

Lecture 2.3: Symmetric and alternating groups Matthew Macauley Department of Mathematical

Representation of musical notation in Haskell Edward Lilley Institute of Astronomy, University of

AN INTRODUCTION TO BACKGROUND SETTINGS: Allows you to change background BACKGROUND SETTINGS: Allows

BU CS 332 Theory of Computation Lecture 18: Reading: Time Complexity Sipser Ch 7.1 7.2

Sambuz

Useful Links

Newsletter

Mail Us

Reminder of Notation Language is always L NT = (0 , S, + , , E, < ). N is the natural numbers