INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander - PowerPoint PPT Presentation

INF3800/INF4800 Søketeknologi 2016.01.19

Foreleser Aleksander ¡Øhrn, ¡Professor ¡II aleksaoh@ifi.uio.no

Gruppelærere Camilla ¡Emina Stenberg Jan ¡Kristian ¡Furulund camilest@student.matnat.uio.no jankfu@student.matnat.uio.no

http://nlp.stanford.edu/IR-‑book/information-‑retrieval-‑book.html Pensum +

Introduksjon

The ¡Sweetspot Distributed ¡ Systems Information ¡ Language ¡ Retrieval Technology

Web ¡Search

alltheweb.com 1999-‑2003

Enterprise ¡Search Much ¡more ¡than ¡intranets

Data ¡Centers alltheweb.com ¡2000

Data ¡Centers Microsoft ¡2010 http://www.youtube.com/watch?v=K3b5Ca6lzqE http://www.youtube.com/watch?v=PPnoKb9fTkA

Search ¡Platform ¡Anatomy The ¡50,000 ¡Foot ¡View Document Crawler Indexer Processing Result Data ¡Mining Index Processing Query Search Front ¡End Processing

Scaling Content ¡Volume • – How ¡many ¡documents ¡are ¡there? – How ¡large ¡are ¡the ¡documents? Content ¡Complexity • – How ¡many ¡fields ¡does ¡each ¡document ¡have? – How ¡complex ¡are ¡the ¡field ¡structures? Query ¡Traffic • – How ¡many ¡queries ¡per ¡second ¡are ¡there? – What ¡is ¡the ¡latency ¡per ¡query? Update ¡Frequency • – How ¡often ¡does ¡the ¡content ¡change? Indexing ¡Latency • – How ¡quickly ¡must ¡new ¡data ¡become ¡searchable? Query ¡Complexity • – How ¡many ¡query ¡terms ¡are ¡there? – What ¡is ¡the ¡type ¡and ¡structure ¡of ¡the ¡query ¡terms? ¡

Scaling Scale ¡through ¡replicating ¡the ¡partitions Query ¡Traffic Content ¡Volume Scale ¡through ¡partitioning ¡the ¡data

Crawling ¡The ¡Web

Processing ¡The ¡Content HTML, ¡PDF, ¡Word, ¡ UTF-‑8, ¡ISCII, ¡ ¡ English, ¡Polish, ¡ Title, ¡headings, ¡ Excel, ¡PowerPoint, ¡ KOI8-‑R, ¡Shift-‑JIS, ¡ Danish, ¡Japanese, ¡ body, ¡navigation, ¡ XML, ¡Zip, ¡… ISO-‑8859-‑1, ¡… Norwegian, ¡… ads, ¡footnotes, ¡… Format ¡detection Encoding ¡detection Language ¡detection Parsing “buljongterning”, ¡ “30,000”, ¡ Go, ¡went, ¡gone “Rindfleischetikett “L’Hôpital’s rule”, ¡ Øhrn, ¡Ohrn, ¡ Car, ¡cars ierungsüberwachu “ 台湾研究 “, ¡… Oehrn, ¡Öhrn, ¡… Silly, ¡sillier, ¡silliest ngsaufgabenübert ragungsgesetz”, ¡… Tokenization Character ¡normalization Lemmatization Decompounding Persons, ¡ Sports, ¡Health, ¡ Who ¡said ¡what, ¡ companies, ¡ Positive ¡or ¡ World, ¡Politics, ¡ who ¡works ¡where, ¡ events, ¡locations, ¡ negative, ¡liberal ¡ Entertainment, ¡ what ¡happened ¡ dates, ¡quotations, ¡ or ¡conservative, ¡… Spam, ¡Offensive ¡ when, ¡… … Content, ¡… Entity ¡extraction Relationship ¡extraction Sentiment ¡analysis Classification

Creating ¡The ¡Index Word Document Position tea 4 22 4 32 4 76 8 3 teacart 8 7 teach 2 102 2 233 8 77 teacher 2 57

Deploying ¡The ¡Index

Processing ¡The ¡Query “I ¡am ¡looking ¡for ¡ “LED ¡TVs ¡between ¡ fish ¡restaurants ¡ $1000 ¡and ¡$2000” near ¡Majorstua” “hphotos-‑snc3 ¡ fbcdn” “brintney speers pics” “23445 ¡+ ¡43213”

Searching ¡The ¡Content http://www.stanford.edu/class/cs276/handouts/lecture2-‑dictionary.pdf Assess ¡relevancy ¡as ¡we ¡go ¡along

Searching ¡The ¡Content Federation Query ¡processing Result ¡processing Dispatching Merging Searching Caption ¡generation “Divide ¡and ¡conquer”

Searching ¡The ¡Content Tiering • Organize ¡the ¡search ¡nodes ¡in ¡a ¡row ¡into ¡multiple ¡ tiers Tier ¡1 • Top ¡tier ¡nodes ¡may ¡have ¡fewer ¡documents ¡and ¡ run ¡on ¡better ¡hardware Fall ¡through? • Keep ¡the ¡good ¡stuff ¡in ¡the ¡top ¡tiers • Only ¡fall ¡through ¡to ¡the ¡lower ¡tiers ¡if ¡not ¡enough ¡ Tier ¡2 good ¡hits ¡are ¡not ¡found ¡in ¡the ¡top ¡tiers • Analyze ¡query ¡logs ¡to ¡decide ¡which ¡documents ¡ Fall ¡through? that ¡belong ¡in ¡which ¡tiers Tier ¡3 “All ¡search ¡nodes ¡are ¡equal, ¡but ¡some ¡are ¡more ¡equal ¡than ¡others”

Searching ¡The ¡Content Context ¡Drilling Body, ¡headings, ¡title, ¡ click-‑through ¡queries, ¡ anchor ¡texts Headings, ¡title, ¡click-‑ through ¡queries, ¡ anchor ¡texts Title, ¡click-‑through ¡ queries, ¡anchor ¡texts Click-‑through ¡queries, ¡ anchor ¡texts “If ¡the ¡result ¡set ¡is ¡too ¡large, ¡only ¡consider ¡the ¡superior ¡contexts”

Relevancy Anchor ¡texts, ¡click-‑ through ¡queries, ¡tags, ¡ … Page ¡rank, ¡link ¡ Title, ¡anchor ¡texts, ¡ cardinality, ¡item ¡profit ¡ headings, ¡body, ¡… margin, ¡popularity, ¡… Crowdsourced annotations Document ¡quality Match ¡context Term ¡frequency, ¡ inverse ¡document ¡ Freshness, ¡date ¡of ¡ frequency, ¡ publication, ¡buzz ¡ completeness ¡in ¡ factor, ¡… superior ¡contexts, ¡ proximity, ¡… Basic ¡statistics Timeliness Relevancy ¡score “Maximize ¡the ¡normalized ¡discounted ¡cumulative ¡gain ¡(NDCG)”

Processing ¡The ¡Results Faceted ¡browsing • What ¡are ¡the ¡distributions ¡of ¡data ¡across ¡ – the ¡various ¡document ¡fields? “Local” ¡versus ¡“global” ¡meta ¡data – Result ¡arbitration • Which ¡results ¡from ¡which ¡sources ¡should ¡ – be ¡displayed ¡in ¡a ¡federation ¡setting? How ¡should ¡the ¡SERP ¡layout ¡be ¡rendered? – Unsupervised ¡clustering • Can ¡we ¡automatically ¡organize ¡the ¡results ¡ – set ¡by ¡grouping ¡similar ¡items ¡together? Last-‑minute ¡security ¡trimming • Does ¡the ¡user ¡still ¡have ¡access ¡to ¡each ¡ – result?

Data ¡Mining

Applications

http://www.google.com/jobs/britney.html Spellchecking

Spellchecking britnay spears vidios Generate ¡candidates britney shears videos bridney speaks vidoes birtney vidies Find ¡the ¡best ¡path 1. Generate ¡a ¡set ¡of ¡candidates ¡per ¡query ¡term ¡using ¡approximate ¡matching ¡techniques. ¡Score ¡each ¡ candidate ¡according ¡to, ¡e.g., ¡“distance” ¡from ¡the ¡query ¡term ¡and ¡usage ¡frequency. 2. Find ¡the ¡best ¡path ¡in ¡the ¡lattice ¡using ¡the ¡Viterbi ¡algorithm. ¡Use, ¡e.g., ¡candidate ¡scores ¡and ¡ bigram ¡statistics ¡to ¡guide ¡the ¡search.

Entity ¡Extraction … … … … … Levels ¡of ¡abstraction MAN FOOD N/proper V/past/eat DET ADJ N/singular Richard ate some bad curry 1. Logically ¡annotate ¡the ¡text ¡with ¡zero ¡or ¡more ¡computed ¡layers ¡of ¡ meta ¡data. ¡The ¡original ¡surface ¡form ¡of ¡the ¡text ¡can ¡be ¡viewed ¡as ¡ trivial ¡meta ¡data. 2. Apply ¡a ¡pattern ¡matcher ¡or ¡grammar ¡over ¡selected ¡layers. ¡Use, ¡e.g., ¡ handcrafted ¡rules ¡or ¡machine-‑trained ¡models. ¡Extract ¡the ¡surface ¡ forms ¡that ¡correspond ¡to ¡the ¡matching ¡patterns.

Sentiment ¡Analysis “What ¡is ¡the ¡current ¡ perception ¡of ¡my ¡ brand?” “I ¡want ¡to ¡stay ¡at ¡a ¡hotel ¡ whose ¡user ¡reviews ¡ have ¡a ¡definite ¡positive ¡ tone.” http://research.microsoft.com/en-‑us/projects/blews/ “What ¡are ¡the ¡most ¡ 1. To ¡construct ¡a ¡sentiment ¡vocabulary, ¡start ¡by ¡defining ¡a ¡small ¡seed ¡ emotionally ¡charged ¡ set ¡of ¡known ¡polar ¡opposites. issues ¡in ¡American ¡ politics ¡right ¡now?” 2. Expand ¡the ¡vocabulary ¡by, ¡e.g., ¡looking ¡at ¡the ¡context ¡around ¡the ¡ seeds ¡in ¡a ¡training ¡corpus. 3. Use ¡the ¡expanded ¡vocabulary ¡to ¡build ¡a ¡classifier. ¡Apply ¡special ¡ heuristics ¡to ¡take ¡care ¡of, ¡e.g., ¡negations ¡and ¡irony.

INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander - PowerPoint PPT Presentation

INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander hrn, Professor II aleksaoh@ifi.uio.no Gruppelrere Camilla Emina Stenberg Jan Kristian Furulund camilest@student.matnat.uio.no

INF3800/INF4800 Sketeknologi 2015.01.19

INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, Professor II

String Extravaganza INF 3800/INF4800 2015.02.02 How do

Characterizing Brand Advertising Strategies on Twitter Shana Dacres, Hamed Haddadi, Matthew

AI Using Machine Learning to Augment your Content Angelo Porretta Senior Architect

Sentiment analysis tasks and methods Mike Thelwall University of Wolverhampton, UK Contents

Research and Teaching Marketing Hanoi 30.05.2019 Overview 1. Introduction 2. Marking 2019

English version of Introduction to Computational Linguistics, slides Conference Paper November

Cython Stefan Behnel <stefan_ml@behnel.de> EuroPython, Basel, July 2019 Hi! Stefan

The NoRDF Project Fabian Suchanek Amazing! This talk is free of the Corona virus! (about the

Storytelling Roadmap: A Toolkit for Literacy Practitioners Launch Webinar February 5, 2020

Hello. We are using #tigerspike today Australian industry perspectives on innovation and its

JORDAN WHO Started in 1982 24 programs in 16 cities WE ARE Nearly 2,000 students

Probing BSM physics using H at CMS BSM physics using H at CMS Table of Contents

Herwig++ and BSM Physics HELAS Hard Process Decays Summary Plots Martyn Gigg Summary IPPP

draft-ietf-pim-sm-bsr-04.txt PIM WG, IETF-60, San Diego, August 3 2004 Alexander Gall

On Generating the Initial Key in the Bounded-Storage Model Main idea Instead of assuming that

Android 292 Jrme Pilliet Universit Paris-Est Marne-la-Valle Forewords Dynamic languages

Constraints on BSM physics through the Higgs couplings J er emie Quevillon LPT Orsay

Behavioural State Machines (programming modular agents) Peter Novk Clausthal University of

SUSY and BSM in ATLAS Recent Results and more selected topics only!!! G. Azuelos Supersymmetry

OpenGL and GLSL Steve Marschner CS4620 Cornell University Cornell CS4620 Fall 2020 Steve

Intro to AWS and Boto3 IN TRODUCTION TO AW S BOTO IN P YTH ON Maksim Pecherskiy Data

HMEH: write-optimal extendible hashing for hybrid DRAM-NVM memory Xiaomin Zou 1 , Fang Wang 1 *,

INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander - PowerPoint PPT Presentation

INF3800/INF4800 Sketeknologi 2016.01.19 Foreleser Aleksander hrn, Professor II aleksaoh@ifi.uio.no Gruppelrere Camilla Emina Stenberg Jan Kristian Furulund camilest@student.matnat.uio.no

INF3800/INF4800 Sketeknologi 2015.01.19

INF3800/INF4800 Sketeknologi 2017.01.16 Foreleser Aleksander hrn, Professor II

String Extravaganza INF 3800/INF4800 2015.02.02 How do

Characterizing Brand Advertising Strategies on Twitter Shana Dacres, Hamed Haddadi, Matthew

AI Using Machine Learning to Augment your Content Angelo Porretta Senior Architect

Sentiment analysis tasks and methods Mike Thelwall University of Wolverhampton, UK Contents

Research and Teaching Marketing Hanoi 30.05.2019 Overview 1. Introduction 2. Marking 2019

English version of Introduction to Computational Linguistics, slides Conference Paper November

Cython Stefan Behnel &lt;stefan_ml@behnel.de&gt; EuroPython, Basel, July 2019 Hi! Stefan

The NoRDF Project Fabian Suchanek Amazing! This talk is free of the Corona virus! (about the

Storytelling Roadmap: A Toolkit for Literacy Practitioners Launch Webinar February 5, 2020

Hello. We are using #tigerspike today Australian industry perspectives on innovation and its

JORDAN WHO Started in 1982 24 programs in 16 cities WE ARE Nearly 2,000 students

Probing BSM physics using H at CMS BSM physics using H at CMS Table of Contents

Herwig++ and BSM Physics HELAS Hard Process Decays Summary Plots Martyn Gigg Summary IPPP

draft-ietf-pim-sm-bsr-04.txt PIM WG, IETF-60, San Diego, August 3 2004 Alexander Gall

On Generating the Initial Key in the Bounded-Storage Model Main idea Instead of assuming that

Android 292 Jrme Pilliet Universit Paris-Est Marne-la-Valle Forewords Dynamic languages

Constraints on BSM physics through the Higgs couplings J er emie Quevillon LPT Orsay

Behavioural State Machines (programming modular agents) Peter Novk Clausthal University of

SUSY and BSM in ATLAS Recent Results and more selected topics only!!! G. Azuelos Supersymmetry

OpenGL and GLSL Steve Marschner CS4620 Cornell University Cornell CS4620 Fall 2020 Steve

Intro to AWS and Boto3 IN TRODUCTION TO AW S BOTO IN P YTH ON Maksim Pecherskiy Data

HMEH: write-optimal extendible hashing for hybrid DRAM-NVM memory Xiaomin Zou 1 , Fang Wang 1 *,

Cython Stefan Behnel <stefan_ml@behnel.de> EuroPython, Basel, July 2019 Hi! Stefan