Information Retrieval & Data Mining Winter Semester 2015/16 - PowerPoint PPT Presentation

Information Retrieval & Data Mining Winter Semester 2015/16 Saarland University, Saarbrücken Jilles Vreeken Gerhard Weikum jilles@mpi-inf.mpg.de weikum@mpi-inf.mpg.de Teaching Asssistants: Abdalghani Joanna Robin Sreyasi Mohamed Adam Abujabal Biega Burghartz Chowdury Gad-Elrab Grycner Dhruv Yusra Saskia Panagiotis Natalia Erdal Amy Gupta Ibrahim Metzler Mandros Prytkova Kuzey Siu Coordinators https://www.mpi-inf.mpg.de/de/departments/databases-and-information-systems/ teaching/winter-semester-201516/information-retrieval-and-data-mining/ 1-1 IRDM 2015

Information Retrieval & Data Mining: What Is It All About? Data Mining is … Information Retrieval is … • discovering insight from data • finding relevant contents • mining interesting patterns • figuring out what users want • finding clusters and outliers • ranking results to satisfy users … from complex data … at Web scale Mutual benefits: Mining needs to retrieve, filter, rank contents from Internet Search benefits from analyzing user behavior data http://tinyurl.com/irdm2015 1-2 IRDM 2015

Organization • Lectures: Tuesday 14-16 in E1.3 Room 001 and Thursday 14-16 in E1.3 Room 002 Office hours of lecturers: appointment by e-mail Register for course and tutor group • Assignments / Tutoring Groups : at http://tinyurl.com/irdm2015 ! Tuesday 16-18 Thursday 16-18 Friday 14-16 Assignments given out in Thursday lecture, to be solved until next Thursday First assignment given out on Oct 22, solutions turned in on Oct 29 First meetings of tutoring groups: Tue, Nov 3; Thu, Nov 5; Fri, Nov 6 • Requirements for obtaining 9 credit points: • pass 2 out of 3 written tests (ca. 60 min each) tentative dates: Thu, Nov 19; Thu, Dec 10; Thu, Feb 4 • pass oral exam (15-20 min), tentative dates: Mon-Tue, Feb 15-16 • must present solutions to 3 exercises (randomly chosen) • up to 3 bonus points possible in tests http://tinyurl.com/irdm2015 1-3 IRDM 2015

Outline of the IRDM Course 1. Motivation and Overview Part I: 2. Data Quality and Data Reduction Introduction & Foundations 3. Basics from Probability Theory and Statistics 4. Patterns: Itemset and Rule Mining 5. Patterns by Clustering Part II: 6. Labeling by Supervised Classification Data Mining 7. Sequences, Time-Series, Streams 8. Graphs: Social Networks, Recommender Systems 9. Anomaly Detection 10. Text Indexing and Similarity Search 11. Probabilistic/Statistical Ranking 12. Topic Models and Graph Models for Ranking Part III: Information 13. Information Extraction Retrieval 14. Entity Linking and Semantic Search 15. Question Answering 1-4 IRDM 2015

IRDM: Primary Textbooks Data Mining: • Charu Aggarwal : „Data Mining: The Textbook “, Springer, 2015 Information Retrieval: • Stefan Büttcher, Charles Clarke, Gorden Cormack: “Information Retrieval: Implementing and Evaluating Search Engines”, MIT Press, 2010 Foundations from Probability and Statistics: • Larry Wasserman: „All of Statistics “, Springer, 2004 More books listed on IRDM web page and available in library: http://www.infomath-bib.de/tmp/vorlesungen/info-core_information-retrieval.html Each unit of the IRDM lecture states relevant parts of book(s), and gives additional references Within each unit, core material and advanced material are flagged 1-5 IRDM 2015

IRDM Research Literature important conferences on IR and DM (see DBLP bibliography for full detail, http://www.informatik.uni-trier.de/~ley/db/) SIGIR, WSDM, WWW, CIKM, KDD, ICDM, SDM, ICML, ECML important journals on IR and DM (see DBLP bibliography for full detail, http://www.informatik.uni-trier.de/~ley/db/) TOIS, TWeb, InfRetr, JASIST, DAMI, TKDD, TODS, VLDBJ performance evaluation initiatives / benchmarks: • Text Retrieval Conference (TREC), http://trec.nist.gov • Conference and Labs of the Evaluation Forum (CLEF), www.clef - initiative.eu • KDD Cup, http://www.kdnuggets.com/datasets/kddcup.html and http://www.sigkdd.org/kddcup/index.php 1-6 IRDM 2015

Chapter 1: Motivation and Overview 1.1 Information Retrieval 1.2 Data Mining „We are drowning in information, and starved for knowledge.“ -- John Naisbitt 1-7 IRDM 2015

1.1 Information Retrieval (IR)  Search Engine Technology Core functionality: • Match keywords and multi-word phrases in documents: web pages, news articles, scholarly publications, books, patents, service requests, enterprise dossiers, social media posts , … • Rank results by (estimated) relevance based on: contents, authority, timeliness, localization, personalization, user context , … • Support interactive exploration of document collections • Generate recommendations (implicit search) ….. Challenges: • Information Deluge • Needles in Haystack • Understanding the User • Efficiency and Scale 1-8 IRDM 2015

Search Engine Architecture ...... ..... ...... ..... extract index search rank present crawl & clean handle fast top-k queries, GUI, user guidance, dynamic pages, query logging, personalization detect duplicates, auto-completion detect spam scoring function strategies for build and analyze over many data crawl schedule and Web graph, and context criteria priority queue for index all tokens crawl frontier or word stems server farm with 100 000‘s of computers, distributed/replicated data in high-performance file system, massive parallelism for query processing 1-9 IRDM 2015

Content Gathering and Indexing Bag-of-Words representations ...... Internet ..... ...... ..... Web Internet Internet crisis Crawling crisis crisis user users user love ... ... search Internet crisis: engine users still love trust search engines Extraction Linguistic Statistically faith and have trust of relevant methods: weighted ... in the Internet words stemming features Indexing (terms) Documents Thesaurus Index (Ontology) (B + -tree) Synonyms, ... Sub-/Super- crisis love Concepts URLs 1-10 IRDM 2015

Vector Space Model for Content Relevance Ranking Ranking by descending Similarity metric: relevance | | F  Search engine d q ij j   1 j ( , ) : sim d q i | | | | F F   2 2 d q ij j | | q  F Query [ 0 , 1 ]   1 1 j j (set of weighted features) | | d  F [ 0 , 1 ] Documents are feature vectors i (bags of words) 1-11 IRDM 2015

Vector Space Model for Content Relevance Ranking Ranking by descending Similarity metric: relevance Search engine | | F  d q ij j   1 j ( , ) : sim d q i | | | | F F   2 2 d q | | q  F ij j Query [ 0 , 1 ]   1 1 j j (Set of weighted features) | | d  F Documents are feature vectors [ 0 , 1 ] i (bags of words)  2  tf*idf e.g., using: d : w / w ij ij ik k formula    ( , ) freq f d # docs    j i : log 1 log w   ij  max ( , )  # freq f d docs with f k k i j 1-12 IRDM 2015

Link Analysis for Authority Ranking Ranking by descending relevance & authority Search engine | | q  F Query [ 0 , 1 ] (Set of weighted features) + Consider in-degree and out-degree of Web nodes: Authority Rank (d i ) := Stationary visit probability [d i ] in random walk on the Web Reconciliation of relevance and authority (and …) by weighted sum 1-13 IRDM 2015

Google‘s PageRank [Brin & Page 1998] Idea: links are endorsements & increase page authority, authority higher if links come from high-authority pages          PR( p ) t( p,q ) PR(q ) j(q ) (1 ) Social Ranking  p IN( q )  ( , ) 1 / t p q outdegree( p) with  ( ) 1 / j q N and Extensions with • weighted links and jumps • trust/spam scores • personalized preferences • graph derived from queries & clicks Authority (page q) = stationary prob. of visiting q random walk: uniformly random choice of links + random jumps 1-14 IRDM 2015

Indexing with Inverted Lists Vector space model suggests term-document matrix , but data is sparse and queries are even very sparse  better use inverted index lists with terms as keys for B+ tree q: Internet B+ tree on terms crisis ... ... trust crisis Internet trust 17: 0.3 17: 0.3 12: 0.5 11: 0.6 index lists Google etc.: 44: 0.4 44: 0.4 14: 0.4 17: 0.1 17: 0.1 52: 0.1 28: 0.1 28: 0.7 with postings > 10 Mio. terms ... 53: 0.8 44: 0.2 44: 0.2 (DocId, score) > 100 Bio. docs 55: 0.6 51: 0.6 sorted by DocId > 50 TB index ... 52: 0.3 ... terms can be full words, word stems, word pairs, substrings, N-grams, etc. (whatever „dictionary terms“ we prefer for the application) • index-list entries in DocId order for fast Boolean operations • many techniques for excellent compression of index lists • additional position index needed for phrases, proximity, etc. (or other precomputed data structures) 1-15 IRDM 2015

Information Retrieval & Data Mining Winter Semester 2015/16 - PowerPoint PPT Presentation

Information Retrieval & Data Mining Winter Semester 2015/16 Saarland University, Saarbrcken Jilles Vreeken Gerhard Weikum jilles@mpi-inf.mpg.de weikum@mpi-inf.mpg.de Teaching Asssistants: Abdalghani Joanna Robin Sreyasi Mohamed

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Agenda LAN , WAN, SAN Everything goes IP ??? SAN Design borders Scaling

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN PGConf.EU 2017,

THE NEWSVENDOR AND APPLICATIONS 2 T HE N EWSVENDOR M ODEL 3 ON ELL S H AMMER 3/2 W

Coordinated Early Intervening Services & Title I Schoolwide Set-Aside Options Individuals

Verifikation von C-Programmen Vorlesung 1 vom 23.10.2014: Der C-Standard: Typen und Deklarationen

Roundtable Feedback James Orr Jerome Kirk Emma Gilpin Ann Muldoon Hazel Beveridge 25 October

For an analy alysi sis o of real al-time IR IRI M Map aps s of foF2 an and h d hmF mF2

Budget-aware Semi-Supervised Semantic and Instance Segmentation Miriam Bellver, Amaia Salvador,

Information Retrieval & Data Mining Winter Semester 2015/16 - PowerPoint PPT Presentation

Information Retrieval & Data Mining Winter Semester 2015/16 Saarland University, Saarbrcken Jilles Vreeken Gerhard Weikum jilles@mpi-inf.mpg.de weikum@mpi-inf.mpg.de Teaching Asssistants: Abdalghani Joanna Robin Sreyasi Mohamed

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Agenda LAN , WAN, SAN Everything goes IP ??? SAN Design borders Scaling

POSTGRESQL ON AWS: TIPS &amp; TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN PGConf.EU 2017,

THE NEWSVENDOR AND APPLICATIONS 2 T HE N EWSVENDOR M ODEL 3 ON ELL S H AMMER 3/2 W

Coordinated Early Intervening Services &amp; Title I Schoolwide Set-Aside Options Individuals

Verifikation von C-Programmen Vorlesung 1 vom 23.10.2014: Der C-Standard: Typen und Deklarationen

Roundtable Feedback James Orr Jerome Kirk Emma Gilpin Ann Muldoon Hazel Beveridge 25 October

For an analy alysi sis o of real al-time IR IRI M Map aps s of foF2 an and h d hmF mF2

Budget-aware Semi-Supervised Semantic and Instance Segmentation Miriam Bellver, Amaia Salvador,

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN PGConf.EU 2017,

Coordinated Early Intervening Services & Title I Schoolwide Set-Aside Options Individuals