ATIR April 28, 2016 Motivation Simple Preprocessing Linguistics - PowerPoint PPT Presentation

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation Vinay Setty Jannik Strötgen vsetty@mpi-inf.mpg.de jannik.stroetgen@mpi-inf.mpg.de ATIR – April 28, 2016

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Organizational Things please register – if you haven’t done so mail to atir16 (at) mpi-inf.mpg.de (i) name, (ii) matriculation number, (iii) preferred email address even if you do not want to get the ECTS points important for announcements about assignments, rooms etc. assignments first assignment today remember: we can only open pdfs 50% of points (not of exercises) with serious, presentable � Jannik Strötgen – ATIR-02 c 2 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Outline Simple Linguistic Preprocessing 1 Linguistics 2 3 Further Linguistic (Pre-)Processing NLP Pipeline Architectures 4 Evaluation Measures 5 � Jannik Strötgen – ATIR-02 c 3 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Why NLP Foundations for IR? � Jannik Strötgen – ATIR-02 c 4 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Why NLP Foundations for IR? different types of data structured data vs. unstructured data (vs. semi-structured data) structured data typically refers to information in tables Employee Manager Salary Johnny Frank 50000 Jack Johnny 60000 Jim Johnny 50000 numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Johnny � Jannik Strötgen – ATIR-02 c 5 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Why NLP Foundations for IR? unstructured data typically refers to “free text” not just string matching queries NLP foundations typical distinction important for IR structured data → “databases” unstructured data → “information retrieval” actually: semi-structured data almost always some structure: title, bullets facilitates semi-structured search title contains NLP and bullet contains data (not to mention the linguistic structure of text . . . ) � Jannik Strötgen – ATIR-02 c 6 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Why NLP Foundations for IR? standard procedure in IR starting point: documents and queries pre-processing of documents and queries typically includes – tokenization (e.g., splitting at white spaces and hyphens) – stemming or lemmatization (group variants of same word) – stopword removal (get rid of words with little information) this results in a bag (or sequence) of indexable terms � Jannik Strötgen – ATIR-02 c 7 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End �� ‣ �� ‣ �� ‣ �� ‣ ‣ ��   ��   ��   ��   ��   ��   ��   ��   ��   ��   �� Jannik Strötgen – ATIR-02 c 8 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Why NLP Foundations for IR? standard procedure in IR starting point: documents and queries pre-processing of documents and queries typically includes – tokenization (e.g., splitting at white spaces and hyphens) – stemming or lemmatization (group variants of same word) – stopword removal (get rid of words with little information) this results in a bag (or sequence) of indexable terms many NLP concepts mentioned in previous lecture today: linguistic / NLP foundations for IR � Jannik Strötgen – ATIR-02 c 9 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Why NLP Foundations for IR? goal of this lecture NLP concepts are not just buzz words, NLP concepts shall be understood example: what’s the difference between lemmatization and stemming? � Jannik Strötgen – ATIR-02 c 10 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Contents Simple Linguistic Preprocessing 1 Tokenization Lemmatization & Stemming Linguistics 2 Further Linguistic (Pre-)Processing 3 NLP Pipeline Architectures 4 Evaluation Measures 5 � Jannik Strötgen – ATIR-02 c 11 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Tokenization the task given a character sequence, split it into pieces called tokens tokens are often loosely referred to as terms/words last lecture: “splitting at white spaces and hyphens” seems to be trivial type vs. token (vs. term) token : instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit type : class of all tokens containing same character sequence term : (normalized) type included in IR system’s dictionary � Jannik Strötgen – ATIR-02 c 12 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Tokenization – Example type vs. token – example a rose is a rose is a rose set-theoretical view how many tokens? 8 tokens → multiset how many types? 3 ({a, is, rose}) (multiset: bag of words ) types → set type vs. token – example A rose is a rose is a rose knowing about normalization is important � Jannik Strötgen – ATIR-02 c 13 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Tokenization queries and documents have to be preprocessed identically tokenization choices determine which (Boolean) queries match guarantees that sequence of characters in query matches the same sequence in text further issues what about hyphens? co-education vs. drag-and-drop what about names? San Francisco, Los Angeles tokenization is language-specific – “this is a sequence of several words” compound – noun compounds are not separated in German: splitter may “Lebensversicherungsgesellschaftsangestellter” improve IR vs. “life insurance company employee” � Jannik Strötgen – ATIR-02 c 15 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Lemmatization & Stemming tokenization is just one step during preprocessing lemmatization stemming stopword removal lemmatization and stemming two tasks, same goal → to group variants of the same word what’s the difference? stemming vs. lemmatization stem vs. lemma � Jannik Strötgen – ATIR-02 c 16 / 68

Motivation Simple Preprocessing Linguistics Further Preprocessing Pipelines Evaluation Measures End Lemma & Lemmatization idea reduce inflectional forms (all variants of a “word”) to base form examples am, are, be, is → be car, cars, car’s, cars’ → car lemmatization proper reduction to dictionary headword form lemma dictionary form of a set of words � Jannik Strötgen – ATIR-02 c 17 / 68

ATIR April 28, 2016 Motivation Simple Preprocessing Linguistics - PowerPoint PPT Presentation

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de jannik.stroetgen@mpi-inf.mpg.de ATIR April 28, 2016 Motivation Simple Preprocessing

2/17/2016 1 2/17/2016 2 2/17/2016 3 2/17/2016 4 2/17/2016 5 2/17/2016 6 2/17/2016 7

Tucson Fire Department April 8, 2016, 2015 Awards April 8, 2016, 2015 Awards March 8, 2015: April

Temporal Information Retrieval Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Semantic Search Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Temporal Information Extraction Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Learning to Rank Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Q1 2016 Results April 26, 2016 Q1 2016 Results April 26, 2016 Safe Harbor Statement This

Clearstream Banking TARGET2-Securities DCP T2S Access Right Model April 2016 27 April 2016

2016 Fixed Income Investor Update Montral April 6, 2016 Toronto April 7, 2016 Winnipeg

Internet Governance in April 2016 26 April 2016 Main events in April 3-6 Apr: Global Privacy

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

Privacy Risks Online David Wagner, C79, 4/2/2013 Tuesday, April 2, 13 Who is Ghyslain Razaa?

SQL Server 2005 extended support, which ends on April 12, 2011 April 12, 2016 April 12, 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

Research Overview Big Data Large Structured into multiple vertical Database of application

14.581 International Trade Lecture 5: Comparative Advantage and Gains from Trade (Empirics)

JONAH SECOND CHANCES! J oo ah 3: 1-10 POINTS 1. You can pray anywhere and anytime FROM

38 Chapter 3 Ob ject-Orien ted Design A sup ercial description of the distinction

Counting permutations by congruence class of major index H el` ene Barcelo (Arizona State

Evidence of Clear-Sky Daylight Whitening: Are we already conducting geoengineering? Chuck Long

A sextuple equidistribution arising in Pattern Avoidance Zhicong Lin NIMS & Jimei University

A modern formatting library for C++ Victor Zverovich (victor.zverovich@gmail.com) Formatting

ATIR April 28, 2016 Motivation Simple Preprocessing Linguistics - PowerPoint PPT Presentation

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de jannik.stroetgen@mpi-inf.mpg.de ATIR April 28, 2016 Motivation Simple Preprocessing

2/17/2016 1 2/17/2016 2 2/17/2016 3 2/17/2016 4 2/17/2016 5 2/17/2016 6 2/17/2016 7

Tucson Fire Department April 8, 2016, 2015 Awards April 8, 2016, 2015 Awards March 8, 2015: April

Temporal Information Retrieval Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Semantic Search Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Temporal Information Extraction Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Learning to Rank Vinay Setty Jannik Strtgen vsetty@mpi-inf.mpg.de

Q1 2016 Results April 26, 2016 Q1 2016 Results April 26, 2016 Safe Harbor Statement This

Clearstream Banking TARGET2-Securities DCP T2S Access Right Model April 2016 27 April 2016

2016 Fixed Income Investor Update Montral April 6, 2016 Toronto April 7, 2016 Winnipeg

Internet Governance in April 2016 26 April 2016 Main events in April 3-6 Apr: Global Privacy

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

Privacy Risks Online David Wagner, C79, 4/2/2013 Tuesday, April 2, 13 Who is Ghyslain Razaa?

SQL Server 2005 extended support, which ends on April 12, 2011 April 12, 2016 April 12, 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

(c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016 Fabbian USA Corp. (c) 2016

Research Overview Big Data Large Structured into multiple vertical Database of application

14.581 International Trade Lecture 5: Comparative Advantage and Gains from Trade (Empirics)

JONAH SECOND CHANCES! J oo ah 3: 1-10 POINTS 1. You can pray anywhere and anytime FROM

38 Chapter 3 Ob ject-Orien ted Design A sup ercial description of the distinction

Counting permutations by congruence class of major index H el` ene Barcelo (Arizona State

Evidence of Clear-Sky Daylight Whitening: Are we already conducting geoengineering? Chuck Long

A sextuple equidistribution arising in Pattern Avoidance Zhicong Lin NIMS &amp; Jimei University

A modern formatting library for C++ Victor Zverovich (victor.zverovich@gmail.com) Formatting

A sextuple equidistribution arising in Pattern Avoidance Zhicong Lin NIMS & Jimei University