lectures and exercises lectures and exercises
play

Lectures and Exercises Lectures and Exercises Lectures Lectures - PDF document

1 TDT4215 Web-intelligence TDT4215 Web intelligence Main topics: Information Retrieval Information Retrieval Large textual document collections Text mining NLP for document analysis NLP for document analysis


  1. 1 TDT4215 Web-intelligence TDT4215 Web intelligence Main topics: • Information Retrieval • Information Retrieval • Large textual document collections • Text mining • NLP for document analysis NLP for document analysis • Ontologies for document management How to extract knowledge from large document collections? How to extract knowledge from large document collections? TDT4215 - Introduction TDT4215 - Introduction 2 Lectures and Exercises Lectures and Exercises Lectures Lectures • Researcher Stein L. Tomassen • Additional lecturers: - PhD student Geir Solskinnsbakk - PhD student Wei Wei - PhD student Nattiya Kanhabua • Guest lectures: - PhD George Tsatsaronis from Athens University of Economics and Business - PhD student Simon Jonassen PhD t d t Si J • Thursdays 08.15-11.00 in S6 (that’s right, three hours!) Exercises • Researcher Stein L. Tomassen • Fridays 14.15-16.00 in F3 All relevant information are continuously published at http://www.idi.ntnu.no/emner/tdt4215 / http://www.idi.ntnu.no/emner/tdt4215 / TDT4215 - Introduction

  2. 3 Text Materials Text Materials • Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval. Addison-Wesley, 1999. (selected chapters) (selected chapters) • Manning, Raghavan and Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008. (selected chapters, available for download) • Compendium from IDI (selected book chapters and papers) (selected book chapters and papers) • Details are published at the homepage of the course TDT4215 - Introduction 4 Assessment Assessment • Group project: 25% of grade – Groups of 3-5 people – Discuss a particular theoretical topic Discuss a particular theoretical topic – Develop an information retrieval / text mining application – Evaluate application To be carried out the first half of the term (25 th Feb To be carried out the first half of the term (25 th Feb – 7 th Apr) 7 th Apr) – – Stein L. Tomassen is responsible for the group project • I di id Individual written examination: 75% of grade l i i i 75% f d 20 th of May – – 4 hours written examination (discussions, calculations, no programming) – Based on everything we will learn in the course TDT4215 - Introduction

  3. 5 Course Characteristics Course Characteristics • Experimental science: – No clear answers or theories – Lots of formulas (that are hard to justify) Lots of formulas (that are hard to justify) • Relevance: – Concerns real-world problems – A basis for knowledge management applications: Search engines, document management systems, publication systems, digital libraries, enterprise business applications, business/web intelligence systems semantic interoperation/integration software etc systems, semantic interoperation/integration software, etc. • Multi-disciplinary: – Combines techniques from several other sciences: Statistics linguistics conceptual modeling artificial intelligence databases Statistics, linguistics, conceptual modeling, artificial intelligence, databases, etc. TDT4215 - Introduction 6 Projects and Exercises Important Projects and Exercises Important • One mandatory project: – Practice in setting up an application – How to evaluate the quality of IR/TM applications? – How to extract knowledge from specific types of text? Which techniques for which types of text? • Exercises: – Examples from lectures – Understand how formulas are used in practice – Be comfortable with “unproven theories” – Representative for examination questions p q • Exercises are important! TDT4215 - Introduction

  4. 7 Lecture Plan (1) Lecture Plan (1) TDT4215 - Introduction 8 Lecture Plan (2) Lecture Plan (2) TDT4215 - Introduction

  5. 9 Lecture Plan (3) Lecture Plan (3) TDT4215 - Introduction 10 From Documents to Knowledge From Documents to Knowledge • Document collections • Knowledge and documents g • Document retrieval • Text Mining • Ontologies 80% of organizational data is textual with no proper structure! TDT4215 - Introduction

  6. 11 Overall approach Overall approach Retrieve document Discover knowledge Information Knowledge elicitation Morpho-syntax Text Text Mining Retrieval Knowledge representation Semantics Ontology Existing Existing New New TDT4215 - Introduction 12 Document Collections Document Collections • Domain-dependent or domain-independent • Structured or non-structured text • Formatted or non-formatted documents • Textual or multimedia documents • • Monolingual and multilingual document collections Monolingual and multilingual document collections • Centralized or non-centralized document management • Confidential or non-confidential • Controlled or free addition of documents • Stable or non-stable collections User User Document Information collection system TDT4215 - Introduction

  7. 13 Case 1: SAP at STATOIL Case 1: SAP at STATOIL • SAP used for major internal business processes • Named user accounts: 29,000 Concurrent users: 3,200 • System complexities: 894,000 customers 18,000 vendors 382 000 382,000 materials t i l • Work orders created each month: 11,000 • Sales orders created each month: 245,000 (11,600 per day) Sales orders created each month: 245 000 (11 600 per day) • Documents produced each month: 2,25 million • • Growth of database: 35 GB per month (Aug 2001) Growth of database: 35 GB per month (Aug 2001) • Document characteristics: highly structured, textual and tabular, formatted, controlled addition, high growth, non-centralized, formatted, controlled addition, high growth, non centralized, possibly multilingual TDT4215 - Introduction 14 Case 2: Reengineering project at g g p j Hydro Agri • Objective: Reengineer organization and implement SAP R3 to support business processes • Project duration: July 1995 – March 1999 j y • Costs: USD 126 million • Staffing: 500+ (140 external consultants) • Document management: Specialized Lotus Notes databases g p • Document production: • SHARE Training: g 1061 docs 868 MB • SHARE Test: 1632 docs 218 MB • SHARE Development: 12859 docs 218 MB • HAE User document.: 1312 docs 133 MB • TOTAL: 16864 docs 1437 MB 359 per month 12 per day TDT4215 - Introduction

  8. 15 Text is Difficult Text is Difficult • Most organizational knowledge encoded in textual documents • Unstructured or semi-structured text difficult to retrieve, interpret or analyze • Particular problems: – Inconsistent documents – Incomplete descriptions Incomplete descriptions – Duplicates – Different terminologies/languages/abbreviations/perspectives TDT4215 - Introduction 16 Knowledge and Documents Knowledge and Documents • One particular document is needed E.g.: What textbook is used in TDT4215? • Several documents provide partial answers E.g.: What is the definition of “text mining”? • All documents contribute to answer E.g.: Who writes about Rosenborg? E.g.: Who writes about Rosenborg? • W Words versus concepts d t • Manual inspection versus automatic reasoning TDT4215 - Introduction

  9. 17 Document Retrieval Document Retrieval • Information retrieval = information access • Retrieve documents that satisfy a user’s information Retrieve documents that satisfy a user s information need from a document collection – Document indexing Document Document Document Document Document – Query interpretation Q i t t ti representations representations representations representations – Ranking of retrieved documents identify relevant – Linguistics and statistics information query formulation formulation display documents to user TDT4215 - Introduction 18 Document Retrieval Example Document Retrieval Example • AllTheWeb from Fast Search & Transfer (2002) • Index: 2,1 GB documents • Languages supported: 52 • Linguistics used: Lemmatization Linguistics used: Lemmatization, language identification, phrasing, anti- phrasing, text categorization, clustering, offensive content reduction, finite-state automata • 30 mill. queries a day • www.alltheweb.com is today part of Yahoo and uses the Inktomi search engine • The old AllTheWeb search engine used Yahoo’s verticals TDT4215 - Introduction

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend