Lectures and Exercises Lectures and Exercises Lectures Lectures - - PDF document

lectures and exercises lectures and exercises
SMART_READER_LITE
LIVE PREVIEW

Lectures and Exercises Lectures and Exercises Lectures Lectures - - PDF document

1 TDT4215 Web-intelligence TDT4215 Web intelligence Main topics: Information Retrieval Information Retrieval Large textual document collections Text mining NLP for document analysis NLP for document analysis


slide-1
SLIDE 1

1

TDT4215 Web intelligence TDT4215 Web-intelligence

Main topics:

  • Information Retrieval
  • Information Retrieval
  • Large textual document collections
  • Text mining
  • NLP for document analysis

NLP for document analysis

  • Ontologies for document management

How to extract knowledge from large document collections?

TDT4215 - Introduction TDT4215 - Introduction

How to extract knowledge from large document collections?

2

Lectures and Exercises Lectures and Exercises

Lectures Lectures

  • Researcher Stein L. Tomassen
  • Additional lecturers:
  • PhD student Geir Solskinnsbakk
  • PhD student Wei Wei
  • PhD student Nattiya Kanhabua
  • Guest lectures:
  • PhD George Tsatsaronis from Athens University of Economics and Business

PhD t d t Si J

  • PhD student Simon Jonassen
  • Thursdays 08.15-11.00 in S6 (that’s right, three hours!)

Exercises

  • Researcher Stein L. Tomassen
  • Fridays 14.15-16.00 in F3

All relevant information are continuously published at http://www.idi.ntnu.no/emner/tdt4215/

TDT4215 - Introduction

http://www.idi.ntnu.no/emner/tdt4215/

slide-2
SLIDE 2

3

Text Materials Text Materials

  • Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval.

Addison-Wesley, 1999. (selected chapters) (selected chapters)

  • Manning, Raghavan and Schütze: Introduction to

Information Retrieval. Cambridge University Press, 2008. (selected chapters, available for download)

  • Compendium from IDI

(selected book chapters and papers) (selected book chapters and papers)

  • Details are published at the homepage of the course

TDT4215 - Introduction

4

Assessment Assessment

  • Group project: 25% of grade

– Groups of 3-5 people Discuss a particular theoretical topic – Discuss a particular theoretical topic – Develop an information retrieval / text mining application – Evaluate application To be carried out the first half of the term (25th Feb 7th Apr) – To be carried out the first half of the term (25th Feb – 7th Apr) – Stein L. Tomassen is responsible for the group project

I di id l i i i 75% f d

  • Individual written examination: 75% of grade

– 20th of May – 4 hours written examination (discussions, calculations, no programming) – Based on everything we will learn in the course

TDT4215 - Introduction

slide-3
SLIDE 3

5

Course Characteristics Course Characteristics

  • Experimental science:

– No clear answers or theories Lots of formulas (that are hard to justify) – Lots of formulas (that are hard to justify)

  • Relevance:

– Concerns real-world problems – A basis for knowledge management applications: Search engines, document management systems, publication systems, digital libraries, enterprise business applications, business/web intelligence systems semantic interoperation/integration software etc systems, semantic interoperation/integration software, etc.

  • Multi-disciplinary:

– Combines techniques from several other sciences: Statistics linguistics conceptual modeling artificial intelligence databases Statistics, linguistics, conceptual modeling, artificial intelligence, databases, etc.

TDT4215 - Introduction

6

Projects and Exercises Important Projects and Exercises Important

  • One mandatory project:

– Practice in setting up an application – How to evaluate the quality of IR/TM applications? – How to extract knowledge from specific types of text? Which techniques for which types of text?

  • Exercises:

– Examples from lectures – Understand how formulas are used in practice – Be comfortable with “unproven theories” – Representative for examination questions p q

  • Exercises are important!

TDT4215 - Introduction

slide-4
SLIDE 4

7

Lecture Plan (1) Lecture Plan (1)

TDT4215 - Introduction

8

Lecture Plan (2) Lecture Plan (2)

TDT4215 - Introduction

slide-5
SLIDE 5

9

Lecture Plan (3) Lecture Plan (3)

TDT4215 - Introduction

10

From Documents to Knowledge From Documents to Knowledge

  • Document collections
  • Knowledge and documents

g

  • Document retrieval
  • Text Mining
  • Ontologies

80% of organizational data is textual with no proper structure!

TDT4215 - Introduction

slide-6
SLIDE 6

11

Overall approach Overall approach

Retrieve document Discover knowledge Information Retrieval Text Mining Text Knowledge elicitation Knowledge representation Morpho-syntax Semantics Ontology Existing New

TDT4215 - Introduction

Existing New

12

Document Collections Document Collections

  • Domain-dependent or domain-independent
  • Structured or non-structured text
  • Formatted or non-formatted documents
  • Textual or multimedia documents
  • Monolingual and multilingual document collections
  • Monolingual and multilingual document collections
  • Centralized or non-centralized document management
  • Confidential or non-confidential
  • Controlled or free addition of documents
  • Stable or non-stable collections

User TDT4215 - Introduction Information system Document collection User

slide-7
SLIDE 7

13

Case 1: SAP at STATOIL Case 1: SAP at STATOIL

  • SAP used for major internal business processes
  • Named user accounts: 29,000

Concurrent users: 3,200

  • System complexities:

894,000 customers 18,000 vendors 382 000 t i l 382,000 materials

  • Work orders created each month: 11,000

Sales orders created each month: 245 000 (11 600 per day)

  • Sales orders created each month: 245,000 (11,600 per day)
  • Documents produced each month: 2,25 million
  • Growth of database: 35 GB per month (Aug 2001)
  • Growth of database: 35 GB per month (Aug 2001)
  • Document characteristics: highly structured, textual and tabular,

formatted, controlled addition, high growth, non-centralized,

TDT4215 - Introduction

formatted, controlled addition, high growth, non centralized, possibly multilingual

14

Case 2: Reengineering project at g g p j Hydro Agri

  • Objective: Reengineer organization and implement SAP R3 to support business

processes

  • Project duration: July 1995 – March 1999

j y

  • Costs: USD 126 million
  • Staffing: 500+ (140 external consultants)
  • Document management: Specialized Lotus Notes databases

g p

  • Document production:
  • SHARE Training:

1061 docs 868 MB g

  • SHARE Test:

1632 docs 218 MB

  • SHARE Development:

12859 docs 218 MB

  • HAE User document.:

1312 docs 133 MB

  • TOTAL:

16864 docs 1437 MB 359 per month

TDT4215 - Introduction

12 per day

slide-8
SLIDE 8

15

Text is Difficult Text is Difficult

  • Most organizational knowledge encoded in textual

documents

  • Unstructured or semi-structured text difficult to

retrieve, interpret or analyze

  • Particular problems:

– Inconsistent documents Incomplete descriptions – Incomplete descriptions – Duplicates – Different terminologies/languages/abbreviations/perspectives

TDT4215 - Introduction

16

Knowledge and Documents Knowledge and Documents

  • One particular document is needed

E.g.: What textbook is used in TDT4215?

  • Several documents provide partial answers

E.g.: What is the definition of “text mining”?

  • All documents contribute to answer

E.g.: Who writes about Rosenborg? E.g.: Who writes about Rosenborg? W d t

  • Words versus concepts
  • Manual inspection versus automatic reasoning

TDT4215 - Introduction

slide-9
SLIDE 9

17

Document Retrieval Document Retrieval

  • Information retrieval = information access
  • Retrieve documents that satisfy a user’s information

Retrieve documents that satisfy a user s information need from a document collection

– Document indexing Q i t t ti

Document Document Document Document

– Query interpretation – Ranking of retrieved documents – Linguistics and statistics

representations representations representations Document representations identify relevant information query formulation formulation display documents to user TDT4215 - Introduction

18

Document Retrieval Example Document Retrieval Example

  • AllTheWeb from Fast Search & Transfer

(2002)

  • Index: 2,1 GB documents
  • Languages supported: 52
  • Linguistics used: Lemmatization

Linguistics used: Lemmatization, language identification, phrasing, anti- phrasing, text categorization, clustering,

  • ffensive content reduction, finite-state

automata

  • 30 mill. queries a day
  • www.alltheweb.com is today part of Yahoo

and uses the Inktomi search engine

  • The old AllTheWeb search engine used

TDT4215 - Introduction

Yahoo’s verticals

slide-10
SLIDE 10

19

Why is Web Search so Difficult? Why is Web Search so Difficult?

  • Volume of data:

– Document explosion Document dynamics – Document dynamics – Distributed over many computers and platforms – Google (2008): estimated about 40 billion pages (over 1 trillion unique urls)

16

66

Ref: http://news.netcraft.com

  • Multitude of languages:

6 8 10 12 14 16 web pages 1999 2001

60 64 62 66 64

– Multi-lingual web – 40-50 languages used on the web – Many text encoding standards

2 4 English German Japanese Chinese French Spanish Russian Italian Korean

  • rtuguese

Dutch Swedish Polish %

TDT4215 - Introduction

J Po Language

20

Why is Web Search so Difficult? Why is Web Search so Difficult?

  • Document Quality:

– Misspellings Spam and offensive content

évènements 76,000 événements 420,000 evenements 35,000 Query No. of documents

– Spam and offensive content – Little text – All topics

evénements 95,000 evènements 22,000 évenements 9,000

  • User Behavior:

Misspellings

1. chatroulette 2 ipad Top 10 queries according to Zeitgeist 2010

– Misspellings – Query length: avg 2.4 terms – Query session: 8 queries H lf f th d t i d

2. ipad 3. justin bieber 4. nicki minaj 5. friv 6. myxer k

– Half of the documents viewed are among top three documents on result page

7. katy perry 8. twitter 9. gamezer

  • 10. facebook

TDT4215 - Introduction

slide-11
SLIDE 11

21

Text Mining Part I Text Mining Part I

  • Text mining = Linguistic analysis?
  • Task:

Analyze linguistic or statistical content of single documents

– Transform document or add information to document – Tagging, lemmatization, NP recognition, etc.

  • Example: Lemmatization for document retrieval

<html> <body> The professor’s assistant reads two papers... <html> <body> The professor’s <lem> professor</lem> assistant reads <lem>read</lem>

Index document

</body> </html> <lem>read</lem> two papers <lem>paper</lem>... </body> </ht l>

TDT4215 - Introduction

</html>

22

Text Mining Example 1 Text Mining Example 1

  • Marmot (from UMass)

– Sentences are separated and segmented into noun phrases, verb h d iti l h phrases, and prepositional phrases – Recognizes dates and duration phrases – Scopes conjunctions and disjunctions

David Brown, University for Industry visits the OU John Dominque Wed, 15 Oct 1997 SUBJ (1) : DAVID BROWN %COMMA% UNIVERSITY PP (2) : FOR INDUSTRY VB (3) : VISITS OBJ1 (4) : THE OU David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM OBJ1 (4) : THE OU PUNC(5) : %PERIOD% drafting his initial 100 Days Report to HM

  • Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce.

TDT4215 - Introduction

Vargas-Vera et al.: Knowledge Extraction by using an Ontology-based Annotation tool

slide-12
SLIDE 12

23

Text Mining Part II Text Mining Part II

  • Text mining = knowledge discovery (in text)?
  • Task:

Discover or derive new information from large document Discover or derive new information from large document collections

– find patterns across datasets/documents – separate signal from noise

David B vid Brown, Univer ersity for for Indu dustry ry vi visi sits t the O OU David vid Brown, Uni Univer ersity for I for Indu dustry ry vi visi sits t the O OU David vid Br Brown, n, Un Univer iversity ty fo for In r Indu dustry ry vi visi sits the OU David vid Brown, Uni Univer ersity for I for Indu dustry ry David vid Brown Un Univer iversity ty fo for Industry try

g – statistical (and linguistic) approach

  • Techniques:

C t t ti

John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

  • Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce. John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

  • Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce. vi visi sits the the OU OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

  • Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce. vi visi sits t the O OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

  • Government. David was accompanied by Jeanette

Pugh Josh Hillman and Nick Pearce Da David Br Brown, , Uni Univer ersity ty fo for In Indu dustry ry vi visi sits t the O OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

  • Government. David was accompanied by Jeanette

– Concept extraction – Ontology construction – TOC construction – Clustering

Pugh, Josh Hillman and Nick Pearce. p y Pugh, Josh Hillman and Nick Pearce.

K l d

– Text categorization – Subtechniques: information extraction, text analysis

Knowledge

TDT4215 - Introduction

24

Text Mining Example 2 Text Mining Example 2

D t ll ti f X

  • Document collection from X
  • What is the content?

P i t t

  • Prominent terms:

Helsestasjon, helseorganisasjon, journalsystemet, kvalitetsrådgiverprogrammet, miljørettet, Journalopplysninger, sped, helsekortet, skolehelsetenesta, journalforskriften, passord, k ifik j

  • Terms used together in text

– Journalforskriften: kravspesifikasjon D t til ik ki b i i tj l Mental retardasjon: Datatilsyn, riksarkivar, oppbevaring, pasientjournaler, Retting, journalopplysninger, sletting, Personregisterloven, journal – Mental retardasjon: Syndrom, cerebral, alkoholforbruk, mor, hørsel, ben, Misdannelse, leveår, forekomst

TDT4215 - Introduction

slide-13
SLIDE 13

25

Text Mining Example 3 Text Mining Example 3

  • X = Kompetansesenteret for IT i Helsevesenet (KITH)
  • Objective: “KITH skal være helsevesenets sentrale rådgiver og

kompetanse organ for bred samordnet og kostnadseffektiv realisering kompetanse-organ for bred, samordnet og kostnadseffektiv realisering

  • g anvendelse av informasjons- og kommunikasjonsteknologi."
  • Terms used together in text

– KITH:

Helsevesen hefte informasjonssikkerhet Helsevesen, hefte, informasjonssikkerhet, Håndbok, standard, pasientjournaler, evt, Minimum, utarbeiding

  • What does this say about KITH?

TDT4215 - Introduction

26

Keyphrase Extraction Keyphrase Extraction

TDT4215 - Introduction

slide-14
SLIDE 14

27

Ontologies Ontologies

  • Definition of ontology:

– Description of entities or concepts and how they are related Conceptualization of some domain – Conceptualization of some domain

  • Purpose:

S ti d i ti f d t ll ti – Semantic description of document collection – Semantic interoperability – Controlled vocabulary for document retrieval

  • Approaches:

– Conceptual modeling – Document analysis and text mining – Standardization work

TDT4215 - Introduction

28

Ontology Example 1 Ontology Example 1

C t t t l i l d l f STATOIL i t t t t

  • Construct ontological model from STATOIL intranet text

collection (T. Brasethvik, NTNU)

   

Intranet Intranet Intranet

TDT4215 - Introduction

slide-15
SLIDE 15

29

Ontology Example 2 Ontology Example 2

  • ISO 15926 Integration of life-cycle data for oil and gas

production facilities

Current status:

  • Production plants: 50.000 terms
  • Geometry and topology: 400 terms
  • Drilling and logging: 2.700 terms
  • Production: 2.000 terms

S f d i 150

TDT4215 - Introduction

  • Safety and automation: 150 terms
  • Subsea equipment: 1.000 terms

30

Ontology Example 3 Ontology Example 3

  • Ontology-driven information retrieval

TDT4215 - Introduction

slide-16
SLIDE 16

31

Conclusions Conclusions

  • Characteristics of document collections
  • Technologies for document and knowledge

g g management:

– Document retrieval T t i i – Text mining – Ontologies

  • Details of technologies

TDT4215 - Introduction