[PDF] - Lectures and Exercises Lectures and Exercises Lectures Lectures PDF Document

SLIDE 1

1

TDT4215 Web intelligence TDT4215 Web-intelligence

Main topics:

Information Retrieval
Information Retrieval
Large textual document collections
Text mining
NLP for document analysis

NLP for document analysis

Ontologies for document management

How to extract knowledge from large document collections?

TDT4215 - Introduction TDT4215 - Introduction

How to extract knowledge from large document collections?

2

Lectures and Exercises Lectures and Exercises

Lectures Lectures

Researcher Stein L. Tomassen
Additional lecturers:
PhD student Geir Solskinnsbakk
PhD student Wei Wei
PhD student Nattiya Kanhabua
Guest lectures:
PhD George Tsatsaronis from Athens University of Economics and Business

PhD t d t Si J

PhD student Simon Jonassen
Thursdays 08.15-11.00 in S6 (that’s right, three hours!)

Exercises

Researcher Stein L. Tomassen
Fridays 14.15-16.00 in F3

All relevant information are continuously published at http://www.idi.ntnu.no/emner/tdt4215/

TDT4215 - Introduction

http://www.idi.ntnu.no/emner/tdt4215/

SLIDE 2

3

Text Materials Text Materials

Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval.

Addison-Wesley, 1999. (selected chapters) (selected chapters)

Manning, Raghavan and Schütze: Introduction to

Information Retrieval. Cambridge University Press, 2008. (selected chapters, available for download)

Compendium from IDI

(selected book chapters and papers) (selected book chapters and papers)

Details are published at the homepage of the course

TDT4215 - Introduction

4

Assessment Assessment

Group project: 25% of grade

– Groups of 3-5 people Discuss a particular theoretical topic – Discuss a particular theoretical topic – Develop an information retrieval / text mining application – Evaluate application To be carried out the first half of the term (25th Feb 7th Apr) – To be carried out the first half of the term (25th Feb – 7th Apr) – Stein L. Tomassen is responsible for the group project

I di id l i i i 75% f d

Individual written examination: 75% of grade

– 20th of May – 4 hours written examination (discussions, calculations, no programming) – Based on everything we will learn in the course

TDT4215 - Introduction

SLIDE 3

5

Course Characteristics Course Characteristics

Experimental science:

– No clear answers or theories Lots of formulas (that are hard to justify) – Lots of formulas (that are hard to justify)

Relevance:

– Concerns real-world problems – A basis for knowledge management applications: Search engines, document management systems, publication systems, digital libraries, enterprise business applications, business/web intelligence systems semantic interoperation/integration software etc systems, semantic interoperation/integration software, etc.

Multi-disciplinary:

– Combines techniques from several other sciences: Statistics linguistics conceptual modeling artificial intelligence databases Statistics, linguistics, conceptual modeling, artificial intelligence, databases, etc.

TDT4215 - Introduction

6

Projects and Exercises Important Projects and Exercises Important

One mandatory project:

– Practice in setting up an application – How to evaluate the quality of IR/TM applications? – How to extract knowledge from specific types of text? Which techniques for which types of text?

Exercises:

– Examples from lectures – Understand how formulas are used in practice – Be comfortable with “unproven theories” – Representative for examination questions p q

Exercises are important!

TDT4215 - Introduction

SLIDE 4

7

Lecture Plan (1) Lecture Plan (1)

TDT4215 - Introduction

8

Lecture Plan (2) Lecture Plan (2)

TDT4215 - Introduction

SLIDE 5

9

Lecture Plan (3) Lecture Plan (3)

TDT4215 - Introduction

10

From Documents to Knowledge From Documents to Knowledge

Document collections
Knowledge and documents

g

Document retrieval
Text Mining
Ontologies

80% of organizational data is textual with no proper structure!

TDT4215 - Introduction

SLIDE 6

11

Overall approach Overall approach

Retrieve document Discover knowledge Information Retrieval Text Mining Text Knowledge elicitation Knowledge representation Morpho-syntax Semantics Ontology Existing New

TDT4215 - Introduction

Existing New

12

Document Collections Document Collections

Domain-dependent or domain-independent
Structured or non-structured text
Formatted or non-formatted documents
Textual or multimedia documents
Monolingual and multilingual document collections
Monolingual and multilingual document collections
Centralized or non-centralized document management
Confidential or non-confidential
Controlled or free addition of documents
Stable or non-stable collections

User TDT4215 - Introduction Information system Document collection User

SLIDE 7

13

Case 1: SAP at STATOIL Case 1: SAP at STATOIL

SAP used for major internal business processes
Named user accounts: 29,000

Concurrent users: 3,200

System complexities:

894,000 customers 18,000 vendors 382 000 t i l 382,000 materials

Work orders created each month: 11,000

Sales orders created each month: 245 000 (11 600 per day)

Sales orders created each month: 245,000 (11,600 per day)
Documents produced each month: 2,25 million
Growth of database: 35 GB per month (Aug 2001)
Growth of database: 35 GB per month (Aug 2001)
Document characteristics: highly structured, textual and tabular,

formatted, controlled addition, high growth, non-centralized,

TDT4215 - Introduction

formatted, controlled addition, high growth, non centralized, possibly multilingual

14

Case 2: Reengineering project at g g p j Hydro Agri

Objective: Reengineer organization and implement SAP R3 to support business

processes

Project duration: July 1995 – March 1999

j y

Costs: USD 126 million
Staffing: 500+ (140 external consultants)
Document management: Specialized Lotus Notes databases

g p

Document production:
SHARE Training:

1061 docs 868 MB g

SHARE Test:

1632 docs 218 MB

SHARE Development:

12859 docs 218 MB

HAE User document.:

1312 docs 133 MB

TOTAL:

16864 docs 1437 MB 359 per month

TDT4215 - Introduction

12 per day

SLIDE 8

15

Text is Difficult Text is Difficult

Most organizational knowledge encoded in textual

documents

Unstructured or semi-structured text difficult to

retrieve, interpret or analyze

Particular problems:

– Inconsistent documents Incomplete descriptions – Incomplete descriptions – Duplicates – Different terminologies/languages/abbreviations/perspectives

TDT4215 - Introduction

16

Knowledge and Documents Knowledge and Documents

One particular document is needed

E.g.: What textbook is used in TDT4215?

Several documents provide partial answers

E.g.: What is the definition of “text mining”?

All documents contribute to answer

E.g.: Who writes about Rosenborg? E.g.: Who writes about Rosenborg? W d t

Words versus concepts
Manual inspection versus automatic reasoning

TDT4215 - Introduction

SLIDE 9

17

Document Retrieval Document Retrieval

Information retrieval = information access
Retrieve documents that satisfy a user’s information

Retrieve documents that satisfy a user s information need from a document collection

– Document indexing Q i t t ti

Document Document Document Document

– Query interpretation – Ranking of retrieved documents – Linguistics and statistics

representations representations representations Document representations identify relevant information query formulation formulation display documents to user TDT4215 - Introduction

18

Document Retrieval Example Document Retrieval Example

AllTheWeb from Fast Search & Transfer

(2002)

Index: 2,1 GB documents
Languages supported: 52
Linguistics used: Lemmatization

Linguistics used: Lemmatization, language identification, phrasing, anti- phrasing, text categorization, clustering,

ffensive content reduction, finite-state

automata

30 mill. queries a day
www.alltheweb.com is today part of Yahoo

and uses the Inktomi search engine

The old AllTheWeb search engine used

TDT4215 - Introduction

Yahoo’s verticals

SLIDE 10

19

Why is Web Search so Difficult? Why is Web Search so Difficult?

Volume of data:

– Document explosion Document dynamics – Document dynamics – Distributed over many computers and platforms – Google (2008): estimated about 40 billion pages (over 1 trillion unique urls)

16

66

Ref: http://news.netcraft.com

Multitude of languages:

6 8 10 12 14 16 web pages 1999 2001

60 64 62 66 64

– Multi-lingual web – 40-50 languages used on the web – Many text encoding standards

2 4 English German Japanese Chinese French Spanish Russian Italian Korean

rtuguese

Dutch Swedish Polish %

TDT4215 - Introduction

J Po Language

20

Why is Web Search so Difficult? Why is Web Search so Difficult?

Document Quality:

– Misspellings Spam and offensive content

évènements 76,000 événements 420,000 evenements 35,000 Query No. of documents

– Spam and offensive content – Little text – All topics

evénements 95,000 evènements 22,000 évenements 9,000

User Behavior:

Misspellings

1. chatroulette 2 ipad Top 10 queries according to Zeitgeist 2010

– Misspellings – Query length: avg 2.4 terms – Query session: 8 queries H lf f th d t i d

2. ipad 3. justin bieber 4. nicki minaj 5. friv 6. myxer k

– Half of the documents viewed are among top three documents on result page

7. katy perry 8. twitter 9. gamezer

10. facebook

TDT4215 - Introduction

SLIDE 11

21

Text Mining Part I Text Mining Part I

Text mining = Linguistic analysis?
Task:

Analyze linguistic or statistical content of single documents

– Transform document or add information to document – Tagging, lemmatization, NP recognition, etc.

Example: Lemmatization for document retrieval

<html> <body> The professor’s assistant reads two papers... <html> <body> The professor’s <lem> professor</lem> assistant reads <lem>read</lem>

Index document

</body> </html> <lem>read</lem> two papers <lem>paper</lem>... </body> </ht l>

TDT4215 - Introduction

</html>

22

Text Mining Example 1 Text Mining Example 1

Marmot (from UMass)

– Sentences are separated and segmented into noun phrases, verb h d iti l h phrases, and prepositional phrases – Recognizes dates and duration phrases – Scopes conjunctions and disjunctions

David Brown, University for Industry visits the OU John Dominque Wed, 15 Oct 1997 SUBJ (1) : DAVID BROWN %COMMA% UNIVERSITY PP (2) : FOR INDUSTRY VB (3) : VISITS OBJ1 (4) : THE OU David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM OBJ1 (4) : THE OU PUNC(5) : %PERIOD% drafting his initial 100 Days Report to HM

Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce.

TDT4215 - Introduction

Vargas-Vera et al.: Knowledge Extraction by using an Ontology-based Annotation tool

SLIDE 12

23

Text Mining Part II Text Mining Part II

Text mining = knowledge discovery (in text)?
Task:

Discover or derive new information from large document Discover or derive new information from large document collections

– find patterns across datasets/documents – separate signal from noise

David B vid Brown, Univer ersity for for Indu dustry ry vi visi sits t the O OU David vid Brown, Uni Univer ersity for I for Indu dustry ry vi visi sits t the O OU David vid Br Brown, n, Un Univer iversity ty fo for In r Indu dustry ry vi visi sits the OU David vid Brown, Uni Univer ersity for I for Indu dustry ry David vid Brown Un Univer iversity ty fo for Industry try

g – statistical (and linguistic) approach

Techniques:

C t t ti

John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce. John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce. vi visi sits the the OU OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

Government. David was accompanied by Jeanette

Pugh, Josh Hillman and Nick Pearce. vi visi sits t the O OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

Government. David was accompanied by Jeanette

Pugh Josh Hillman and Nick Pearce Da David Br Brown, , Uni Univer ersity ty fo for In Indu dustry ry vi visi sits t the O OU John Dominque Wed, 15 Oct 1997 David Brown, the Chairman of the University for Industry Design and Implementation Advisory Group and Chairman of Motorola, visited the OU as part of a fact finding exercise, prior to drafting his initial 100 Days Report to HM

Government. David was accompanied by Jeanette

– Concept extraction – Ontology construction – TOC construction – Clustering

Pugh, Josh Hillman and Nick Pearce. p y Pugh, Josh Hillman and Nick Pearce.

K l d

– Text categorization – Subtechniques: information extraction, text analysis

Knowledge

TDT4215 - Introduction

24

Text Mining Example 2 Text Mining Example 2

D t ll ti f X

Document collection from X
What is the content?

P i t t

Prominent terms:

Helsestasjon, helseorganisasjon, journalsystemet, kvalitetsrådgiverprogrammet, miljørettet, Journalopplysninger, sped, helsekortet, skolehelsetenesta, journalforskriften, passord, k ifik j

Terms used together in text

– Journalforskriften: kravspesifikasjon D t til ik ki b i i tj l Mental retardasjon: Datatilsyn, riksarkivar, oppbevaring, pasientjournaler, Retting, journalopplysninger, sletting, Personregisterloven, journal – Mental retardasjon: Syndrom, cerebral, alkoholforbruk, mor, hørsel, ben, Misdannelse, leveår, forekomst

TDT4215 - Introduction

SLIDE 13

25

Text Mining Example 3 Text Mining Example 3

X = Kompetansesenteret for IT i Helsevesenet (KITH)
Objective: “KITH skal være helsevesenets sentrale rådgiver og

kompetanse organ for bred samordnet og kostnadseffektiv realisering kompetanse-organ for bred, samordnet og kostnadseffektiv realisering

g anvendelse av informasjons- og kommunikasjonsteknologi."
Terms used together in text

– KITH:

Helsevesen hefte informasjonssikkerhet Helsevesen, hefte, informasjonssikkerhet, Håndbok, standard, pasientjournaler, evt, Minimum, utarbeiding

What does this say about KITH?

TDT4215 - Introduction

26

Keyphrase Extraction Keyphrase Extraction

TDT4215 - Introduction

SLIDE 14

27

Ontologies Ontologies

Definition of ontology:

– Description of entities or concepts and how they are related Conceptualization of some domain – Conceptualization of some domain

Purpose:

S ti d i ti f d t ll ti – Semantic description of document collection – Semantic interoperability – Controlled vocabulary for document retrieval

Approaches:

– Conceptual modeling – Document analysis and text mining – Standardization work

TDT4215 - Introduction

28

Ontology Example 1 Ontology Example 1

C t t t l i l d l f STATOIL i t t t t

Construct ontological model from STATOIL intranet text

collection (T. Brasethvik, NTNU)

   

Intranet Intranet Intranet

TDT4215 - Introduction

SLIDE 15

29

Ontology Example 2 Ontology Example 2

ISO 15926 Integration of life-cycle data for oil and gas

production facilities

Current status:

Production plants: 50.000 terms
Geometry and topology: 400 terms
Drilling and logging: 2.700 terms
Production: 2.000 terms

S f d i 150

TDT4215 - Introduction

Safety and automation: 150 terms
Subsea equipment: 1.000 terms

30

Ontology Example 3 Ontology Example 3

Ontology-driven information retrieval

TDT4215 - Introduction

SLIDE 16

31

Conclusions Conclusions

Characteristics of document collections
Technologies for document and knowledge

g g management:

– Document retrieval T t i i – Text mining – Ontologies

Details of technologies

TDT4215 - Introduction