Franz Kurfess: Knowledge Retrieval
Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Franz J. Kurfess
Knowledge Retrieval
1 Tuesday, May 5, 2009
Knowledge Retrieval Franz J. Kurfess Computer Science Department - - PowerPoint PPT Presentation
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Franz Kurfess: Knowledge Retrieval Tuesday, May 5, 2009 1 Knowledge Retrieval Franz J. Kurfess Computer
Franz Kurfess: Knowledge Retrieval
Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
1 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
2 Tuesday, May 5, 2009
Some of the material in these slides was developed for a lecture series sponsored by the European Community under the BPD program with Vilnius University as host institution
3 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
These slides are primarily intended for the students in classes I teach. In some cases, I
fkurfess@calpoly.edu. I hereby grant permission to use them in educational settings. If you do so, it would be nice to send me an email about it. If you’re considering using them in a commercial environment, please contact me first.
4
4 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
5
❖Finding Out About
❖Keywords and Queries; Documents; Indexing
❖Data Retrieval
❖Access via Address, Field, Name
❖Information Retrieval
❖Access via Content (Values); Parsing; Matching Against
❖Knowledge Retrieval
❖Access via Structure;Meaning;Context; Usage
❖Knowledge Discovery
❖Data Mining; Rule Extraction
5 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000] 6
6 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖Keywords ❖Queries ❖Documents ❖Indexing
[Belew 2000]
7
7 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖linguistic atoms used to characterize the subject
❖words ❖pieces of words (stems) ❖phrases
❖provide the basis for a match between
❖the user’s characterization of information need ❖the contents of the document
❖problems
❖ambiguity [Belew 2000]
8
8 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖formulated in a query language
❖natural language
❖interaction with human information providers
❖artificial language
❖interaction with computers
❖especially search engines
❖vocabulary
❖controlled
❖limited set of keywords may be used
❖uncontrolled
❖any keywords may be used
❖syntax [Belew 2000]
9
9 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖general interpretation
❖any document that can be represented digitally
❖text, image, music, video, program, etc.
❖practical interpretation
❖passage of text
❖strings of characters in an alphabet ❖written natural language ❖length may vary
❖longer documents may be composed of shorter ones
10
10 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖describes the suitability of a document as
❖assumptions
❖all documents have equal aboutness
❖the probability of any document in a corpus to be considered
relevant is equal for all documents
❖simplistic; not valid in reality
❖a paragraph is the smallest unit of text with appreciable
[Belew 2000]
11
11 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖documents may be composed of documents
❖paragraphs, subsections, sections, chapters, parts ❖footnotes, references
❖documents may contain meta-data
❖information about the document ❖not part of the content of the document itself ❖may be used for organization and retrieval purposes ❖can be abused by creators
❖usually to increase the perceived relevance
12
12 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖surrogates for the real document
❖abridged representations
❖catalog, abstract
❖pointers
❖bibliographical citation, URL
❖different media
❖microfiches ❖digital representations
13
13 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖a vocabulary of keywords is assigned to all
❖an index maps each document doci to the set of
❖indexing of a document / corpus
❖manual: humans select appropriate keywords ❖automatic: a computer program selects the keywords
❖building the index relation between documents
[Belew 2000] 14
14 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000]
15
15 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖access to specific data items ❖access via address, field, name ❖typically used in data bases ❖user asks for items with specific features
❖absence or presence of features ❖values
❖system returns data items
❖no irrelevant items
❖deterministic retrieval method
16
16 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖access to documents
❖also referred to as document retrieval
❖access via keywords ❖IR aspects
❖parsing ❖matching against indices ❖retrieval assessment
17
17 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000]
18
18 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖extraction of lexical features from documents
❖mostly words
❖may require some manipulation of the extracted
❖e.g. stemming of words
❖used as the basis for automatic compilation of
[Belew 2000]
19
19 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖Montytagger http://web.media.mit.edu/~hugo/
❖ python and Java
❖fnTBL (C++) http://nlp.cs.jhu.edu/~rflorian/fntbl/
❖fast
❖Brill Tagger (C) http://www.cs.jhu.edu/~brill/
❖the original; influenced several later ones
❖Natural Language Toolkit: http://
❖good starting point for basics of NLP algorithms
20
20 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖identification of documents that are relevant for a
❖keywords of the query are compared against the
❖either in the data or meta-data of the document
❖in addition to queries, other features of
❖descriptive features provided by the author or cataloger
❖usually meta-data
❖derived features computed from the contents of the
[Belew 2000]
21
21 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖interpretation of the index matrix
❖relates documents and keywords
❖can grow extremely large
❖binary matrix of 100,000 words * 1,000,000 documents ❖sparsely populated: most entries will be 0
❖can be used to determine similarity of documents
❖overlap in keywords ❖proximity in the (virtual) vector space
❖associative memories can be used as hardware
[Belew 2000]
22
22 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000]
23
23 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖ideally, all relevant documents should be
❖relative to the query posed by the user ❖relative to the set of documents available (corpus) ❖relevance can be subjective
❖precision and recall
❖relevant documents vs. retrieved documents
24
24 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000]
25
25 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000] 26
26 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000] 27
27 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖subjective assessment
❖how well do the retrieved documents satisfy the request
❖objective assessment
❖idealized omniscient expert determines the quality of
[Belew 2000]
28
28 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000]
29
29 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖subjective assessment of retrieval results ❖often used to iteratively improve retrieval results ❖may be collected by the retrieval system for
❖can be viewed as a variant of object recognition
❖the object to be recognized is the prototypical document
❖this document may or may not exist
❖the difference between the retrieved document(s) and
[Belew 2000]
30
30 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖relevance feedback is used to move the query
❖moving away from bad documents does not necessarily
❖it can also be used as a filter for a constant
❖as in news channels or similar situations [Belew 2000]
31
31 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000]
32
32 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖relevance feedback from multiple users
❖identifies documents that many users found useful or
❖used by some Web sites ❖related to collaborative filtering ❖can also be used as an evaluation method for search
❖performance criteria must be carefully considered
❖precision and recall, plus many others
[Belew 2000]
33
33 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Term 1 Term 2 Term 3 Term 4
Documents
Query Index Corpus
Keywords
34
34 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Term 1 Term 2 Term 3 Term 4
Documents
Query Index Corpus
Keywords
35
35 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Term 1 Term 2 Term 3 Term 4
Documents
Query Index Corpus
Keywords
36
36 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Term 1 Term 2 Term 3 Term 4
Documents
Query Index Corpus
Keywords
37
37 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Term 1 Term 2 Term 3 Term 4
Documents
Query Index Corpus
Keywords
38
38 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖Context ❖Usage
❖exploratory search ❖faceted search
39
39 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖in addition to
❖explicit links
❖hypertext
❖related concepts
❖thesaurus, ontology
❖proximity
❖spatial: place, directory ❖temporal: creation date/time
❖intermediate relations
❖author/creator ❖organization ❖project
40
40 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖determines relationships between documents ❖citations are explicit references to relevant
❖bibliographic references ❖legal citations ❖hypertext
❖examples
❖NEC CiteSeer <http://citeseer.nj.nec.com> ❖Google Scholar http://scholar.google.com
41
41 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[Belew 2000, after Kochen 1975]
42
42 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖inter-document links provide explicit relationships
❖can be used to determine the relevance of a document
❖example:
❖intra-document links may offer additional context
❖footnotes, glossaries, related terms
43
43 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖fine-tuning the matching between queries and
❖learning of relationships between terms
❖training with term pairs (thesaurus) ❖pattern detection in past queries ❖automatic grouping of documents according to common features
❖clustering of similar documents
❖pre-defined categories ❖metadata ❖overlap in keywords ❖consensual relevance ❖source
44
44 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
45
45 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖query types (templates)
❖frequently used types of queries
❖e.g. problem/solution, symptoms/diagnosis, problem/further
checks, ...
❖category types
❖abstractions of query types ❖used to determine categories or topics for the grouping
❖context information
❖current working document/directory ❖previous queries [Pratt, Hearst, Fagan 2000]
46
46 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖individual terms are connected to related terms
❖thesaurus/ontology
❖synonyms, super-/sub-classes, related terms
❖identifies labels for the category types
[Pratt, Hearst, Fagan 2000]
47
47 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖categorizer
❖determines the categories to be selected for the
❖assigns retrieved documents to the categories
❖organizer
❖arranges categories into a hierarchy
❖should be balanced and easy to browse by the user
❖depends on the distribution of the search results [Pratt, Hearst, Fagan 2000]
48
48 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖retrieved documents are grouped into
❖the categories are related to the query ❖the categories are related to each other ❖all categories have similar size
❖not always achievable due to the distribution of documents
❖reduced search times ❖higher user satisfaction
[Pratt, Hearst, Fagan 2000]
49
49 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖knowledge-based approach to the organization
❖categorizes results into meaningful groups that
❖uses knowledge of query types and of the domain
❖applied to the domain of medicine
❖MEDLINE is an on-line repository of medical abstracts
❖9.2 million bibliographic entries from 3800 journals ❖PubMed is a web-based search tool
❖returns titles as an relevance-ranked list
[DynaCat, 2000]
50
50 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[DynaCat, 2000]
51
51 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[DynaCat, 2000]
52
52 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
[DynaCat, 2000]
53
53 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
54
54 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
Term 1 Term 2
Term 3 Term 4
Keywords Documents
Query Index
Corpus Term A Term B Term E Term M Term D Term J Term I Term H Term F Term C Term G Term K Term L Ontology
55
keyword input relation expansion synonym expansion
55 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖finding knowledge through association ❖hypothesis: Human-made associations between
❖especially if the associations are made by experts or
56
56 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖What are current concepts, methods and tools
57
57 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖better knowledge management for scientific
❖build, maintain, and share paths through the document
❖see Vannevar Bush, “As We May Think”, Atlantic
58
58 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖exploration of a domain via attributes
❖select a relevant attribute, and display the elements of
59
59 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖What are tools or applications that employ
60
60 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
61
61 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖displaying lists of items ordered according to an
❖attributes often lend themselves to alternative
❖visual
❖static
❖color, size, shape
❖dynamic
❖movement, changes over time
❖auditory
❖often for supplementary information
62
62 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖combination of
❖Data Mining ❖Knowledge Extraction ❖Knowledge Fusion
63
63 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖identification of interesting “nuggets” in huge
❖often relations between subsets ❖automatic or semi-automatic
❖techniques
❖classification, correlation (e.g. temporal, spatial)
64
64 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖conversion of internal representations of
❖extraction of rules from neural networks is one example
65
65 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖multiple pieces of information are combined into
❖redundancy
❖do several pieces contain the same type of information
❖compatibility
❖do the individual pieces have similar formats and interpretations ❖are there mappings to convert values into the same format
❖consistency
❖are the values of the individual pieces close
66
66 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
❖identification, selection, and presentation of
❖utilization of structural information, context,
❖organized presentation of results
❖categories, visual arrangement
❖internal representations may be converted to
67
67 Tuesday, May 5, 2009
Franz Kurfess: Knowledge Retrieval
68
68 Tuesday, May 5, 2009