Document Clustering for Mediated Information Access The WebCluster - - PowerPoint PPT Presentation

document clustering for mediated information access the
SMART_READER_LITE
LIVE PREVIEW

Document Clustering for Mediated Information Access The WebCluster - - PowerPoint PPT Presentation

Document Clustering for Mediated Information Access The WebCluster Project Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at the Robert


slide-1
SLIDE 1

Gheorghe Muresan SCILS, Rutgers University

Document Clustering for Mediated Information Access – The WebCluster Project –

Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University

The original WebCluster project was conducted at the Robert Gordon University, Aberdeen, UK. It was supervised by Prof. David J. Harper and sponsored by Ubilab, Zurich. Current work is being conducted in collaboration with Ph.D. student Hyuk-Jin Lee and

  • Prof. Nicholas J. Belkin.

Exploratory Search Interfaces: Categorization, Clustering and Beyond Workshop at HCIL 2005, University of Maryland, June 2, 2004

slide-2
SLIDE 2

Gheorghe Muresan SCILS, Rutgers University

WebCluster - Motivation

Information Need Query Search engine (within some subject domain) WWW_SearchEngine Domain

  • Gulfs

– information need ↔ query – structured subject domain ↔ unstructured target collection (WWW)

slide-3
SLIDE 3

Gheorghe Muresan SCILS, Rutgers University

Information need

  • 1. Select library
  • 2. Consult catalog
  • 3. Browse

shelves

  • 4. Use inter-library scheme

Information Need Formulation

Interaction in the library

slide-4
SLIDE 4

Gheorghe Muresan SCILS, Rutgers University

  • 1. Select source

collection Information Need Formulation

  • 2. Explore

source collection with ClusterBook Results Results Information need

  • 3. Search WWW

Can we simulate the library interaction ?

Structured source collections

slide-5
SLIDE 5

Gheorghe Muresan SCILS, Rutgers University

The mediated access interaction

Information need

Web search engine WebCluster

Query

Specialised source

Target collection (WWW)

Topical documents

slide-6
SLIDE 6

Gheorghe Muresan SCILS, Rutgers University

Interaction model vs. prototype

Structuring the source collection

w Document clustering w Supervised classification w Manual (intellectual) classification

Exploring the structured source collection

w Metaphor – Library, book, encyclopaedia w Visualization tool – Folder metaphor, hyperbolic tree, themescape, cone trees, thematic maps w Search strategies supported – Best match or cluster-based searching, browsing

slide-7
SLIDE 7

Gheorghe Muresan SCILS, Rutgers University

Model vs. prototype

Interaction model

w Explicit (the user marks relevant documents) vs. implicit (cues

  • n relevance are derived based on user behavior/actions)

w Transparent (the user is aware) vs. opaque (the user is happy

to see effect of ‘magic’)

w Automatic vs. manual/intellectual generation of the mediated query

Query model

w Language models (generative, Kullback-Leibler) w Probabilistic models w Rocchio or other RF-specific formulae

slide-8
SLIDE 8

Gheorghe Muresan SCILS, Rutgers University

ClusterBook - Source collection

slide-9
SLIDE 9

Gheorghe Muresan SCILS, Rutgers University

ClusterBook - Target collection

slide-10
SLIDE 10

Gheorghe Muresan SCILS, Rutgers University

Informal experiments

  • Objectives -

Test the users’ reaction to the mediated access concept Test the user satisfaction regarding the functionality of the

system, and the relevance of the documents retrieved

Formative usability testing - some volunteers were not only

experienced searchers, but also had experience in evaluating IR systems

Comparison of user generated queries vs. system generated

queries

  • Note. These experiments were run at different stages of the

development

slide-11
SLIDE 11

Gheorghe Muresan SCILS, Rutgers University

Informal experiments

  • Experimental procedure -
  • Subjects received introduction to the system
  • Task assigned: “You are a trainee in a newspaper. You support the

journalists by providing information for the topic of their articles.”

  • Sample topics:

w The history of the Brasilian debt crisis w How are the quotas for growing coffee set and controlled on a world-wide

basis ?

  • Source collection: a sub-collection of Reuters (newspaper articles)
  • Steps followed by users (explicit scenario):

w Formulate a query and record it w Browse source collection, select ‘best’ cluster, edit query generated by

system, submit it to the search engine

w Submit to the same search engine the initial, self-generated query w Compare results of the two searches

slide-12
SLIDE 12

Gheorghe Muresan SCILS, Rutgers University

Informal experiments

  • Results -
  • Users found the mediation useful for unfamiliar topics
  • The system nearly always proposed new, good query terms
  • Users not always good at recognizing ‘good’ query terms
  • The system proposed bad query terms (not specific to the topic)

⇒ the opaque scenario not viable unless the query formulation is improved

  • The two-step process was questioned when:

w the query formulation was considered easy, for a familiar topic w the documents of the source collection were considered sufficient to cover the information need

  • Complete link, group average – OK; single link – bad
  • Overall, the system is usable
slide-13
SLIDE 13

Gheorghe Muresan SCILS, Rutgers University

Consequences of informal experiments

Formal experiments are needed to verify the main

assumptions:

w The Cluster Hypothesis holds for a specialized collection w Good clusters can be found with the search strategies provided w Mediated queries can improve retrieval effectiveness

The effect on retrieval performance of various

parameters should be compared

w Weighting schemes w Clustering methods w Search strategies

slide-14
SLIDE 14

Gheorghe Muresan SCILS, Rutgers University

Fixed Plants Coastal Wind Farms

Pacific Rim Wind Farms

Design of Coastal Wind Farms Design of …. Desert Wind Farms

Inland Wind Farms

...

Portable Generators

...

Wind generators for yachts

Power Generation Propulsion Wind Energy

Critical issue: The label generation

w Document representatives

w searching

w Cluster representatives

w browsing w searching w mediation

w Collection representatives

w

collection selection

slide-15
SLIDE 15

Gheorghe Muresan SCILS, Rutgers University

Mediation experiment - simulations

Objectives:

w Test the potential of mediation to increase retrieval effectiveness w Test the effect on performance of a variety of parameters

Search engine Search engine

Simple query generator (baseline) Topic-based mediator (upperbound)

Source collection Target collection

Cluster-based mediation (realistic mediation)

slide-16
SLIDE 16

Gheorghe Muresan SCILS, Rutgers University

Experimental setup

Interactive track of TREC-8

w Offers relevance judgments for complex topics, with a multitude of aspects w Offers the experimental design for the user experiment

w Six topics with 12 to 56 aspects each w Target collection: FT 1991-4, with 210,158 articles

w Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant

slide-17
SLIDE 17

Gheorghe Muresan SCILS, Rutgers University

Results – the cluster hypothesis

Aspectual cluster hypothesis

confirmed by an extended version of the van Rijsbergen – Sparck Jones separation test

w Similarity between pairs of docs covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection

Consequence confirmed:

clustering groups documents in pockets of relevance

slide-18
SLIDE 18

Gheorghe Muresan SCILS, Rutgers University

Results – retrieval effectiveness

Tf-Idf > KL > RelFreq as weighting schemes for

document representation

Adding disambiguation terms to the query

increases recall, but decreases precision

Nearest-neighbor mediation (“more like this”) highly

significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect

Cosine and Dice performs similarly

slide-19
SLIDE 19

Gheorghe Muresan SCILS, Rutgers University

Mediation results

Upperbound experiment (all relevant docs

known in source)

w Both recall and precision increase with query length w Query term weights strongly affect performance w No evidence that uniformity of term frequency affects performance

Clustered source mediation

w Best cluster mediation increases P, decreases R w “Fuse and search” – strong increase in R and P w “Search and fuse” – good R, terrible P !

slide-20
SLIDE 20

Gheorghe Muresan SCILS, Rutgers University

User experiment – effectiveness of mediated information retrieval for Web searches

Source & target – based mediation On the fly clustering Structured (cluster) Source-based mediation Baseline Linear (list) Mediated Unaided

Query formulation (between subjects) R e s u l t p r e s e n t a t i

  • n

( w i t h i n s u b j e c t s )

slide-21
SLIDE 21

Gheorghe Muresan SCILS, Rutgers University

User experiment – no mediation

slide-22
SLIDE 22

Gheorghe Muresan SCILS, Rutgers University

User experiment – mediated access

slide-23
SLIDE 23

Gheorghe Muresan SCILS, Rutgers University

User experiment – mediated access

slide-24
SLIDE 24

Gheorghe Muresan SCILS, Rutgers University

Contributions of WebCluster

Proposes and explores system-based mediated access to

very large heterogeneous document collections

Explores the use of clustering for capturing the topical,

semantic structure of a problem domain (as represented by a specialized collection)

Explores the use of language models for building cluster

and document representatives

Offers a framework for building structured portals on the

WWW

Offers a framework for building collaborative environments

slide-25
SLIDE 25

Gheorghe Muresan SCILS, Rutgers University

WebCluster - Other applications

CD-ROM based collections

w structured access to the collection itself w mediated access to WWW (via CD-ROM)

Mediated access (portals) via hierarchically structured

information sources

Examples are: via large structured report (e.g. government reports), via structured collection of information (e.g. encyclopaedia), via intranet

Multimedia information access

w cluster multimedia source, e.g. annotated photographs w mediated access to other photographs (not annotated)

slide-26
SLIDE 26

Gheorghe Muresan SCILS, Rutgers University

Other directions for WebCluster

Clustered vs. categorized source collection Language model – based labels vs. specialized

terminology based on a domain ontology / thesaurus

Different interaction and visualization metaphors

w Spring-embedded algorithms for 2D representation of clusters

Various inter-document similarities (faceted ?) User profiles / personalization

w Change of interest over time

slide-27
SLIDE 27

Gheorghe Muresan SCILS, Rutgers University

Topic representation

What (weighted) terms best describe a topic ?

w Applications:

w Clustering – generating cluster representatives w Mediation – generating mediated queries

w Machine generation

w Simulation based on test topics and relevance judgments w Use various weighting formulae and cut-off points w Which representations are more effective ? What do they have in

common / specific ?

w Human (manual / intellectual generation)

w Compare queries generated by searchers in TREC tasks

– Effectiveness – Keyword vs. natural language queries

slide-28
SLIDE 28

Gheorghe Muresan SCILS, Rutgers University

Questions ?

slide-29
SLIDE 29

Gheorghe Muresan SCILS, Rutgers University

Query formulation problems

  • Vague information need
  • Vocabulary mismatch
  • Difficulty of query language syntax
  • Lack of context, ambiguity of terms
  • Lack of a search strategy
  • No understanding of the underlying indexing/searching

model

  • Note. TREC experiments have shown that the quality of the

query has a higher impact on retrieval effectiveness than weighting schemes or search algorithms.

slide-30
SLIDE 30

Gheorghe Muresan SCILS, Rutgers University

Role of structure

Computing Computer Screen Keyboard C++ Pascal Programming language ... Mathematics ... Algebra Computing, Mathematics Physics Science

  • Reveals the semantic structure of the domain & its concepts
  • Groups (semantically ?) similar documents
  • Supports exploration and concept formation
  • Supports term disambiguation (context)
  • (Has potential for efficient retrieval)
  • (Has potential for effective retrieval)
slide-31
SLIDE 31

Gheorghe Muresan SCILS, Rutgers University

Browsing label (relative cluster representative)

Coastal Wind Farms Inland Wind Farms

Pacific Rim Wind Farms

Design of Coastal Wind Farms Design of …. Desert Wind Farms Wind generators for yachts

Fixed Plants

...

Portable Generators

...

Power Generation Propulsion Wind Energy

parent i cluster i cluster i i i

p p p parent cluster KL R

, , ,

log ) , ( = =

slide-32
SLIDE 32

Gheorghe Muresan SCILS, Rutgers University

Searching label (absolute cluster representative)

Coastal Wind Farms Inland Wind Farms

Pacific Rim Wind Farms

Design of Coastal Wind Farms Design of …. Desert Wind Farms Wind generators for yachts

Fixed Plants

...

Portable Generators

...

Power Generation Propulsion Wind Energy

collection i cluster i cluster i i i

p p p collection cluster KL A

, , ,

log ) , ( = =

slide-33
SLIDE 33

Gheorghe Muresan SCILS, Rutgers University

Mediation label (Expanded cluster representative)

Fixed Plants Coastal Wind Farms

Pacific Rim Wind Farms

Design of Coastal Wind Farms Design of …. Desert Wind Farms

Inland Wind Farms

...

Portable Generators

...

Wind generators for yachts

Power Generation Propulsion Wind Energy

r i r r i r i i i i

A A A A A E

, 1 , 1 2 , 2 1 , ,

) 1 ( ... ) 1 ( ) 1 ( ) 1 ( ⋅ + ⋅ ⋅ − + + ⋅ ⋅ − + ⋅ ⋅ − + ⋅ − =

− −

ω ω ω ω ω ω ω ω

slide-34
SLIDE 34

Gheorghe Muresan SCILS, Rutgers University

Topic model representations

Exemplary representation Statistical representation

Statistical analysis Language model Context analysis Typical terms, weighted Thresholding Mediated query

Keyword representation

slide-35
SLIDE 35

Gheorghe Muresan SCILS, Rutgers University

The cluster hypothesis

Reminder: the original cluster hypothesis

w “Closely associated documents tend to be relevant to the same requests” (van Rijsbergen)

Aspectual cluster hypothesis: Highly similar

documents tend to be relevant to the same topic. However, documents relevant to the same topic may be quite dissimilar if they cover distinct aspects of the topic.

w Consequence: Clustering algorithms tend to group together documents that cover highly focused topics, or aspects of complex topic. Documents covering distinct aspects of complex topics tend to be spread over the cluster structure.

slide-36
SLIDE 36

Gheorghe Muresan SCILS, Rutgers University

Aspects of relevance in the mediated access process

slide-37
SLIDE 37

Gheorghe Muresan SCILS, Rutgers University

Distribution of relevant documents in clusters

Clustering vs. Random

0% 5% 10% 15% 20% 25% Clusters Recall

Clustering Random

slide-38
SLIDE 38

Gheorghe Muresan SCILS, Rutgers University

WebCluster scenario#1

Document from the source collection Document from the target collection (WWW)

W e b C l u s t e r Web Search Engine

c0 c4 c5 c2 c1 c3 c’0 c’3 c’2 c’5

WWW

  • Name

w Transparent mediated access

  • Targeted users

w Experienced searchers

  • Specific

w The users are aware of the mediation process, of the separation between the source and target collections w The users have the option to edit the query generated (proposed) by the system. They understand the indexing / searching model.

slide-39
SLIDE 39

Gheorghe Muresan SCILS, Rutgers University

WebCluster scenario#2

W e b C l u s t e r

c0 c4 c5 c2 c1 c3

WWW

c’0 c’3 c’2 c’5

Web Search Engine

  • Name

w Opaque mediated access

  • Targeted users

w Naive / beginner searchers

  • Specific

w The users explore the structure of the domain, which contains sample documents, and have the option of asking for similar documents w The users are unaware of the mediation process - the query generation and target search are not visible

Document from the source collection Document from the target collection (WWW)

slide-40
SLIDE 40

Gheorghe Muresan SCILS, Rutgers University

Initial user interface (Java AWT)