Document Clustering for Mediated Information Access The WebCluster - PowerPoint PPT Presentation

Document Clustering for Mediated Information Access – The WebCluster Project – Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at the Robert Gordon University, Aberdeen, UK. It was supervised by Prof. David J. Harper and sponsored by Ubilab, Zurich. Current work is being conducted in collaboration with Ph.D. student Hyuk-Jin Lee and Prof. Nicholas J. Belkin. Exploratory Search Interfaces: Categorization, Clustering and Beyond Gheorghe Muresan Workshop at HCIL 2005, University of Maryland, June 2, 2004 SCILS, Rutgers University

WebCluster - Motivation Information (within some subject domain) Need WWW_SearchEngine Domain Search engine Query Gulfs � – information need ↔ query – structured subject domain ↔ unstructured target collection (WWW) Gheorghe Muresan SCILS, Rutgers University

Interaction in the library Information 2. Consult catalog need 1. Select library Information Need Formulation 3. Browse shelves 4. Use inter-library scheme Gheorghe Muresan SCILS, Rutgers University

Can we simulate the library interaction ? Structured Information source need collections 3. Search WWW 1. Select source collection Results Information Need Formulation 2. Explore source collection Results with ClusterBook Gheorghe Muresan SCILS, Rutgers University

The mediated access interaction Specialised Information source Web search engine need WebCluster Topical documents Query Target collection (WWW) Gheorghe Muresan SCILS, Rutgers University

Interaction model vs. prototype � Structuring the source collection w Document clustering w Supervised classification w Manual (intellectual) classification � Exploring the structured source collection w Metaphor – Library, book, encyclopaedia w Visualization tool – Folder metaphor, hyperbolic tree, themescape, cone trees, thematic maps w Search strategies supported – Best match or cluster-based searching, browsing Gheorghe Muresan SCILS, Rutgers University

Model vs. prototype � Interaction model w Explicit (the user marks relevant documents) vs. implicit (cues on relevance are derived based on user behavior/actions) w Transparent (the user is aware) vs. opaque (the user is happy to see effect of ‘magic’) w Automatic vs. manual/intellectual generation of the mediated query � Query model w Language models (generative, Kullback-Leibler) w Probabilistic models w Rocchio or other RF-specific formulae Gheorghe Muresan SCILS, Rutgers University

ClusterBook - Source collection Gheorghe Muresan SCILS, Rutgers University

ClusterBook - Target collection Gheorghe Muresan SCILS, Rutgers University

Informal experiments - Objectives - � Test the users’ reaction to the mediated access concept � Test the user satisfaction regarding the functionality of the system, and the relevance of the documents retrieved � Formative usability testing - some volunteers were not only experienced searchers, but also had experience in evaluating IR systems � Comparison of user generated queries vs. system generated queries � Note. These experiments were run at different stages of the development Gheorghe Muresan SCILS, Rutgers University

Informal experiments - Experimental procedure - Subjects received introduction to the system � Task assigned: “You are a trainee in a newspaper. You support the � journalists by providing information for the topic of their articles.” Sample topics: � w The history of the Brasilian debt crisis w How are the quotas for growing coffee set and controlled on a world-wide basis ? Source collection: a sub-collection of Reuters (newspaper articles) � Steps followed by users (explicit scenario): � w Formulate a query and record it w Browse source collection, select ‘best’ cluster, edit query generated by system, submit it to the search engine w Submit to the same search engine the initial, self-generated query w Compare results of the two searches Gheorghe Muresan SCILS, Rutgers University

Informal experiments - Results - Users found the mediation useful for unfamiliar topics � The system nearly always proposed new, good query terms � Users not always good at recognizing ‘good’ query terms � The system proposed bad query terms (not specific to the topic) � ⇒ the opaque scenario not viable unless the query formulation is improved The two-step process was questioned when: � w the query formulation was considered easy, for a familiar topic w the documents of the source collection were considered sufficient to cover the information need Complete link, group average – OK; single link – bad � Overall, the system is usable � Gheorghe Muresan SCILS, Rutgers University

Consequences of informal experiments � Formal experiments are needed to verify the main assumptions: w The Cluster Hypothesis holds for a specialized collection w Good clusters can be found with the search strategies provided w Mediated queries can improve retrieval effectiveness � The effect on retrieval performance of various parameters should be compared w Weighting schemes w Clustering methods w Search strategies Gheorghe Muresan SCILS, Rutgers University

Critical issue: The label generation w Document representatives w searching Wind Energy w Cluster representatives w browsing ... w searching Power Generation Propulsion w mediation Collection representatives w Fixed Plants collection selection w Coastal Wind Farms Inland Wind Farms Portable Generators ... Pacific Rim Design of Wind generators Design of Coastal Desert Wind Farms Wind Farms Wind Farms …. for yachts Gheorghe Muresan SCILS, Rutgers University

Mediation experiment - simulations � Objectives: w Test the potential of mediation to increase retrieval effectiveness w Test the effect on performance of a variety of parameters Cluster-based mediation (realistic mediation) Search engine Search engine Topic-based mediator (upperbound) Target Simple query generator collection Source collection (baseline) Gheorghe Muresan SCILS, Rutgers University

Experimental setup � Interactive track of TREC-8 w Offers relevance judgments for complex topics, with a multitude of aspects w Offers the experimental design for the user experiment w Six topics with 12 to 56 aspects each w Target collection: FT 1991-4, with 210,158 articles w Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant Gheorghe Muresan SCILS, Rutgers University

Results – the cluster hypothesis � Aspectual cluster hypothesis confirmed by an extended version of the van Rijsbergen – Sparck Jones separation test w Similarity between pairs of docs covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection � Consequence confirmed: clustering groups documents in pockets of relevance Gheorghe Muresan SCILS, Rutgers University

Results – retrieval effectiveness � Tf-Idf > KL > RelFreq as weighting schemes for document representation � Adding disambiguation terms to the query increases recall, but decreases precision � Nearest-neighbor mediation (“more like this”) highly significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect � Cosine and Dice performs similarly Gheorghe Muresan SCILS, Rutgers University

Mediation results � Upperbound experiment (all relevant docs known in source) w Both recall and precision increase with query length w Query term weights strongly affect performance w No evidence that uniformity of term frequency affects performance � Clustered source mediation w Best cluster mediation increases P, decreases R w “Fuse and search” – strong increase in R and P w “Search and fuse” – good R, terrible P ! Gheorghe Muresan SCILS, Rutgers University

User experiment – effectiveness of mediated information retrieval for Web searches Query formulation (between subjects) n o ) i s t t a Unaided Mediated c t e n j e b s u e Linear s Source-based r p n Baseline i t h mediation l (list) u t i s w e ( R Structured On the fly Source & target – clustering based mediation (cluster) Gheorghe Muresan SCILS, Rutgers University

User experiment – no mediation Gheorghe Muresan SCILS, Rutgers University

User experiment – mediated access Gheorghe Muresan SCILS, Rutgers University

Contributions of WebCluster � Proposes and explores system-based mediated access to very large heterogeneous document collections � Explores the use of clustering for capturing the topical, semantic structure of a problem domain (as represented by a specialized collection) � Explores the use of language models for building cluster and document representatives � Offers a framework for building structured portals on the WWW � Offers a framework for building collaborative environments Gheorghe Muresan SCILS, Rutgers University

Document Clustering for Mediated Information Access The WebCluster - PowerPoint PPT Presentation

Document Clustering for Mediated Information Access The WebCluster Project Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University The original WebCluster project was conducted at the Robert

Document #15 Document #15 Document #15 Document #15 Document #15 Document #15 Document #15

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Web Information Retrieval Lecture 15 Clustering Todays Topic: Clustering Document

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering in Swedish The Impact of some Properties of the Swedish Language on Document

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Simulation in a Nutshell Game Theory meets Object Oriented Simulation Special Interest Group

A Community Approach to Palliative and End of Life Care # ICMatters

NC Falls Prevention Coalition Quarterly Meeting September 16, 2020 Strategic Planning Update:

Informative Lobbying and Agenda Control Arnaud Dellis Mandar Oak UQAM University of Adelaide

An Update on Game Tree Research Akihiro Kishimoto and Martin Mueller Tutorial 2: Solving and

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs

Autotuning (2/2): Specialized code generators Prof. Richard Vuduc Georgia Institute of

Building amazing searcies with Searci API T h o ma s S e i d l ( d r u n k e n

Sambuz

Useful Links

Newsletter

Mail Us