guiding people to context information providing an
play

Guiding People to Context Information: Providing an Interface to - PDF document

Guiding People to Context Information: Providing an Interface to Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for ML techniques/algorithms used in IR Indexing IR applied to


  1. Guiding People to Context Information: Providing an Interface to • Information Retrieval Systems a Digital Library Using and Machine Learning Reference as a Basis for – ML techniques/algorithms used in IR Indexing – IR applied to ML, esp. CBR • User feedback in learning systems S. Bradshaw, A. Scheinkman & K. Hammond Plan for Discussion Larger Issues Raised • Background on IR • ML in document classification • Indexing and textual • Citation indexing classification: – CiteSeer, Rosetta – As specific to text-based • Indexing from the perspective knowledge systems of rhetorical theory – As relevant to automated • Practical & theoretical aspects knowledge acquisition in general of the underlying cognitive – Implications as a cognitive model problem of textual indexing & classification • CBR & texts; CBR “textuality” 1

  2. Automated Text Automated Text Categorization Categorization • Task: assign a value (usually • CPC vs DPC: {0, 1}) to each entry in a – Category-pivoted categorization decision matrix: • One row at a time • When categories added dynamically – Document-pivoted categorization d ... d ... d 0 j n • One column at a time c a ... a ... a 0 00 0 j 0 n • When documents added over long ... ... ... ... ... ... period of time c a ... a ... a i i0 ij in • Assignment vs “relevance” ... ... ... ... ... ... – The latter is subjective c a ... a ... a m m 0 m j m n – It is largely the same as the notion • Categories are labels (no access to of relevance to an information meaning) need • Attribution is content-based (no metadata) • Constraints can differ wrt cardinalities of the assignment Automated Text Document Categorization Categorization & ML • Applications • Earliest efforts (‘80s): knowledge-engineering – Automatic indexing for IR using a controlled dictionary; usually (manually building expert performed by experts; indices = system) using rules categories – Example: CONSTRUE (for – Classified ads Reuters) – Document filtering (e.g., – Typical problem: “knowledge Reuters/AP) acquisition bottleneck” (updating) – WSD (word sense • More recently (‘90s): ML disambiguation) approach • Word occurrence contexts = docs, – Not construct the classifier, but senses = categories the builder of classifiers • Itself important as an indexing technique – Variety of approaches (both – Web-page categorization (e.g., inductive & lazy) Yahoo) 2

  3. VSM (Vector-Space VSM (Vector-Space Model) Model) • Vector of n weighted index • Standard model: tfidf (term terms: “bag of words” frequency X inverse document frequency) – More sophisticated approaches based on noun phrases = #(tk, dj)*log(|Tr|/#(tk)) • Linguistic vs. statistical notion of • Assumptions: phrase – the more often a term occurs in a • Results not conclusive document, the more representative it is – the more documents a term occurs in, the less discriminating it is – Is this always an appropriate model? Are “representative” and “significant” the same? VSM (Vector-Space VSM (Vector-Space Model) Model) • Pre-processing: • Dimensionality Reduction – Removal of function words – Local: each category separately • Each dj has a different – Stemming representation for each ci • “Distance”: based on dot • In practice: subsets of dj’s original product of vectors representation* • Dimensionality problem – Global: all categories are reduced in same way – IR cosine-matching scales well, – Bases: linear algebra, information but other learning algorithms used theory for classifier induction do not • feature extraction vs selection* – DR: dimensionality reduction (also reduces overfitting) – New features are not subset of original; not homogeneous with original: combine or transform the originals 3

  4. VSM (Vector-Space VSM (Vector-Space Model) Model) • Feature selection: TSR (term • Feature extraction: space reduction): proven to reparameterization diminish effectiveness the least – “synthesized” rather than naturally occurring – Document frequency (those which occur most frequently in – A way of dealing with polysemy, collection are most valuable) homonymy and synonymy • Apparently contradicts premise of • Term clustering tfidf that low df terms are more – Group words with pair-wise informative semantic relatedness, use their • But majority of words that occur in corpus have extremely low df, so “centroid” as term reduction by factor of 10 will only • One way: co-occurrence/co-absence prune these (which are probably • Latent semantic indexing insignificant within the document they occur in as well) – Combines original dimensions on – Other techniques: information basis of co-occurrence gain, chi-square, correlation – Capable of educing an underlying coefficient, etc. semantics not available from • Some improvement original terms • Complexity of techniques obviates easy interpretation of why results are better VSM (Vector-Space Building the Classifier Model) • Example • Two phases – Great number of terms which – Definition of a mapping function contribute a small amout to the – Definition of a threshold on whole values returned by that function category: “Demographic shifts in • Methods for building the the U.S. with economic impact” mapping function text: “The nation grew to 249.6 – Parametric (training data used to million in the 1980s as more estimate parameters of a Americans left the industrial and probability distribution) agricultural heartlands for the – Non-parametric South and West” • Profile-based (linear classifier): • Problems extract vector from training (pre- – Sometimes new terms not readily categorized) documents; use this interpretable profile to categorize documents in D according to RSV (retrieval status – Could eliminate an original term value) which was significant • Example-based: use the documents • Synthetic production of indices: in training set with highest category status values as candidates for relation to the type of indexing classifying documents in D done with citations 4

  5. Building the Classifier Building the Classifier • Parametric • Profile-based, cont’d – “Naïve Bayes” – In general, rewards closeness to – Cannot use feature selection (need positive centroid, distance from the full term space) negative centroid – As in most Bayesian learning, assumes that features are – Produces understandable independent classifiers (amenable to human – It has been shown to work well tuning) • Profile – However, since it is linearly – Embodies explicit/declarative averaged, there are only two representation of the category subspaces (n-spheres), so this – Incremental or batch (on training risks excluding most of positive docs) training examples. – Most common batch: Rocchio • EBLclassifiers (wyj: weight of term in a given document; wyi weight for – Not explicit, declarative; classifying dj as ci) – Use k-NN; looks at the k training w w ∑ ∑ documents most similar to dj, to yi = β + γ yj yj w see if they have been classified ij = 1 ij = 0 |{ j | ca }| |{ j | ca }| d d ij = 1 ij = under ci; threshold value { d j | ca } { d j | ca 0 } determines decision β + γ = 1, β ≥ 0, γ ≤ 0 Building the Classifier Building the Classifier • EBLclassifiers • Lam & Ho: attempts to combine profile & example- ∑ i ( d j ) = ) ⋅ ca ( d j , d z CSV RSV based iz d z ∈ TR k ( d j ) – The k-NN algorithm is given generalized instances (GIs) – TRk(dj) set of k documents dz for instead of training documents which RSV(dj,dz) is maximum and ca values are from the correct • Clustering positive instances of decision matrix category ci into {cli1 … cliki} – RSV is some measure of semantic • Extracting a profile from each relatedness: could be probabilistic or vector-based cosine cluster with linear classifier • Does not subdivide space into • Applying k-NN to these only two spaces – Avoids sensitivity of k-NN to • Efficient: O (|Tr|) noise, but exploits its superiority • One variant uses 1, -1 instead of over linear classifiers 1, 0 for ca 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend