 
              2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179
Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text numerically To learn how to make use of enormous amounts of unlabeled data To learn how to find co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 119 / 179
2.1 Basics of Text Representation and Analysis based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 120 / 179
What is text mining? Definition Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the (biomedical) literature. Motivation Most knowledge is stored in terms of texts, both in industry and in academia. This alone makes text mining an integral part of knowledge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 121 / 179
Why text mining? Text data is growing in an unprecedented manner Digital libraries Web and Web-enabled applications (e.g. Social networks) Newswire services D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 122 / 179
Text mining terminology Important definitions A set of features of text is also referred to as a lexicon . A document can be either viewed as a sequence or multidimensional record. A collection of documents is referred to as a corpus . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 123 / 179
Text mining terminology Number of special characteristics of text data Very sparse Diverse length Nonnegative statistics Side information is often available, e.g. Hyperlink, meta-data Lots of unlabeled data D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 124 / 179
What is text mining? Common tasks Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents Document ranking: rank all documents in the collection Document selection: classify documents into relevant and irrelevant Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a category that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 125 / 179
Evaluation text mining Precision and Recall Let the set of documents that are relevant to a query be denoted as { Relevant } and the set of retrieved documents as { Retrieved } . The precision is the percentage of retrieved documents that are relevant to the query precision = ∣{ Relevant } ∩ { Retrieved }∣ (1) ∣{ Retrieved }∣ The recall is the percentage of relevant documents that were retrieved by the query: recall = ∣{ Relevant } ∩ { Retrieved }∣ (2) ∣{ Relevant }∣ D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 126 / 179
Text representation Tokenization Tokenization is the process of identifying keywords in a document. Not all words in a text are relevant. Text mining ignores stop words. Stop words form the stop list. Stop lists are context-dependent. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 127 / 179
Text representation Vector space model Given # d documents and # t terms. Model each document as a vector v in a t -dimensional space. Weighted term-frequency matrix Matrix TF of size # d × # t Entries measure association of term and document If a term t does not occur in a document d , then TF ( d , t ) = 0. If a term t does occur in a document d , then TF ( d , t ) > 0. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 128 / 179
Text representation Definitions of term frequency If term t occurs in document d , then TF ( d , t ) = 1 TF ( d , t ) = frequency of t in d (freq(d,t)) freq(d,t) TF ( d , t ) = ∑ t ′∈ T freq(d,t’) ⎧ ⎪ ⎪ 1 + log ( freq(d,t) ) freq(d,t) > 0 , TF ( d , t ) = ⎨ ⎪ freq(d,t) = 0 . ⎪ 0 ⎩ D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 129 / 179
Text representation Inverse document frequency The inverse document frequency (IDF) represents the scaling factor, or importance, of a term. A term that appears in many documents is scaled down: IDF ( t ) = log 1 + ∣ d ∣ ∣ d t ∣ , (3) where ∣ d ∣ is the number of all documents, and ∣ d t ∣ is the number of documents containing term t . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 130 / 179
Text representation TF-IDF measure The TF-IDF measure is the product of term frequency and inverse document frequency: TF - IDF ( d , t ) = TF ( d , t ) IDF ( t ) . (4) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 131 / 179
Measuring similarity Cosine measure Let v 1 and v 2 be two document vectors. The cosine similarity is defined as sim ( v 1 , v 2 ) = v ⊺ 1 v 2 ∣ v 1 ∣∣ v 2 ∣ . (5) Kernels Depending on how we represent a document, there are many kernels available for measuring similarity of these representations: vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel, one long string: string kernels that count common k-mers in two strings. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 132 / 179
2.2 Topic Modeling based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.4.3 and 13.4 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 133 / 179
Topic Modeling Definition Topic modeling can be viewed as a probabilistic version of latent semantic analysis (LSA). Its most basic version is referred to as Probabilistic Latent Semantic Analysis (PLSA) . It provides an alternative method for performing dimensionality reduction and has several advantages over LSA. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 134 / 179
Topic Modeling: SVD on text Latent Semantic Analysis Latent Semantic Analysis (LSA) is an application of SVD to the text domain. The goal is to retrieve a vectorial representation of terms and documents. The data matrix D is an n × d document-term matrix containing word frequencies in the n documents, where d is the size of the lexicon. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 135 / 179
Topic Modeling: SVD on text Latent Semantic Analysis Words Topics d k Topics Words k document basis (Importance) d k Documents vectors k document Δ k Documents Topics Topics Document Term ≈ n k basis vectors of k n Matrix documents T Lk Δ k D Rk D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 136 / 179
Topic Modeling: Centering and sparsity Latent Semantic Analysis No mean centering is used. The results are approximately the same as for PCA because of the sparsity of D: The sparsity implies that most of the entries are zero, and that the mean is much smaller than the non-zero entries. In such scenarios, it can be shown that the covariance matrix is approximately proportional to D ⊺ D . The sparsity of the data also results in a low intrinsic dimensionality. The dimensionality reduction effect of LSA is rather drastic: Often, a corpus represented on a lexicon on 100 , 000 dimensions can be summarized in fewer than 300 dimensions. LSA is also a classic example of how to the ”loss” of information from discarding some dimensions can actually result in an improvement in the quality of the data representation. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 137 / 179
Topic Modeling: Synonymy and polysemy Latent Semantic Analysis Synonymy refers to the fact that two words have the same meaning, e.g. comical and hilarious . Polysemy refers to the fact that the same word has two different meanings, e.g. jaguar . Typically the meaning of a word can be understood from its context, but frequency terms do not capture the context sufficiently, e.g. two documents containing the words comical and hilarious may not be deemed sufficiently similar. The truncated representation after LSA typically removes the noise of effects of synonymy and polysemy because the singular vectors represent the direction of correlation in the data. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 138 / 179
Topic Modeling Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis (PLSA) is a probabilistic variant of LSA and SVD. It is an expectation-maximization based modeling algorithm. Its goal is to discover the correlation structure of the words, not the documents (or data objects). D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 139 / 179
Recommend
More recommend