2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, - PowerPoint PPT Presentation

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179

Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text numerically To learn how to make use of enormous amounts of unlabeled data To learn how to find co-occurring keywords in documents D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 119 / 179

2.1 Basics of Text Representation and Analysis based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 120 / 179

What is text mining? Definition Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the (biomedical) literature. Motivation Most knowledge is stored in terms of texts, both in industry and in academia. This alone makes text mining an integral part of knowledge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 121 / 179

Why text mining? Text data is growing in an unprecedented manner Digital libraries Web and Web-enabled applications (e.g. Social networks) Newswire services D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 122 / 179

Text mining terminology Important definitions A set of features of text is also referred to as a lexicon . A document can be either viewed as a sequence or multidimensional record. A collection of documents is referred to as a corpus . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 123 / 179

Text mining terminology Number of special characteristics of text data Very sparse Diverse length Nonnegative statistics Side information is often available, e.g. Hyperlink, meta-data Lots of unlabeled data D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 124 / 179

What is text mining? Common tasks Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents Document ranking: rank all documents in the collection Document selection: classify documents into relevant and irrelevant Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a category that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 125 / 179

Evaluation text mining Precision and Recall Let the set of documents that are relevant to a query be denoted as { Relevant } and the set of retrieved documents as { Retrieved } . The precision is the percentage of retrieved documents that are relevant to the query precision = ∣{ Relevant } ∩ { Retrieved }∣ (1) ∣{ Retrieved }∣ The recall is the percentage of relevant documents that were retrieved by the query: recall = ∣{ Relevant } ∩ { Retrieved }∣ (2) ∣{ Relevant }∣ D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 126 / 179

Text representation Tokenization Tokenization is the process of identifying keywords in a document. Not all words in a text are relevant. Text mining ignores stop words. Stop words form the stop list. Stop lists are context-dependent. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 127 / 179

Text representation Vector space model Given # d documents and # t terms. Model each document as a vector v in a t -dimensional space. Weighted term-frequency matrix Matrix TF of size # d × # t Entries measure association of term and document If a term t does not occur in a document d , then TF ( d , t ) = 0. If a term t does occur in a document d , then TF ( d , t ) > 0. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 128 / 179

Text representation Definitions of term frequency If term t occurs in document d , then TF ( d , t ) = 1 TF ( d , t ) = frequency of t in d (freq(d,t)) freq(d,t) TF ( d , t ) = ∑ t ′∈ T freq(d,t’) ⎧ ⎪ ⎪ 1 + log ( freq(d,t) ) freq(d,t) > 0 , TF ( d , t ) = ⎨ ⎪ freq(d,t) = 0 . ⎪ 0 ⎩ D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 129 / 179

Text representation Inverse document frequency The inverse document frequency (IDF) represents the scaling factor, or importance, of a term. A term that appears in many documents is scaled down: IDF ( t ) = log 1 + ∣ d ∣ ∣ d t ∣ , (3) where ∣ d ∣ is the number of all documents, and ∣ d t ∣ is the number of documents containing term t . D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 130 / 179

Text representation TF-IDF measure The TF-IDF measure is the product of term frequency and inverse document frequency: TF - IDF ( d , t ) = TF ( d , t ) IDF ( t ) . (4) D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 131 / 179

Measuring similarity Cosine measure Let v 1 and v 2 be two document vectors. The cosine similarity is defined as sim ( v 1 , v 2 ) = v ⊺ 1 v 2 ∣ v 1 ∣∣ v 2 ∣ . (5) Kernels Depending on how we represent a document, there are many kernels available for measuring similarity of these representations: vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel, one long string: string kernels that count common k-mers in two strings. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 132 / 179

2.2 Topic Modeling based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.4.3 and 13.4 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 133 / 179

Topic Modeling Definition Topic modeling can be viewed as a probabilistic version of latent semantic analysis (LSA). Its most basic version is referred to as Probabilistic Latent Semantic Analysis (PLSA) . It provides an alternative method for performing dimensionality reduction and has several advantages over LSA. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 134 / 179

Topic Modeling: SVD on text Latent Semantic Analysis Latent Semantic Analysis (LSA) is an application of SVD to the text domain. The goal is to retrieve a vectorial representation of terms and documents. The data matrix D is an n × d document-term matrix containing word frequencies in the n documents, where d is the size of the lexicon. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 135 / 179

Topic Modeling: SVD on text Latent Semantic Analysis Words Topics d k Topics Words k document basis (Importance) d k Documents vectors k document Δ k Documents Topics Topics Document Term ≈ n k basis vectors of k n Matrix documents T Lk Δ k D Rk D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 136 / 179

Topic Modeling: Centering and sparsity Latent Semantic Analysis No mean centering is used. The results are approximately the same as for PCA because of the sparsity of D: The sparsity implies that most of the entries are zero, and that the mean is much smaller than the non-zero entries. In such scenarios, it can be shown that the covariance matrix is approximately proportional to D ⊺ D . The sparsity of the data also results in a low intrinsic dimensionality. The dimensionality reduction effect of LSA is rather drastic: Often, a corpus represented on a lexicon on 100 , 000 dimensions can be summarized in fewer than 300 dimensions. LSA is also a classic example of how to the ”loss” of information from discarding some dimensions can actually result in an improvement in the quality of the data representation. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 137 / 179

Topic Modeling: Synonymy and polysemy Latent Semantic Analysis Synonymy refers to the fact that two words have the same meaning, e.g. comical and hilarious . Polysemy refers to the fact that the same word has two different meanings, e.g. jaguar . Typically the meaning of a word can be understood from its context, but frequency terms do not capture the context sufficiently, e.g. two documents containing the words comical and hilarious may not be deemed sufficiently similar. The truncated representation after LSA typically removes the noise of effects of synonymy and polysemy because the singular vectors represent the direction of correlation in the data. D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 138 / 179

Topic Modeling Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis (PLSA) is a probabilistic variant of LSA and SVD. It is an expectation-maximization based modeling algorithm. Its goal is to discover the correlation structure of the words, not the documents (or data objects). D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 139 / 179

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, - PowerPoint PPT Presentation

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179 Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

New Picture Books New Picture Books FALL 2018 FALL 2018 Professor Jonathan Roux Martin

Text Adventure: >Escape from ICICS amon ge april 20, 2018 Zork (1980) inspiration

to spice up your projects with EMF Parsley Lorenzo Bettini Francesco Guidieri EclipseCon France

how to use EMF Parsley to get desktop, web and mobile UIs from the model Vincenzo Caselli

6 Transductive Support Vector Machines Thorsten Joachims tj@cs.cornell.edu In contrast to

Observation of the drying process in secondary school kos Szeidemann, ron Bodor, Marcell

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Lagrangian observations; multi-particle statistics J. H. LaCasce Norwegian Meteorological

Sambuz

Useful Links

Newsletter

Mail Us

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, - PowerPoint PPT Presentation

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179 Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

New Picture Books New Picture Books FALL 2018 FALL 2018 Professor Jonathan Roux Martin

Text Adventure: &gt;Escape from ICICS amon ge april 20, 2018 Zork (1980) inspiration

to spice up your projects with EMF Parsley Lorenzo Bettini Francesco Guidieri EclipseCon France

how to use EMF Parsley to get desktop, web and mobile UIs from the model Vincenzo Caselli

6 Transductive Support Vector Machines Thorsten Joachims tj@cs.cornell.edu In contrast to

Observation of the drying process in secondary school kos Szeidemann, ron Bodor, Marcell

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Lagrangian observations; multi-particle statistics J. H. LaCasce Norwegian Meteorological

Sambuz

Useful Links

Newsletter

Mail Us

Text Adventure: >Escape from ICICS amon ge april 20, 2018 Zork (1980) inspiration