Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

What is text mining? Definition Text mining is the use of automated methods for exploit- ing the enormous amount of knowledge available in the (biomedical) literature. Motivation Most knowledge is stored in terms of texts, both in in- dustry and in academia This alone makes text mining an integral part of knowledge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

What is text mining? Common tasks Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents Document ranking: rank all documents in the collection Document selection: classify documents into relevant and irrelevant Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a cate- gory that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Evaluating text mining Precision and Recall Let the set of documents that are relevant to a query be denoted as { Relevant } and the set of retrieved documents as { Retrieved } . The precision is the percentage of retrieved documents that are relevant to the query precision = |{ Relevant } ∩ { Retrieved }| (1) |{ Retrieved }| The recall is the percentage of relevant documents that were retrieved by the query: recall = |{ Relevant } ∩ { Retrieved }| (2) |{ Relevant }| Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Text representation Tokenization Process of identifying keywords in a document Not all words in a text are relevant Text mining ignores stop words Stop words form the stop list Stop lists are context-dependent Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Text representation Vector space model Given # d documents and # t terms Model each document as a vector v in a t -dimensional space Weighted Term-frequency matrix Matrix TF of size # d × # t Entries measure association of term and document If a term t does not occur in a document d , then TF ( d, t ) = 0 If a term t does occur in a document d , then TF ( d, t ) > 0 . Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Text representation If term t occurs in document d , then TF ( d, t ) = 1 TF ( d, t ) = frequency of t in d ( freq ( d, t ) ) freq ( d,t ) TF ( d, t ) = t ′∈ T freq ( d,t ′ ) � TF ( d, t ) = 1 + log(1 + log( freq ( d, t ))) Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Text representation Inverse document frequency represents the scaling factor, or importance, of a term A term that appears in many document is scaled down IDF ( t ) = log 1 + | d | (3) | d t | where | d | is the number of all documents, and | d t | is the number of documents containing term t TF-IDF measure Product of term frequency and inverse document frequency: TF - IDF ( d, t ) = TF ( d, t ) IDF ( t ); (4) Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Measuring similarity Cosine measure Let v 1 and v 2 be two document vectors. The cosine similarity is defined as sim ( v 1 , v 2 ) = v ⊤ 1 v 2 (5) | v 1 || v 2 | Kernels depending on how we represent a document, there are many kernels available for measuring similarity of these representations vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel one long string: string kernels that count common k- mers in two strings (more on that later in the course) Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

Keyword co-occurrence Problem Find sets of keyword that often co-occur Common problem in biomedical literature: find associ- ations between genes, proteins or other entities using co-occurrence search Keyword co-occurrence search is an instance of a more general problem in data mining, called association rule mining. Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Association rules Definitions Let I = { I 1 , I 2 , . . . , I m } be a set of items (keywords) Let D be the database of transactions T (collection of documents) A transaction T ∈ D is a set of items: T ⊆ I (a document is a set of keywords) Let A be a set of items: A ⊆ T . An association rule is an implication of the form A ⊆ T ⇒ B ⊆ T, (6) where A, B ⊆ I and A ∩ B = ∅ Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Association rules Support and Confidence The rule A ⇒ B holds in the transaction set D with support s , where s is the percentage of transactions in D that contain A ∪ B : support ( A ⇒ B ) = |{ T ∈ D | A ⊆ T ∧ B ⊆ T }| (7) |{ T ∈ D }| The rule A ⇒ B has confidence c in the transaction set D , where c is the percentage of transactions in D containing A that also contain B : confidence ( A ⇒ B ) = |{ T ∈ D | A ⊆ T ∧ B ⊆ T }| (8) |{ T ∈ D | A ⊆ T }| Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

Association rules Strong rules Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (minconf) are called strong association rules— and these are the ones we are after! Finding strong rules 1. Search for all frequent itemsets (set of items that occur in at least minsup % of all transactions) 2. Generate strong association rules from the frequent itemsets Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Association rules Apriori algorithm Makes use of the Apriori property: If an itemset A is frequent, then any subset B of A ( B ⊆ A ) is frequent as well. If B is infrequent, then any superset A of B ( A ⊇ B ) is infrequent as well. Steps 1. Determine frequent items = k -itemsets with k = 1 2. Join all pairs of frequent k -itemsets that differ in at most 1 item = candidates C k +1 for being frequent k +1 itemsets 3. Check the frequency of these candidates C k +1 : the frequent ones form the frequent k + 1 -itemsets (trick: dis- card any candidate immediately that contains an infrequent k -itemset) 4. Repeat from Step 2 until no more candidate is frequent Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Transduction Known test set Classification on text databases often means that we know all the data we will work with before training Hence the test set is known apriori This setting is called ’transductive’ Can we define classifiers that exploit the known test set? Yes! Transductive SVM (Joachims, ICML 1999) Trains SVM on both training and test set Uses test data to maximise margin Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Inductive vs. transductive Classification Task: predict label y from features x Classic inductive setting Strategy: Learn classifier on (labelled) training data Goal: Classifier shall generalise to unseen data from same distribution Transductive setting Strategy: Learn classifier on (labelled) training data AND a given (unlabelled) test dataset Goal: Predict class labels for this particular dataset Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Why transduction? Really necessary? Classic approach works: train on training dataset, test on test dataset That is what we usually do in practice, for instance, in cross-validation. We usually ignore or neglect that the fact that settings are transductive. The benefits of transductive classification Inductive setting: infinitely many potential classifiers Transductive setting: finite number of equivalence classes of classifiers f and f ′ in same equivalence class ⇔ f and f ′ classify points from training and test dataset identically Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Why transduction? Idea of Transductive SVMs Risk on Test data ≤ Risk on Training data + confidence interval (depends on number of equivalence classes) Theorem by Vapnik(1998): The larger the margin, the lower the number of equivalence classes that contain a classifier with this margin Find hyperplane that separates classes in training data AND in test data with maximum margin. Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Why transduction? Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Transduction on text Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Transductive SVM Linearly separable case 1 2 � w � 2 min w,b,y ∗ i =1 y i [ w ⊤ x i + b ] ≥ 1 s.t. ∀ n ∀ k j =1 y ∗ j [ w ⊤ x ∗ j + b ] ≥ 1 Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Transductive SVM Non-linearly separable case n k 1 2 � w � 2 + C � ξ i + C ∗ � ξ ∗ min j w,b,y ∗ ,ξ,ξ ∗ i =0 j =0 s.t. ∀ n i =1 y i [ w ⊤ x i + b ] ≥ 1 − ξ i j =1 y ∗ j [ w ⊤ x ∗ j + b ] ≥ 1 − ξ ∗ ∀ k j ∀ n i =1 ξ i ≥ 0 j =1 ξ ∗ ∀ k j ≥ 0 Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 What is text mining? Definition Text mining is the use

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Programming Macro Tree Transducers Patrick Bahr 1 Laurence E. Day 2 1 University of Copenhagen,

Nameless Feature Selection Challenge Attempt By Ran Gilad-Bachrach and Amir Navot Overview

Recognizing Named Entities using Automatically Extracted Transduction Rules D. Nouvel, J.Y.

Designing and comparing G2P-type lemmatizers for a morphology-rich language Steffen Eger, Goethe

Agent Architectures and Hierarchical Control Overview: Agents and Robots Agent systems and

Another Diversity-Promoting Objective Function for Neural Dialogue Generation Ryo Nakamura ,

Sustainable Biodiesel Production Veera Gnaneswar Gude, Ph.D., P.E. Georgene Elizabeth Grant

Introduction to the Transport and Services Area (TSV) David L. Black, Dell EMC Mirja Khlewind,

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tbingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 What is text mining? Definition Text mining is the use

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt March

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt March 1

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Data Mining in Bioinformatics Day 10: Graph Mining in Bioinformatics Karsten Borgwardt February

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Programming Macro Tree Transducers Patrick Bahr 1 Laurence E. Day 2 1 University of Copenhagen,

Nameless Feature Selection Challenge Attempt By Ran Gilad-Bachrach and Amir Navot Overview

Recognizing Named Entities using Automatically Extracted Transduction Rules D. Nouvel, J.Y.

Designing and comparing G2P-type lemmatizers for a morphology-rich language Steffen Eger, Goethe

Agent Architectures and Hierarchical Control Overview: Agents and Robots Agent systems and

Another Diversity-Promoting Objective Function for Neural Dialogue Generation Ryo Nakamura ,

Sustainable Biodiesel Production Veera Gnaneswar Gude, Ph.D., P.E. Georgene Elizabeth Grant

Introduction to the Transport and Services Area (TSV) David L. Black, Dell EMC Mirja Khlewind,

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt