Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt - - PowerPoint PPT Presentation

data mining in bioinformatics day 4 text mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt - - PowerPoint PPT Presentation

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tbingen .: Data Mining in Bioinformatics, Page 1 What is text mining?


slide-1
SLIDE 1

.: Data Mining in Bioinformatics, Page 1

Data Mining in Bioinformatics Day 4: Text Mining

Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

slide-2
SLIDE 2

What is text mining?

.: Data Mining in Bioinformatics, Page 2

Definition Text mining is the use of automated methods for exploit- ing the enormous amount of knowledge available in the (biomedical) literature. Motivation Most knowledge is stored in terms of texts, both in in- dustry and in academia This alone makes text mining an integral part of knowl- edge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text

slide-3
SLIDE 3

What is text mining?

.: Data Mining in Bioinformatics, Page 3

Common tasks Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents Document ranking: rank all documents in the collec- tion Document selection: classify documents into relevant and irrelevant Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a cate- gory that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents

slide-4
SLIDE 4

Evaluating text mining

.: Data Mining in Bioinformatics, Page 4

Precision and Recall Let the set of documents that are relevant to a query be denoted as {Relevant} and the set of retrieved doc- uments as {Retrieved}. The precision is the percentage of retrieved documents that are relevant to the query precision = |{Relevant} ∩ {Retrieved}|

|{Retrieved}| (1)

The recall is the percentage of relevant documents that were retrieved by the query: recall = |{Relevant} ∩ {Retrieved}|

|{Relevant}| (2)

slide-5
SLIDE 5

Text representation

.: Data Mining in Bioinformatics, Page 5

Tokenization Process of identifying keywords in a document Not all words in a text are relevant Text mining ignores stop words Stop words form the stop list Stop lists are context-dependent

slide-6
SLIDE 6

Text representation

.: Data Mining in Bioinformatics, Page 6

Vector space model Given #d documents and #t terms Model each document as a vector v in a t-dimensional space Weighted Term-frequency matrix Matrix TF of size #d × #t Entries measure association of term and document If a term t does not occur in a document d, then

TF(d, t) = 0

If a term t does occur in a document d, then TF(d, t) >

0.

slide-7
SLIDE 7

Text representation

.: Data Mining in Bioinformatics, Page 7

If term t occurs in document d, then

TF(d, t) = 1 TF(d, t) = frequency of t in d (freq(d, t)) TF(d, t) =

freq(d,t)

  • t′∈T freq(d,t′)

TF(d, t) = 1 + log(1 + log(freq(d, t)))

slide-8
SLIDE 8

Text representation

.: Data Mining in Bioinformatics, Page 8

Inverse document frequency represents the scaling factor, or importance, of a term A term that appears in many document is scaled down

IDF(t) = log 1 + |d| |dt| (3)

where |d| is the number of all documents, and |dt| is the number of documents containing term t TF-IDF measure Product of term frequency and inverse document fre- quency:

TF-IDF(d, t) = TF(d, t)IDF(t); (4)

slide-9
SLIDE 9

Measuring similarity

.: Data Mining in Bioinformatics, Page 9

Cosine measure Let v1 and v2 be two document vectors. The cosine similarity is defined as

sim(v1, v2) = v⊤

1 v2

|v1||v2| (5)

Kernels depending on how we represent a document, there are many kernels available for measuring similarity of these representations vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel

  • ne long string: string kernels that count common k-

mers in two strings (more on that later in the course)

slide-10
SLIDE 10

Keyword co-occurrence

.: Data Mining in Bioinformatics, Page 10

Problem Find sets of keyword that often co-occur Common problem in biomedical literature: find associ- ations between genes, proteins or other entities using co-occurrence search Keyword co-occurrence search is an instance of a more general problem in data mining, called association rule mining.

slide-11
SLIDE 11

Association rules

.: Data Mining in Bioinformatics, Page 11

Definitions Let I = {I1, I2, . . . , Im} be a set of items (keywords) Let D be the database of transactions T (collection of documents)

A transaction T ∈ D is a set of items: T ⊆ I (a docu-

ment is a set of keywords) Let A be a set of items: A ⊆ T. An association rule is an implication of the form

A ⊆ T ⇒ B ⊆ T, (6)

where A, B ⊆ I and A ∩ B = ∅

slide-12
SLIDE 12

Association rules

.: Data Mining in Bioinformatics, Page 12

Support and Confidence The rule A ⇒ B holds in the transaction set D with sup- port s, where s is the percentage of transactions in D that contain A ∪ B: support(A ⇒ B) = |{T ∈ D|A ⊆ T ∧ B ⊆ T}|

|{T ∈ D}| (7)

The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B: confidence(A ⇒ B) = |{T ∈ D|A ⊆ T ∧ B ⊆ T}|

|{T ∈ D|A ⊆ T}| (8)

slide-13
SLIDE 13

Association rules

.: Data Mining in Bioinformatics, Page 13

Strong rules Rules that satisfy both a minimum support thresh-

  • ld (minsup) and a minimum confidence threshold

(minconf) are called strong association rules— and these are the ones we are after! Finding strong rules

  • 1. Search for all frequent itemsets (set of items that occur

in at least minsup % of all transactions)

  • 2. Generate strong association rules from the frequent

itemsets

slide-14
SLIDE 14

Association rules

.: Data Mining in Bioinformatics, Page 14

Apriori algorithm Makes use of the Apriori property: If an itemset A is frequent, then any subset B of A (B ⊆ A) is frequent as well. If B is infrequent, then any superset A of B (A ⊇ B) is infrequent as well. Steps

  • 1. Determine frequent items = k-itemsets with k = 1
  • 2. Join all pairs of frequent k-itemsets that differ in at most

1 item = candidates Ck+1 for being frequent k+1 itemsets

  • 3. Check the frequency of these candidates Ck+1: the fre-

quent ones form the frequent k + 1-itemsets (trick: dis- card any candidate immediately that contains an infre- quent k-itemset)

  • 4. Repeat from Step 2 until no more candidate is frequent
slide-15
SLIDE 15

Transduction

.: Data Mining in Bioinformatics, Page 15

Known test set Classification on text databases often means that we know all the data we will work with before training Hence the test set is known apriori This setting is called ’transductive’ Can we define classifiers that exploit the known test set? Yes! Transductive SVM (Joachims, ICML 1999) Trains SVM on both training and test set Uses test data to maximise margin

slide-16
SLIDE 16

Inductive vs. transductive

.: Data Mining in Bioinformatics, Page 16

Classification Task: predict label y from features x Classic inductive setting Strategy: Learn classifier on (labelled) training data Goal: Classifier shall generalise to unseen data from same distribution Transductive setting Strategy: Learn classifier on (labelled) training data

AND a given (unlabelled) test dataset

Goal: Predict class labels for this particular dataset

slide-17
SLIDE 17

Why transduction?

.: Data Mining in Bioinformatics, Page 17

Really necessary? Classic approach works: train on training dataset, test

  • n test dataset

That is what we usually do in practice, for instance, in cross-validation. We usually ignore or neglect that the fact that settings are transductive. The benefits of transductive classification Inductive setting: infinitely many potential classifiers Transductive setting: finite number of equivalence classes of classifiers

f and f ′ in same equivalence class ⇔ f and f ′ classify

points from training and test dataset identically

slide-18
SLIDE 18

Why transduction?

.: Data Mining in Bioinformatics, Page 18

Idea of Transductive SVMs Risk on Test data ≤ Risk on Training data + confidence interval (depends on number of equivalence classes) Theorem by Vapnik(1998): The larger the margin, the lower the number of equivalence classes that contain a classifier with this margin Find hyperplane that separates classes in training data

AND in test data with maximum margin.

slide-19
SLIDE 19

Why transduction?

.: Data Mining in Bioinformatics, Page 19

slide-20
SLIDE 20

Transduction on text

.: Data Mining in Bioinformatics, Page 20

slide-21
SLIDE 21

Transductive SVM

.: Data Mining in Bioinformatics, Page 21

Linearly separable case

min

w,b,y∗

1 2w2 s.t. ∀n

i=1 yi[w⊤xi + b] ≥ 1

∀k

j=1 y∗ j[w⊤x∗ j + b] ≥ 1

slide-22
SLIDE 22

Transductive SVM

.: Data Mining in Bioinformatics, Page 22

Non-linearly separable case

min

w,b,y∗,ξ,ξ∗

1 2w2 + C

n

  • i=0

ξi + C∗

k

  • j=0

ξ∗

j

s.t. ∀n

i=1 yi[w⊤xi + b] ≥ 1 − ξi

∀k

j=1 y∗ j[w⊤x∗ j + b] ≥ 1 − ξ∗ j

∀n

i=1 ξi ≥ 0

∀k

j=1 ξ∗ j ≥ 0

slide-23
SLIDE 23

Transductive SVM

.: Data Mining in Bioinformatics, Page 23

Optimisation How to solve this OP? Not so nice: combination of integer and convex OP Joachims’ approach: find approximate solution by itera- tive application of inductive SVM train inductive SVM on training data, predict on test data, assign labels to test data retrain on all data, with special slack weights for test data (C∗

−, C∗ +)

Outer loop: repeat and slowly increase (C∗

−, C∗ +)

Inner loop: within each repetition switch pairs of ’mis- classified’ data points repeatedly Local search with approximate solution to OP

slide-24
SLIDE 24

Inductive SVM for TSVM

.: Data Mining in Bioinformatics, Page 24

Variant of inductive SVM

min

w,b,y∗,ξ,ξ∗

1 2w2 + C

n

  • i=0

ξi + C∗

− k

  • j:y∗

j=−1

ξ∗

j + C∗ + k

  • j:y∗

j=1

ξ∗

j

s.t. ∀n

i=1 yi[w⊤xi + b] ≥ 1 − ξi

∀k

j=1 y∗ j[w⊤x∗ j + b] ≥ 1 − ξ∗ j

Three different penalty costs

C for points from training dataset C∗

− for points from in test dataset currently in class −1

C∗

+ for points from in test dataset currently in class +1

slide-25
SLIDE 25

Experiments

.: Data Mining in Bioinformatics, Page 25

Average P/R-breakeven point on the Reuters dataset for different training set sizes and a test size of 3,299

slide-26
SLIDE 26

Experiments

.: Data Mining in Bioinformatics, Page 26

Average P/R-breakeven point on the Reuters dataset for 17 training documents and varying test set size for the TSVM

slide-27
SLIDE 27

Experiments

.: Data Mining in Bioinformatics, Page 27

Average P/R-breakeven point on the WebKB category ’course’ for different training set sizes

slide-28
SLIDE 28

Experiments

.: Data Mining in Bioinformatics, Page 28

Average P/R-breakeven point on the WebKB category ’project’ for different training set sizes

slide-29
SLIDE 29

Summary

.: Data Mining in Bioinformatics, Page 29

Results Transductive version of SVM Maximizes margin on training and test data Implementation uses variant of classic inductive SVM Solution is approximate and fast Works well on text, in particular on small training sam- ples and large test sets

slide-30
SLIDE 30

References and further reading

.: Data Mining in Bioinformatics, Page 30

References

[1] T.-Joachims. Transductive Inference for Text Classifica- tion using Support Vector Machines ICML, 1999: 200- 209. [2] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Elsevier, Morgan-Kaufmann Publishers, 2006.

slide-31
SLIDE 31

The end

.: Data Mining in Bioinformatics, Page 31

See you tomorrow! Next topic: Graph Mining