2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, - - PowerPoint PPT Presentation

2 text mining
SMART_READER_LITE
LIVE PREVIEW

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, - - PowerPoint PPT Presentation

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179 Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text


slide-1
SLIDE 1
  • 2. Text Mining

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179

slide-2
SLIDE 2

Text Mining

Goals

To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text numerically To learn how to make use of enormous amounts of unlabeled data To learn how to find co-occurring keywords in documents

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 119 / 179

slide-3
SLIDE 3

2.1 Basics of Text Representation and Analysis

based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 120 / 179

slide-4
SLIDE 4

What is text mining?

Definition

Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the (biomedical) literature.

Motivation

Most knowledge is stored in terms of texts, both in industry and in academia. This alone makes text mining an integral part of knowledge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 121 / 179

slide-5
SLIDE 5

Why text mining?

Text data is growing in an unprecedented manner

Digital libraries Web and Web-enabled applications (e.g. Social networks) Newswire services

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 122 / 179

slide-6
SLIDE 6

Text mining terminology

Important definitions

A set of features of text is also referred to as a lexicon. A document can be either viewed as a sequence or multidimensional record. A collection of documents is referred to as a corpus.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 123 / 179

slide-7
SLIDE 7

Text mining terminology

Number of special characteristics of text data

Very sparse Diverse length Nonnegative statistics Side information is often available, e.g. Hyperlink, meta-data Lots of unlabeled data

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 124 / 179

slide-8
SLIDE 8

What is text mining?

Common tasks

Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents

Document ranking: rank all documents in the collection Document selection: classify documents into relevant and irrelevant

Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a category that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 125 / 179

slide-9
SLIDE 9

Evaluation text mining

Precision and Recall

Let the set of documents that are relevant to a query be denoted as {Relevant} and the set of retrieved documents as {Retrieved}. The precision is the percentage of retrieved documents that are relevant to the query precision = ∣{Relevant} ∩ {Retrieved}∣ ∣{Retrieved}∣ (1) The recall is the percentage of relevant documents that were retrieved by the query: recall = ∣{Relevant} ∩ {Retrieved}∣ ∣{Relevant}∣ (2)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 126 / 179

slide-10
SLIDE 10

Text representation

Tokenization

Tokenization is the process of identifying keywords in a document. Not all words in a text are relevant. Text mining ignores stop words. Stop words form the stop list. Stop lists are context-dependent.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 127 / 179

slide-11
SLIDE 11

Text representation

Vector space model

Given #d documents and #t terms. Model each document as a vector v in a t-dimensional space.

Weighted term-frequency matrix

Matrix TF of size #d × #t Entries measure association of term and document If a term t does not occur in a document d, then TF(d,t) = 0. If a term t does occur in a document d, then TF(d,t) > 0.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 128 / 179

slide-12
SLIDE 12

Text representation

Definitions of term frequency

If term t occurs in document d, then

TF(d,t) = 1 TF(d,t) = frequency of t in d (freq(d,t)) TF(d,t) = freq(d,t)

∑t′∈T freq(d,t’)

TF(d,t) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 1 + log(freq(d,t)) freq(d,t) > 0, freq(d,t) = 0.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 129 / 179

slide-13
SLIDE 13

Text representation

Inverse document frequency

The inverse document frequency (IDF) represents the scaling factor, or importance, of a term. A term that appears in many documents is scaled down: IDF(t) = log 1 + ∣d∣ ∣dt∣ , (3) where ∣d∣ is the number of all documents, and ∣dt∣ is the number of documents containing term t.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 130 / 179

slide-14
SLIDE 14

Text representation

TF-IDF measure

The TF-IDF measure is the product of term frequency and inverse document frequency: TF-IDF(d,t) = TF(d,t)IDF(t). (4)

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 131 / 179

slide-15
SLIDE 15

Measuring similarity

Cosine measure

Let v1 and v2 be two document vectors. The cosine similarity is defined as sim(v1,v2) = v⊺

1v2

∣v1∣∣v2∣. (5)

Kernels

Depending on how we represent a document, there are many kernels available for measuring similarity of these representations:

vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel,

  • ne long string: string kernels that count common k-mers in two strings.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 132 / 179

slide-16
SLIDE 16

2.2 Topic Modeling

based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.4.3 and 13.4

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 133 / 179

slide-17
SLIDE 17

Topic Modeling

Definition

Topic modeling can be viewed as a probabilistic version of latent semantic analysis (LSA). Its most basic version is referred to as Probabilistic Latent Semantic Analysis (PLSA). It provides an alternative method for performing dimensionality reduction and has several advantages over LSA.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 134 / 179

slide-18
SLIDE 18

Topic Modeling: SVD on text

Latent Semantic Analysis

Latent Semantic Analysis (LSA) is an application of SVD to the text domain. The goal is to retrieve a vectorial representation of terms and documents. The data matrix D is an n × d document-term matrix containing word frequencies in the n documents, where d is the size of the lexicon.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 135 / 179

slide-19
SLIDE 19

Topic Modeling: SVD on text

Latent Semantic Analysis

Words

d n

Topics

k n

Topics (Importance)

k k

Rk T

d k

Documents Documents Topics

D

Words Topics

k document basis vectors

Δk Δk Lk

k document basis vectors of documents Document Term Matrix

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 136 / 179

slide-20
SLIDE 20

Topic Modeling: Centering and sparsity

Latent Semantic Analysis

No mean centering is used. The results are approximately the same as for PCA because of the sparsity of D: The sparsity implies that most of the entries are zero, and that the mean is much smaller than the non-zero entries. In such scenarios, it can be shown that the covariance matrix is approximately proportional to D⊺D. The sparsity of the data also results in a low intrinsic dimensionality. The dimensionality reduction effect of LSA is rather drastic: Often, a corpus represented

  • n a lexicon on 100,000 dimensions can be summarized in fewer than 300 dimensions.

LSA is also a classic example of how to the ”loss” of information from discarding some dimensions can actually result in an improvement in the quality of the data representation.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 137 / 179

slide-21
SLIDE 21

Topic Modeling: Synonymy and polysemy

Latent Semantic Analysis

Synonymy refers to the fact that two words have the same meaning, e.g. comical and hilarious. Polysemy refers to the fact that the same word has two different meanings, e.g. jaguar. Typically the meaning of a word can be understood from its context, but frequency terms do not capture the context sufficiently, e.g. two documents containing the words comical and hilarious may not be deemed sufficiently similar. The truncated representation after LSA typically removes the noise of effects of synonymy and polysemy because the singular vectors represent the direction of correlation in the data.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 138 / 179

slide-22
SLIDE 22

Topic Modeling

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis (PLSA) is a probabilistic variant of LSA and SVD. It is an expectation-maximization based modeling algorithm. Its goal is to discover the correlation structure of the words, not the documents (or data

  • bjects).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 139 / 179

slide-23
SLIDE 23

Topic Modeling

Probabilistic Latent Semantic Analysis

Words

d n

Topics

k n

Topics (Importance)

k k

Rk T=[P(wordi | topicm)]

d k

Documents Documents Topics D = [P(doci,wordj)] Words Topics

k document basis vectors

Δk

P(topicm) Prior Probability Lk=[P(doci | topicm)]

k document basis vectors of documents Document Term Matrix

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 140 / 179

slide-24
SLIDE 24

Topic Modeling

Probabilistic Latent Semantic Analysis

In PLSA, the generative process is inherently designed for dimensionality reduction rather than clustering, and different parts of the same document can be generated by different mixture components. It is assumed that there are k aspects (or latent topic) denoted by G1,...,Gk. The generative process builds document-term matrix as follows:

1 Select a latent component (aspect) Gm with probability P(Gm). 2 Generate the indices (i,j) of a document-word pair (Di,wj) with probabilities P(Di∣Gm) and

P(wj∣Gm), respectively. Increment the frequency of entry (i,j) in the document-term matrix by 1. The document and word indices are generated in an independent way.

All the parameters of this generative process, such as P(Gm),P(Di∣Gm) and P(wj∣Gm), need to be estimated from the observed frequencies in the n × d document-term matrix.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 141 / 179

slide-25
SLIDE 25

Topic Modeling

Probabilistic Latent Semantic Analysis

An important assumption in PLSA is that the selected documents and words are conditionally independent after the latent topical component Gm has been fixed: P(Di,wj∣Gm) = P(Di∣Gm)P(wj∣Gm) (6) This implies that the joint probability P(Di,wj) for selecting a document-word pair can be expressed in the following way: P(Di,wj) =

k

m=1

P(Gm)P(Di,wj∣Gm) =

k

m=1

P(Gm)P(Di∣Gm)P(wj∣Gm) (7) Local independence between documents and words does not imply global independence.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 142 / 179

slide-26
SLIDE 26

Topic Modeling

Probabilistic Latent Semantic Analysis

In PLSA, the posterior probability P(Gm∣Di,wj) of the latent component associated with a particular document-word pair is estimated. The EM algorithm starts by initializing P(Gm), P(Di∣Gm) and P(wj∣Gm) to 1/k, 1/n, and 1/d. k is the number of clusters, n the number of documents and d the number of words.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 143 / 179

slide-27
SLIDE 27

Topic Modeling

Probabilistic Latent Semantic Analysis

The algorithm iteratively executes the following E- and M-steps to convergence:

1 (E-step) Estimate posterior probability P(Gm∣Di,wj) in terms of P(Gm), P(Di∣Gm) and

P(wj∣Gm).

2 (M-step) Estimate P(Gm), P(Di∣Gm) and P(wj∣Gm) in terms of the posterior probability

P(Gm∣Di,wj), and observed data about word-document co-occurrence using log-likelihood maximization.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 144 / 179

slide-28
SLIDE 28

Topic Modeling

Probabilistic Latent Semantic Analysis - E-step

The posterior probability estimated in the E-step can be expanded using the Bayes rule: P(Gm∣Di,wj) = P(Gm)P(Di,wj∣Gm) P(Di,wj) (8) Expanding the numerator via (6) and the denominator via (7), we obtain P(Gm∣Di,wj) = P(Gm)P(Di∣Gm)P(wj∣Gm) ∑k

r=1 P(Gr)P(Di∣Gr)P(wj∣Gr)

(9) This shows that the E-step can be implemented in terms of P(Gm),P(Di∣Gm), and P(wj∣Gm).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 145 / 179

slide-29
SLIDE 29

Topic Modeling

Probabilistic Latent Semantic Analysis - M-step

P(Gm∣Di,wj) may be viewed as weights attached with word-document co-occurrence pairs for each aspect Gm. These weights can be used to estimate P(Gm), P(Di∣Gm) and P(wj∣Gm) via the following update rules (shown without proof): P(Di∣Gm) ∝ ∑

wj

f (Di,wj)P(Gm∣Di,wj) ∀i ∈ 1,...,n, m ∈ 1,...,k (10) P(wj∣Gm) ∝ ∑

Di

f (Di,wj)P(Gm∣Di,wj) ∀j ∈ 1,...,d, m ∈ 1,...,k (11) P(Gm) ∝ ∑

Di

wj

f (Di,wj)P(Gm∣Di,wj) ∀m ∈ 1,...,k (12) f (Di,wj) is the observed frequency of word wj in document Di.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 146 / 179

slide-30
SLIDE 30

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

The three key sets of parameters estimated by the M-step are P(Gm), P(Di∣Gm) and P(wj∣Gm). These sets of parameters provide an SVD-like matrix factorization of the n × d document-term matrix D. Assume D is scaled such that it scales to an aggregate probability of 1. Therefore the (i,j)th entry of D can be viewed as an observed instantiation of the probabilisitc quantity P(Di,wj). Let Lk be the n × k matrix for which the (i,m)th entry is P(Di∣Gm). Let ∆k be the k × k diagonal matrix for which the mth diagonal entry is P(Gm). Let Rk be the d × k matrix for which the (j,m)th entry is P(wj∣Gm).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 147 / 179

slide-31
SLIDE 31

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

Then the (i,j)th entry P(Di,wj) of the matrix D can be expressed in terms of the entries of the aforementioned matrices according to (7), which is replicated here: P(Di,wj) =

k

m=1

P(Gm)P(Di,wj∣Gm) =

k

m=1

P(Gm)P(Di∣Gm)P(wj∣Gm) (13) The left hand side is equal to entry (i,j) of D. The right hand side is equal to entry (i,j) of QkΣkP⊺

k .

Depending on the number of components k, the left hand side can only approximate the matrix D, which is denoted by Dk.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 148 / 179

slide-32
SLIDE 32

Topic Modeling

Probabilistic Latent Semantic Analysis - Dimensionality Reduction

In matrix notation, we then have: Dk = Lk∆kR⊺

k .

The transformed representation in k-dimensional space is Lk∆k. The transformed representations will differ between PLSA and LSA. LSA optimizes the mean-squared error and and PLSA maximizes the log-likelihood fit to a probabilistic generative model. Both representations capture synonymy and polysemy. In PLSA, unlike LSA, the columns of R⊺

k are non-negative and have a clear probabilistic

  • meaning. They allow to infer the topical words for the corresponding aspects.

In LSA, unlike PLSA, the transformation can be interpreted in terms of a rotation of an

  • rthonormal axis system, which can also be applied to out-of-sample documents.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 149 / 179

slide-33
SLIDE 33

Topic Modeling

Probabilistic Latent Semantic Analysis - Limitations

Although the PLSA method is an intuitively sound model for probabilistic modeling, it does have a number of practical drawbacks. The number of parameters grows linearly with the number of documents. Therefore, such an approach can be slow and may overfit the training data because of the large number of estimated parameters. Furthermore, while PLSA provides a generative model of document-word pairs in the training data, it cannot easily assign probabilities to previously unseen documents. In contrast, other EM models, such as Latent Dirichlet Allocation, transfer to unseen documents as well.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 150 / 179

slide-34
SLIDE 34

2.3 Transduction

based on: Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines. ICML 1999: 200-209, Source of the figures in this section

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 151 / 179

slide-35
SLIDE 35

Transduction

Known test set

Classification on text databases often means that we know all the data we will work with before training. Hence the test set is known apriori. This setting is called ’transductive’. Can we define classifiers that exploit the known test set? Yes!

Transductive SVM (Joachims, ICML 1999)

Trains SVM on both training and test set Uses test data to maximise margin

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 152 / 179

slide-36
SLIDE 36

Transduction

Inductive vs. Transductive Classification

Task: predict label y from features x

Classic inductive setting

Strategy: Learn classifier on (labelled) training data Goal: Classifier shall generalise to unseen data from same distribution

Transductive setting

Strategy: Learn classifier on (labelled) training data AND a given (unlabelled) test dataset Goal: Predict class labels for this particular dataset

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 153 / 179

slide-37
SLIDE 37

Transduction

Why transduction?

Classic approach works: train on training dataset, test on test dataset That is what we usually do in practice, for instance, in cross-validation. We usually ignore or neglect that the fact that settings are transductive.

The benefits of transductive classification

Inductive setting: infinitely many potential classifiers Transductive setting: finite number of equivalence classes of classifiers f and f ′ in same equivalence class ⇔ f and f ′ classify points from training and test dataset identically

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 154 / 179

slide-38
SLIDE 38

Transductive SVM

Learning-theoretic argument

Risk on Test data ≤ Risk on Training data + confidence interval (depends on number of equivalence classes) Theorem by Vapnik (1998): The larger the margin, the lower the number of equivalence classes that contain a classifier with this margin Find hyperplane that separates classes in training data AND in test data with maximum margin.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 155 / 179

slide-39
SLIDE 39

Transductive SVM

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 156 / 179

slide-40
SLIDE 40

Transductive SVM

salt and basil parsley atom physics nuclear D1 D2 D3 D4 D5 D6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 157 / 179

slide-41
SLIDE 41

Transductive SVM

Linearly separable case

min

w,b,y∗

1 2∥w∥2 s.t. ∀n

i=1 yi[w⊺xi + b] ≥ 1

∀k

j=1 y∗ j [w⊺x∗ j + b] ≥ 1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 158 / 179

slide-42
SLIDE 42

Transductive SVM

Non-linearly separable case

min

w,b,y∗,ξ,ξ∗

1 2∥w∥2 + C

n

i=0

ξi + C ∗

k

j=0

ξ∗

j

s.t. ∀n

i=1 yi[w⊺xi + b] ≥ 1 − ξi

∀k

j=1 y∗ j [w⊺x∗ j + b] ≥ 1 − ξ∗ j

∀n

i=1 ξi ≥ 0

∀k

j=1 ξ∗ j ≥ 0

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 159 / 179

slide-43
SLIDE 43

Transductive SVM

Optimisation

How to solve this OP? Not so ‘nice’: combination of integer and convex OP Joachims’ approach: find approximate solution by iterative application of inductive SVM

1 Train inductive SVM on training data, predict on test data, assign labels to test data. 2 Retrain on all data, with special slack weights for test data (C ∗

−,C ∗ +).

3 Outer loop: Repeat and slowly increase (C ∗

−,C ∗ +).

4 Inner loop: Within each repetition, switch pairs of ‘misclassified’ data points repeatedly.

Local search with approximate solution to OP

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 160 / 179

slide-44
SLIDE 44

Transductive SVM: Optimization

Variant of inductive SVM

min

w,b,y∗,ξ,ξ∗

1 2∥w∥2 + C

n

i=0

ξi + C ∗

− k

j∶y∗

j =−1

ξ∗

j + C ∗ + k

j∶y∗

j =1

ξ∗

j

s.t. ∀n

i=1 yi[w⊺xi + b] ≥ 1 − ξi

∀k

j=1 y∗ j [w⊺x∗ j + b] ≥ 1 − ξ∗ j

Three different penalty costs

C for points from training dataset C ∗

− for points from in test dataset currently in class −1

C ∗

+ for points from in test dataset currently in class +1

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 161 / 179

slide-45
SLIDE 45

Transductive SVM: Optimisation

+ + + + + + +

  • -
  • -
  • ξj

*

ξj

*

ξi ξi

train, predict re-train re-predict + + + + + + +

  • -
  • -
  • ξi

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 162 / 179

slide-46
SLIDE 46

Transductive SVM: Experiments

Average P/R-breakeven point on the Reuters dataset for different training set sizes and a test size of 3,299

20 40 60 80 100 17 26 46 88 170 326 640 1200 2400 4801 9603 Average P/R-breakeven point Examples in training set Transductive SVM SVM Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 163 / 179

slide-47
SLIDE 47

Transductive SVM: Experiments

Average P/R-breakeven point on the Reuters dataset for 17 training documents and varying test set size for the TSVM

10 20 30 40 50 60 70 80 90 100 206 412 825 1650 3299 Average P/R-breakeven point Examples in test set Transductive SVM SVM Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 164 / 179

slide-48
SLIDE 48

Transductive SVM: Experiments

Average P/R-breakeven point on the WebKB category ’course’ for different training set sizes

20 40 60 80 100 9 16 29 57 113 226 P/R-breakeven point (class course) Examples in training set Transductive SVM SVM Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 165 / 179

slide-49
SLIDE 49

Transductive SVM: Experiments

Average P/R-breakeven point on the WebKB category ’project’ for different training set sizes

20 40 60 80 100 9 16 29 57 113 226 P/R-breakeven point (class project) Examples in training set Transductive SVM SVM Naive Bayes

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 166 / 179

slide-50
SLIDE 50

Transductive SVM: Summary

Results

Transductive version of SVM Maximizes margin on training and test data Implementation uses variant of classic inductive SVM Solution is approximate and fast Works well on text, in particular on small training samples and large test sets

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 167 / 179

slide-51
SLIDE 51

2.4 Cotraining

based on: Avrim Blum, Tom M. Mitchell, Combining Labeled and Unlabeled Data with Co-Training. COLT 1998: 92-100

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 168 / 179

slide-52
SLIDE 52

Cotraining

Goals

To understand that hyperlinks define a second view of documents. To understand that this view can be used to infer class labels for an augmented training dataset and improved prediction accuracy. To understand how this concept of cotraining generalizes to other domains.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 169 / 179

slide-53
SLIDE 53

Cotraining

Motivation

In text mining: Besides their content in form of words, texts nowadays carry hyperlinks that point to related pages. Can this second type of information on a website be used to improve classification? In general: How to improve classification if there is plenty of unlabeled data in form of a second view of the data? Yes, the second view can be used to infer class labels of unlabeled data points, to augment the training dataset.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 170 / 179

slide-54
SLIDE 54

Cotraining

Classic cotraining algorithm

Blum and Mitchell’s cotraining uses two classifiers, trained on separate views of the data, to create pseudo-label for those unlabeled data points for which the predictors are most confident about their predictions. The pseudolabels are then used to retrain the classifiers, before repeating the pseudolabel generation. The entire process is repeated in k iterations.

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 171 / 179

slide-55
SLIDE 55

Cotraining

+ + + +

  • x1 x2

L x1 x2

? ? ? ? ? ? ?

x1 x2

? ? ? ?

+ + + +

  • x1 x2

+

  • h1

h2

x1 x2 x1 x2

+

  • +
  • +
  • h1

h2 ?

L' U U'

train classify sample add

x1 x2 x1 x2

1 1 2 1 3 1 4

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 172 / 179

slide-56
SLIDE 56

Cotraining: Pseudocode

Source: Blum and Mitchell, 1998 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 173 / 179

slide-57
SLIDE 57

Cotraining

Why can unlabeled data help at all?

Assume an instance space X = X1 × X2, where X1 and X2 are different views of the data. Each view is assumed to be sufficient for correct classification. Let D be a distribution over X and let C1 and C2 be concept classes defined over X1 and X2, respectively. We assume that all labels on examples with non-zero probability under D are consistent with some target function f1 ∈ C1 and f2 ∈ C2. If f denotes the combined target concept over the entire example, then for any example x = (x1,x2) observed with label l, we have f (x) = f1(x1) = f2(x2) = l. This means that D assigns probability zero to any example (x1,x2) such that f1(x1) ≠ f2(x2).

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 174 / 179

slide-58
SLIDE 58

Cotraining

Why can unlabeled data help at all?

For a given D over X, we define f = (f1,f2) ∈ C1 × C2 as being compatible with D if it satisfies the condition that D assigns zero probability to the set of examples (x1,x2) such that f1(x1) ≠ f2(x2). The set of compatible target functions is typically much simpler and smaller than the entire concept class they are from. As in the transductive SVM, a reduction in the equivalence classes of the target functions leads to an improved bound on the test error!

D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 175 / 179

slide-59
SLIDE 59

Cotraining: Graph Representation of Key Idea

Source: Blum and Mitchell, 1998 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 176 / 179