- 2. Text Mining
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179
2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, - - PowerPoint PPT Presentation
2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179 Text Mining Goals To learn key problems and techniques in the mining one of the most common types of data To learn how to represent text
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 119 / 179
based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 13
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 120 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 121 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 122 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 123 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 124 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 125 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 126 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 127 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 128 / 179
∑t′∈T freq(d,t’)
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 129 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 130 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 131 / 179
1v2
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 132 / 179
based on: Charu Aggarwal, Data Mining - The Textbook, Springer 2015, Chapter 2.4.4.3 and 13.4
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 133 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 134 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 135 / 179
Words
d n
Topics
k n
Topics (Importance)
k k
d k
Documents Documents Topics
Words Topics
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 136 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 137 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 138 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 139 / 179
Words
d n
Topics
k n
Topics (Importance)
k k
Rk T=[P(wordi | topicm)]
d k
Documents Documents Topics D = [P(doci,wordj)] Words Topics
P(topicm) Prior Probability Lk=[P(doci | topicm)]
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 140 / 179
1 Select a latent component (aspect) Gm with probability P(Gm). 2 Generate the indices (i,j) of a document-word pair (Di,wj) with probabilities P(Di∣Gm) and
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 141 / 179
k
m=1
k
m=1
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 142 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 143 / 179
1 (E-step) Estimate posterior probability P(Gm∣Di,wj) in terms of P(Gm), P(Di∣Gm) and
2 (M-step) Estimate P(Gm), P(Di∣Gm) and P(wj∣Gm) in terms of the posterior probability
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 144 / 179
r=1 P(Gr)P(Di∣Gr)P(wj∣Gr)
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 145 / 179
wj
Di
Di
wj
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 146 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 147 / 179
k
m=1
k
m=1
k .
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 148 / 179
k .
k are non-negative and have a clear probabilistic
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 149 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 150 / 179
based on: Thorsten Joachims, Transductive Inference for Text Classification using Support Vector Machines. ICML 1999: 200-209, Source of the figures in this section
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 151 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 152 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 153 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 154 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 155 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 156 / 179
salt and basil parsley atom physics nuclear D1 D2 D3 D4 D5 D6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 157 / 179
w,b,y∗
i=1 yi[w⊺xi + b] ≥ 1
j=1 y∗ j [w⊺x∗ j + b] ≥ 1
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 158 / 179
w,b,y∗,ξ,ξ∗
n
i=0
k
j=0
j
i=1 yi[w⊺xi + b] ≥ 1 − ξi
j=1 y∗ j [w⊺x∗ j + b] ≥ 1 − ξ∗ j
i=1 ξi ≥ 0
j=1 ξ∗ j ≥ 0
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 159 / 179
1 Train inductive SVM on training data, predict on test data, assign labels to test data. 2 Retrain on all data, with special slack weights for test data (C ∗
−,C ∗ +).
3 Outer loop: Repeat and slowly increase (C ∗
−,C ∗ +).
4 Inner loop: Within each repetition, switch pairs of ‘misclassified’ data points repeatedly.
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 160 / 179
w,b,y∗,ξ,ξ∗
n
i=0
− k
j∶y∗
j =−1
j + C ∗ + k
j∶y∗
j =1
j
i=1 yi[w⊺xi + b] ≥ 1 − ξi
j=1 y∗ j [w⊺x∗ j + b] ≥ 1 − ξ∗ j
− for points from in test dataset currently in class −1
+ for points from in test dataset currently in class +1
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 161 / 179
*
ξj
*
ξi ξi
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 162 / 179
20 40 60 80 100 17 26 46 88 170 326 640 1200 2400 4801 9603 Average P/R-breakeven point Examples in training set Transductive SVM SVM Naive Bayes
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 163 / 179
10 20 30 40 50 60 70 80 90 100 206 412 825 1650 3299 Average P/R-breakeven point Examples in test set Transductive SVM SVM Naive Bayes
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 164 / 179
20 40 60 80 100 9 16 29 57 113 226 P/R-breakeven point (class course) Examples in training set Transductive SVM SVM Naive Bayes
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 165 / 179
20 40 60 80 100 9 16 29 57 113 226 P/R-breakeven point (class project) Examples in training set Transductive SVM SVM Naive Bayes
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 166 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 167 / 179
based on: Avrim Blum, Tom M. Mitchell, Combining Labeled and Unlabeled Data with Co-Training. COLT 1998: 92-100
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 168 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 169 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 170 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 171 / 179
+ + + +
? ? ? ? ? ? ?
? ? ? ?
+ + + +
+
h2
+
h2 ?
train classify sample add
1 1 2 1 3 1 4
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 172 / 179
Source: Blum and Mitchell, 1998 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 173 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 174 / 179
D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 175 / 179
Source: Blum and Mitchell, 1998 D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 176 / 179