Text Classification
- Dr. Ahmed Rafea
Text Classification Dr. Ahmed Rafea Supervised learning Learning - - PDF document
Text Classification Dr. Ahmed Rafea Supervised learning Learning to assign objects to classes given examples Learner (classifier) A typical supervised text learning scenario. 2 Difference with texts M.L classification techniques
2
3
4
5
6
7
8
OR
9
subset scenario
document d (E.g.: “Sports” and “Not sports” (all remaining documents)
d c d, d c d, d c d, d c d,
d
d
10
Equal importance for each document
Equal importance for each class
c d c d
, , μ
] , 1 [ ] , [ ] , [ ) (
μ μ μ μ
M M M precision M + =
] 1 , [ ] , [ ] , [ ) (
μ μ μ μ
M M M recall M + =
=
c d d c c
M C M
,
| | 1 ] , 1 [ ] , [ ] , [ ) (
c c c c
M M M precision M + =
] 1 , [ ] , [ ] , [ ) (
c c c c
M M M recall M + =
11
Plot of precision vs. recall: Better classifier has
Harmonic mean : Discard classifiers that sacrifice
1
12
Index each document and remember class label
13
Fetch “k” most similar document to given
– Majority class wins – Alternative: Weighted counts – counts of classes weighted by the corresponding similarity measure s(dq,c) = Σ s(dq,dc )
dc εkNN(dq)
– Alternative: per-class offset bc which is tuned by testing the classifier on a portion of training data held out for this purpose. s(dq,c) = bc + Σ s(dq,dc )
dc εkNN(dq)
14
Nearest neighbor classification
15
16
involves as many inverted index lookups as there
scoring the (possibly large number of) candidate
sorting by overall similarity, picking the best k documents,
Data stored at level of individual documents No distillation
17
Find clusters in the data Store only a few statistical parameters per cluster. Compare with documents in only the most
Ad-hoc choices for number and size of clusters
k is corpus sensitive
18
19
230,000 ~= 1010,000
| |
20
Purposes:
– Improve accuracy by avoiding over fitting – maintain accuracy while discarding as many features as possible to save a great deal of space for storing statistics
Heuristic, guided by linguistic and domain
21
frequent” or “too rare” terms
algorithm)
algorithm)
22
(most commonly used in the text domain)
1.
2.
3.
23
24
Aggregates the deviations of observed values from
Larger the value of , the lower is our belief that
i,1 i,0
2
25
2
00 10 01 11 00 01 10 11 2 01 10 00 11 , , 2
m l t t m l
26
1.
2.
3.
27
Measures similarity or distance between two distributions
Let X be a feature in T. Let The presence of M renders the presence of X unnecessary
as a feature => M is a Markov blanket for X
Technically
if X is conditionally independent of given M
contained in other existing features does not increase the KL distance between Pr(C|T) and Pr(C|F).
28
Purpose: To cut down computational complexity search for Markov blankets M to those with at most
given feature X, search for the members of M to
29
Greedy may include features with high individual correlations even
though one subsumes the other
Features individually uncorrelated could be jointly more correlated
with the class
k