Foundations: Statistical Classification in Natural Language Processing
Dietrich Klakow
Foundations: Statistical Classification in Natural Language - - PowerPoint PPT Presentation
Foundations: Statistical Classification in Natural Language Processing Dietrich Klakow What is Classification? Classification: telling things apart 2 Introduction 3 Spam/junk/bulk Emails The messages you spend your time with just to
Dietrich Klakow
2
Classification: telling things apart
3
4
5
Speech Recognition Information Retrieval Computational Linguistics Everything else
bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla
e.g. text classification
6
city
LOCATION What city did Duke Ellington live in?
technique
ENTITY What do sailors use to measure time ?
human
DESCRIPTION Who is Desmond Tutu ?
mountain
LOCATION Where is the highest point in Japan ?
group
HUMAN Who has won the most Super Bowls ?
individual
HUMAN Who killed Gandhi ?
Most frequent question types:
Human:individual 18% Location:other 9% Decription:definition 8%
50 different question types
7
band 532732 strip n band/2/1 band 532733 stripe n band/2/1.2 band 532734 range n band/2/2 band 532735 group n band/1/2 band 532736 mus n band/1/1 band 532744 brass n brass_band band 532745 radio n band/2/2.1 band 532746 vb v band/1/3 band 532747 silver n silver_band band 532756 steel n steel_band band 532765 big n big_band band 532782 dance n dance_band band 532790 elastic n elastic_band band 532806 march n marching_band band 532814 man n one-man_band band 532838 rubber n rubber_band band 532903 ed n band/2/3 band 532949 saw n band_saw band 532963 course n band_course band 532979 pl n band/2/4 band 533487 vb2 a band/2/5 band 533495 portion n band/2/1.3 band 533508 waist n waistband band 533520 ring n band/2/1.4 band 533522 sweat n sweat_band band 533580 wrist n wristband//1 band 533705 vb3 v band/2/6 band 533706 vb4 v band/2/7
8
????
9
band 532736 mus n band/1/1
10
????
11
band 532734 range n band/2/2
12
????
13
band 532744 brass n brass_band
14
????
15
band 532838 rubber n rubber_band
16
17
18
eat Verb, base form VB ate Verb, past tense VBD … … … quickly, never Adverb RB IBM Proper noun, singular NNP province Noun, sing. or mass NN yellow Adjective JJ
Determiner DT
Cardinal number CD and, but, or Coordinating Conjunction CC Example Description Tag
19
20
Test Data xi
Feature Extraction Classifier Model ω1 …. Feature Extraction Training Data Training Algorithm ω2 ωn
21
22
ω1 =“spam” ω2=“not-spam”
How would you set up a decision rule?
23
Not-spam spam P(ω2) P(ω1) Classify every e-mail as
24
spam P(ω2) P(ω1) Classify every e-mail as not-spam not-spam Incorrectly classified
25
spam P(ω2) P(ω1) Classify every e-mail as spam Not-spam Incorrectly classified Smaller number of e-mails with wrong label
26
k i
k
ω
27
k
k k
28
k i
k
ω
Ugly: usually x is measured for a given class ωk
29
) | ( max arg x P
k i
k
ω ω
ω
=
k k P
k
ω
) ( ) ( ) | ( ) ( ) , ( ) | ( x P P x P x P x P x P
k k k k
ω ω ω ω = =
Use definition of cond. probability P(x) does not affect decision
) ( ) ( ) | ( max arg x P P x P
k k
k
ω ω
ω
=
30
k k k
k
ω
31
k
k
Prior: Posterior:
32
=
N i k i k N
1 1
33
34
35
j j i i j i j i j i j i j i j i
36
3.48 284 0.006 is DESC:def 26.23 124 0.007 When NUM:date 32.01 120 0.007 country LOC:country 7.52 274 0.010 How DESC:manner 11.22 253 0.011 Where LOC:other 6.23 336 0.011 How NUM:count 4.46 498 0.013 Who HUM:ind 13.7 322 0.015 many NUM:count P(x|ω)/P(x) N(x,ω) MI(x,ω) Feature Type
37
y" vocabular feature "
size : else 1 ) ( if 1 ) ( ) | ( V V x N V N d x N x P
i i k i
k k k
> + − = α α ω
ω ω ω
Absolute discounting:
38
Proper smoothing important
39
40
41
42
43
44
45