Introduction to Classification and Sequence Labeling
Grzegorz Chrupa la
Spoken Language Systems Saarland University
Annual IRTG Meeting 2009
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 1 / 95
Introduction to Classification and Sequence Labeling Grzegorz Chrupa - - PowerPoint PPT Presentation
Introduction to Classification and Sequence Labeling Grzegorz Chrupa la Spoken Language Systems Saarland University Annual IRTG Meeting 2009 Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 1 / 95 Outline
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 1 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 2 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 3 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 4 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 5 / 95
◮ x · z ◮ x, z ◮ xTz Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 6 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 7 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 8 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 9 / 95
◮ Density p(x|Yi) (continuous feature x) ◮ Probability P(x|Yi) (discrete feature x)
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 10 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 11 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 12 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 13 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 14 / 95
◮ gi(x) = p(Yi|x) = p(x|Yi)P(Yi)/
j p(x|Yj)P(Yj)
◮ gi(x) = p(x|Yi)P(Yi) ◮ gi(x) = ln p(x|Yi) + ln P(Yi) Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 15 / 95
◮ g(x) = P(Y1|x) − P(Y2|x) ◮ g(x) = ln p(x|Y1)
p(x|Y2) + ln P(Y1) P(Y2)
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 16 / 95
−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 x p(x|Yi) R1 R2 R1 Y1 Y2
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 17 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 18 / 95
◮ Priors are easy to estimate for typical classification problems ◮ However, for class-conditional densities, training data is typically sparse!
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 19 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 20 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 21 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 22 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 23 / 95
◮ Density estimation – Parzen windows ◮ Use training examples to derive decision functions directly: K-nearest
◮ Assume a known form for discriminant functions, and estimate their
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 24 / 95
◮ Memory-based learning ◮ Instance or exemplar based learning ◮ Similarity-based methods ◮ Case-based reasoning Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 25 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 26 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 27 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 28 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 29 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 30 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 31 / 95
◮ n·j = k
i=1 nij
◮ ni· = m
j=1 nij
◮ n·· = k
i=1
j=0 nij
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 32 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 33 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 34 / 95
◮ During learning little “work” is done by the algorithm: the training
◮ During prediction the test instance is compared to the training
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 35 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 36 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 37 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 38 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 39 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 40 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 41 / 95
−4 −2 2 4 −4 −2 2 4 x y y=−1x−0.5 y=−3x+1 y=69x+1
la (UdS) Classification and Sequence Labeling 2009 42 / 95
◮ Start with a zero weight vector and process each training example in turn. ◮ If the current weight vector classifies the current example incorrectly, move
◮ If weights stop changing, stop
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 43 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 44 / 95
◮ this improves the chance that if the position of the data points is
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 45 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 46 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 47 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 48 / 95
◮ A kernel function can be thought of as dot product in some
◮ It can also be thought of as a similarity function in the input object
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 49 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 50 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 51 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 52 / 95
1, 2x1x2, x2 2)
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 53 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 54 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 55 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 56 / 95
−4 −2 2 4 −4 −2 2 4 x y
la (UdS) Classification and Sequence Labeling 2009 57 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 58 / 95
i∈SV
i (xi · x) + b∗
n
n
n
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 59 / 95
◮ One-vs-rest (also known as one-vs-all): train |Y | binary classifiers and
◮ One-vs-one: train |Y |(|Y | − 1)/2 pairwise binary classifiers, and choose
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 60 / 95
◮ a is the slope ◮ b is the intercept ◮ This model has two parameters (or weigths) ◮ One feature = x ◮ Example: ⋆ x = number of vague adjectives in property descriptions ⋆ y = amount house sold over asking price
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 61 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 62 / 95
◮ y = outcome ◮ w0 = intercept ◮ f1..fN = features vector and w1..wN weight vector
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 63 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 64 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 65 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 66 / 95
◮ L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno method) ◮ gradient ascent ◮ conjugate gradient ◮ iterative scaling algorithms Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 67 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 68 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 69 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 70 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 71 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 72 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 73 / 95
◮ p(x, 0) + p(y, 0) = 0.6
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 74 / 95
◮ p(x, 0) + p(y, 0) = 0.6
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 74 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 75 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 76 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 77 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 78 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 79 / 95
◮ POS tagging ◮ chunking (shallow parsing) ◮ named-entity recognition
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 80 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 81 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 82 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 83 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 84 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 85 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 86 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 87 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 88 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 89 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 90 / 95
◮ while MEMM uses per state exponential models for the conditional
◮ CRF has a single exponential model for the joint probability of the
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 91 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 92 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 93 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 94 / 95
Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 95 / 95
Collins, M. (2002). Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP 2002: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support Vector Machines: and other kernel-based learning methods. Cambridge Univ Pr. Daelemans, W. and van den Bosch, A. (2005). Memory-Based Language Processing. Cambridge University Press. Duda, R., Hart, P., and Stork, D. (2001). Pattern classification. Wiley New York. Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing. Prentice Hall, 2 edition. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML 2001: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289. Ratnaparkhi, A. (1997). A simple introduction to maximum entropy models for natural language processing. IRCS Report, pages 97–08. Sha, F. and Pereira, F. (2003). Shallow parsing with Conditional Random Fields. In NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 134–141. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, USA. Grzegorz Chrupa la (UdS) Classification and Sequence Labeling 2009 95 / 95