Topics in Machine Learning
(with less Magic)
Grzegorz Chrupa la
Spoken Dialog Systems Saarland University
CNGL March 2009
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 1 / 111
Topics in Machine Learning (with less Magic) Grzegorz Chrupa la - - PowerPoint PPT Presentation
Topics in Machine Learning (with less Magic) Grzegorz Chrupa la Spoken Dialog Systems Saarland University CNGL March 2009 Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 1 / 111 Outline Preliminaries 1 Bayesian Decision
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 1 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 2 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 3 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 4 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 5 / 111
◮ x · z ◮ x, z ◮ xTz Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 6 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 7 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 8 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 9 / 111
◮ Density p(x|Yi) (continuous feature x) ◮ Probability P(x|Yi) (discrete feature x)
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 10 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 11 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 12 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 13 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 14 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 14 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 14 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 15 / 111
◮ gi(x) = p(Yi|x) = p(x|Yi)P(Yi)/
j p(x|Yj)P(Yj)
◮ gi(x) = p(x|Yi)P(Yi) ◮ gi(x) = ln p(x|Yi) + ln P(Yi) Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 16 / 111
◮ g(x) = P(Y1|x) − P(Y2|x) ◮ g(x) = ln p(x|Y1)/p(x|Y2) + ln P(Y1)/P(Y2) Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 17 / 111
−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 x p(x|Yi) R1 R2 R1 Y1 Y2
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 18 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 19 / 111
◮ Priors are easy to estimate for typical classification problems ◮ However, for class-conditional densities, training data is typically sparse!
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 20 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 21 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 22 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 23 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 24 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 25 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 26 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 27 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 28 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 29 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 30 / 111
◮ Density estimation – Parzen windows ◮ Use training examples to derive decision functions directly: K-nearest
◮ Assume a known form for discriminant functions, and estimate their
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 31 / 111
◮ Memory-based learning ◮ Instance or exemplar based learning ◮ Similarity-based methods ◮ Case-based reasoning Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 32 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 33 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 34 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 35 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 36 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 37 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 38 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 39 / 111
◮ n·j = k
i=1 nij
◮ ni· = m
j=1 nij
◮ n·· = k
i=1
j=0 nij
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 40 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 41 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 42 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 43 / 111
◮ During learning little “work” is done by the algorithm: the training
◮ During prediction the test instance is compared to the training
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 44 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 45 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 46 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 47 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 48 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 49 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 50 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 51 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 52 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 53 / 111
−4 −2 2 4 −4 −2 2 4 x y y=−1x−0.5 y=−3x+1 y=69x+1
la (UdS) Machine Learning Tutorial 2009 54 / 111
◮ Start with a zero weight vector and process each training example in turn. ◮ If the current weight vector classifies the current example incorrectly, move
◮ If weights stop changing, stop
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 55 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 56 / 111
◮ A method of avoiding overfitting ◮ As final weight vector, use the mean of all the weight vector values for
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 57 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 58 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 59 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 60 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 60 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 60 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 61 / 111
◮ this improves the chance that if the position of the data points is
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 62 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 63 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 64 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 65 / 111
◮ A kernel function can be thought of as dot product in some
◮ It can also be thought of as a similarity function in the input object
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 66 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 67 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 68 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 69 / 111
1, 2x1x2, x2 2)
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 70 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 71 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 72 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 73 / 111
−4 −2 2 4 −4 −2 2 4 x y
la (UdS) Machine Learning Tutorial 2009 74 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 75 / 111
i∈SV
i (xi · x) + b∗
n
n
n
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 76 / 111
◮ One-vs-rest (also known as one-vs-all): train |Y | binary classifiers and
◮ One-vs-one: train |Y |(|Y | − 1)/2 pairwise binary classifiers, and choose
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 77 / 111
◮ a is the slope ◮ b is the intercept ◮ This model has two parameters (or weigths) ◮ One feature = x ◮ Example: ⋆ x = number of vague adjectives in property descriptions ⋆ y = amount house sold over asking price Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 78 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 79 / 111
◮ y = outcome ◮ w0 = intercept ◮ f1..fN = features vector and w1..wN weight vector
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 80 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 81 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 82 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 83 / 111
◮ L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno method) ◮ gradient ascent ◮ conjugate gradient ◮ iterative scaling algorithms Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 84 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 85 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 86 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 87 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 88 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 89 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 90 / 111
◮ p(x, 0) + p(y, 0) = 0.6
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 91 / 111
◮ p(x, 0) + p(y, 0) = 0.6
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 91 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 92 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 93 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 94 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 95 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 96 / 111
◮ POS tagging ◮ chunking (shallow parsing) ◮ named-entity recognition
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 97 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 98 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 99 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 100 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 101 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 102 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 103 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 104 / 111
◮ Classification Φ(x, y) = (φ(x)[y = Yi])|Y |
i=1
◮ Sequence labeling Φ(x, y) = n
i=1 φ(xi, yi) where n is the length of the
◮ Classification argmaxy∈Y w · Φ(x, y) ◮ Sequence labeling ViterbiPath(x; w) or BeamSearch(x; w) where the
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 105 / 111
◮ while MEMM uses per state exponential models for the conditional
◮ CRF has a single exponential model for the joint probability of the
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 106 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 107 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 108 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 109 / 111
Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 110 / 111
Berger, A. L., Pietra, V. J. D., and Pietra, S. A. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71. Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support Vector Machines: and other kernel-based learning methods. Cambridge Univ Pr. Daelemans, W. and van den Bosch, A. (2005). Memory-Based Language Processing. Cambridge University Press. Duda, R., Hart, P., and Stork, D. (2001). Pattern classification. Wiley New York. Jurafsky, D. and Martin, J. (2008a). Speech and language processing. Prentice Hall. Jurafsky, D. and Martin, J. H. (2008b). Speech and Language Processing. Prentice Hall, 2 edition. Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML 2001: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289. Manning, C., Sch¨ utze, H., and Press, M. (1999). Foundations of statistical natural language processing. MIT Press. Sha, F. and Pereira, F. (2003). Shallow parsing with Conditional Random Fields. In NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Grzegorz Chrupa la (UdS) Machine Learning Tutorial 2009 111 / 111