machine learning for computational linguistics
play

Machine Learning for Computational Linguistics Regression ar ltekin - PowerPoint PPT Presentation

Machine Learning for Computational Linguistics Regression ar ltekin University of Tbingen Seminar fr Sprachwissenschaft April 26/28, 2016 Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tbingen


  1. Machine Learning for Computational Linguistics Regression Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft April 26/28, 2016

  2. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Please follow the instructions precisely! For each homework, you either get 6 ECTS without term paper 9 ECTS with term paper Regression Statistical inference 1 / 35 ▶ Course credits: ▶ Homeworks & evaluation: 0 not satisfactory or not submitted [ 6 , 10 ] satisfactory and on time ▶ Late homeworks are not accepted

  3. Practical matters 5 2 13 3 14 2 15 0 16 1 17 18 3 0 19 3 20 2 If the data was really uniformly distributed: . Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 12 11 Revisiting entropy 2 Statistical inference Regression Entropy of your random numbers 1 1 2 1 3 0 4 5 1 0 6 1 7 6 8 4 9 2 10 2 / 35

  4. Practical matters 5 Revisiting entropy 13 3 14 2 15 0 16 1 17 18 3 0 19 3 20 2 If the data was really uniformly distributed: . Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 12 2 11 2 Statistical inference Regression Entropy of your random numbers 1 1 2 1 3 1 4 0 5 0 10 2 9 4 8 2 / 35 6 7 1 6 ∑ H ( X ) = − P ( x ) log 2 P ( x ) x = 2 . 61

  5. Practical matters 1 Revisiting entropy 2 13 3 14 2 15 0 16 17 11 5 18 0 19 3 20 2 Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 3 12 1 4 Statistical inference Regression Entropy of your random numbers 1 1 2 1 10 0 3 2 8 2 9 4 5 2 / 35 6 7 1 6 0 ∑ H ( X ) = − P ( x ) log 2 P ( x ) x = 2 . 61 If the data was really uniformly distributed: H ( X ) = 4 . 32 .

  6. Practical matters 0 1 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Average code length of a string under code 2: Average code length of a string under code 1: 111 1 1 d Revisiting entropy 1 0 c 10 110 3 / 35 b 0 Statistical inference 0 0 a Regression code 2 code 1 prob Coding a four-letter alphabet letter 1/2 1/4 1/8 1/8 1 22 + 1 42 + 1 82 + 1 82 = 2 . 0 bits 1 21 + 1 42 + 1 83 + 1 83 = 1 . 75 bits = H

  7. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, instances training set such that the resulting model is useful for unseen defjned by a set of parameters the best model within the class of models sample) beyond the data at hand (training set, or experimental Statistical inference and estimation Regression Statistical inference 4 / 35 ▶ Statistical inference is about making generalizations that go ▶ In a typical scenario, we (implicitly) assume that a particular class of models describe the real-world process, and try to fjnd ▶ In most cases, our models are parametrized: the model is ▶ The task, then, becomes estimating the parameters from the

  8. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, explaining the observed phenomena) account for 5 / 35 Estimation of model parameters A typical statistical model can be formulated as Regression Statistical inference y = f ( x ; w ) + ϵ x is the input to the model y is the quantity or label assigned to for a given input w is the parameter(s) of the model f ( x ; w ) is the model’s estimate of output y given the input x , sometimes denoted as ˆ y ϵ represents the uncertainty or noise that we cannot explain or ▶ In machine learning, focus is correct prediction of y ▶ In statistics, the focus is on inference (testing hypotheses or

  9. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, uncertainty of the estimate calculating the expected value from the distribution parameter(s) Regression Estimating parameters: Bayesian approach Statistical inference 6 / 35 Given the training data X , we fjnd the posterior distribution p ( w | X ) = p ( X | w ) p ( w ) p ( X ) ▶ The result, posterior, is a probability distribution of the ▶ One can get a point estimate of w , for example, by ▶ The posterior distribution also contains the information on the ▶ Prior information can be specifjed by the prior distribution

  10. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, function value of probability mass function for the continuous variables 7 / 35 Estimating parameters: frequentist approach the likelihood Regression Statistical inference Given the training data X , we fjnd the value of w that maximizes w = arg min ˆ p ( X | w ) w ▶ The likelihood function p ( X | w ) , often denoted L ( w | X ) , is the probability of data given w for discrete variables, and the ▶ The problem becomes searching for the maximum value of a ▶ Note that we cannot make probabilistic statements about w ▶ Uncertainty of the estimate is less straightforward

  11. Practical matters An example: April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Revisiting entropy characters in twitter messages. We will use two data sets: 8 / 35 We assume that data observed comes from the model: A simple example: estimation of the population mean Regression Statistical inference y = µ + ϵ where, ϵ ∼ N ( 0 , σ 2 ) ▶ Let’s assume that we are estimating the average number of ▶ 87 , 101 , 88 , 45 , 138 ▶ The mean of the sample ( ¯ x ) is 91 . 8 ▶ Variance of the sample ( sd 2 ) is 1111 . 7 ( sd = 33 . 34 ) ▶ 87 , 101 , 88 , 45 , 138 , 66 , 79 , 78 , 140 , 102 ▶ ¯ x = 92 . 4 ▶ sd 2 = 876 . 71 ( sd = 29 . 61 )

  12. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, With more data, we get a more certain estimate and the data mean mean is (almost) the same as the mean of the data Estimating mean: Bayesian way We simply use Bayes’ formula: Regression Statistical inference 9 / 35 p ( µ | D ) = p ( D | µ ) p ( µ ) p ( D ) ▶ With a vague prior (high variance/entropy), the posterior ▶ With a prior with lower variance, posterior is between the prior ▶ Posterior variance indicates the uncertainty of our estimate.

  13. Practical matters 0.01 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, density x 0.05 0.04 0.03 Revisiting entropy 0.02 0.00 200 150 100 50 0 vague prior, small sample Estimating mean: Bayesian way Regression Statistical inference 10 / 35 Prior: N ( 70 , 1000 ) Likelihood: N ( 91 . 8 , 33 . 34 ) Posterior: N ( 91 . 78 , 14 . 91 )

  14. Practical matters 0.01 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, density x 0.05 0.04 0.03 Revisiting entropy 0.02 0.00 200 150 100 50 0 vague prior, larger sample Estimating mean: Bayesian way Regression Statistical inference 11 / 35 Prior: N ( 70 , 1000 ) Likelihood: N ( 92 . 40 , 29 . 61 ) Posterior: N ( 92 . 39 , 9 . 36 )

  15. Practical matters 0.01 April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, density x 0.05 0.04 0.03 Revisiting entropy 0.02 0.00 200 150 100 50 0 visualization Estimating mean: Bayesian way Regression Statistical inference 12 / 35 Prior: N ( 70 , 50 ) Likelihood: N ( 92 . 40 , 29 . 61 ) Posterior: N ( 91 . 64 , 9 . 20 )

  16. Practical matters Revisiting entropy April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, of the same size drawn from the same population. which corresponds to the means of the (hypothetical) samples 13 / 35 sample Statistical inference Regression Estimating mean: frequentist way ▶ The MLE of the mean of the population is the mean of the µ = ¯ x = 91 . 8 ▶ For 5-tweet sample: ˆ µ = ¯ x = 92 . 4 ▶ For 10-tweet sample: ˆ ▶ We express the uncertainty in terms of standard error of the mean ( SE ) x = sd x SE ¯ √ n √ ▶ For 5-tweet sample: SE ¯ x = 33 . 34/ 5 = 14 . 91 √ ▶ For 10-tweet sample: SE ¯ x = 29 . 61/ 10 = 9 . 36 ▶ A rough estimate for a 95% confjdence interval is ¯ x ± 2SE ¯ x ▶ For 5-tweet sample: 91 . 8 ± 2 × 14 . 91 = [ 61 . 98 , 121 . 62 ] ▶ For 10-tweet sample: 92 . 4 ± 2 × 9 . 36 = [ 83 . 04 , 101 . 76 ]

  17. Practical matters Revisiting entropy Statistical inference Regression Regression continuous response variables based on a number of predictors variable given the predictor(s) But the border between the two often is not that clear Ç. Çöltekin, SfS / University of Tübingen April 26/28, 2016 14 / 35 ▶ Regression is a supervised method for predicting value of a ▶ We estimate the conditional expectation of the outcome ▶ If the outcome is a label, the problem is called classifjcation.

  18. Practical matters b (slope) is the April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Revisiting entropy increased one unit. 15 / 35 where the line Statistical inference a (intercept) is Regression The linear equation: a reminder y y = 1 − x y = a + bx 2 x y = 2 + 1 x crosses the y axis. 2 x y = − 1 y = 1 change in y as x is

  19. Practical matters increased one unit. April 26/28, 2016 SfS / University of Tübingen Ç. Çöltekin, Revisiting entropy (relation)? What is the 15 / 35 b (slope) is the Statistical inference where the line Regression The linear equation: a reminder a (intercept) is y y y = a + bx = 1 − x 2 x y = 2 + 1 crosses the y axis. x change in y as x is 2 x y = − 1 y = 1 correlation between x and y for each line

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend