machine learning for computational linguistics
play

Machine Learning for Computational Linguistics Some probability - PDF document

Machine Learning for Computational Linguistics Some probability distributions . ltekin, SfS / University of Tbingen April 19/21, 2016 5 / 48 Probability theory Information theory 0 0 1 0 0 Probability mass function Example:


  1. Machine Learning for Computational Linguistics Some probability distributions … Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 5 / 48 Probability theory Information theory 0 0 1 0 0 Probability mass function Example: probabilities for sentence length in words Probability Sentence length 0.1 0.2 1 0 0 0 1 0 0 1 0 0 0 3 Adv Negative Positive Value Outcome Noun Verb Adj … 1 0 0 0 0 Value 1 2 3 A refresher on probability and information theory … …or 2 4 Examples: 6 C. Prob. 1 2 3 4 5 7 Length 8 9 10 11 Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 Prob. 1.0 5 SfS / University of Tübingen 6 7 8 9 10 11 Ç. Çöltekin, April 19/21, 2016 0.5 6 / 48 Probability theory Some probability distributions Information theory Cumulative distribution function Cumulative Probability Sentence length Outcome 4 Random variables: mapping outcomes to real numbers April 19/21, 2016 0 the event is impossible 0.5 the event is as likely to happen (or happened) as it is not 1 the event is certain Axioms of probability states that Ç. Çöltekin, SfS / University of Tübingen 2 / 48 What is probability? Probability theory Some probability distributions Information theory Where do probabilities come from Axioms of probability does not specify how to assign probabilities to events. between 0 and 1 Information theory its relative frequency (in the limit) Information theory Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft April 19/21, 2016 Probability theory Some probability distributions Why probability theory? Some probability distributions because of Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 1 / 48 Probability theory Two major (rival) ways of assigning probabilities to events are 7 / 48 belief Information theory Ç. Çöltekin, Note: not all of these are numbers on the sample space) of a trial to (a vector of) real numbers (a real valued function April 19/21, 2016 uncertainties 4 / 48 Probability theory Random variables Some probability distributions SfS / University of Tübingen Information theory Some probability distributions Probability theory SfS / University of Tübingen 3 / 48 April 19/21, 2016 Ç. Çöltekin, ▶ Probability theory studies uncertainty ▶ In machine learning we deal with problems with uncertainty, ▶ inherently stochasticity of some physical systems ▶ incomplete/inaccurate measurements ▶ incomplete modeling ▶ Probability is a measure of (un)certainty of an event ▶ We quantify the probability of an event with a number ▶ All possible outcomes of an trial (experiment or observation) ▶ Frequentist (objective) probabilities: probability of an event is is called the sample space ( Ω ) ▶ Bayesian (subjective) probabilities: probabilities are degrees of 1. P ( E ) ∈ R , P ( E ) ⩾ 0 2. P ( Ω ) = 1 3. For disjoint events E 1 and E 2 , P ( E 1 ∪ E 2 ) = P ( E 1 ) + P ( E 2 ) ▶ Continuous Example: frequency of a sound signal: 100 . 5 , 220 . 3 , 4321 . 3 … ▶ A random variable is a variable whose value is subject to ▶ Discrete ▶ A random variable is always a number ( ∈ R for our purposes) ▶ Number of words in a sentence: 2 , 5 , 10 , … ▶ Think of a random variable as mapping between the outcomes ▶ Whether a review is negative or positive: ▶ Example outcomes of uncertain experiments 0 1 ▶ height or weight of a person ▶ The POS tag of a word: ▶ length of a word randomly chosen from a corpus ▶ whether an email is spam or not ▶ the fjrst word of a book, or fjrst word uttered by a baby ▶ F X ( x ) = P ( X ⩽ x ) ▶ Probability mass function of a discrete random variable ( X ) maps every possible ( x ) value to its probability ( P ( X = x ) ). 0 . 16 0 . 16 x P ( X = x ) 0 . 18 0 . 34 0 . 21 0 . 55 0 . 155 0 . 19 0 . 74 0 . 185 0 . 10 0 . 85 0 . 210 0 . 07 0 . 91 0 . 194 0 . 04 0 . 95 0 . 102 0 . 02 0 . 97 0 . 066 0 . 01 0 . 99 0 . 039 0 . 01 0 . 99 0 . 023 0 . 00 1 . 00 0 . 012 1 2 3 4 5 6 7 8 9 10 11 0 . 005 1 2 3 4 5 6 7 8 9 10 11 0 . 004

  2. Probability theory f Probability distribution of letters following probabilities Lett. a b c d e g Some probability distributions h Prob. Probability Letter Some probability distributions 0.2 e a Information theory Probability theory d Mode, median, mean, standard deviation functions Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 11 / 48 Probability theory Some probability distributions Information theory Visualization on sentence length example 12 / 48 Probability Sentence length 0.1 0.2 mode = median = 3.0 Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 h g Median and mode of a random variable SfS / University of Tübingen b c d e f g h Ç. Çöltekin, April 19/21, 2016 h 14 / 48 Probability theory Some probability distributions Information theory Expected values of joint distributions Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 a g c Information theory b f Ç. Çöltekin, SfS / University of Tübingen April 19/21, 2016 13 / 48 Probability theory Some probability distributions Joint and marginal probability f Two random variables form a joint probability distribution . An examaple: consider the bigrams of letters in our earlier example. The joint distribution can be defjned with a table like the one below. a b c d e Mode is the value that occurs most often in the data. 0.1 Information theory random variable Probability Short divergence: Chebyshev’s inequality Information theory Some probability distributions Probability theory 9 / 48 April 19/21, 2016 SfS / University of Tübingen Ç. Çöltekin, Variance and standard deviation 0.11 Information theory Some probability distributions Probability theory 8 / 48 April 19/21, 2016 SfS / University of Tübingen Ç. Çöltekin, Expected value Information theory 0.25 15 / 48 0.04 Ç. Çöltekin, 0.0001 This also shows why standardizing values of random variables, 0.01 Some probability distributions Probability theory 10 / 48 April 19/21, 2016 SfS / University of Tübingen ▶ Expected value (mean) of a random variable X is, ▶ Variance of a random variable X is, n n ∑ ∑ Var ( X ) = σ 2 = P ( x i )( x i − µ ) 2 = E [ X 2 ] − ( E [ X ]) 2 E [ X ] = µ = P ( x i ) x i = P ( x 1 ) x 1 + P ( x 2 ) x 2 + . . . + P ( x n ) x n i = 1 i = 1 ▶ More generally, expected value of a function of X is ▶ It is a measure of spread, divergence from the central tendency n ▶ The square root of variance is called standard deviation ∑ E [ f ( X )] = P ( x i ) f ( x i ) � n i = 1 � ∑ � ( P ( x i ) x 2 ) σ = − µ 2 � i ▶ Expected value is an important measure of central tendency i = 1 ▶ Note: it is not the ‘most likely’ value ▶ Standard deviation is in the same units as the values of the ▶ Expected value is linear ▶ Variance is not linear: σ 2 X + Y ̸ = σ 2 X + σ 2 Y (neither the σ ) E [ aX + bY ] = aE [ X ] + bE [ Y ] For any probability distribution, and k > 1 , Median is the mid-point of a distribution. Median of a random variable is defjned as the number m that satisfjes P ( | x − µ | > kσ ) ⩽ 1 k 2 P ( X ⩽ m ) ⩾ 1 and P ( X ⩾ m ) ⩾ 1 2 2 ▶ Median of 1 , 4 , 5 , 8 , 10 is 5 Distance from µ 2σ 3σ 5σ 10σ 100σ ▶ Median of 1 , 4 , 5 , 7 , 8 , 10 is 6 ▶ Modes appear as peaks in probability mass (or density) z = x − µ σ ▶ Mode of 1 , 4 , 4 , 8 , 10 is 4 ▶ Modes of 1 , 4 , 4 , 8 , 9 , 9 are 4 and 9 makes sense (the normalized quantity is often called the z-score). ▶ We have a hypothetical language with 8 letters with the µ = 3 . 56 0 . 23 0 . 04 0 . 05 0 . 08 0 . 29 0 . 02 0 . 07 0 . 22 9 0 . 2 = σ 1 2 3 4 5 6 7 8 9 10 11 ∑ ∑ E [ f ( X , Y )] = P ( x , y ) f ( x , y ) x y ∑ ∑ µ X = E [ X ] = P ( x , y ) x x y ∑ ∑ µ Y = E [ Y ] = P ( x , y ) y 0 . 04 0 . 02 0 . 02 0 . 03 0 . 05 0 . 01 0 . 02 0 . 06 0 . 23 x y 0 . 01 0 . 00 0 . 00 0 . 00 0 . 01 0 . 00 0 . 00 0 . 01 0 . 04 We can simplify the notation by vector notation, for µ = ( µ x , µ y ) , 0 . 02 0 . 00 0 . 00 0 . 00 0 . 01 0 . 00 0 . 00 0 . 01 0 . 05 0 . 02 0 . 00 0 . 00 0 . 01 0 . 02 0 . 00 0 . 01 0 . 02 0 . 08 ∑ 0 . 06 0 . 02 0 . 01 0 . 03 0 . 08 0 . 01 0 . 01 0 . 07 0 . 29 µ = x P ( x ) 0 . 00 0 . 00 0 . 00 0 . 00 0 . 01 0 . 00 0 . 00 0 . 01 0 . 02 x ∈ XY 0 . 01 0 . 00 0 . 00 0 . 01 0 . 02 0 . 00 0 . 01 0 . 02 0 . 07 0 . 08 0 . 00 0 . 00 0 . 01 0 . 10 0 . 00 0 . 01 0 . 02 0 . 22 where vector x ranges over all possible combinations of the values of random variables X and Y . 0 . 23 0 . 04 0 . 05 0 . 08 0 . 29 0 . 02 0 . 07 0 . 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend