probability for linguists
play

Probability for linguists probabili- ties Logarithms and plogs - PowerPoint PPT Presentation

Probability for linguists John A Goldsmith probability and distri- butions Unigram Probability for linguists probabili- ties Logarithms and plogs John A Goldsmith From single symbols to strings of symbols Conditional July 6, 2015


  1. Probability for linguists John A Goldsmith probability and distri- butions Unigram Probability for linguists probabili- ties Logarithms and plogs John A Goldsmith From single symbols to strings of symbols Conditional July 6, 2015 probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account

  2. Probability for linguists John A Overall strategy Goldsmith 1 probabilities and distributions probability and distri- butions 2 unigram probability Unigram 3 a word about parametric distributions probabili- ties 4 -1 × log 2 probability (or plog : positive log probability) Logarithms and plogs 5 bigram probability: conditional probability From single symbols to 6 mutual information : the log of the ratio of the observed strings of symbols to the “expected” Conditional 7 average plog → entropy probability: first steps 8 encoding events: compression, optimal compression, in taking sequence and cross-entropy into account 9 encoding grammars optimally Conditional probability: first steps in taking sequence into account

  3. Probability for linguists A distribution John A Goldsmith probability and distri- butions Unigram probabili- Big point 1 ties A distribution is a list of numbers that are not negative and Logarithms and plogs that sum to 1. From single symbols to strings of � symbols p i = 1 Conditional i probability: first steps p i ≥ 0 in taking sequence into account Conditional probability: first steps in taking sequence into account

  4. Probability for linguists A probabilistic grammar John A Goldsmith probability and distri- butions Unigram probabili- ties • A probabilistic model, or grammar, is a universe of Logarithms possibilities (“sample space”) + a distribution. and plogs • A probabilistic grammar is a distribution over all From single symbols to strings of the IPA alphabet. strings of symbols • It is not a formalism stating which strings are in and Conditional probability: which are out . first steps in taking sequence into account Conditional probability: first steps in taking sequence into account

  5. Probability for linguists The purpose of a probabilistic John A model Goldsmith probability and distri- butions Unigram Big point 2 probabili- ties The purpose of a probabilistic model is to test the model Logarithms and plogs against the data. From single symbols to • Suppose we have some well-chosen data D. Then the strings of symbols best grammar is the one that assigns the highest Conditional probability to D, all other things being equal. probability: first steps • The goal is not to test the data! in taking sequence into • Therefore: all grammars must be probabilistic, so they account can be tested and evaluated. Conditional probability: first steps in taking sequence into account

  6. Probability for linguists Probability John A Goldsmith probability and distri- butions Unigram probabili- ties • The quantitative theory of evidence . Logarithms and plogs • If we have variable data, then probability is the best From single model to use. symbols to strings of • If we have categorical (not variable) data, probability is symbols Conditional still the best model to use. probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account

  7. Probability for linguists Probabilities and frequencies John A Goldsmith Probabilities and frequencies are not the same thing. probability and distri- • Frequencies are observed . butions Unigram • Probabilities are values in a system that a human being probabili- ties creates and assigns . Logarithms • We can choose to assign probabilities as the observed and plogs frequencies—buy that is not always a good idea. From single symbols to strings of • This is a good idea only so long as we don’t need to symbols handle yet-unseen (never before seen) data. Conditional probability: • In many cases, this choice maximizes the probability of first steps in taking the data. sequence into • They both deal with distributions (i.e., the observed account Conditional frequencies and the probability distributions of a probability: first steps model). in taking sequence into account

  8. Probability for linguists Probabilities and frequencies John A Goldsmith probability and distri- butions Probabilities and frequencies are not the same thing. Unigram • Counts are counts: the number of things or events that probabili- ties fall in some category. Logarithms and plogs • Frequency is ambiguous: it either means count (less From single often) or it means relative frequency : a ratio between a symbols to strings of count of something and the total number of things that symbols fall within the larger category. Conditional probability: • There are 63,147 occurrences of the in the Brown first steps in taking Corpus, out of 1,017,904; 6.2% of the words in the sequence into Brown Corpus are the . account Conditional probability: first steps in taking sequence into account

  9. Probability for linguists English, French, Spanish John A Goldsmith probability and distri- butions Unigram probabili- ties Let’s take a look at some languages. Logarithms and plogs And for starters, let’s just look at unigram frequencies: the From single frequencies at which items appear, not conditioned by the symbols to strings of environment. symbols people.cs.uchicago.edu/jagoldsm/course/class1 Conditional probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account

  10. Probability for linguists Plogs John A Goldsmith probability and distri- butions Unigram probabili- ties • We will assign probabilities to every outcome we Logarithms consider. and plogs From single • Each of these is typically quite small. symbols to strings of • We therefore use a slightly different way of talking symbols about small numbers: plogs. Conditional probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account

  11. Probability for linguists Inverse log probabilities, or plogs John A Goldsmith A way to describe small numbers... upside down. probability A probability its plog and distri- butions 0.5 1 Unigram 0.25 2 probabili- ties 0.128 3 Logarithms 1 4 and plogs 16 1 From single 5 symbols to 32 1 strings of 10 1024 symbols . . . . . . Conditional 1 probability: almost 20 1 , 000 , 000 first steps in taking • The bigger the plog, the smaller the probability. sequence into • It’s a bit like a measure of markedness, if you think of account Conditional more marked things as being less frequent. probability: first steps • plog ( x ) = − log 2 ( x ) = log 2 ( 1 x ) in taking sequence into account

  12. Probability for linguists Plogs John A Goldsmith probability and distri- butions Unigram probabili- 5 ties Logarithms 4 and plogs From single symbols to plog 3 strings of symbols Conditional 2 probability: first steps in taking 1 sequence into account Conditional probability 0 1 probability: first steps in taking sequence into account

  13. Probability for linguists John A Goldsmith Average is 4.64 below: S probability ej and distri- 6 butions z Unigram 5 s t n probabili- ties 4 @ # # Logarithms 3 and plogs 2 From single symbols to 1 stations strings of symbols Conditional This diagram from a visually interactive program displaying probability: first steps phonological complexity at: in taking sequence http://hum.uchicago.edu/~jagoldsm/PhonologicalComplexi into account Conditional probability: first steps in taking sequence into account

  14. Probability for linguists Most and least frequent John A phonemes in English Goldsmith rank phoneme frequency plog probability and distri- 1 # 0.20 2.30 butions 2 0.066 3.92 @ Unigram probabili- 3 n 0.058 4.10 ties 4 t 0.056 4.17 Logarithms and plogs 5 s 0.041 4.61 From single 6 r 0.040 4.76 symbols to strings of 7 d 0.037 4.85 symbols 8 l 0.035 4.94 Conditional probability: 9 k 0.026 5.27 first steps in taking 10 æ ´ 0.025 5.31 sequence into 45 Oy ´ 0.000 78 10.32 account 46 0.000 69 10.50 æ ˘ Conditional probability: 47 ˇ z 0.000 54 10.84 first steps in taking 48 0.000 38 11.36 ay ˘ sequence into 49 ˘ a 0.000 36 11.42 account 50 0.000 28 11.79 ˘ O

  15. Probability for linguists average plogs John A Goldsmith probability and distri- rank orthography phonemes av. plog 1 butions 1 a @ 3.11 Unigram probabili- 2 an @ n 3.44 ties 3 to t @ 3.47 Logarithms and plogs 4 and @ nd 3.80 From single symbols to 5 eh E ´ 3.88 strings of 6 the 3.88 symbols @ Conditional 7 can k @ n 3.90 probability: 8 an æ n 3.91 first steps ´ in taking 9 Ann æ n ´ 3.91 sequence into 10 in ´ I n 3.91 account Conditional probability: first steps in taking sequence into account

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend