Statistical Natural Language Processing A refresher on information - PowerPoint PPT Presentation

Statistical Natural Language Processing A refresher on information theory Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017

Information theory Information theory storage and transmission of information many difgerent fjelds NLP Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 1 / 19 • Information theory is concerned with measurement, • It has its roots in communication theory, but is applied to • We will revisit some of the major concepts

Information theory Noisy channel model Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, including in speech recognition and machine translations able to detect and correct errors the channel bandwidth 2 / 19 channel noisy 10010010 10000010 a decoder encoder a • We want codes that are effjcient: we do not want to waste • We want codes that are resilient to errors: we want to be • This simple model has many applications in NLP,

Information theory b Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Coding example c 3 / 19 Can we do even better? binary coding of an eight-letter alphabet one-hot representation a one-hot coding? letter code 00000001 • We can encode an 8-letter 00000010 alphabet with 8 bits using 00000100 00001000 • Can we do better than 00010000 00100000 01000000 10000000

Information theory b Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Coding example c 3 / 19 Can we do even better? binary coding of an eight-letter alphabet one-hot representation a one-hot coding? letter code 00000000 • We can encode an 8-letter 00000001 alphabet with 8 bits using 00000010 00000011 • Can we do better than 00000100 00000101 00000110 00000111

Information theory b Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Coding example c 3 / 19 a binary coding of an eight-letter alphabet one-hot representation one-hot coding? code letter 00000000 • We can encode an 8-letter 00000001 alphabet with 8 bits using 00000010 00000011 • Can we do better than 00000100 00000101 • Can we do even better? 00000110 00000111

Information theory Self information / surprisal Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, content 4 / 19 Self information (or surprisal ) associated with an event x is 1 I ( x ) = log P ( x ) = − log P ( x ) • If the event is certain, the information (or surprise) associated with it is 0 • Low probability (surprising) events have higher information • Base of the log determines the unit of information 2 bits e nats 10 dit, ban, hartley

Information theory Why log? linear relations possible outcomes exponentially – The possible number of strings you can fjt into two pages is exponentially more than one page – But we expect information to double, not increase exponentially computationally more suitable Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 5 / 19 • Reminder: logarithms transform exponential relations to • In most systems, linear increase in capacity increases • Working with logarithms is mathematically and

Information theory Entropy Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, while self information is about individual events Note: entropy is about a distribution, sum with integral) 6 / 19 Entropy is a measure of the uncertainty of a random variable: ∑ H ( X ) = − P ( x ) log P ( x ) x • Entropy is the lower bound on the best average code length, given the distribution P that generates the data • Entropy is average surprisal: H ( X ) = E [− log P ( x )] • It generalizes to continuous distributions as well (replace

Information theory Example: entropy of a Bernoulli distribution Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 7 / 19 1 0 . 8 0 . 6 H ( X ) in bits 0 . 4 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 P ( X = 1 )

Information theory Entropy: demonstration Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 8 / 19 increasing number of outcomes increases entropy H = − 1 H = − 1 4 − 1 1 H = − log 1 = 0 1 2 − 1 1 4 − 1 1 4 − 1 1 1 2 = 1 4 = 2 4 log 2 4 log 2 2 log 2 4 log 2 2 log 2 4 log 2

Information theory ? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, Entropy: demonstration 9 / 19 the distribution matters H = − 3 H = − 1 H = − 1 4 − 1 3 2 − 1 1 1 4 − 1 16 − 1 1 1 8 − 1 1 4 − 1 16 − 1 1 8 − 1 1 4 − 1 1 1 1 1 8 = 1 . 47 4 = 2 16 = 0 . 97 4 log 2 2 log 2 4 log 2 16 log 2 8 log 2 4 log 2 16 log 2 8 log 2 4 log 2 8 log 2 16 log 2 4 log 2

Information theory letter Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Back to coding letters b a code prob c maximum entropy. bits, we No. bits, we need bits on average If the probabilities were difgerent, could we do maximum uncertainty, hence the Yes. Now better? 10 / 19 need Uniform distribution has the bits on average • Can we do better? 1 000 8 1 001 8 1 010 8 1 011 8 1 100 8 1 101 8 1 110 8 1 111 8

Information theory prob Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Back to coding letters c b a code 10 / 19 letter better? Yes. Now difgerent, could we do If the probabilities were bits on average bits, we need bits on average Uniform distribution has the maximum uncertainty, hence the maximum entropy. • Can we do better? • No. H = 3 bits, we need 3 1 000 8 1 001 8 1 010 8 1 011 8 1 100 8 1 101 8 1 110 8 1 111 8

Information theory prob Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d c Back to coding letters b a code 10 / 19 letter better? maximum entropy. Yes. Now bits on average bits, we need bits on average Uniform distribution has the maximum uncertainty, hence the difgerent, could we do • Can we do better? • No. H = 3 bits, we need 3 1 2 1 • If the probabilities were 4 1 8 1 16 1 64 1 64 1 64 1 64

Information theory code Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, h g f e d Back to coding letters c b a 10 / 19 prob better? difgerent, could we do bits on average Uniform distribution has the maximum uncertainty, hence the letter maximum entropy. • Can we do better? • No. H = 3 bits, we need 3 1 0 2 1 • If the probabilities were 10 4 1 110 8 1 1110 • Yes. Now H = 2 bits, we 16 1 need 2 bits on average 111100 64 1 111101 64 1 111110 64 1 111111 64

Information theory Entropy of your random numbers Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, entropy would be, If it was uniformly distributed the 11 / 19 0 . 11 20 • Entropy of the distribution: 19 0 . 04 0 . 07 18 17 0 . 21 H = −(+ 0 . 04 × log 2 0 . 04 0 . 07 16 + 0 . 11 × log 2 0 . 11 15 0 . 04 0 . 04 14 + 13 0 . 07 . . . 0 12 + 0 . 11 × log 2 0 . 11 ) 11 0 0 10 = 3 . 63 9 0 . 04 0 8 7 0 . 07 0 . 04 6 5 0 . 04 0 . 04 4 3 0 0 . 11 2 1 0 . 04 0 0 . 1 0 . 2

Statistical Natural Language Processing A refresher on information - PowerPoint PPT Presentation

Statistical Natural Language Processing A refresher on information theory ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 Information theory Information theory storage and transmission of

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Statistical natural language processing 24.05.19 Statistical Natural Language Processing 1 The

Shashidhar Reddy Puchakayala (Shashi) Apr 15, 2010 What is registration? Why registration

Bottom-up Cell Suppression that Preserves the Missing-at-random Condition Yoshitaka Kameya and

Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2

Fast adaptive estimation of log-additive exponential models in Kullback-Leibler divergence

An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams Auth

Obstructions to embedding subsets of Schatten classes in L p spaces Gideon Schechtman Joint work

A formalization of metric spaces in HOL Light Marco Maggesi DiMaI - Universit` a degli Studi

Dimensionality Reduction embedding Distortion L Norm Corollaries Anil Maheshwari Euclidean