Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017

Announcements: Assignment 1 Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1: forgetting files Common pitfall #2: incorrect paths to files Common pitfall #3: 3 rd party libraries

Announcements: Question 6 𝑞 (𝑜) 𝑦 𝑗 𝑦 𝑗−𝑜+1 : 𝑦 𝑗−1 ) = 𝜇 𝑜 𝑔 (𝑜) 𝑦 𝑗−𝑜+1 :𝑦 𝑗 + 1 − 𝜇 𝑜 𝑞 (𝑜−1) (𝑦 𝑗 |𝑦 𝑗−𝑜+2 :𝑦 𝑗−1 ) 𝑞 (𝑜) 𝑦 𝑗 𝑦 𝑗−𝑜+1 : 𝑦 𝑗−1 ) = 𝜇 𝑜,𝑜 𝑔 (𝑜) 𝑦 𝑗−𝑜+1 :𝑦 𝑗 + 𝜇 𝑜,𝑜−1 𝑔 (𝑜−1) 𝑦 𝑗−𝑜+2 :𝑦 𝑗 + ⋯ 𝜇 𝑜,0 𝑔 (0) ⋅ 𝑜−1 𝜇 𝑜,0 = 1 − ෍ 𝜇 𝑜,𝑜−𝑛 𝑛=0

Announcements: Course Project Official handout will be out Friday 9/22 Until then, focus on assignment 1 Teams of 1-3 Mixed undergrad/grad is encouraged but not required Some novel aspect is needed Ex 1: reimplement existing technique and apply to new domain Ex 2: reimplement existing technique and apply to new (human) language Ex 3: explore novel technique on existing problem

Recap from last time…

Classify or Decode with Bayes Rule how likely is label X overall? how well does text Y represent label X ? For “simple” or “flat” labels: * iterate through labels * evaluate score for each label, keeping only the best (n best) * return the best (or n best) label and score

Classification Evaluation: the 2-by-2 contingency table Actually Actually Correct Incorrect Selected/ True Positive False Positive Guessed (TP) (FP) Guessed Guessed Correct Correct Not selected/ False Negative True Negative not guessed (FN) (TN) Guessed Guessed Correct Correct Classes/Choices

Classification Evaluation: Accuracy, Precision, and Recall Accuracy : % of items correct Precision : % of selected items that are correct Recall : % of correct items that are selected Actually Correct Actually Incorrect Selected/Guessed True Positive (TP) False Positive (FP) Not select/not guessed False Negative (FN) True Negative (TN)

Language Modeling as Naïve Bayes Classifier Adopt naïve bag of words representation Y i Assume position doesn’t matter Assume the feature probabilities are independent given the class X

Naïve Bayes Summary Potential Advantages Potential Issues Model the posterior in one go? Very Fast, low storage requirements Robust to Irrelevant Features Are the features really uncorrelated? Very good in domains with many equally important features Are plain counts always appropriate? Optimal if the independence assumptions hold Are there “better” ways of handling missing/noisy data? Dependable baseline for text (automated, more principled) classification (but often not the best)

Naïve Bayes Summary Potential Advantages Potential Issues Model the posterior in one go? Very Fast, low storage requirements Model the posterior in one go? Relevant for classification… Robust to Irrelevant Features Are the features really Are the features really uncorrelated? uncorrelated? Very good in domains with many equally important features Are plain counts always Are plain counts always appropriate? appropriate? Optimal if the independence assumptions hold Are there “better” ways of Are there “better” ways of handling missing/noisy data? handling missing/noisy data? Dependable baseline for text (automated, more principled) (automated, more principled) classification (but often not the best)

Naïve Bayes Summary Potential Advantages Potential Issues Model the posterior in one go? Very Fast, low storage requirements Relevant for Model the posterior in one go? classification… and Robust to Irrelevant Features Are the features really language modeling Are the features really uncorrelated? uncorrelated? Very good in domains with many equally important features Are plain counts always Are plain counts always appropriate? appropriate? Optimal if the independence assumptions hold Are there “better” ways of Are there “better” ways of handling missing/noisy data? handling missing/noisy data? Dependable baseline for text (automated, more principled) (automated, more principled) classification (but often not the best)

Maximum Entropy Models a more general language model argmax 𝑌 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌)

Maximum Entropy Models a more general language model argmax 𝑌 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) classify in one go argmax 𝑌 𝑞 𝑌 𝑍)

Document Classification Three people have been A TTACK fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region. We need to score the different combinations.

Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … score k (department, A TTACK ) … are all of these uncorrelated?

Score and Combine Our Possibilities score 1 (fatally shot, A TTACK ) C OMBINE posterior score 2 (seriously wounded, A TTACK ) probability of score 3 (Shining Path, A TTACK ) A TTACK … Q: What are the score and combine functions for Naïve Bayes?

Scoring Our Possibilities Three people have been fatally shot, and five people, including a score( , ) = mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . score 1 (fatally shot, A TTACK ) Learn these scores… but score 2 (seriously wounded, A TTACK ) how? score 3 (Shining Path, A TTACK ) What do we optimize? …

Maxent Modeling Three people have been fatally shot, and five people, including p( | ) ∝ a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally S NAP ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

Maxent Modeling Three people have been fatally p( | ) ∝ shot, and five people, including a mayor, were seriously wounded as A TTACK a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . Three people have been fatally exp ( score( , ) ) shot, and five people, including a mayor, were seriously wounded A TTACK as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp gives a positive, unnormalized probability f(x) = exp(x)

Maxent Modeling Three people have been fatally shot, and five people, including p( | ) ∝ a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) score 1 (fatally shot, A TTACK ) score 2 (seriously wounded, A TTACK ) score 3 (Shining Path, A TTACK ) … Learn the scores (but we’ll declare what combinations should be looked at)

Maxent Modeling Three people have been fatally shot, and five people, including p( | ) ∝ a mayor, were seriously A TTACK wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . exp ( ) ) weight 1 * applies 1 (fatally shot, A TTACK ) weight 2 * applies 2 (seriously wounded, A TTACK ) weight 3 * applies 3 (Shining Path, A TTACK ) …

Q : What if none of our features apply?

Guiding Principle for Log-Linear Models “[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information .” Edwin T. Jaynes, 1957

Guiding Principle for Log-Linear Models “[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is exp( θ· f)  maximally noncommittal exp( θ· 0) = 1 with regard to missing information .” Edwin T. Jaynes, 1957

Easier-to-write form exp ( ) ) θ 1 * f 1 (fatally shot, A TTACK ) θ 2 * f 2 (seriously wounded, A TTACK ) θ 3 * f 3 (Shining Path, A TTACK ) …

Easier-to-write form exp ( ) ) θ 1 * f 1 (fatally shot, A TTACK ) θ 2 * f 2 (seriously wounded, A TTACK ) θ 3 * f 3 (Shining Path, A TTACK ) … K K weights features

Easier-to-write form dot product exp ( ) θ ·f ( doc , A TTACK ) K-dimensional K-dimensional weight vector feature vector

Log-Linear Models

Log-Linear Models Feature function(s) Sufficient statistics “Strength” function(s)

Log-Linear Models Feature Weights Natural parameters Distribution Parameters

Log-Linear Models How do we normalize?

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 - PowerPoint PPT Presentation

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 Announcements: Assignment 1 Due 11:59 PM, Saturday 9/23 ~3.5 days Use submit utility with: class id cs473_ferraro assignment id a1 We must be able to run it on GL! Common pitfall #1:

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

Maxent Models, Conditional Introduction Estimation, and Optimization In recent years there

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline

Nave Bayes & Maxent Models CMSC 473/673 UMBC September 18 th , 2017 Some slides adapted

Maxent Models, Conditional Estimation, and Optimization Without Magic That is, With Math! Dan

Maxent Kraft Angelman syndrome GENOMIC IMPRINTING DISORDER OF THE BRAIN Presented by Sybille

MinNorm approximation of MaxEnt/MinDiv problems for probability tables Patrick Bogaert and Sarah

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1

GAUSSIAN Max H. M. Costa and Olivier Rioul Unicamp and Tlcom-ParisTech 22/09/2014 MaxEnt

A gentle introduction to Maximum Entropy Models and their friends Mark Johnson Brown University

(part 1) CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Maximum Entropy models

Contingentism in Metaphysics David Chalmers Contingentism Can metaphysical truths be contingent?

Holography and the dynamical breaking of Supersymmetry Riccardo Argurio Universit Libre de

ENVR E-120 http://courses.dce.harvard.edu/~envre120 Sick Earth, Diseased Humans: and the

2: Power as Prime Motive Moral Relativism 2. Meta-ethical: A moral judgement can only be assessed

Distributed Databases Instructor: Matei Zaharia cs245.stanford.edu Outline Replication

Reliability Coordinator (RC) and Energy Imbalance Market (EIM) Nancy Traweek Executive Director,

HIMSS Emerging Healthcare Leaders Webinar HIMSS Emerging Healthcare Leaders Webinar Non

Treating Interference as Noise is Optimal for Covert Communication over Interference Channels