StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010 - PDF document

Overview Statistical�NLP � So�far:�language�models�give�P(s) Spring�2010 � Help�model�fluency�for�various�noisy,channel�processes�(MT,� ASR,�etc.) � N,gram�models�don’t�represent�any�deep�variables�involved�in� language�structure�or�meaning � Usually�we�want�to�know�something�about�the�input�other�than� how�likely�it�is�(syntax,�semantics,�topic,�etc) � Next:�Naïve�Bayes�models � We�introduce�a�single�new�global�variable � Still�a�very�simplistic�model�family Lecture�4:�Text�Categorization � Lets�us�model�hidden�properties�of�text,�but�only�very�non,local� ones… � In�particular,�we�can�only�model�properties�which�are�largely� Dan�Klein�– UC�Berkeley invariant�to�word�order�(like�topic) Text�Categorization Naïve,Bayes�Models Idea:�pick�a�topic,�then�generate�a�document�using�a�language� Want�to�classify�documents�into�broad�semantic�topics�(e.g.�politics,� � � model�for�that�topic. sports,�etc.) Naïve,Bayes�assumption:�all�words�are�independent�given�the�topic. Obama�is�hoping�to�rally�support� California�will�open�the�2009� � for�his�$825�billion�stimulus� season�at�home�against� ∏ = �� package�on�the�eve�of�a�crucial� Maryland�Sept.�5�and�will�play�a� � � � � �� House�vote.�Republicans�have� total�of�six�games�in�Memorial� � expressed�reservations�about�the� Stadium�in�the�final�football� proposal,�calling�for�more�tax� schedule�announced�by�the� � cuts�and�less�spending.�GOP� Pacific,10�Conference�Friday.� representatives�seemed�doubtful� The�original�schedule�called�for� that�any�deals�would�be�made. 12�games�over�12�weekends.� � � =�STOP � � � � � � �� Which�one�is�the�politics�document?�(And�how�much�deep� � processing�did�that�decision�take?) Compare�to�a�unigram�language�model: � One�approach:�bag,of,words�and�Naïve,Bayes models � ∏ = � � � � � � � � � � � � Another�approach�later… � � � � � � Usually�begin�with�a�labeled�corpus�containing�examples�of�each� � � class Using�NB�for�Classification Two�NB�Formulations We�have�a�joint�model�of�topics�and�documents � Two�NB�event�models�for�text�categorization � ∏ � The�class,conditional�unigram�model,�a.k.a.�multinomial�model = � � � � � � � � � � � � � � � � � � � � � � One�node�per�word�in�the�document � � � � � � Driven�by�words�which�are�present � � Multiple�occurrences,�multiple�evidence Gives�posterior�likelihood�of�topic�given�a�document � � Better�overall�– plus,�know�how�to�smooth ∏ � � � � �� The�binominal�(binary)�model � � � � � � � � � � � � One�node�for�each�word�in�the�vocabulary = � � � � � � � � � � � � � � �   ∑ ∏ � � � � � � � � � � � � �   accuracy �   � � � � � � � � �� What�about�totally�unknown�words? � Can�work�shockingly�well�for�textcat�(especially�in�the�wild) � vocabulary How�can�unigram�models�be�so�terrible�for�language�modeling,�but�class,conditional� � Incorporates�explicit�negative�correlations � unigram�models�work�for�textcat? � Know�how�to�do�feature�selection�(e.g.�keep�words�with�high� Numerical�/�speed�issues mutual�information�with�the�class�variable) � How�about�NB�for�spam�detection? � 1

Example:�Barometers Example:�Stoplights �� !�� NB�FACTORS: �� NB�FACTORS: PREDICTIONS: � P(s) =�1/2� P(r,,,,)�=�(½)(¾)(¾) � P(w)�=�6/7� � P(b)�=�1/7� � �� P(s,,,,)�=�(½)(¼)(¼) � P(,|s)�=�1/4� � P(r|w)�=�1/2� � � P(r|b)�=�1� P(r|,,,)�=�9/10 � � P(,|r)�=�3/4 � P(g|w)�=�1/2 � P(g|b)�=�0 P(s|,,,)�=�1/10 � �� "� #� P(b|r,r)�=�4/10�(what�happened?) �� (Non,)Independence�Issues Language�Identification How�can�we�tell�what�language�a�document�is�in? � � Mild�Non,Independence The�38th�Parliament�will�meet�on� La�38e�législature�se�réunira�à�11�heures�le� � Evidence�all�points�in�the�right�direction Monday,�October�4,�2004,�at�11:00�a.m.� lundi�4�octobre�2004,�et�la�première�affaire� � Observations�just�not�entirely�independent The�first�item�of�business�will�be�the� à�l'ordre�du�jour�sera�l’élection�du�président� � Results election�of�the�Speaker�of�the�House�of� de�la�Chambre�des�communes.�Son� Commons.�Her�Excellency�the�Governor� Excellence�la�Gouverneure�générale� � Inflated�Confidence General�will�open�the�First�Session�of� ouvrira�la�première�session�de�la�38e� � Deflated�Priors the�38th�Parliament�on�October�5,�2004,� législature�avec�un�discours�du�Trône�le� � What�to�do?��Boost�priors�or�attenuate�evidence with�a�Speech�from�the�Throne.� mardi�5�octobre�2004.� ∏ = > < � � � � � � � � � � � � � � � � �� How�to�tell�the�French�from�the�English? � � � � � Treat�it�as�word,level�textcat? � � � Overkill,�and�requires�a�lot�of�training�data � You�don’t�actually�need�to�know�about�words! � Severe�Non,Independence � Words�viewed�independently�are�misleading Option:�build�a�character,level�language�model � � Interactions�have�to�be�modeled ΣύUφωνο�σταθερότητας�και�ανάπτυξης � What�to�do? Patto�di�stabilità�e�di�crescita � Change�your�model! Class,Conditional�LMs Clustering�/�Pattern�Detection Can�add�a�topic�variable�to�richer�language�models � ∏ = � � � � � � � � � � � � � � � � � � � � � � � − � � � � � � Soccer�team�wins�match � Stocks�close�up�3% � Investing�in�the�stock�market�has�… The�first�game�of�the�world�series�… � � � � � � �� Problem�1:�There�are�many�patterns�in�the�data,� Could�be�characters�instead�of�words,�used�for�language�ID�(HW2) � most�of�which�you�don’t�care�about. Could�sum�out�the�topic�variable�and�use�as�a�language�model � How�might�a�class,conditional�n,gram�language�model�behave� � differently�from�a�standard�n,gram�model? 2

StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010 - PDF document

Overview StatisticalNLP Sofar:languagemodelsgiveP(s) Spring2010 Helpmodelfluencyforvariousnoisy,channelprocesses(MT, ASR,etc.)

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Maximum Entropy Model (I) LING 572 Advanced Statistical Methods for NLP January 28, 2020 1

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Computer Architecture for the Next Millenium November 1, 1999 William J. Dally Computer Systems

Scale-out your Tier-Based Systems in 3 steps Using Spring Nati Shalom CTO GigaSpaces Agenda

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R.

Frontier and Squid for same data access by many jobs Dave Dykstra dwd@fnal.gov OSG Users'

GP Cluster 14 December 2017 Healthier. Stronger. Together PARKING - IMPORTANT Whilst delegates

REAL-TIME WITH AI THE CONVERGENCE OF BIG DATA AND AI COLIN MACNAUGHTON NEEVE RESEARCH

Agenda 1. Capital One 2. Traditional Batch Analytics 3. The Great Paradigm Shift Real-Time

Study of coherent pion production in proton-deuteron collisions with polarized beams and target

Sambuz

Useful Links

Newsletter

Mail Us