1 Dialog Systems ELIZA A psychotherapist agent (Weizenbaum, - PDF document

What is NLP? CSE 473: Artificial Intelligence Advanced Applic's: Natural Language Processing  Fundamental goal: analyze and process human language, broadly, robustly, accurately…  End systems that we want to build:  Ambitious: speech recognition, machine translation, information extraction, dialog interfaces, question answering…  Modest: spelling correction, text categorization… Steve Tanimoto --- University of Washington [Some of these slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Problem: Ambiguities Parsing as Search  Headlines:  Enraged Cow Injures Farmer With Ax  Hospitals Are Sued by 7 Foot Doctors  Ban on Nude Dancing on Governor’s Desk  Iraqi Head Seeks Arms  Local HS Dropouts Cut in Half  Juvenile Court to Try Shooting Defendant  Stolen Painting Found by Tree  Kids Make Nutritious Snacks  Why are these funny? Grammar: PCFGs Syntactic Analysis  Natural language grammars are very ambiguous!  PCFGs are a formal probabilistic model of trees  Each “rule” has a conditional probability (like an HMM)  Tree’s probability is the product of all rules used  Parsing: Given a sentence, find the best tree – search! ROOT  S 375/420 S  NP VP . 320/392 NP  PRP 127/539 Hurricane Emily howled toward Mexico 's Caribbean coast on Sunday packing 135 mph winds and torrential rain and VP  VBD ADJP causing panic in Cancun, where frightened tourists squeezed into musty shelters. 32/401 ….. [Demo: Berkeley NLP Group Parser http://tomato.banatao.berkeley.edu:8080/parser/parser.html] 1

Dialog Systems ELIZA  A “psychotherapist” agent (Weizenbaum, ~1964)  Led to a long line of chatterbots  How does it work:  Trivial NLP: string match and substitution  Trivial knowledge: tiny script / response database  Example: matching “I remember __” results in “Do you often think of __”?  Can fool some people some of the time? [Demo: http://nlp-addiction.com/eliza] Watson What’s in Watson?  A question-answering system (IBM, 2011)  Designed for the game of Jeopardy  How does it work:  Sophisticated NLP: deep analysis of questions, noisy matching of questions to potential answers  Lots of data: onboard storage contains a huge collection of documents (e.g. Wikipedia, etc.), exploits redundancy  Lots of computation: 90+ servers  Can beat all of the people all of the time? Machine Translation Machine Translation  Translate text from one language to another  Recombines fragments of example translations  Challenges:  What fragments? [learning to translate]  How to make efficient? [fast translation search] 2

The Problem with Dictionary Lookups MT: 60 Years in 60 Seconds 13 Data-Driven Machine Translation Learning to Translate An HMM Translation Model Levels of Transfer 17 3

Example: Syntactic MT Output Document Analysis with LSA: Outline • Motivation • Bag-of-words representation • Stopword elimination, stemming, reference vocabulary • Vector-space representation • Document comparison with the cosine similarity measure • Latent Semantic Analysis [ISI MT system output] 21 Motivation Representations for Documents  Document analysis is a highly active  Text string area, very relevant to information  Image (I.e., .jpg, .gif, and .png files) science, the World Wide Web, and  linguistically structured files: PostScript, search engines. Portable Doc. Format (PDF), XML.  Algorithms for document analysis span a  Vector: e.g., bag-of-words wide range of techniques, from string  Hypertext, hypermedia processing to large matrix computations.  One application: automatic essay grading. Fundamental Problems Bag-of-Words Representation • Representation* A multiset is a collection like a set, but which allows • Lexical Analysis (tokenizing)* duplicates (any number of copies) of elements. • Information Extraction* { a, b, c} is a set. (It is also a multiset.) • Comparison (similarity, distance)* { a, a, b, c, c, c } is not a set, but it is a multiset. • Classification (e.g., for net-nanny service)* { c, a, b, a, c, c } is the same multiset. (Order doesn’t • Indexing (to permit fast retrieval) matter). words • Retrieval (querying and query processing) A multiset is also called a bag . words bag in of repeat a may *important for AI 4

Bag-of-Words (continued) Eliminating Stopwords Let document D = In information retrieval and some other types of “The big fox jumped over the big fence.” document analysis, we often begin by deleting words The bag representation is: that don’t carry much meaning or that are so { big, big, fence, fox, jumped, over, the, the } common that they do little to distinguish one document from another. Such words are called stopwords . For notational consistency, we use alphabetical order. Examples: (articles) a, an, the; (quantifiers) any, some, only, many, all, no; Also, we omit punctuation and normalize the case. (pronouns) I, you, it, he, she, they, me, him, her, them, his, hers, their, theirs, my, mine, your, our, yours, ours, this, that, these, those, who, whom, which; (prepositions) above, at, behind, below, beside, for, in, into, of, on, onto, over, The ordering information in the document is lost. But under; (verbs) am, are, be, been, is, were, go, gone, went, had, have, do, did, this is OK for some applications. can, could, will, would, might, may, must; (conjunctions) and, but, if, then, not, neither, nor, either, or; (other) yes, perhaps, first, last, there, where, when. Stemming Reference Vocabulary In order to detect similarities among words, it often A counterpart to stopwords is the reference vocabulary . helps to perform stemming. We typically stem a These are the words that ARE allowed in document word by removing its suffixes, leaving the basic representations. word, or “uninflecting” the word These are all stemmed, and are not stopwords. • apples  apple There might be several hundred or even thousands of • cacti  cactus terms in a reference vocabulary for real document • swimming  swim processing. • swam  swim Indexing Vector representation Assume we have a reference vocabulary of words that Create links from terms to documents or might appear in our documents. document parts {apple, big, cat, dog, fence, fox, jumped, over, the, zoo} (a) concordance We represent our bag (b) table of contents { big, big, fence, fox, jumped, over, the, the } (c) book index by giving a vector (list) of occurrence counts of each (d) index for a search engine reference term in the document: (e) database index for a relation (table) [0, 2, 0, 0, 1, 1, 1, 1, 2, 0] If there are n terms in the reference vocabulary, then each document is represented by a point in an n-dimensional space. 5

Concordance Search Engine Index A concordance for a document is a sort of Query terms are organized into a large table dictionary that lists, for each word that occurs in or tree that can be quickly searched. the document the sentences or lines in which it occurs. (e.g., large hash-table in memory, or a B-Tree with its top levels in memory). “document”: A concordance for a document is a sort of dictionary that lists, for each word that occurs in the document the Associated with each term is a list of occurrences, typically consisting of Document “occurs”: IDs or URLs. that lists, for each word that occurs in the document the sentences or lines in which it occurs . Document Comparison Cosine Similarity Function Typical problems: Document 1: “All Blues. First the key to last night's notes.” •Determine whether two documents are slightly different versions of the same document. Document 2: (applications: search engine hit filtering, plagiarism “How to get your message across. Restate your key detection). points first and last. “ •Find the longest common subsequence for a pair of documents. (can be useful in genetic sequencing). Reference vocabulary: •Determine whether a new document should be { across, blue, first, key, last, message, night, placed into the same category as a model document. note, point, restate, zebra } (essay grading, automatic response generation, etc.) Cosine Similarity (cont) Cosine Similarity (cont) Document 1 reduced: Dot product (same as “inner product”) [0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0]  [1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0] blue first key last night note = 0  1 + 1  0 + 1  1 + 1  1 + 1  1 + 0  1 + 1  0 + 1  0 + 0  Document 2 reduced: message across restate key point first last 1 + 0  1 + 0  0 = 3 Document 1 vector representation: Normalized: cos  = (v 1  v 2 ) / ( || v 1 || || v 2 || ) [0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0]  || v || =  v  v cos  = 3 / (  6  7)  0.4629. Document 2 vector representation: [1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0] 6

1 Dialog Systems ELIZA A psychotherapist agent (Weizenbaum, - PDF document

What is NLP? CSE 473: Artificial Intelligence Advanced Applic's: Natural Language Processing Fundamental goal: analyze and process human language, broadly, robustly, accurately End systems that we want to build: Ambitious:

Precision cosmology as a laboratory for particle physics (or, Evidence for a 4th neutrino?)

Steve Huffey Three papers distributed for this presentation are at - - The Ultimate

Surgery, concordance and isotopy of metrics of positive scalar curvature Boris Botvinnik

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

GIS file formats & Locating geography files in the EFT Collection ACCOLEDS Nov.2013

Singular structures, groupoids and metrics of positive scalar curvature. Paolo Piazza. Sapienza

Autonomous collocation error correction with a data-driven approach Orsolya Vincze Margarita

Compiling topic-specific corpora from limited-access online databases Costas Gabrielatos

Satellite operations and fractal structures on knot concordance Arunima Ray Brandeis University

Shake slice and shake concordant knots Arunima Ray Brandeis University Joint work with Tim

Efficient Allocations under Ambiguity Tomasz Strzalecki (Harvard University) Jan Werner

Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias

12. Interior-point methods inequality constrained minimization logarithmic barrier function

CS681: Advanced Topics in Computational Biology Week 6 Lectures 2-3 Can Alkan EA224

Sub- & Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

BlandAltman plots, rank parameters, and calibration ridit splines Roger B. Newson

Evaluating Intensive Outpatient Primary Care: VA Experience Steven M. Asch MD MPH Director,

A primal-dual algorithm for expontial-cone optimization ICCOPT Berlin, August 8th, 2019

Results of the WMT16 Metrics Shared Task Ond rej Bojar Yvette Graham Amir Kamran Milo s

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre dAspremont , CNRS &

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

Model 1 proc logistic data=framing descending; model chd01 = age; run; Model Information Data

Machine Learning for Information Discovery Thorsten Joachims Cornell University Department of

Square Formation by Asynchronous Oblivious Robots CCCG 2016 Marcello Mamino, Giovanni Viglietta

1 Dialog Systems ELIZA A psychotherapist agent (Weizenbaum, - PDF document

What is NLP? CSE 473: Artificial Intelligence Advanced Applic's: Natural Language Processing Fundamental goal: analyze and process human language, broadly, robustly, accurately End systems that we want to build: Ambitious:

Precision cosmology as a laboratory for particle physics (or, Evidence for a 4th neutrino?)

Steve Huffey Three papers distributed for this presentation are at - - The Ultimate

Surgery, concordance and isotopy of metrics of positive scalar curvature Boris Botvinnik

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

GIS file formats &amp; Locating geography files in the EFT Collection ACCOLEDS Nov.2013

Singular structures, groupoids and metrics of positive scalar curvature. Paolo Piazza. Sapienza

Autonomous collocation error correction with a data-driven approach Orsolya Vincze Margarita

Compiling topic-specific corpora from limited-access online databases Costas Gabrielatos

Satellite operations and fractal structures on knot concordance Arunima Ray Brandeis University

Shake slice and shake concordant knots Arunima Ray Brandeis University Joint work with Tim

Efficient Allocations under Ambiguity Tomasz Strzalecki (Harvard University) Jan Werner

Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias

12. Interior-point methods inequality constrained minimization logarithmic barrier function

CS681: Advanced Topics in Computational Biology Week 6 Lectures 2-3 Can Alkan EA224

Sub- &amp; Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

BlandAltman plots, rank parameters, and calibration ridit splines Roger B. Newson

Evaluating Intensive Outpatient Primary Care: VA Experience Steven M. Asch MD MPH Director,

A primal-dual algorithm for expontial-cone optimization ICCOPT Berlin, August 8th, 2019

Results of the WMT16 Metrics Shared Task Ond rej Bojar Yvette Graham Amir Kamran Milo s

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre dAspremont , CNRS &amp;

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

Model 1 proc logistic data=framing descending; model chd01 = age; run; Model Information Data

Machine Learning for Information Discovery Thorsten Joachims Cornell University Department of

Square Formation by Asynchronous Oblivious Robots CCCG 2016 Marcello Mamino, Giovanni Viglietta

GIS file formats & Locating geography files in the EFT Collection ACCOLEDS Nov.2013

Sub- & Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre dAspremont , CNRS &