Language 13 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 1 13 - - PowerPoint PPT Presentation

▶

Dec 17, 2022 319 likes •1.03k views

Language 13 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 1 13 Language 13.1 Linguistics 13.2 Grammar 13.3 Syntactic analysis 13.4 Processing 13.5 Practical systems AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 2 Linguistics

SLIDE 1

Language

13

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 1

SLIDE 2

13 Language 13.1 Linguistics 13.2 Grammar 13.3 Syntactic analysis 13.4 Processing 13.5 Practical systems∗

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 2

SLIDE 3

Linguistics

Natural language understanding (NLU) or natural language process- ing (NLP) (computational linguistics, psycholinguistics) concern with the interactions between computers and human natural languages – extracting meaningful information from natural language input – producing natural language output

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 3

SLIDE 4

A brief history of NLU

1940-60s Foundational Insights automaton, McCulloch-Pitts neuron probabilistic or information-theoretic models formal language theory (Chomsky, 1956) 1957–70 The Two Camps symbolic and stochastic (parsing algorithms) Bayesian method (text recognition) the first on-line corpora (Brown corpus of English) 1970–83 Four Paradigms stochastic paradigm: Hidden Markov Model logic-based paradigm: Prolog (Definite Clause Grammars) natural language understanding: SHRDLU (Winograd, 1972) discourse modeling paradigm: speech acts, BDI 1983–93 Empiricism and Finite State Models Redux

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 4

SLIDE 5

A brief history of NLU

1994–99 The Field Comes Together probabilistic and data-driven models 2000–07 The Rise of Machine Learning big data (spoken and written) statistical learning Resurgence of probabilistic and decision-theoretic methods 2008– Deep learning high-performance computing ULP as recognition Ref Grosz et al. (1986), Readings in Natural Language Processing

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 5

SLIDE 6

Communication

“Classical” view (pre-1953): language consists of sentences that are true/false (cf. logic) “Modern” view (post-1953): language is a form of action Wittgenstein (1953), Philosophical Investigations Austin (1962), How to Do Things with Words Searle (1969), Speech Acts

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 6

SLIDE 7

Speech acts

SITUATION

Speaker Utterance Hearer

Speech acts achieve the speaker’s goals: Inform “There’s a pit in front of you” Query “Can you see the gold?” Command “Pick it up” Promise “I’ll share the gold with you” Acknowledge “OK” Speech act planning requires knowledge of – Situation – Semantic and syntactic conventions – Hearer’s goals, knowledge base, and rationality

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 7

SLIDE 8

Stages in communication (informing)

Intention S wants to inform H that P Generation S selects words W to express P in context C Synthesis S utters words W Perception H perceives W ′ in context C′ Analysis H infers possible meanings P1, . . . Pn Disambiguation H infers intended meaning Pi Incorporation H incorporates Pi into KB How could this go wrong?

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 8

SLIDE 9

Stages in communication (informing)

Intention S wants to inform H that P Generation S selects words W to express P in context C Synthesis S utters words W Perception H perceives W ′ in context C′ Analysis H infers possible meanings P1, . . . Pn Disambiguation H infers intended meaning Pi Incorporation H incorporates Pi into KB How could this go wrong? – Insincerity (S doesn’t believe P) – Speech wreck ignition failure – Ambiguous utterance – Differing understanding of current context (C = C′)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 9

SLIDE 10

Knowledge representation in language

Engaging in complex language behavior requires various kinds of knowledge of language

Phonetics and phonology: the linguistic sounds
Morphology: the meaningful components of words
Syntax: the structural relationships between words
Semantics: meaning
Pragmatics: the relationship of meaning to the goals and intentions
f the speaker
Discourse: the linguistic units larger than a single utterance

and

World knowledge: common knowledge, commonsense knowledge

– language cannot be understood without the everyday knowledge that all speakers share about the world

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 10

SLIDE 11

Grammar

Vervet monkeys, antelopes etc. use isolated symbols for sentences ⇒ restricted set of communicable propositions, no generative capacity Chomsky (1957): Syntactic Structures Grammar specifies the compositional structure of complex messages e.g., speech (linear), text (linear), music (two-dimensional) A formal language is a set of strings of terminal symbols Each string in the language can be analyzed/generated by the gram- mar

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 11

SLIDE 12

Grammar

The grammar is a set of rewrite rules, e.g., S → NP VP Article → the | a | an | . . . Here S is the sentence symbol, NP and VP are nonterminals

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 12

SLIDE 13

Grammar types

Regular: nonterminal → terminal[nonterminal] S → aS S → Λ Context-free: nonterminal → anything S → aSb Context-sensitive: more nonterminals on right-hand side ASB → AAaBB Recursively enumerable: no constraints Related to Post systems and Kleene systems of rewrite rules Natural languages probably context-free, parsable in real time

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 13

SLIDE 14

Wumpus lexicon

Noun → stench | breeze | glitter | nothing | wumpus | pit | pits | gold | east | . . . Verb → is | see | smell | shoot | feel | stinks | go | grab | carry | kill | turn | . . . Adjective → right | left | east | south | back | smelly | . . . Adverb → here | there | nearby | ahead | right | left | east | south | back | . . . Pronoun → me | you | I | it | . . . Name → John | Mary | Beijing | UCB | P KU | . . . Article → the | a | an | . . . Preposition → to | in | on | near | . . . Conjunction → and | or | but | . . . Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 14

SLIDE 15

Wumpus lexicon

Noun → stench | breeze | glitter | nothing | wumpus | pit | pits | gold | east | . . . Verb → is | see | smell | shoot | feel | stinks | go | grab | carry | kill | turn | . . . Adjective → right | left | east | south | back | smelly | . . . Adverb → here | there | nearby | ahead | right | left | east | south | back | . . . Pronoun → me | you | I | it | S/HE | Y ′ALL . . . Name → John | Mary | Boston | UCB | P AJC | . . . Article → the | a | an | . . . Preposition → to | in | on | near | . . . Conjunction → and | or | but | . . . Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 15

SLIDE 16

Wumpus grammar

S → NP VP I + feel a breeze | S Conjunction S I feel a breeze + and + I smell a wumpus NP → Pronoun I | Noun pits | Article Noun the + wumpus | Digit Digit 3 4 | NP PP the wumpus + to the east | NP RelClause the wumpus + that is smelly VP → Verb stinks | VP NP feel + a breeze | VP Adjective is + smelly | VP PP turn + to the east | VP Adverb go + ahead PP → Preposition NP to + the east RelClause → that VP that + is smelly

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 16

SLIDE 17

Grammaticality judgements

Formal language L1 may differ from natural language L2

L1 L2

false positives false negatives

Adjusting L1 to agree with L2 is a learning problem * the gold grab the wumpus * I smell the wumpus the gold I give the wumpus the gold Intersubjective agreement reliable, independent of semantics Real grammars 10–500 pages, insufficient even for “proper” English

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 17

SLIDE 18

Syntactic analysis

Exhibit the grammatical structure of a sentence

I shoot the wumpus

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 18

SLIDE 19

Parse trees

Exhibit the grammatical structure of a sentence

I shoot the wumpus Pronoun Verb Article Noun

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 19

SLIDE 20

Parse trees

Exhibit the grammatical structure of a sentence

I shoot the wumpus Pronoun Verb Article Noun NP VP NP

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 20

SLIDE 21

Parse trees

Exhibit the grammatical structure of a sentence

I shoot the wumpus Pronoun Verb Article Noun NP VP NP VP

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 21

SLIDE 22

Parse trees

Exhibit the grammatical structure of a sentence

I shoot the wumpus Pronoun Verb Article Noun NP VP NP VP S

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 22

SLIDE 23

Parsing

Bottom-up: replacing any substring that matches RHS of a rule with the rule’s LHS

function BottomUpParse(words, grammar) returns a parse tree forest ← words loop do if Length(forest) = 1 and Category(forest[1]) = Start(grammar) then return forest[1] else i ← choose from {1. . . Length(forest)} rule ← choose from Rules(grammar) n ← Length(Rule-RHS(rule)) subsequence ← Subsequence(forest,i,i+n-1) if Match(subsequence,Rule-RHS(rule)) then forest[i. . . i+n-1] ← [Make-Node(Rule-LHS(rule),subsequence)] else fail end

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 23

SLIDE 24

Context-free parsing

Efficient algorithms (e.g., chart parsing) O(n3) for context-free, run at several thousand words/sec for real grammars Context-free parsing ≡ Boolean matrix multiplication ⇒ unlikely to find faster practical algorithms

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 24

SLIDE 25

Logical grammars

BNF notation for grammars too restrictive: – difficult to add “side conditions” (number agreement, etc.) – difficult to connect syntax to semantics Idea: express grammar rules as logic X → YZ becomes Y (s1) ∧ Z(s2) ⇒ X(Append(s1, s2)) X → word becomes X([“word”]) X → Y | Z becomes Y (s) ⇒ X(s) Z(s) ⇒ X(s) Here, X(s) means that string s can be interpreted as an X

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 25

SLIDE 26

Logical grammars

It’s easy to augment the rules NP(s1) ∧ EatsBreakfast(Ref(s1)) ∧ V P(s2) ⇒ NP(Append(s1, [“who”], s2)) NP(s1) ∧ Number(s1, n) ∧ V P(s2) ∧ Number(s2, n) ⇒ S(Append(s1, s2)) Parsing is reduced to logical inference: Ask(KB, S([“I” “am” “a” “wumpus”])) Can add extra arguments to return the parse structure, semantics – semantic interpretations

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 26

SLIDE 27

Logical grammars

Generation simply requires a query with uninstantiated variables: Ask(KB, S(x)) If we add arguments to nonterminals to construct sentence semantics, NLP generation can be done from a given logical sentence: Ask(KB, S(x, At(Robot, [1, 1])) Montague grammar

R. Montague, English as a Formal Language, 1970

(Formal Philosophy, 1974)

I. Heim, A. Kratzer, Semantics in Generative Grammar, 1998
C. Potts, Logic of Conventional Implicatures, 2005

– Chomsky: Minimalist Program – Discourse Representation Theory – Situation Semantics/Situation Theory – Game-theoretic Semantics

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 27

SLIDE 28

Probabilistic grammar

Probabilistic context-free grammar (PCFG): the grammar assigns a probability to every string VP → Verb[0.70] | VP NP[0.03] With probability 0.70 a verb phrase consists solely of a verb, and with probability 0.30 it is a VP followed by an NP Also assign a probability to every word (lexicon) Chart parsers: to avoid inefficiency of repeated parsing, every time we analyze a substring, store the results so we wont have to reanalyze it later such a bottom-up (PCFG) version called chart parsing algorithm

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 28

SLIDE 29

Syntax in NLP

Most view syntactic structure as an essential step towards meaning; “Mary hit John” = “John hit Mary” “And since I was not informed—as a matter of fact, since I did not know that there were excess funds until we, ourselves, in that checkup after the whole thing blew up, and that was, if you’ll remember, that was the incident in which the attorney general came to me and told me that he had seen a memo that indicated that there were no more funds.”

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 29

SLIDE 30

Syntax in NLP

Most view syntactic structure as an essential step towards meaning; “Mary hit John” = “John hit Mary” “And since I was not informed—as a matter of fact, since I did not know that there were excess funds until we, ourselves, in that checkup after the whole thing blew up, and that was, if you’ll remember, that was the incident in which the attorney general came to me and told me that he had seen a memo that indicated that there were no more funds.” “Wouldn’t the sentence ’I want to put a hyphen between the words Fish and And and And and Chips in my Fish-And-Chips sign’ have been clearer if quotation marks had been placed before Fish, and between Fish and and, and and and And, and And and and, and and and And, and And and and, and and and Chips, as well as after Chips?”

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 30

SLIDE 31

Problems

Real human languages provide many problems for NLP

ambiguity
anaphora
indexicality
vagueness
discourse structure
metonymy
metaphor
noncompositionality etc.

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 31

SLIDE 32

Ambiguity

Ambiguity at all levels

Lexical

“You held your breath and the door for me”

Syntactic

“Put the book in the box on the table” [the book] in the box [the book in the box]

Semantic: sentence can have more than one meaning

“Alice wants a dog like Bob’s”

Pragmatic

“Alice: Do you know whos going to the party? Bob: Who?”

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 32

SLIDE 33

Understanding

Levels of understanding

1. Keyword processing: limited knowledge of particular words
r phrases

e.g., Chatbots, information retrieval, Web searching

2. Limited linguistic ability: appropriate response to simple,

highly constrained sentences e.g., database queries in NL, simple NL interfaces

3. Full text comprehension: multi-sentence text and its relation

to the real world e.g., conversational dialogue, automatic knowledge acquisition

4. Emotional understanding/generation

e.g., responding to literature, poetry, story narration

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 33

SLIDE 34

Understanding

Why is understanding hard? – Ambiguity: mapping is one-to-many – Rich structures than strings: often hierarchical or scope-bearing – Strong expressiveness: mapping from surface-form to meaning is many-to-one Debate: empiricism vs. rationalism Gold showed that it is not possible to reliably learn a correct context-free grammar Chomsky argued that there must be an innate universal grammar that all children have from birth Horning showed that it is possible to learn a probabilistic context- free grammar (by PAC algorithms)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 34

SLIDE 35

Understanding

Goal: a scientific theory of communication by language

To understand the structure of language and its use as a com-

plex computational system

To develop the data structures and algorithms that can imple-

ment that system Long way to go

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 35

SLIDE 36

Processing

Probabilistic models of language
Text classification
Information retrieval
Information extraction

Ref Bird Steven, Edward Loper and Ewan Klein (2009), Natural Lan- guage Processing with Python, OReilly Media Inc. (Python has good string-handling functionality, besides LISP)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 36

SLIDE 37

Probabilistic models of language

Define a natural language (approximative) model as a probability dis- tribution over sentences and possible meaning A corpus is a body of text N-gram (letters or units) model P(c1:N): probability distribution of n-letter (or word) sequences, defined as Markov chain of order n − 1 Say, a trigram (3-gram) model: P(ci|c1:i−1) = P(ci|ci−2:i−1) In a language with 100 characters, the distribution has a million entries, and can be accurately estimated by counting character se- quences in a corpus with 10 million characters With a vocabulary of 105 words, there are 1015 trigram probabilities to estimate e.g., books.google.com/ngram

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 37

SLIDE 38

Text classification

Text classification (categorization): given a text of some kind, decide which of a predefined set of classes it belongs to Eg., language identification, spam detection etc. Language identification: determine what natural language it is written (or spelling correction, genre classification, name-entity recognition etc.) – N-gram models are well suited for task of language identification with small n ≥ 3 – the training corpora need validation for nonzero probability

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 38

SLIDE 39

Information retrieval

Information retrieval (IR): find documents that are relevant to a query Eg.: search engine IR system:

1. A corpus of documents
2. Queries posed in a query language
3. A result set
4. A presentation of the result set

IR technique moves from Boolean keyword model to statistic model

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 39

SLIDE 40

IR evaluation

Precision: the proportion of documents in the result set that are actually relevant Recall: the proportion of all the relevant documents in the collection that are in the result set In a larger document collection, such as Web, recall is difficult to compute There have only some refinement techniques to improve the perfor- mance of a search engine

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 40

SLIDE 41

PageRank algorithm

PageRank is a link analysis algorithm based on the Web graph (pages as nodes and hyperlinks as edges), taking the rank value to indicate an importance of a particular page, which is defined recursively and depends on the number of all pages that link to it (in-links) A page p that is linked to by many pages with high PageRank receives a high rank itself PR(pi) = 1−d

N + d(

pj links to pi

PR(pj) C(pj) +

pj has no out link

PR(pi) N

) PR(pi) — the Pagerank value of page pi N — the total number of pages in the corpus pj — the pages that link in to pi C(pj) — the count of the total number of out-links on pages pj d — a damping factor (random surfer)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 41

SLIDE 42

PageRank algorithm

function PageRank(G, k) returns a PageRank value G: a inlink file, k: iteration number persistent: N, number of pages from G ho,hi, outlink/inlink count hash from G,respectively d, damping factor, initially 0.85 d ← 0.85, ho,hi,N ← G for all p in the graph do

pg ← 1

while k > 0 do for all p that has no out-links do dp ← dp + d × opg[p] for all p in the graph do npg[p] ← dp + 1−d

for all pi in hi(p) do npg[p] ← npg[p] + d×opg[pi]

ho[pi]

pg ← npg

k ← k - 1 return PageRank

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 42

SLIDE 43

Question answering

Question answering (QA): answering a question, not a ranked list of documents but rather a short response (a sentence or phrase) QA program may use either a pre-structured database or a collection

f natural language documents (a text corpus such as Web)

Question types: fact, list, definition, how, why, hypothetical, seman- tically constrained, and cross-lingual questions AskMSR: Web-based QA system (2002) Watson: IBM’s DeepQA – In 2011, competed on the quiz show Jeopardy – access to 200 million pages of structured and unstructured con- tent consuming four terabytes of disk storage, including the full text

f Wikipedia, but was not connected to the Internet during the game

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 43

SLIDE 44

Information extraction

Information extraction: acquiring knowledge by skimming a text and looking for occurrences of a particular class of object and for relation- ships among objects also known as Information filtering: remove redundant informa- tion to management of information overload E.g, addresses from Web Some methods: finite state automata (regular expressions), proba- bilistic models, machine learning (deep learning), machine reading,

ntology

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 44

SLIDE 45

Ontology extraction

Ontology extraction from large corpora, say, KG (knowledge graph) – all types of domains, not just one specific domain – dominated by precision, not recall – the results gathered from multiple sources E.g., general templates for categories NP such as NP(, NP)∗(, )?((and|or)NP)? To learn templates from a few examples, then use the templates to learn more examples, from which more templates can be learned Wordnet: dictionary of about 100,000 words and phrases – parts of speech, semantic relations (synonym, antonym) Penn Treebank: parse trees for a 3-million-word corpus (English) The British National Corpus: 100 million words Web: trillion words

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 45

SLIDE 46

Machine reading

Machine reading: extracting by reading on its own and build up its

wn database without no human input

E.g., TextRunner – using syntactic templates extracted from the Penn Treebank Watson, say, reading one million medical documents within one hour

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 46

SLIDE 47

Recommender system

Recommender system: to predict the “rating” or “preference” that a user would give to an item E.g., movies, music, news, books, articles, search queries, social tags, products and online dating etc. Collaborative filtering – building a model from users’ past behavior (say, ratings) – using a model to predict items (say, ratings for items) that the user may have an interest in Machine learning models: matrix computation, probabilistic models etc. Netflix (2006-2009) and other prizes for competitions of RS

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 47

SLIDE 48

Practical systems

Machine translation
Speech recognition
Conversational agent

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 48

SLIDE 49

Machine translation

MT: automatic translation of text from one natural language (the source) to another (the target) Try to translate a passage of a page in a browser by Google translator in the source Chinese into the target English, and then translate back from English to Chinese What can you find?? A translator (human or machine) requires in-depth understanding of the bilingual text A representation language that makes all the distinctions necessary for a set of languages is called an interlingua – creating a complete knowledge representation of everything – parsing into that representation – generating sentences from that representation

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 49

SLIDE 50

Machine translation

NMT (Neural MT): end-to-end (deep) learning approach for MT – regard MT as a sequence-to-sequence prediction task and, without using any information from standard MT systems – design two deep neural networks ⇒ viewing MT as recognition – – an encoder: to learn continuous representations of source language sentences – – a decoder: to generate the target language sentence with source sentence representation Currently, the best MT system, than conventional or statistical phrase- based systems

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 50

SLIDE 51

Speech recognition

Speech recognition: identify a sequence of words uttered by a speaker, given the acoustic signal It’s not easy to wreck a nice beach (recognize speech) Speech signals are noisy, variable, ambiguous Since the mid-1970s, speech recognition has been formulated as prob- abilistic inference What is the most likely word sequence, given the speech signal? I.e., choose Words to maximize P(Words|signal)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 51

SLIDE 52

Speech recognition

Use Bayes’ rule: P(Words|signal) = αP(signal|Words)P(Words) I.e., decomposes into acoustic model + language model Words are the hidden state sequence, signal is the observation sequence

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 52

SLIDE 53

Phones

All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal ⇒ acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] beat [b] bet [p] pet [ih] bit [ch] Chet [r] rat [ey] bet [d] debt [s] set [ao] bought [hh] hat [th] thick [ow] boat [hv] high [dh] that [er] Bert [l] let [w] wet [ix] roses [ng] sing [en] button . . . . . . . . . . . . . . . . . .

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 53

SLIDE 54

Speech sounds

Raw signal is the microphone displacement as a function of time; processed into overlapping 30ms frames, each described by features

Analog acoustic signal: Sampled, quantized digital signal: Frames with features:

10 15 38 52 47 82 22 63 24 89 94 11 10 12 73

Frame features are typically formants—peaks in the power spectrum

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 54

SLIDE 55

Phone models

Frame features in P(features|phone) summarized by – an integer in [0 . . . 255] (using vector quantization); or – the parameters of a mixture of Gaussians Three-state phones: each phone has three phases (Onset, Mid, End) E.g., [t] has silent Onset, explosive Mid, hissing End ⇒ P(features|phone, phase) Triphone context: each phone becomes n2 distinct phones, depending

n the phones to its left and right

E.g., [t] in “star” is written [t(s,aa)] (different from “tar”!) Triphones useful for handling coarticulation effects: the articulators have inertia and cannot switch instantaneously between positions E.g., [t] in “eighth” has tongue against front teeth

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 55

SLIDE 56

Phone model example

Phone HMM for [m]: 0.1 0.9 0.3 0.6 0.4 C1: 0.5 C2: 0.2 C3: 0.3 C3: 0.2 C4: 0.7 C5: 0.1 C4: 0.1 C6: 0.5 C7: 0.4 Output probabilities for the phone HMM: Onset: Mid: End:

FINAL

0.7

Mid End Onset

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 56

SLIDE 57

Word pronunciation models

Each word is described as a distribution over phone sequences Distribution represented as an HMM transition model

0.5 0.5 0.2 0.8 [m] [ey] [ow] [t] [aa] [t] [ah] [ow] 1.0 1.0 1.0 1.0 1.0

P([towmeytow]|“tomato”) = P([towmaatow]|“tomato”) = 0.1 P([tahmeytow]|“tomato”) = P([tahmaatow]|“tomato”) = 0.4 Structure is created manually, transition probabilities learned from data

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 57

SLIDE 58

Isolated words

Phone models + word models fix likelihood P(e1:t|word) for isolated word P(word|e1:t) = αP(e1:t|word)P(word) Prior probability P(word) obtained simply by counting word frequen- cies P(e1:t|word) can be computed recursively: define ℓ1:t = P(Xt, e1:t) and use the recursive update ℓ1:t+1 = Forward(ℓ1:t, et+1) and then P(e1:t|word) = Σxtℓ1:t(xt) Isolated-word dictation systems with training reach 95–99% accuracy

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 58

SLIDE 59

Continuous speech

Not just a sequence of isolated-word recognition problems – Adjacent words highly correlated – Sequence of most likely words = most likely sequence of words – Segmentation: there are few gaps in speech – Cross-word coarticulation—e.g., “next thing” Continuous speech systems manage 60–80% accuracy on a good day

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 59

SLIDE 60

Language model

Prior probability of a word sequence is given by chain rule: P(w1 · · · wn) =

n

i=1 P(wi|w1 · · · wi−1)

Bigram model: P(wi|w1 · · · wi−1) ≈ P(wi|wi−1) Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 60

SLIDE 61

Combined HMM

States of the combined language+word+phone model are labelled by the word we’re in + the phone in that word + the phone state in that phone Viterbi algorithm finds the most likely phone state sequence Does segmentation by considering all possible word sequences and boundaries Doesn’t always give the most likely word sequence because each word sequence is the sum over many state sequences Jelinek invented A∗ in 1969 a way to find most likely word sequence where “step cost” is − log P(wi|wi−1)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 61

SLIDE 62

DBNs for speech recognition

articulators tongue, lips P(OBS | 2) = 1 end-of-word observation deterministic, fixed stochastic, learned deterministic, fixed phoneme index transition phoneme

P(OBS | not 2) = 0

1 1 1 2 2 n n n

bservation

stochastic, learned

a a b b u u r r a u

stochastic, learned

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 62

SLIDE 63

DNNs for speech recognition

DNN architecture for speech recognition Speech signal → Feature extraction → Decoder acoustic models + pronunciation dictionary + language models → Recognized Words Deep belief networks: combined DNNs + DBNs Currently, the best practices, than DBNs – various Speech Assistant (Cortana/Siri/GAssistant/Alexa/FMessenger/· · ·)

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 63

SLIDE 64

Conversational agent

Conversational systems (conv. agents, dialogue systems, (chat)bots): converse with humans via natural language ⇐ Turing test A dialogue contains the following phases

1. Input recognizer: converting user’s input to plain text

e.g., speech, gesture, handwriting etc.

2. NLP units
3. Dialog manager: analyzing the semantic information, keeping the

history and state of the dialog and managing the general flow of the conversation

4. Task managers: the dialog manager contacts one or more, that

have knowledge of the specific task domain

5. Output generator: producing output

e.g., natural language generator, layout engine etc.

6. Output renderer

e.g., text-to-speech engine (TTS), talking head, robot, avatar etc.

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 64

SLIDE 65

Conversational software

Pipeline architecture for a spoken dialogue system

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 65

SLIDE 66

Context

Situational context

– specific setting, time and location

Discourse context

– local: the preceding sentence – global: world knowledge e.g., representing mental states: beliefs, desires, intentions, and plans

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 66

SLIDE 67

ELIZA

The ELIZA program (Weizenbaum, 1966): imitate the responses of a Rogerian psychotherapist User: You are like my father in some ways. ELIZA: WHAT RESEMBLANCE DO YOU SEE? User: You are not very aggressive but I think you dont want me to notice that. ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE? User: You dont argue with me. ELIZA: WHY DO YOU THINK I DONT ARGUE WITH YOU? User: You are afraid of me. ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU?

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 67

SLIDE 68

ELIZA

Use of pattern-matching to process the input and translate it into suitable outputs Word Rank Pattern Outputs alike 10 ?X In what way? ELIZA doesnt actually need to know anything to mimic a Rogerian psychotherapist So-called Eliza-style fool

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 68

SLIDE 69

Dialogue

Trying practical dialogue systems say, Microsoft Xiao Bing (in Chinese) ⇒ say something?? How long have you been asked for a dialogue??

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 69

SLIDE 70

The dream

Trend: Deep learning is better than statistical learning both in ma- chine translation and speech recognition, but not in conservation Combines: language processing + machine learning (deep learning) Linguistics + Psycho-linguistics + Knowledge representation and rea- soning + Machine Learning + Information Science + Signal process- ing Learning: models of how children learn their language just from what they hear and observe – apply machine learning to show how children can learn – to map words in a sentence to real world objects – the relation between verbs and their arguments ⇐ Understanding?? The dream: “The linguistic computer” Human-like competence in language ⇐ strong AI

AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 13 70