Natural Language from 20,000 feet AI Class 28 (no reading) Slides - PDF document

Natural Language from 20,000 feet AI Class 28 (no reading) Slides from Paula Matuszek and Mary-Angela Papalaskari, Villanova University, with thanks Bookkeeping • Review Session: Friday, 12/16, 6pm-8pm • If you can’t make that time, see posted slides • End of class: SECQ 1

Logical Equivalence • Some quick notes: • Something is equivalent iff it covers exactly the same cases in all universes. • Redundancy can be equivalent, although not ideal • “All red cards and all hearts are legal” • Extra cases are NOT equivalent • “All red cards, all hearts, and all queens are legal” • You will have to make some thresholded (or otherwise probabilistically inspired) guesses • In 200 plays you will NOT see everything NL and NLP • “Natural” languages = human languages • English, Russian, Wolof, … • Natural Language Processing: any form of dealing with NL computationally • Many, many sub-areas; important from an AI perspective, 2 are most crucial: • Natural Language Understanding • Natural Language Generation 2

Natural Language Processing • NLP is involved in, many topics: • speech recognition • natural language understanding • computational linguistics • psycholinguistics • information extraction • information retrieval • inference • natural language generation • speech synthesis • language evolution Applied NLP • Machine translation • spelling/grammar correction • Information Retrieval • Data mining • Document classification • Question answering, conversational agents 3

You See It Daily • Question answering: Siri, OK Google, Cortana • spelling/grammar correction • Automated response systems • To get input for • Information Retrieval • Data mining • Document classification • Machine translation Natural Language Understanding sound waves accoustic / morphological semantic / phonetic /syntactic pragmatic internal representation 4

Natural Language Understanding sound waves semantic / accoustic / morphological pragmatic phonetic /syntactic Sounds Symbols Sense internal representation Which ones are words? sound waves accoustic / morphological semantic / phonetic /syntactic pragmatic “ How to recognize speech” internal • not “to wreck a nice beach ” representation “ The cat scares all the birds away ” • not “The cat ’ s cares are few ” - pauses in speech bear little relation to word breaks + intonation offers additional clues to meaning 5

Dissecting words/sentences sound waves accoustic / morphological semantic / phonetic /syntactic pragmatic internal representation • “I saw the birds outside.” • “I saw the Golden Gate bridge flying into San Francisco.” Dissecting words/sentences sound waves accoustic / morphological semantic / phonetic /syntactic pragmatic internal Who’s outside? representation • “I saw the birds outside.” • “I saw the Golden Gate bridge flying into San Francisco.” Who’s flying? 6

What does it mean? sound waves accoustic / morphological semantic / phonetic /syntactic pragmatic internal • “ I saw Pathfinder on Mars with a telescope ” representation • “ Pathfinder photographed Mars ” • “ The Pathfinder photograph from Ford has arrived ” • “ When a Pathfinder fords a river it sometimes mars its paint. ” What Does it Mean? sound waves accoustic / morphological semantic / phonetic /syntactic pragmatic • “ Jack went to the store. He found the milk in aisle 3. He paid for it and left. ” internal representation • “ Q: Did you read the report? A: I read Bob ’ s email. ” 7

Classic Steps in NLP • Morphology: the way words are built up from sounds, phonemes, phones • Syntax: how words are put together to form correct sentences and what structural role each word has • Semantics: what words mean and how meanings combine in sentences to form sentence meanings • Discourse and Pragmatics: how preceding text affects the interpretation of current text and how sentences are used in different situations Human Languages • You know ~50,000 words of primary language, each with several meanings • Six year old knows ~13000 words • First 16 years we learn 1 word every 90 min of waking time • Mental grammar generates sentences • virtually every sentence is novel! • 3 year olds already have 90% of grammar • ~6000 human languages – none of them simple! Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London 8

Human Spoken language • Most complicated mechanical motion of the body • Movements must be accurate to within half mm • synchronized within hundredths of a second • We can understand up to 50 phonemes/sec (normal speech 10-15ph/sec) • but if sound is repeated 20 times /sec we hear continuous buzz! • All aspects of language processing are involved and manage to keep apace Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London Let ’ s talk! This model shows what a man's body would look like if each part grew in proportion to the area of the cortex of the brain concerned with its movement. The Natural History Museum (UK)– picture library http://piclib.nhm.ac.uk/piclib/www/comp.php?img=87493&frm=med&search=homunculus 9

Why Language is Hard • NLP is AI-complete • Abstract concepts are difficult to represent • LOTS of possible relationships among concepts • Many ways to represent similar concepts • Tens of hundreds or thousands of features/ dimensions Why Language is Easy • Highly redundant • Relatively crude methods provide fairly good results • Lots of subject matter experts! 10

Some of the Tools • A mixed bag, at various levels... • Tokenizers • Regular Expressions and Finite State Automata • Part of Speech taggers • Grammars • Parsers • N-Grams • Semantic Analysis What will it take? • Models of computation (state machines) • Formal grammars • Knowledge representation • Search algorithms • Dynamic programming • Logic • Machine learning • Probability theory 11

A Few Key Problems and Tools Parts of Speech Tagging • Part-of-Speech (POS) taggers identify nouns, verbs, adjectives, noun phrases, etc. • More recent work uses machine learning to create taggers from labeled examples 12

Named Entities (NE) Tagging • Persons, places, companies • “Proper nouns” • One of most common information extraction tasks • Combination of rules and dictionary • Example rules: • Capitalized word not at beginning of sentence • Two capitalized words in a row • One or more capitalized words followed by Inc • Dictionaries of common names, places, major corporations. • Sometimes called “gazetteer” Reference Resolution • Discourse Knowledge — what have we just said? Paula is here. She is ready. • Domain/World Knowledge • U: I would like to register in a CMSC Course. • S: Which number? • U: 647. • S: Which section? • U: Which section is in the evening? • S: section 1. • U: Then that one. 13

Word Sense Resolution • Many words have several meanings or senses • We need to resolve which of the senses of an ambiguous word is invoked in a particular use of the word • I made her duck. (meanings?) • Again, discourse and world knowledge Semantics • What kinds of things can we not do well with the tools we have already looked at? • Retrieve information in response to unconstrained questions: e.g., travel planning • Accurate translations? • Play the “chooser” side of 20 Questions • Read a newspaper article and answer questions about it • These tasks require that we also consider semantics : the meaning of our tokens and their sequences 14

Evaluation • You should have gotten mail with a link from StudentCourseEvaluations@umbc.edu. • Or, access via Blackboard and myUMBC. The Student Evaluation of Educational Quality (SEEQ) is a standardized course evaluation instrument used to provide measures of an instructor’s teaching effectiveness. The results of this questionnaire will be used by promotion and tenure committees as part of the instructor’s evaluation. The Direct Instructor Feedback Forms (DIFFs) were designed to provide feedback to instructors and they are not intended for use by promotion and tenure committees. The responses to the SEEQ and the DIFFs will be kept confidential and will not be distributed until final grades are in. 15

Natural Language from 20,000 feet AI Class 28 (no reading) Slides - PDF document

Natural Language from 20,000 feet AI Class 28 (no reading) Slides from Paula Matuszek and Mary-Angela Papalaskari, Villanova University, with thanks Bookkeeping Review Session: Friday, 12/16, 6pm-8pm If you cant make that time, see

Growth in Known Compounds 70,000,000 63,175,733 60,000,000 54,675,250 50,000,000 50,000,000

SJVIA Projected Cash Flows as of 10/15/15 $10,000,000 $9,000,000 $8,000,000 $7,000,000

State funding remains below pre-recession levels $300,000,000 $290,000,000 $280,000,000 $273.1M

APRIL 30, 2019 $14,000,000.00 $12,000,000.00 $10,000,000.00 $8,000,000.00 $6,000,000.00

PAPA Technical Meetings - 2017 HMA PRODUCTION BY YEAR 1,200,000 1,000,000 980,000 1,000,000

CFR Data- State-Wide Fiscal Losses State Wide Losses - Education Programs 93,700,000

Camping units 300,000 290,000 280,000 270,000 260,000 250,000 240,000 230,000 220,000

Industrial Robot Outlook 1,000,000 900,000 800,000 700,000 600,000 500,000 400,000 300,000

3,542 o F 3,542 o F 120 o F 50 o F N ATURAL GAS USE - 75% 1.5 MILLION THERMS /Y R .

Curtis Dubay Senior Economist, U.S. Chamber of Commerce April 2020 Historically High

BUDGET OVERVIEW 1 CHALLENGES FOR THE GENERAL FUND $30,000,000 $25,000,000 $20,000,000

27.3% 9,130,000 C-Crossovers sold Crossover % in 2014 in total C segment in 2014 35,000,000

The Returns to Education Source: Bureau of Labor Statistics 1 Total Enrollment Over Time

DAIRY MARKETS ARE ALIVE AND WELL Volume & Open Interest - Class III Milk 5,000 50,000

Strategic Resource Allocation Project Academic Leadership Meeting September 28, 2017

Hacking Healthcare Technology in Africa Mike McKay BaobabHealth.org 37 $172 Malawi

The low-down on Masking Up for COVID-19 The How to Ask Dr. science make, wear Mask! July 7,

Louis Nine/Intervale/Freeman Summer 2012 1 Commissioner Janette Sadik-Khan, New York City

The Servant Master

Proposed Expansion of the Horry County Class 2 Landfill PUBLIC HEARING To avoid echoing or

IDOLS WITH FEET OF CLAY: ON THE SECURITY OF BOOTLOADERS AND FIRMWARE UPDATERS FOR THE IOT Lionel

The Energetic Cost of Adaptive Feet in Walking 12. 9, 2011 Seungmoon Song and Hartmut Geyer

Verifying remote computations using PCPs Srinath Setty, Andrew Blumberg, and Michael Walfish UT

Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Natural Language from 20,000 feet AI Class 28 (no reading) Slides - PDF document

Natural Language from 20,000 feet AI Class 28 (no reading) Slides from Paula Matuszek and Mary-Angela Papalaskari, Villanova University, with thanks Bookkeeping Review Session: Friday, 12/16, 6pm-8pm If you cant make that time, see

Growth in Known Compounds 70,000,000 63,175,733 60,000,000 54,675,250 50,000,000 50,000,000

SJVIA Projected Cash Flows as of 10/15/15 $10,000,000 $9,000,000 $8,000,000 $7,000,000

State funding remains below pre-recession levels $300,000,000 $290,000,000 $280,000,000 $273.1M

APRIL 30, 2019 $14,000,000.00 $12,000,000.00 $10,000,000.00 $8,000,000.00 $6,000,000.00

PAPA Technical Meetings - 2017 HMA PRODUCTION BY YEAR 1,200,000 1,000,000 980,000 1,000,000

CFR Data- State-Wide Fiscal Losses State Wide Losses - Education Programs 93,700,000

Camping units 300,000 290,000 280,000 270,000 260,000 250,000 240,000 230,000 220,000

Industrial Robot Outlook 1,000,000 900,000 800,000 700,000 600,000 500,000 400,000 300,000

3,542 o F 3,542 o F 120 o F 50 o F N ATURAL GAS USE - 75% 1.5 MILLION THERMS /Y R .

Curtis Dubay Senior Economist, U.S. Chamber of Commerce April 2020 Historically High

BUDGET OVERVIEW 1 CHALLENGES FOR THE GENERAL FUND $30,000,000 $25,000,000 $20,000,000

27.3% 9,130,000 C-Crossovers sold Crossover % in 2014 in total C segment in 2014 35,000,000

The Returns to Education Source: Bureau of Labor Statistics 1 Total Enrollment Over Time

DAIRY MARKETS ARE ALIVE AND WELL Volume &amp; Open Interest - Class III Milk 5,000 50,000

Strategic Resource Allocation Project Academic Leadership Meeting September 28, 2017

Hacking Healthcare Technology in Africa Mike McKay BaobabHealth.org 37 $172 Malawi

The low-down on Masking Up for COVID-19 The How to Ask Dr. science make, wear Mask! July 7,

Louis Nine/Intervale/Freeman Summer 2012 1 Commissioner Janette Sadik-Khan, New York City

The Servant Master

Proposed Expansion of the Horry County Class 2 Landfill PUBLIC HEARING To avoid echoing or

IDOLS WITH FEET OF CLAY: ON THE SECURITY OF BOOTLOADERS AND FIRMWARE UPDATERS FOR THE IOT Lionel

The Energetic Cost of Adaptive Feet in Walking 12. 9, 2011 Seungmoon Song and Hartmut Geyer

Verifying remote computations using PCPs Srinath Setty, Andrew Blumberg, and Michael Walfish UT

Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

DAIRY MARKETS ARE ALIVE AND WELL Volume & Open Interest - Class III Milk 5,000 50,000