Language and Computers Unsupervised Learning Features & - PowerPoint PPT Presentation

Language and Computers Classifying Documents Introduction Language Identification Machine Learning Supervised Learning Language and Computers Unsupervised Learning Features & Classifying Documents Evidence Measuring sucess Document Based on Dickinson, Brew, & Meurers (2013) classifiers Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 1 / 45

Language and Document classification Computers Classifying Documents Introduction Document classification = sort documents into Language Identification user-defined classes Machine Learning ◮ e.g., email sent to the New York Times could be Supervised Learning Unsupervised Learning classified into letters to the editor, new subscription Features & Evidence requests, complaints about undelivered papers, job Measuring sucess inquiries, proposals to buy ad pages, and others Document classifiers Consider the case of sentiment analysis : Authorship ◮ automate the detection of positive and negative Attribution Author Identification statements in documents Stylometry Lexical Markers ◮ would allow one to track opinions about policies, Lexical Markers: Function Words products, & positions Plagiarism Detection What is plagiarism? Plagiarism Detection References 2 / 45

Language and Sentiment Analysis Computers Classifying Example #1 Documents Introduction Language Identification Machine Learning Supervised Learning Unsupervised Learning Features & For the movie Pearl Harbor : Evidence Measuring sucess Ridiculous movie. Worst movie I’ve seen in my Document classifiers entire life [Koen D. on metacritic] Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 3 / 45

Language and Sentiment Analysis Computers Classifying Example #2 Documents Introduction Language Identification Machine Learning Supervised Learning Unsupervised Learning Features & One of my favorite movies. It’s a bit on the Evidence lengthy side, sure. But its made up of a really great Measuring sucess cast which, for me, just brings it all together. [Erica Document classifiers H., again on metacritic] Authorship Attribution Author Identification Stylometry Lexical Markers Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 4 / 45

Language and Sentiment Analysis Computers Classifying Example #3 Documents Introduction Language Identification Machine Learning The Japanese sneak attack on Pearl Harbor Supervised Learning that brought the United States into World War II Unsupervised Learning Features & has inspired a splendid movie, full of vivid Evidence performances and unforgettable scenes, a movie Measuring sucess that uses the coming of war as a backdrop for Document classifiers individual stories of love, ambition, heroism and Authorship betrayal. The name of that movie is ”From Here to Attribution Author Identification Eternity.” (First lines of Alan Scott’s review of “Pearl Stylometry Lexical Markers Harbor”, New York Times, May 25, 2001) Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 5 / 45

Language and Sentiment Analysis Computers Classifying Example #4 Documents Introduction Language Identification Machine Learning The film is not as painful as a blow to the head, Supervised Learning but it will cost you up to $10, and it takes three Unsupervised Learning Features & hours. The first hour and forty-five minutes Evidence establishes one of the most banal love triangles Measuring sucess ever put to film. Childhood friends Rafe McCawley Document classifiers and Danny Walker (Ben Affleck and Josh Hartnett) Authorship both find themselves in love with the same woman, Attribution Author Identification Evelyn Johnson (Kate Beckinsale). [Heather Stylometry Lexical Markers Feher, from www.filmstew.com] Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 6 / 45

Language and Some document classification tasks Computers Classifying Documents Introduction Language Identification ◮ Sentiment analysis : what is the attitude of the text? Machine Learning ◮ Authorship attribution : who wrote a text? Supervised Learning Unsupervised Learning ◮ Author Identification (who penned The Federalist Features & Evidence Papers ?) Measuring sucess ◮ Forensic Evidence (who wrote the note?) Document ◮ Plagiarism Detection (who did the work?) classifiers Authorship ◮ Spam filtering : is this email junk or not? Attribution Author Identification ◮ Language identification : which language is this Stylometry Lexical Markers document in? Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 7 / 45

Language and Language identification Computers Classifying Documents Introduction Language Let’s consider this relatively simple task first . . . Identification Machine Learning ◮ Can sometimes the language tell by Supervised Learning Unsupervised Learning ◮ which characters are used, Features & ◮ e.g. Liebe Gr¨ uße uses ¨ u and ß → German Evidence ◮ which character encoding is being used Measuring sucess ◮ e.g., ISO 8859-8 is used to encode Hebrew characters Document classifiers → text is written in Hebrew Authorship ◮ But how can you tell if you are reading English vs. Attribution Author Identification Japanese transliterated into the Roman alphabet? Or Stylometry Lexical Markers Swedish vs. Norwegian? Lexical Markers: Function Words Plagiarism Detection What is plagiarism? Plagiarism Detection References 8 / 45

Language and Language identification Computers Classifying N-grams Documents Introduction Language Identification ◮ One simple technique for identifying languages is to use Machine Learning n-grams = stretch of n tokens (i.e., letters or words): Supervised Learning Unsupervised Learning ◮ Go through texts for which we know which language Features & they are written in and store the n-grams of letters Evidence found, for a certain n . Measuring sucess ◮ e.g., extracting the trigrams (3-grams) for the last Document classifiers sentence we’d get: Go , o t , th , thr , hro , rou , . . . Authorship ◮ This provides us with an indication of what sequences Attribution of letters are possible in a given language (and how Author Identification Stylometry frequent they occur). Lexical Markers Lexical Markers: Function ◮ e.g., thr is not a likely Japanese string. Words Plagiarism ◮ How do we make this more concrete? Detection What is plagiarism? Plagiarism Detection References 9 / 45

Language and Language identification Computers Classifying Frequency distributions Documents Introduction ◮ Store a frequency distribution of trigrams, i.e., how Language many times each n-gram appears for a given language. Identification Machine Learning n-gram English Japanese Supervised Learning Unsupervised Learning aba 12 54 Features & ace 95 10 Evidence act 45 1 Measuring sucess Document arc 8 0 classifiers . . . . . . Authorship Attribution ◮ Now, apply the frequency distribution to a new text and Author Identification Stylometry use it to help calculate the probability of the text being a Lexical Markers Lexical Markers: Function particular language. Words Plagiarism ◮ Compare each n-gram to see if it is more likely to be Detection What is plagiarism? English or Japanese. Plagiarism Detection ◮ See which language won the most comparisons. References 10 / 45

Language and Machine Learning Computers Classifying Documents Document classification is an example of a computer science activity called machine learning , which is itself part Introduction of the subfield of artificial intelligence Language Identification ◮ We have access to a training set of examples, from Machine Learning which we will learn Supervised Learning Unsupervised Learning ◮ e.g., articles from the on-line version of last month’s Features & Evidence New York Times Measuring sucess ◮ Long-term goal: use what we have learned in order to Document build a robust system that can process future examples classifiers of the same kind as in the training set Authorship Attribution ◮ e.g., articles that are going to appear in next month’s Author Identification New York Times Stylometry Lexical Markers ◮ As an approximation, we use a separate test set of Lexical Markers: Function Words examples to stand in for the unavailable future ones Plagiarism Detection ◮ e.g., this month’s New York Times articles What is plagiarism? ◮ Since the test set is separate from the training set, the Plagiarism Detection References system will not have seen them. 11 / 45

Language and Computers Unsupervised Learning Features & - PowerPoint PPT Presentation

Language and Computers Classifying Documents Introduction Language Identification Machine Learning Supervised Learning Language and Computers Unsupervised Learning Features & Classifying Documents Evidence Measuring sucess

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Language and Computers where to start? Outline Computers Computers Computers Topic 1: Text

Language and Computers where to start? Language and Outline Language and Computers

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

What is MT good for? Language and Example translations Language and Computers Computers

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

The Turing Test Language and Example conversation (cont.) Language and Computers Computers

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Quantum Mechanics; a Blessing and a Curse By Elias Marcopoulos Quantum Computers Quantum

Why people care about spelling Language and Detection vs. Correction Language and Computers

Good Morning! INT1004 Computers for Business Ulrich Werner Discovering Computers Technology in

A Brief History of Computers A Brief History of Computers A Brief History of Computers By

Linguistics 384: Language and Computers Relation to language Comparison of systems Topic 1: Text

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

CS 309: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

Renewable Resources 480 Experimental Design & Data Analysis in Environmental Sciences

Devlin Chapter 1 Discussion, Plagiarism Examples Math 100, Fall 2013 Instructor: Robert Ellis,

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

ETHICS Prita Pant Dept. of Metallurgical Engineering and Materials Science, IIT Bombay Ethics

Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4,

Executive MBA Academic Standards & Policies Academic Standards & Grading Scale

Pre Prese sente nter Di Diana Wo Wood odwo worth, Fi Financ ancial Aid d Coordinato

Language and Computers Unsupervised Learning Features & - PowerPoint PPT Presentation

Language and Computers Classifying Documents Introduction Language Identification Machine Learning Supervised Learning Language and Computers Unsupervised Learning Features & Classifying Documents Evidence Measuring sucess

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Language and Computers where to start? Outline Computers Computers Computers Topic 1: Text

Language and Computers where to start? Language and Outline Language and Computers

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

What is MT good for? Language and Example translations Language and Computers Computers

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

The Turing Test Language and Example conversation (cont.) Language and Computers Computers

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Quantum Mechanics; a Blessing and a Curse By Elias Marcopoulos Quantum Computers Quantum

Why people care about spelling Language and Detection vs. Correction Language and Computers

Good Morning! INT1004 Computers for Business Ulrich Werner Discovering Computers Technology in

A Brief History of Computers A Brief History of Computers A Brief History of Computers By

Linguistics 384: Language and Computers Relation to language Comparison of systems Topic 1: Text

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

CS 309: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

Renewable Resources 480 Experimental Design &amp; Data Analysis in Environmental Sciences

Devlin Chapter 1 Discussion, Plagiarism Examples Math 100, Fall 2013 Instructor: Robert Ellis,

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

ETHICS Prita Pant Dept. of Metallurgical Engineering and Materials Science, IIT Bombay Ethics

Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4,

Executive MBA Academic Standards &amp; Policies Academic Standards &amp; Grading Scale

Pre Prese sente nter Di Diana Wo Wood odwo worth, Fi Financ ancial Aid d Coordinato

Renewable Resources 480 Experimental Design & Data Analysis in Environmental Sciences

Executive MBA Academic Standards & Policies Academic Standards & Grading Scale