L18 July 17, 2017 1 Lecture 18: Natural Language Processing II - PDF document

L18 July 17, 2017 1 Lecture 18: Natural Language Processing II CSCI 1360E: Foundations for Informatics and Analytics 1.1 Overview and Objectives Last week, we introduced the concept of natural language processing, and in particular the "bag of words" model for representing and quantifying text for later analysis. In this lecture, we’ll expand on those topics, including some additional preprocessing and text representation methods. By the end of this lecture, you should be able to • Implement several preprocessing techniques like stemming, stopwords, and minimum counts • Understand the concept of feature vectors in natural language processing • Compute inverse document frequencies to up or down-weight term frequencies 1.2 Part 1: Feature Vectors The "bag of words" model: why do we do it? what does it give us? It’s a way of representing documents in a format that is convenient and amenable to sophis- ticated analysis. You’re interested in blogs. Specifically, you’re interested in how blogs link to each other. Do politically-liberal blogs tend to link to politically-conservative blogs, and vice versa? Or do they mostly link to themselves? Imagine you have a list of a few hundred blogs. To get their political leanings, you’d need to analyze the blogs and see how similar they are. To do that, you need some notion of similarity ... We need to be able to represent the blogs as feature vectors . If you can come up with a quantitative representation of your "thing" of interest, then you can compare it to other instances of that thing. The bag-of-words model is just one way of turning a document into a feature vector that can be used in analysis. By considering each blog to be a single document, you can therefore convert each blog to its own bag-of-words and compare them directly. (in fact, this has actually been done) http://waxy.org/2008/10/memeorandum_colors/ If you have some data point � x that is an n -dimensional vector (pictured above: a three- dimensional vector), each dimension is a single feature. ( Hint : what does this correspond to with NumPy arrays?) 1

blogs featurevector 2

wordvector Therefore, a bag-of-words model is just a way of representing a document as a vector , where the dimensions are word counts! Pictured above are three separate documents, and the number of times each of the words ap- pears is given by the height of the histogram bar. Stare at this until you get some understanding of what’s happening-- these are three documents that share the same words (as you can see, they have the same x-axes), but what differs are the relative heights of the bars , meaning they have differ- ent values along the x-axes. Of course there are other ways of representing documents as vectors, but bag-of-words is the easiest . 1.3 Part 2: Text Preprocessing What is "preprocessing"? Name some preprocessing techniques with text we’ve covered! • Lower case (or upper case) everything • Split into single words • Remove trailing whitespace (spaces, tabs, newlines) There are a few more that can be very powerful. To start, let’s go back to the Alice in Wonderland example from the previous lecture, but this time, we’ll add a few more books for comparison: • Pride and Prejudice , by Jane Austen • Frankenstein , by Mary Shelley 3

• Beowulf , by Lesslie Hall • The Adventures of Sherlock Holmes , by Sir Arthur Conan Doyle • The Adventures of Tom Sawyer , by Mark Twain • The Adventures of Huckleberry Finn , by Mark Twain Hopefully this variety should give us a good idea what we’re dealing with! First, we’ll read all the books’ raw contents into a dictionary. In [1]: books = {} # We'll use a dictionary to store all the text from the books. files = ['Lecture18/alice.txt', 'Lecture18/pride.txt', 'Lecture18/frank.txt', 'Lecture18/bwulf.txt', 'Lecture18/holmes.txt', 'Lecture18/tom.txt', 'Lecture18/finn.txt'] for f in files: # This weird line just takes the part of the filename between the "/" and "." as the prefix = f.split("/")[-1].split(".")[0] try: with open(f, "r", encoding = "ISO-8859-1") as descriptor: books[prefix] = descriptor.read() except: print("File '{}' had an error!".format(f)) books[prefix] = None In [2]: # Here you can see the dict keys (i.e. the results of the weird line of code in the last print(books.keys()) dict_keys(['frank', 'alice', 'pride', 'finn', 'holmes', 'bwulf', 'tom']) Just like before, let’s go ahead and lower case everything, strip out whitespace, then count all the words. In [3]: def preprocess(book): # First, lowercase everything. lower = book.lower() # Second, split into lines. lines = lower.split("\n") # Third, split each line into words. words = [] for line in lines: words.extend(line.strip().split(" ")) # That's it! return count(words) 4

In [4]: from collections import defaultdict # Our good friend from the last lecture, defaultdict! def count(words): counts = defaultdict(int) for w in words: counts[w] += 1 return counts In [5]: counts = {} for k, v in books.items(): counts[k] = preprocess(v) Let’s see how our basic preprocessing techniques from the last lecture worked out. In [6]: from collections import Counter def print_results(counts): for key, bag_of_words in counts.items(): word_counts = Counter(bag_of_words) mc_word, mc_count = word_counts.most_common(1)[0] print("'{}' has {} unique words, and the most common is '{}', occuring {} times." .format(key, len(bag_of_words.keys()), mc_word, mc_count)) print_results(counts) 'frank' has 11702 unique words, and the most common is 'the', occuring 4327 times. 'alice' has 5582 unique words, and the most common is 'the', occuring 1777 times. 'pride' has 13128 unique words, and the most common is 'the', occuring 4479 times. 'finn' has 13839 unique words, and the most common is 'and', occuring 6109 times. 'holmes' has 14544 unique words, and the most common is 'the', occuring 5704 times. 'bwulf' has 11024 unique words, and the most common is '', occuring 3497 times. 'tom' has 13445 unique words, and the most common is 'the', occuring 3907 times. Yeesh. Not only are the most common words among the most boring ("the"? "and"?), but there are occasions where the most common word isn’t even a word, but rather a blank space. (How do you think that could happen?) 1.3.1 Stop words A great first step is to implement stop words. (I used this list of 319 stop words) In [7]: with open("Lecture18/stopwords.txt", "r") as f: lines = f.read().split("\n") stopwords = [w.strip() for w in lines] print(stopwords[:5]) ['a', 'about', 'above', 'across', 'after'] 5

Now we’ll augment our preprocess function to include stop word processing. In [8]: def preprocess_v2(book, stopwords): # Note the "_v2"--this is a new function! # First, lowercase everything. lower = book.lower() # Second, split into lines. lines = lower.split("\n") # Third, split each line into words. words = [] for line in lines: tokens = line.strip().split(" ") # Check for stopwords. for t in tokens: if t in stopwords: continue # This "continue" SKIPS the stopword entirely! words.append(t) # That's it! return count(words) Now let’s see what we have! In [9]: counts = {} for k, v in books.items(): counts[k] = preprocess_v2(v, stopwords) print_results(counts) 'frank' has 11440 unique words, and the most common is '', occuring 3413 times. 'alice' has 5348 unique words, and the most common is '', occuring 1178 times. 'pride' has 12858 unique words, and the most common is '', occuring 2474 times. 'finn' has 13589 unique words, and the most common is '', occuring 2739 times. 'holmes' has 14278 unique words, and the most common is '', occuring 2750 times. 'bwulf' has 10781 unique words, and the most common is '', occuring 3497 times. 'tom' has 13176 unique words, and the most common is '', occuring 2284 times. Well, this seems even worse! What could we try next? 1.3.2 Minimum length Pretty straightforward: cut out all the words under a certain length; say, 2. After all--how many words do you know that are semantically super-important to a book, and yet are fewer than 2 letters long? (I’m sure you can think of a few; my point being, they’re so few they’re unlikely to matter much) 6

L18 July 17, 2017 1 Lecture 18: Natural Language Processing II - PDF document

L18 July 17, 2017 1 Lecture 18: Natural Language Processing II CSCI 1360E: Foundations for Informatics and Analytics 1.1 Overview and Objectives Last week, we introduced the concept of natural language processing, and in particular the

Quantitative Cyber-Security Colorado State University Yashwant K Malaiya CS559 L18 CSU

CompSci 514: Computer Networks L18: Datacenter Network Architectures II Xiaowei Yang 1

CS3505/5020 Software Practice II C# Style Guides Software processes Agile methods CS 3505 L18

References and Resources Textbook: Simmons and Young, 3 rd edition: Chapter 3, Neuronal

L16 July 18, 2018 1 Lecture 16: Natural Language Processing II CSCI 1360E: Foundations for

Probability Probability Random variables Atomic events Sample space Probability

Construction of Orthogonal and Biorthogonal Product Systems Bal azs Kir aly 6 th Workshop

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger Mahmoud El-Haj Paul

Narrative Points of View revised 02.05.13 || English 1302: Composition & Rhetoric II || D.

Introduction to Cybersecurity Prof. Dr. Michael Backes Director, CISPA Center for IT Security,

Solving Mysteries with Rare Kaon Decays 2/8 Douglas Bryman University of British Columbia 1

Automated Theorem Proving 1/4: Introduction and Propositional Theorem Proving A.L. Lamprecht

CSE 469: Computer and Network Forensics Topic 1: Forensics Intro Dr. Mike Mabey | Spring 2019

Occams Razor Sampling Bias : generate the data carefuly Data Snooping : handle the data

Constructing Social Networks of Irish and British Fiction, 1800-1922 Derek Greene School of

Informatics 1: Data & Analysis Lecture 13: Annotation of Corpora Ian Stark School of

Applications of in Forensic Science Pushing Out the Frontiers Nick D. K. Petraco and Many

Cryptography and Network Chapter 2 Classical Encryption Techniques Security Chapter 2

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining

The Impact of Photography History of Information 103 Geoff Nunberg March 29, 2011 1

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence

Logic in psychology: With applications to false-belief tests 12 Torben Bra uner Roskilde

ZEB1 Regulates the Latent- -Lytic Lytic Switch Switch ZEB1 Regulates the Latent in Infection

Concatenating data Cleaning Data in Python Combining data Data may not always come in 1

Sambuz

Useful Links

Newsletter

Mail Us

L18 July 17, 2017 1 Lecture 18: Natural Language Processing II - PDF document

L18 July 17, 2017 1 Lecture 18: Natural Language Processing II CSCI 1360E: Foundations for Informatics and Analytics 1.1 Overview and Objectives Last week, we introduced the concept of natural language processing, and in particular the

Quantitative Cyber-Security Colorado State University Yashwant K Malaiya CS559 L18 CSU

CompSci 514: Computer Networks L18: Datacenter Network Architectures II Xiaowei Yang 1

CS3505/5020 Software Practice II C# Style Guides Software processes Agile methods CS 3505 L18

References and Resources Textbook: Simmons and Young, 3 rd edition: Chapter 3, Neuronal

L16 July 18, 2018 1 Lecture 16: Natural Language Processing II CSCI 1360E: Foundations for

Probability Probability Random variables Atomic events Sample space Probability

Construction of Orthogonal and Biorthogonal Product Systems Bal azs Kir aly 6 th Workshop

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger Mahmoud El-Haj Paul

Narrative Points of View revised 02.05.13 || English 1302: Composition &amp; Rhetoric II || D.

Introduction to Cybersecurity Prof. Dr. Michael Backes Director, CISPA Center for IT Security,

Solving Mysteries with Rare Kaon Decays 2/8 Douglas Bryman University of British Columbia 1

Automated Theorem Proving 1/4: Introduction and Propositional Theorem Proving A.L. Lamprecht

CSE 469: Computer and Network Forensics Topic 1: Forensics Intro Dr. Mike Mabey | Spring 2019

Occams Razor Sampling Bias : generate the data carefuly Data Snooping : handle the data

Constructing Social Networks of Irish and British Fiction, 1800-1922 Derek Greene School of

Informatics 1: Data &amp; Analysis Lecture 13: Annotation of Corpora Ian Stark School of

Applications of in Forensic Science Pushing Out the Frontiers Nick D. K. Petraco and Many

Cryptography and Network Chapter 2 Classical Encryption Techniques Security Chapter 2

Outline CHARM: An Efficient Algorithm Introductions for Closed Itemset Mining

The Impact of Photography History of Information 103 Geoff Nunberg March 29, 2011 1

Ti Ti Tiny Directory Tiny Directory Di Di t t Making Coherence Tracking Making Coherence

Logic in psychology: With applications to false-belief tests 12 Torben Bra uner Roskilde

ZEB1 Regulates the Latent- -Lytic Lytic Switch Switch ZEB1 Regulates the Latent in Infection

Concatenating data Cleaning Data in Python Combining data Data may not always come in 1

Sambuz

Useful Links

Newsletter

Mail Us

Narrative Points of View revised 02.05.13 || English 1302: Composition & Rhetoric II || D.

Informatics 1: Data & Analysis Lecture 13: Annotation of Corpora Ian Stark School of