introduction to text mining
play

Introduction to Text Mining Alliance Summer School 2019 Elliott Ash - PowerPoint PPT Presentation

Introduction to Text Mining Alliance Summer School 2019 Elliott Ash 1/85 Social Science meets Data Science We are seeing a revolution in social science : new datasets : administrative data, digitization of text archives, social media


  1. Introduction to Text Mining Alliance Summer School 2019 Elliott Ash 1/85

  2. Social Science meets Data Science ◮ We are seeing a revolution in social science : ◮ new datasets : administrative data, digitization of text archives, social media ◮ new methods : natural language processing, machine learning ◮ In particular: ◮ many important human behaviors consist of text – millions and millions of lines of it. ◮ we cannot read these texts – somehow we must teach machines to read them for us. 2/85

  3. Readings ◮ Google Developers Guide to Text Classification: ◮ https://developers.google.com/machine-learning/ guides/text-classification/ ◮ “Analyzing polarization in social media: Method and application to tweets on 21 mass shootings” (2019). ◮ Demszky, Garg, Voigt, Zou, Gentzkow, Shapiro, and Jurafsky ◮ Natural Language Processing in Python ◮ Hands-on Machine Learning with Scikit-learn & TensorFlow 2.0 4/85

  4. Programming ◮ Python is ideal for text data and machine learning. ◮ I recommend Anaconda 3.6: continuum.io/downloads ◮ For relatively small corpora, R is also fine: ◮ see the quanteda package. 5/85

  5. Text as Data ◮ Text data is a sequence of characters called documents . ◮ The set of documents is the corpus . ◮ Text data is unstructured : ◮ the information we want is mixed together with (lots of) information we don’t. ◮ How to separate the two? 6/85

  6. Dictionary Methods ◮ Dictionary methods use a pre-selected list of words or phrases to analyze a corpus. ◮ Corpus-specific ◮ count words related to your analysis ◮ General ◮ e.g. LIWC ( liwc.wpengine.com ) has lists of words across categories. ◮ Sentiment Analysis: count sets of positive and negative words (doesn’t work very well) 8/85

  7. Measuring uncertainty in macroeconomy Baker, Bloom, and Davis ◮ Baker, Bloom, and Davis measure economic policy uncertainty using Boolean search of newspaper articles. (See http://www.policyuncertainty.com/ ). ◮ For each paper on each day since 1985, submit the following query: ◮ 1. Article contains “uncertain” OR “uncertainty”, AND ◮ 2. Article contains “economic” OR “economy”, AND ◮ 3. Article contains “congress” OR “deficit” OR “federal reserve” OR “legislation” OR “regulation” OR “white house” ◮ Normalize resulting article counts by total newspaper articles that month. 9/85

  8. Measuring uncertainty in macroeconomy Baker, Bloom, and Davis 10/85

  9. Goals of Featurization ◮ The goal: produce features that are ◮ predictive in the learning task ◮ interpretable by human investigators ◮ tractable enough to be easy to work with 12/85

  10. Pre-processing ◮ Standard pre-processing steps: ◮ drop capitalization, punctuation, numbers, stopwords (e.g. “the”, “such”) ◮ remove word stems (e.g., “taxes” and “taxed” become “tax”) 13/85

  11. Parts of speech ◮ Parts of speech (POS) tags provide useful word categories corresponding to their functions in sentences: ◮ Content : noun (NN), verb (VB), adjective (JJ), adverb (RB) ◮ Function : determinant (DT), preposition (IN), conjunction (CC), pronoun (PR). ◮ Parts of speech vary in their informativeness for various functions: ◮ For categorizing topics , nouns are usually most important ◮ For sentiment , adjectives are usually most important. 14/85

  12. N-grams ◮ N-grams are phrases, sequences of words up to length N . ◮ bigrams, trigrams, quadgrams, etc. ◮ capture information and familiarity from local word order. ◮ e.g. “estate tax” vs “death tax” 15/85

  13. Filtering the Vocabulary ◮ N-grams will blow up your feature space: filtering out uninformative n-grams is necessary. ◮ Google Developers recommend vocab size = m =20,000; I have gotten good performance from m =2,000. 1. Drop phrases that appear in few documents, or in almost all documents, using tf-idf weights: tf-idf( w ) = (1+log( c w )) × log( N ) d w ◮ c w = count of phrase w in corpus, N = number of documents, d w = number of documents where w appears. 2. filter on parts of speech (keep nouns, adjectives, and verbs). 3. filter on pointwise mutual information to get collocations (Ash JITE 2017, pg. 2) 4. supervised feature selection: select phrases that are predictive of outcome. 16/85

  14. A decent baseline for featurization ◮ Tag parts of speech: keep nouns, verbs, and adjectives. ◮ Drop stopwords, capitalization, punctuation. ◮ Run snowball stemmer to drop word endings. ◮ Make bigrams from the tokens. ◮ Take top 10,000 bigrams based on tf-idf weight. ◮ Represent documents as tf-idf frequencies over these bigrams. 17/85

  15. Cosine Similarity v 1 · v 2 cos_sim( v 1 , v 2 ) = || v 1 |||| v 2 || where v 1 and v 2 are vectors, rep- resenting documents (e.g., IDF- weighted frequencies). ◮ each document is a non-negative vector in an m -space ( m = size of dictionary): ◮ closer vectors form smaller angles: cos(0) = +1 means identical documents. ◮ furthest vectors are orthogonal: cos( π/ 2) = 0 means no words in common. ◮ For n documents, this gives n × ( n − 1) similarities. 19/85

  16. Text analysis of patent innovation Kelly, Papanikolau, Seru, and Taddy (2018) “Measuring technological innovation over the very long run” ◮ Data: ◮ 9 million patents since 1840, from U.S. Patent Office and Google Scholar Patents. ◮ date, inventor, backward citations ◮ text (abstract, claims, and description) ◮ Text pre-processing: ◮ drop HTML markup, punctuation, numbers, capitalization, and stopwords. ◮ remove terms that appear in less than 20 patents. ◮ 1.6 million words in vocabulary. 20/85

  17. Measuring Innovation Kelly, Papanikolau, Seru, and Taddy (2018) ◮ Backward IDF weighting of word w in patent i : # of patents prior to i BIDF( w , i ) = log (1 + # patents prior to i that include w ) ◮ down-weights words that appeared frequently before a patent. ◮ For each patent i : ◮ compute cosine similarity ρ ij to all future patents j , using BIDF of i . ◮ 9m × 9m similarity matrix = 30TB of data. ◮ enforce sparsity by setting similarity < .05 to zero (93.4% of pairs). 21/85

  18. Novelty, Impact, and Quality Kelly, Papanikolau, Seru, and Taddy (2018) ◮ “Novelty” is defined by dissimilarity (negative similarity) to previous patents: � Novelty j = − ρ ij i ∈ B ( j ) where B ( j ) is the set of previous patents (in, e.g., last 20 years). ◮ “Impact” is defined as similarity to subsequent patents: � Impact i = ρ ij j ∈ F ( i ) where F ( i ) is the set of future patents (in, e.g., next 100 years). ◮ A patent has high quality if it is novel and impactful : logQuality k = logImpact k +logNovelty k 22/85

  19. Validation Kelly, Papanikolau, Seru, and Taddy (2018) ◮ For pairs with higher ρ ij , patent j more likely to cite patent i . ◮ Within technology class (assigned by patent office), similarity is higher than across class. ◮ Higher quality patents get more cites: 23/85

  20. Most Innovative Firms Kelly, Papanikolau, Seru, and Taddy (2018) 24/85

  21. Breakthrough patents: citations vs quality Kelly, Papanikolau, Seru, and Taddy (2018) 25/85

  22. Breakthrough patents and firm profits Kelly, Papanikolau, Seru, and Taddy (2018) 26/85

  23. Topic Models in Social Science ◮ Topic models developed in computer science and statistics: ◮ summarize unstructured text using words within document ◮ useful for dimension reduction ◮ Social scientists use topics as a form of measurement ◮ how observed covariates drive trends in language ◮ tell a story not just about what, but how and why ◮ topic models are more interpretable than other methods, e.g. principal components analysis. 28/85

  24. Latent Dirichlet Allocation (LDA) ◮ Idea: documents exhibit each topic in some proportion. ◮ Each document is a distribution over topics . ◮ Each topic is a distribution over words . ◮ Latent Dirichlet Allocation (e.g. Blei 2012) is the most poular topic model in this vein because it is easy to use and (usually) provides great results. ◮ Maintained assumptions: Bag of words/phrases, fix number of topics ex ante. 29/85

  25. A statistical highlighter 30/85

  26. Topic modeling Federal Reserve Bank transcripts Hansen, McMahon, and Prat (QJE 2017) ◮ Use LDA to analyze speech at the FOMC (Federal Open Market Committee). ◮ private discussions among committee members at Federal Reserve (U.S. Central Bank) ◮ transcripts: 150 meetings, 20 years, 26,000 speeches, 24,000 unique words. ◮ Pre-processing: ◮ drop stopwords, stems, etc. ◮ Drop words with low TF-IDF weight 31/85

  27. LDA Training Hansen, McMahon, and Prat (QJE 2017) ◮ K = 40 topics selected for interpretability / topic coherence. ◮ the “statistically optimal” K = 70, but these were less interpretable. ◮ hyperparemeters α = 50 / K and η = . 025 to promote sparse word distributions (and more interpretable topics). 32/85

  28. 33/85

  29. Pro-Cyclical Topics Hansen, McMahon, and Prat (QJE 2017) 34/85

  30. Counter-Cyclical Topics Hansen, McMahon, and Prat (QJE 2017) 35/85

  31. FOMC Topics and Policy Uncertainty Hansen, McMahon, and Prat (QJE 2017) 36/85

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend