J.Bonilla | PhD Defense |12/08/2016 Slide #22
Process & Methodology (Quantitative)
News (NYT, WS, FT News (NYT, WS, FT) ProQuest Newsstand Search filter on: NYT, WSJ, FT “analytics” 2004-2015 à8102 articles Sampled Corpus Lexicon: richness & complexity: describe statistics on # of words, types of words, and # of sentences Document Similarities: cosine distance Corpus evaluation of readability, complexity, and lexical diversity Random sample with 33% stratification à2352 articles Text pre- processing Stop words: syntax vs semantic words Stemming: words in its root form DTM: document term matrix representation of the corpus Sparsity: handling on zeros in DTM Text analytics & Natural Language Processing Frequency analysis Thematic seasonality analysis Hierarchical clustering Probabilistic topic models Statistical associations Words in context Named Entity Recognition Features and Entities