natural processing

Natural ___ Processing Noah A. Smith School of Computer Science - PowerPoint PPT Presentation

Natural ___ Processing Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu This Talk 1. A light discussion of micro-analysis words micro sentences documents 2. Some recent developments in


  1. Natural ___ Processing Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu

  2. This Talk 1. A light discussion of micro-analysis words “micro” sentences documents 2. Some recent developments in macro-analysis , specifically linking text and non-text “macro”

  3. Text Mining/Information Retrieval vs. Natural Language Processing • Typical representation of a document’s text content: bag of words • Great for ranking documents given a query, and for classifying documents, and a number of other applications. • Engineering: Will (some) applications work better if we “square” the data and opt for more abstract representations of text? • “Turning unstructured data into structured data” • Cf. Claire’s talk yesterday!

  4. Beyond Engineering • This isn’t just about how to build a better piece of software. • Old AI philosophers: Is a bag of words really “understanding”? • Computational linguistics: computational models for theorizing about the phenomenon of human language • Very hard problem: representing what a piece of text “means” • Long running debate: linguistic theories and NLP • Less controversial: linguistic representations in NLP • Is there a parallel for computational social science ?

  5. Deeper, More Abstract, Structured Representations • Stemming, lemmatization, and (for other languages) morphological disambiguation Jan’s Continuum • Syntactic disambiguation (parts of speech, phrase- structure parsing, dependency parsing) • Word sense disambiguation, semantic roles, predicate- argument structures • Named entity recognition, coreference resolution • Opinion, sentiment, and subjectivity analysis, discourse relations and parsing

  6. Phrase-Structure Syntax SINV S-TPC VP NP-PRD NP-SBJ NP-SBJ SBAR SBAR S WHNP S VP VP VP VP NP-SBJ NP PP WHNP PP NP-SBJ NP PP NP NP S NP NP NP NP * * * * " DT VBZ RB DT NN IN NN TO VB DT JJ NN IN , " VBD * NNP NNP , WP VBZ NN IN DT JJ NNS . " This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . 0 28

  7. http://www.ark.cs.cmu.edu/TurboParser Dependency Syntax " This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . 0 28 " This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . 0

  8. http://www.ark.cs.cmu.edu/SEMAFOR Frame-Semantic Structure Emotion Emotion People by directed directed vocation Make noise The professor chuckled with unabashed glee Experiencer State Sound Internal cause Sound source Experiencer

  9. A Bit of History • Many of these are “old” problems. • Twenty years ago we started using text data to solve them: 1. Pay experts to annotate examples. 2. Use a combination of human ingenuity and statistics to build systems and/or do science. • Sound familiar? + Reusable data, clearly identified evaluation tasks, benchmarks → progress - Domain specificity, overcommitting to representations, nearsightedness

  10. Comments on “Domain” • Huge challenge: models built for one domain don’t work well in others. • What’s a “domain”? • Huge effort expended on news and, later, biomedical articles. • Lesson: NLP people will follow the annotated datasets.

  11. The Next Paradigm? • If we could do with less or no annotated data, maybe we could be more “agile” about text domains and representations. 1. Pay experts to annotate examples. 2. Use a combination of human ingenuity and statistics to build systems and/or do science. • Small, noisy datasets; active, semisupervised, and unsupervised learning • NB: this is where we do Bayesian statistics! • Can we get information about language from other data? (See part 2.)

  12. Current Work (Yano, Resnik, and Smith, 2010) • Pilot study: can untrained annotators consistently annotate political bias? • What clues give away political bias in sentences from political blogs? (See Monroe, Colaresi, and Quinn, 2008) • We could probably define “political bias” under Jan’s subjectivity umbrella! • Sample of sentences from six political blogs (2008) - not uniform • Amazon Mechanical Turk judgments of bias (buzzword: crowdsourcing) • Survey of basic political views of annotators

  13. Example 1 Feminism has much to answer for denigrating men and encouraging women to seek independence whatever the cost to their families.

  14. Example 2 They died because of the Bush administration’s hubris.

  15. Example 3 In any event, for those of us who have followed this White House carefully during the entirety of the war, the charge is frankly absurd.

  16. Other Observations • Bias is hard to label without context, but not always impossible. • Pairwise kappa of 0.5 to 0.55 (decent for “moderately difficult” tasks) • Annotators: more social liberals, more fiscal conservatives • Liberal blogs are liberal, conservative blogs are conservative • Liberals are quick to see conservative bias; conservatives see both!

  17. Lexical Indicators of Bias Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22 Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc

  18. Lexical Indicators of Bias Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22 Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc

  19. Lexical Indicators of Bias Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22 Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc

  20. Current Work (Eisenstein, Yano, Smith, Cohen, and Xing, in progress) • Goal: find mentions of specific “persons of interest” in noisy multi-author text • Starting point: a list of names • Cf. noun phrase coreference resolution • Deal with misspellings, variations, titles, even sobriquets like “Obamessiah” • Exploit local and document context • Predominantly unsupervised approach (supervision is the list of names)

  21. NLP Success Stories ? Search engines ✓ Translation (Google) ✓ Information extraction (text → databases) ✓ Question answering (recently: IBM’s Watson) ✓ Opinion mining

  22. Opinion NLP: The Refrigerator of the Information Age • It enables all kinds of more exciting activities. • It does something you can’t do very well (cf. dishwashers). • When it’s working for you, it’s always quietly humming in the background. • You don’t think about it too often, and instructions never mention it. • When it stops working, you will notice. • Though expertise is required, there is very little glamor in manufacturing, maintaining, or improving it.

  23. Note to John: These may be superficial. 2. Adventures in Macro-Analysis: Linking Text to Other Data social behavior “data describing government activity in all parts of the policymaking process” economic, financial, political

Recommend


More recommend