Natural ___ Processing Noah A. Smith School of Computer Science - - PowerPoint PPT Presentation

natural processing
SMART_READER_LITE
LIVE PREVIEW

Natural ___ Processing Noah A. Smith School of Computer Science - - PowerPoint PPT Presentation

Natural ___ Processing Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu This Talk 1. A light discussion of micro-analysis words micro sentences documents 2. Some recent developments in


slide-1
SLIDE 1

Natural ___ Processing

Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu

slide-2
SLIDE 2

This Talk

  • 1. A light discussion of micro-analysis
  • 2. Some recent developments in macro-analysis,

specifically linking text and non-text

“micro”

“macro”

documents

words

sentences

slide-3
SLIDE 3

Text Mining/Information Retrieval vs. Natural Language Processing

  • Typical representation of a document’s text content: bag of words
  • Great for ranking documents given a query, and for classifying documents,

and a number of other applications.

  • Engineering: Will (some) applications work better if we “square” the data and
  • pt for more abstract representations of text?
  • “Turning unstructured data into structured data”
  • Cf. Claire’s talk yesterday!
slide-4
SLIDE 4

Beyond Engineering

  • This isn’t just about how to build a better piece of software.
  • Old AI philosophers: Is a bag of words really “understanding”?
  • Computational linguistics: computational models for theorizing about the

phenomenon of human language

  • Very hard problem: representing what a piece of text “means”
  • Long running debate: linguistic theories and NLP
  • Less controversial: linguistic representations in NLP
  • Is there a parallel for computational social science?
slide-5
SLIDE 5

Deeper, More Abstract, Structured Representations

  • Stemming, lemmatization, and (for other languages)

morphological disambiguation

  • Syntactic disambiguation (parts of speech, phrase-

structure parsing, dependency parsing)

  • Word sense disambiguation, semantic roles, predicate-

argument structures

  • Named entity recognition, coreference resolution
  • Opinion, sentiment, and subjectivity analysis,

discourse relations and parsing

Jan’s Continuum

slide-6
SLIDE 6

Phrase-Structure Syntax

" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . SINV " S-TPC NP-SBJ VP DT VBZ RB , NP-PRD NP PP SBAR DT NN NN NN NN IN TO VB DT JJ IN " VBD NNP NNP , WP VBZ IN DT JJ NNS . WHNP S * NP-SBJ * VP VP NP NP * PP VP * NP-SBJ NP * SBAR WHNP S NP-SBJ VP NP NP PP NP 28 NP S

slide-7
SLIDE 7

Dependency Syntax

" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . 28

" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks .

http://www.ark.cs.cmu.edu/TurboParser

slide-8
SLIDE 8

Frame-Semantic Structure The professor chuckled with unabashed glee

Make noise Emotion directed

Experiencer State Sound source Internal cause Sound

People by vocation Emotion directed

Experiencer http://www.ark.cs.cmu.edu/SEMAFOR

slide-9
SLIDE 9

A Bit of History

  • Many of these are “old” problems.
  • Twenty years ago we started using text data to solve them:
  • Sound familiar?

+ Reusable data, clearly identified evaluation tasks, benchmarks → progress

  • Domain specificity, overcommitting to representations, nearsightedness

1. Pay experts to annotate examples. 2. Use a combination of human ingenuity and statistics to build systems and/or do science.

slide-10
SLIDE 10

Comments on “Domain”

  • Huge challenge: models built for one domain don’t work well in others.
  • What’s a “domain”?
  • Huge effort expended on news and, later, biomedical articles.
  • Lesson: NLP people will follow the annotated datasets.
slide-11
SLIDE 11

The Next Paradigm?

  • If we could do with less or no annotated data, maybe we could be more

“agile” about text domains and representations.

  • Small, noisy datasets; active, semisupervised, and unsupervised learning
  • NB: this is where we do Bayesian statistics!
  • Can we get information about language from other data? (See part 2.)

1. Pay experts to annotate examples. 2. Use a combination of human ingenuity and statistics to build systems and/or do science.

slide-12
SLIDE 12

Current Work (Yano, Resnik, and Smith, 2010)

  • Pilot study: can untrained annotators consistently annotate political bias?
  • What clues give away political bias in sentences from political blogs? (See

Monroe, Colaresi, and Quinn, 2008)

  • We could probably define “political bias” under Jan’s subjectivity umbrella!
  • Sample of sentences from six political blogs (2008) - not uniform
  • Amazon Mechanical Turk judgments of bias (buzzword: crowdsourcing)
  • Survey of basic political views of annotators
slide-13
SLIDE 13

Example 1

Feminism has much to answer for denigrating men and encouraging women to seek independence whatever the cost to their families.

slide-14
SLIDE 14

Example 2

They died because of the Bush administration’s hubris.

slide-15
SLIDE 15

Example 3

In any event, for those of us who have followed this White House carefully during the entirety of the war, the charge is frankly absurd.

slide-16
SLIDE 16

Other Observations

  • Bias is hard to label without context, but not always impossible.
  • Pairwise kappa of 0.5 to 0.55 (decent for “moderately difficult” tasks)
  • Annotators: more social liberals, more fiscal conservatives
  • Liberal blogs are liberal, conservative blogs are conservative
  • Liberals are quick to see conservative bias; conservatives see both!
slide-17
SLIDE 17

Lexical Indicators of Bias

Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22

Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc

slide-18
SLIDE 18

Lexical Indicators of Bias

Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22

Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc

slide-19
SLIDE 19

Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22

Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc

Lexical Indicators of Bias

slide-20
SLIDE 20

Current Work (Eisenstein, Yano, Smith, Cohen, and Xing, in progress)

  • Goal: find mentions of specific “persons of interest” in noisy multi-author text
  • Starting point: a list of names
  • Cf. noun phrase coreference resolution
  • Deal with misspellings, variations, titles, even sobriquets like “Obamessiah”
  • Exploit local and document context
  • Predominantly unsupervised approach (supervision is the list of names)
slide-21
SLIDE 21

NLP Success Stories

? Search engines ✓ Translation (Google) ✓ Information extraction (text → databases) ✓ Question answering (recently: IBM’s Watson) ✓ Opinion mining

slide-22
SLIDE 22

NLP: The Refrigerator of the Information Age

  • It enables all kinds of more exciting activities.
  • It does something you can’t do very well (cf. dishwashers).
  • When it’s working for you, it’s always quietly humming in the background.
  • You don’t think about it too often, and instructions never mention it.
  • When it stops working, you will notice.
  • Though expertise is required, there is very little glamor in manufacturing,

maintaining, or improving it.

Opinion

slide-23
SLIDE 23
  • 2. Adventures in Macro-Analysis:

Linking Text to Other Data

Note to John: These may be superficial.

“data describing government activity in all parts of the policymaking process” social behavior

economic, financial, political

slide-24
SLIDE 24
  • Linear regression:
  • Each xi is some representation of a document as a d-dimensional vector.
  • Each yi is some output value that we seek to learn to predict.
  • w is a vector of weights.
  • Learning the model = picking w

Recent Development: Text Regression

min

w∈Rd

1 n

n

  • i=1

error

  • yi, w⊤xi
slide-25
SLIDE 25

Text Regression Examples

x y reviews of a film from newspaper critics

  • pening weekend

revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume

(Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

slide-26
SLIDE 26

Text Regression Examples

x y reviews of a film from newspaper critics

  • pening weekend

revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume

(Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

slide-27
SLIDE 27

Data

  • Text: pre-release movie reviews from seven American newspapers

(1,147+317+254 movies), during 2005-2009

  • Metadata: name, production house, genre(s), scriptwriter(s), director(s),

country of origin, primary actors, release date, MPAA rating, running time, production budget (metacritic.com, the-numbers.com)

  • (State-of-the-art forecasters are based on these observables; see, e.g.,

Simonoff and Sparrow, 2000; Sharda and Delen, 2006)

  • Target: opening weekend gross revenue, number of opening weekend

screens (the-numbers.com)

slide-28
SLIDE 28

Experimental Results

Total MAE ($M) Per Screen MAE ($K) Predict median 10.521 6.642 Non-text 5.983 6.540 Words, bigrams, trigrams 7.627 6.060 Non-text + words, bigrams, trigrams 5.750 6.052

slide-29
SLIDE 29

Features ($M)

rating pg +0.085 adult

  • 0.236

rate r

  • 0.364

sequels this series +13.925 the franchise +5.112 the sequel +4.224 people will smith +2.560 brittany +1.128 ^ producer brian +0.486 genre testosterone +1.945 comedy for +1.143 a horror +0.595 documentary

  • 0.037

independent

  • 0.127

sent. best parts of +1.462 smart enough +1.449 a good thing +1.117 shame $

  • 0.098

bogeyman

  • 0.689

plot torso +09.054 vehicle in 5.827 superhero $ 2.020 Also ... “of the art,” “and cgi”, “shrek movies,” “voldemort,” “blockbuster,” “anticipation,” “summer movie”, “canne” is bad.

slide-30
SLIDE 30

Text Regression Examples

x y reviews of a film from newspaper critics

  • pening weekend

revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume

(Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

slide-31
SLIDE 31

Why Finance?

  • I hate reading financial reports, but they contain crucial information about my

investments, hence my future.

  • Natural language processing for boring texts seems like a good bet.
  • Finance researchers want to know:
  • Are financial reports worth the cost?
  • Are they informative?
  • Does this tell us anything about financial policy?
slide-32
SLIDE 32
  • Return on day t:
  • Sample standard deviation from day t - τ to day t:
  • This is called measured volatility.

Volatility

v[t−τ,t] =

  • τ
  • i=0

(rt−i − ¯ r)2

  • τ

rt = closingpricet + dividendst closingpricet−1 − 1

slide-33
SLIDE 33

Important Property of Volatility

  • Autoregressive conditional heteroscedacity: volatility tends to be stable (over

horizons like one year).

  • v[t - τ, t] is a strong predictor of v[t, t + τ]
  • This is our strong baseline.
slide-34
SLIDE 34

Data

  • Text: “Form 10-K” section 7 from publicly-traded American firms’

government-mandated disclosures (26,806 reports, ~250M words), during 1996-2006

  • sec.gov
  • www.ark.cs.cmu.edu/10K
  • Metadata: volatility in the year prior to the report (“history”), v[t - 1y, t]
  • Target: volatility in the year following the report, v[t, t + 1y]

Source: Center for Research in Security Prices U.S. Stocks Databases

slide-35
SLIDE 35

Experimental Setup

  • Test on year Y.
  • Train on (Y - 5, Y - 4, Y - 3, Y - 2, Y - 1).
  • Six such splits.
  • Compare
  • history-only baseline;
  • text-only SVR, unigrams and bigrams, log(1 + frequency);
  • combined SVR.
slide-36
SLIDE 36

Dominant Weights (2000-4)

loss 0.025 net income -0.021 net loss 0.017 rate -0.017 year # 0.016 properties -0.014 expenses 0.015 dividends -0.013 going concern 0.014 lower interest -0.012 a going 0.013

critical accounting -0.012

administrative 0.013 insurance -0.011 personnel 0.013 distributions -0.011

high volatility terms low volatility terms

slide-37
SLIDE 37

MSE of Log-Volatility

0.120 0.143 0.165 0.188 0.210

2001 2002 2003 2004 2005 2006 Micro-ave.

History Text Text + History

* * * * *

lower is better

slide-38
SLIDE 38

2002

  • Enron and other accounting scandals
  • Sarbanes-Oxley Act of 2002 (and other reforms)
  • Longer reports
  • Are the reports more informative after 2002? Because of Sarbanes-Oxley?
slide-39
SLIDE 39

Text Regression Examples

x y reviews of a film from newspaper critics

  • pening weekend

revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume

(Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

slide-40
SLIDE 40

Political Blogs

  • Arguably an influential medium in American politics.
  • Adamic and Glance (2005), Leskovec et al. (2007), and many others have

considered the link and community structure.

  • We’re focusing on predicting how readers will respond.
  • Cf. ideological discourse models: Lin, Xing, and Hauptmann (2008)
slide-41
SLIDE 41
slide-42
SLIDE 42

42

Comments

slide-43
SLIDE 43

Data

  • Main Text: blog posts from five American political blogs, 2007-2008.

1000-2200 posts per blog, 110K-320K words per blog, average words 68-185 per post (by blog).

  • Comments: text and author for each comment on the above posts. 30-200

comments per post, 20-40 words per comment.

  • www.ark.cs.cmu.edu/blog-data
slide-44
SLIDE 44

Technical Digression

  • Blei et al.’s latent Dirichlet allocation model (Blei et al., 2003) is not so

different from dimensionality reduction, but it has a more extensible probabilistic interpretation.

  • Quite useful for exploratory data analysis.
  • Easy to extend to model other observed and hidden variables.
  • I will skip the derivations and give the high-level idea:

p(response | words) ∝

  • p(words, topics, mixture, response)
slide-45
SLIDE 45

1873 women black white men people liberal civil working woman rights 1730 obama clinton campaign hillary barack president presidential really senator democratic 1643 think people policy really way just good political kind going 1561 conservative party political democrats democratic republican republicans immigration gop right 1521 people city school college photo creative states license good time 1484 romney huckabee giuliani mitt mike rudy muslim church really republican 1478 iran world nuclear israel united states foreign war international iranian 1452 carbon oil trade emissions change climate energy human global world 1425 obama clinton win campaign mccain hillary primary voters vote race 1352 health economic plan care tax spending economy money people insurance 1263 iraq war military government american iraq troops forces security years 1246 administration bush congress torture law intelligence legal president cia government 1215 mccain john bush president campaign policy know george press man 1025 team game season defense good trade play player better best 1007 book times news read article post blog know media good

slide-46
SLIDE 46

Predicting Volume

  • Which blog posts will attract more attention than the average?

Precision Recall Naive Bayes 72.5 41.7 Poisson SLDA* 70.1 63.2 CommentLDA 70.2 68.8

*Similar to Blei and McAuliffe (2008), but mixture of Poissons.

slide-47
SLIDE 47

Text Regression Examples

x y reviews of a film from newspaper critics

  • pening weekend

revenue

(Joshi, Das, Gimpel, and Smith, 2010)

a company’s annual 10-K report volatility (a measure of financial risk)

(Kogan, Levin, Routledge, Sagi, and Smith, 2009)

political blog post total comment volume

(Yano and Smith, 2010)

microblog microblogger’s geographical coordinates

(Eisenstein, O’Connor, and Smith, in progress)

slide-48
SLIDE 48

“Dialect” from Twitter

130 120 110 100 90 80 70 60 25 30 35 40 45 50

slide-49
SLIDE 49

Another Macro-Analysis Example

  • Linking microblog sentiment to other measures of public opinion
  • Warning: this is descriptive stuff!
slide-50
SLIDE 50

Data

  • Text: A billion “tweets,” from 2008-2009. Average length about 11 words.
  • Time Series: (all data are freely available online)

Economic confidence:

  • Gallup Economic Confidence poll (daily, three-day averages)
  • ICS from Reuters/U. Michigan (monthly)

Politics:

  • Gallup’s poll of Obama approval rating
slide-51
SLIDE 51

Technical Approach

  • Retrieve messages that match a keyword (e.g., jobs or obama).
  • Estimate daily opinion.
  • “Positive” and “negative” words come from the OpinionFinder lexicon

(Wilson, Wiebe, and Hoffmann, 2005; 1600 types and 1200 types respectively).

  • Two parameters: number of days for moving average, and lead/lag.

xt = countt(positive word ∧ keyword) countt(negative word ∧ keyword)

slide-52
SLIDE 52

Moving Averages 1, 7, 30 past days

200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 1 2 3 4 5 Sentiment Ratio

slide-53
SLIDE 53

Sentiment ratios (x); two different smoothing window and lead parameters

Sentiment Ratio 1.5 2.0 2.5 3.0 3.5 4.0 k=15, lead=0 k=30, lead=50 Gallup Economic Confidence 60 50 40 30 20 Michigan ICS 200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 55 60 65 70 75

Gallup (y) Michigan ICS (y)

slide-54
SLIDE 54

r = 0.794

jobs and Gallup’s Economic Confidence Poll

slide-55
SLIDE 55

Economic Confidence and Twitter

  • job and economy did not work well at all.
  • jobs ≠ job
slide-56
SLIDE 56

15-day smoothing; r = 0.725

1 2 3 4 5 Sentiment Ratio for "obama" 0.00 0.15

  • Frac. Messages

with "obama" 200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 200912 40 45 50 55 % Support Obama (Election) 40 50 60 70 % Pres. Job Approval

  • bama

sentiment

  • bama

frequency Gallup approval

slide-57
SLIDE 57

Questions for You

  • Fill in the blank:

computational linguistics is to NLP as computational social science is to ___ (and what is that relationship?)

  • What linguistic representations make sense for computational social science?
  • What are some applications that are not superficial?
slide-58
SLIDE 58

Acknowledgments

  • Collaborators: Ramnath Balasubramanyan, Desai Chen, William Cohen,

Dipanjan Das, Kevin Gimpel, Jacob Eisenstein, Mahesh Joshi, Dimitry Levin, André Martins, Brendan O’Connor, Bryan Routledge, Nathan Schneider, Eric Xing, Tae Yano (all CMU); Shimon Kogan (Texas); Jacob Sagi (Vanderbilt)

  • NSF

, DARPA, IBM, Google, HP Labs, Yahoo

slide-59
SLIDE 59

Lead/Lag

  • One thing to consider is the correlation between the text statistic (x) and the

poll (y).

  • But we want forecasting!
  • We can track correlation for different “text lead/poll lag” values; a positive

value means we are forecasting.

  • Does xt predict yt+k?
slide-60
SLIDE 60

90 50 10 30 50 70 90 0.4 0.5 0.6 0.7 0.8 0.9 Text lead / poll lag

  • Corr. against Gallup
  • k=30

k=15 k=7 Text leads poll Poll leads text 90 50 10 30 50 70 90 0.2 0.0 0.2 0.4 0.6 0.8 Text lead / poll lag

  • Corr. against ICS

k=30 k=60

slide-61
SLIDE 61

Sentiment Ratio 1.5 2.0 2.5 3.0 3.5 4.0 k=15, lead=0 k=30, lead=50 Gallup Economic Confidence 60 50 40 30 20 Michigan ICS 200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 55 60 65 70 75 90 50 10 30 50 70 90 0.4 0.5 0.6 0.7 0.8 0.9 Text lead / poll lag

  • Corr. against Gallup
  • k=30

k=15 k=7 Text leads poll Poll leads text 90 50 10 30 50 70 90 0.2 0.0 0.2 0.4 0.6 0.8 Text lead / poll lag

  • Corr. against ICS

k=30 k=60

Gallup (y) Michigan ICS (y)

slide-62
SLIDE 62

Open Questions

  • Smoothing tends to lead to higher correlation. We do not know why.
  • Our sentiment statistic is noisy - is that a problem?
  • Data are not IID. What are appropriate training/tuning/testing strategies?
  • Intuitively, “check it tomorrow.”
  • But how to compare systems fairly, test for significant differences, etc.?
  • How do we know whether this is noise?
slide-63
SLIDE 63

Conclusion

slide-64
SLIDE 64

Text-Driven Forecasting

  • Input: documents
  • Output: a prediction about a future, measurable state of the world.
  • Attractions:
  • theory-neutral,
  • easy and natural to evaluate,
  • inexpensive, and
  • plenty of data (and no annotation needed).
  • Challenges: linguistic representations, noisy NLP

, experimental design.