Natural ___ Processing
Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu
Natural ___ Processing Noah A. Smith School of Computer Science - - PowerPoint PPT Presentation
Natural ___ Processing Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu This Talk 1. A light discussion of micro-analysis words micro sentences documents 2. Some recent developments in
Noah A. Smith School of Computer Science Carnegie Mellon University nasmith@cs.cmu.edu
specifically linking text and non-text
words
sentences
and a number of other applications.
phenomenon of human language
morphological disambiguation
structure parsing, dependency parsing)
argument structures
discourse relations and parsing
" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . SINV " S-TPC NP-SBJ VP DT VBZ RB , NP-PRD NP PP SBAR DT NN NN NN NN IN TO VB DT JJ IN " VBD NNP NNP , WP VBZ IN DT JJ NNS . WHNP S * NP-SBJ * VP VP NP NP * PP VP * NP-SBJ NP * SBAR WHNP S NP-SBJ VP NP NP PP NP 28 NP S
" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks . 28
" This is not the sort of market to have a big position in , " said David Smith , who heads trading in all non-U.S. stocks .
http://www.ark.cs.cmu.edu/TurboParser
Experiencer State Sound source Internal cause Sound
Experiencer http://www.ark.cs.cmu.edu/SEMAFOR
+ Reusable data, clearly identified evaluation tasks, benchmarks → progress
1. Pay experts to annotate examples. 2. Use a combination of human ingenuity and statistics to build systems and/or do science.
“agile” about text domains and representations.
1. Pay experts to annotate examples. 2. Use a combination of human ingenuity and statistics to build systems and/or do science.
Monroe, Colaresi, and Quinn, 2008)
Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22
Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc
Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22
Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc
Overall Liberal Conservative Not Sure Which bad 0.60 Administration 0.28 illegal 0.40 pass 0.32 personally 0.56 Americans 0.24 Obama’s 0.38 bad 0.32 illegal 0.53 woman 0.24 corruption 0.32 sure 0.28 woman 0.52 single 0.24 rich 0.28 blame 0.28 single 0.52 personally 0.24 stop 0.26 they’re 0.24 rich 0.52 lobbyists 0.23 tax 0.25 happen 0.24 corruption 0.52 Republican 0.22 claimed 0.25 doubt 0.24 Administration 0.52 union 0.20 human 0.24 doing 0.24 Americans 0.51 torture 0.20 doesn’t 0.24 death 0.24 conservative 0.50 rich 0.20 difficult 0.24 actually 0.24 doubt 0.48 interests 0.20 Democrats 0.24 exactly 0.22 torture 0.47 doing 0.20 less 0.23 wrong 0.22
Most strongly biased words, ranked by relative frequency of receiving a bias mark, normalized by total frequenc
? Search engines ✓ Translation (Google) ✓ Information extraction (text → databases) ✓ Question answering (recently: IBM’s Watson) ✓ Opinion mining
maintaining, or improving it.
economic, financial, political
min
w∈Rd
1 n
n
error
x y reviews of a film from newspaper critics
revenue
(Joshi, Das, Gimpel, and Smith, 2010)
a company’s annual 10-K report volatility (a measure of financial risk)
(Kogan, Levin, Routledge, Sagi, and Smith, 2009)
political blog post total comment volume
(Yano and Smith, 2010)
microblog microblogger’s geographical coordinates
(Eisenstein, O’Connor, and Smith, in progress)
x y reviews of a film from newspaper critics
revenue
(Joshi, Das, Gimpel, and Smith, 2010)
a company’s annual 10-K report volatility (a measure of financial risk)
(Kogan, Levin, Routledge, Sagi, and Smith, 2009)
political blog post total comment volume
(Yano and Smith, 2010)
microblog microblogger’s geographical coordinates
(Eisenstein, O’Connor, and Smith, in progress)
(1,147+317+254 movies), during 2005-2009
country of origin, primary actors, release date, MPAA rating, running time, production budget (metacritic.com, the-numbers.com)
Simonoff and Sparrow, 2000; Sharda and Delen, 2006)
screens (the-numbers.com)
Total MAE ($M) Per Screen MAE ($K) Predict median 10.521 6.642 Non-text 5.983 6.540 Words, bigrams, trigrams 7.627 6.060 Non-text + words, bigrams, trigrams 5.750 6.052
rating pg +0.085 adult
rate r
sequels this series +13.925 the franchise +5.112 the sequel +4.224 people will smith +2.560 brittany +1.128 ^ producer brian +0.486 genre testosterone +1.945 comedy for +1.143 a horror +0.595 documentary
independent
sent. best parts of +1.462 smart enough +1.449 a good thing +1.117 shame $
bogeyman
plot torso +09.054 vehicle in 5.827 superhero $ 2.020 Also ... “of the art,” “and cgi”, “shrek movies,” “voldemort,” “blockbuster,” “anticipation,” “summer movie”, “canne” is bad.
x y reviews of a film from newspaper critics
revenue
(Joshi, Das, Gimpel, and Smith, 2010)
a company’s annual 10-K report volatility (a measure of financial risk)
(Kogan, Levin, Routledge, Sagi, and Smith, 2009)
political blog post total comment volume
(Yano and Smith, 2010)
microblog microblogger’s geographical coordinates
(Eisenstein, O’Connor, and Smith, in progress)
investments, hence my future.
horizons like one year).
government-mandated disclosures (26,806 reports, ~250M words), during 1996-2006
Source: Center for Research in Security Prices U.S. Stocks Databases
2001 2002 2003 2004 2005 2006 Micro-ave.
* * * * *
lower is better
x y reviews of a film from newspaper critics
revenue
(Joshi, Das, Gimpel, and Smith, 2010)
a company’s annual 10-K report volatility (a measure of financial risk)
(Kogan, Levin, Routledge, Sagi, and Smith, 2009)
political blog post total comment volume
(Yano and Smith, 2010)
microblog microblogger’s geographical coordinates
(Eisenstein, O’Connor, and Smith, in progress)
considered the link and community structure.
42
1000-2200 posts per blog, 110K-320K words per blog, average words 68-185 per post (by blog).
comments per post, 20-40 words per comment.
different from dimensionality reduction, but it has a more extensible probabilistic interpretation.
1873 women black white men people liberal civil working woman rights 1730 obama clinton campaign hillary barack president presidential really senator democratic 1643 think people policy really way just good political kind going 1561 conservative party political democrats democratic republican republicans immigration gop right 1521 people city school college photo creative states license good time 1484 romney huckabee giuliani mitt mike rudy muslim church really republican 1478 iran world nuclear israel united states foreign war international iranian 1452 carbon oil trade emissions change climate energy human global world 1425 obama clinton win campaign mccain hillary primary voters vote race 1352 health economic plan care tax spending economy money people insurance 1263 iraq war military government american iraq troops forces security years 1246 administration bush congress torture law intelligence legal president cia government 1215 mccain john bush president campaign policy know george press man 1025 team game season defense good trade play player better best 1007 book times news read article post blog know media good
*Similar to Blei and McAuliffe (2008), but mixture of Poissons.
x y reviews of a film from newspaper critics
revenue
(Joshi, Das, Gimpel, and Smith, 2010)
a company’s annual 10-K report volatility (a measure of financial risk)
(Kogan, Levin, Routledge, Sagi, and Smith, 2009)
political blog post total comment volume
(Yano and Smith, 2010)
microblog microblogger’s geographical coordinates
(Eisenstein, O’Connor, and Smith, in progress)
130 120 110 100 90 80 70 60 25 30 35 40 45 50
Economic confidence:
Politics:
(Wilson, Wiebe, and Hoffmann, 2005; 1600 types and 1200 types respectively).
200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 1 2 3 4 5 Sentiment Ratio
Sentiment Ratio 1.5 2.0 2.5 3.0 3.5 4.0 k=15, lead=0 k=30, lead=50 Gallup Economic Confidence 60 50 40 30 20 Michigan ICS 200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 55 60 65 70 75
1 2 3 4 5 Sentiment Ratio for "obama" 0.00 0.15
with "obama" 200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 200912 40 45 50 55 % Support Obama (Election) 40 50 60 70 % Pres. Job Approval
computational linguistics is to NLP as computational social science is to ___ (and what is that relationship?)
Dipanjan Das, Kevin Gimpel, Jacob Eisenstein, Mahesh Joshi, Dimitry Levin, André Martins, Brendan O’Connor, Bryan Routledge, Nathan Schneider, Eric Xing, Tae Yano (all CMU); Shimon Kogan (Texas); Jacob Sagi (Vanderbilt)
, DARPA, IBM, Google, HP Labs, Yahoo
poll (y).
value means we are forecasting.
90 50 10 30 50 70 90 0.4 0.5 0.6 0.7 0.8 0.9 Text lead / poll lag
k=15 k=7 Text leads poll Poll leads text 90 50 10 30 50 70 90 0.2 0.0 0.2 0.4 0.6 0.8 Text lead / poll lag
k=30 k=60
Sentiment Ratio 1.5 2.0 2.5 3.0 3.5 4.0 k=15, lead=0 k=30, lead=50 Gallup Economic Confidence 60 50 40 30 20 Michigan ICS 200801 200802 200803 200804 200805 200806 200807 200808 200809 200810 200811 200812 200901 200902 200903 200904 200905 200906 200907 200908 200909 200910 200911 55 60 65 70 75 90 50 10 30 50 70 90 0.4 0.5 0.6 0.7 0.8 0.9 Text lead / poll lag
k=15 k=7 Text leads poll Poll leads text 90 50 10 30 50 70 90 0.2 0.0 0.2 0.4 0.6 0.8 Text lead / poll lag
k=30 k=60
, experimental design.