An experiment on the Reddit dataset Sha hahbaz Ahm hmed ed (1155 - - PowerPoint PPT Presentation

an experiment on the reddit dataset
SMART_READER_LITE
LIVE PREVIEW

An experiment on the Reddit dataset Sha hahbaz Ahm hmed ed (1155 - - PowerPoint PPT Presentation

An experiment on the Reddit dataset Sha hahbaz Ahm hmed ed (1155 15594 94) Viore orel Mora orari ri (1156 15629 29) The nee need for for Sum Summarization Go Goal - to capture the important information contained in large


slide-1
SLIDE 1

An experiment on the Reddit dataset

Sha hahbaz Ahm hmed ed (1155 15594 94) Viore

  • rel Mora
  • rari

ri (1156 15629 29)

slide-2
SLIDE 2

The nee need for for Sum Summarization

  • Go

Goal

  • to capture the important information contained in

large volumes of text, and present it in a brief, representative, and consistent summary

  • TL;DR
  • TLDR acronym expression stands for "Too Long, Didn't

Read"

slide-3
SLIDE 3

Types of

  • f Sum

Summariz izatio ion

  • Auto

utomatic ic sum ummari rizati tion

  • n
  • reducing a text document or a larger corpus of multiple documents into a short set of words
  • r paragraph that conveys the main meaning of the text
slide-4
SLIDE 4

Ext Extractive vs.

  • s. Abs

Abstractive

  • Extr

tracti tive meth methods

  • work by selecting a subset of existing

words, phrases, or sentences in the

  • riginal text to form the summary

Abs bstr tracti tive meth methods

  • build an internal semantic representation and

then use natural language generation techniques to create a summary that is closer to what a human might generate

slide-5
SLIDE 5

Exa Example le

The Army Corps of Engineers, rushing to meet President Bush’s promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press. Ex Extr tracti tive Sum ummary ry : “Army Corp Corps s of

  • f En

Engin ineers rs,” “Pre resid ident Bu Bush sh,” “Ne New Orlea Orleans,” and “de defe fecti tive floo

  • od-contro

rol l pu pump mps” Abs bstr tracti tive Sum ummary ry: “po polit itic ical ne negli ligence” or “ina nadequate pr prote tectio ion from floo

  • ods.”
slide-6
SLIDE 6

Exis Existing wor

  • rk
  • Gupta and Lehal, 2010 – single document summarization
  • Goldstein et al. 2000 - summarization of multiple documents on the same topic (~20

200 do documen ents ts)

  • Cselle, Albrecht, and Wattenhofer, 2007 - summarizing discussions such as email conversations

(~20 200 0 com

  • mmen

ents ts)

  • Hu, Sun, and Lim 2007 – blogs summarization (~15

1500 0 bl blog

  • g po

posts) sts)

  • Chakrabarti and Punera 2011 – tweets summarization (440K

440K twee weets; s; ov

  • ver

er 150 150 games es)

  • Brody and Elhadad 2010 – reviews summarization
slide-7
SLIDE 7

Our ur Da Dataset : : The Red Reddit Uni niverse!

  • Co

Commen ents 149.6 GB : 1,659,361,605(~ 1.66 billion) entries

  • Sub

ubmis issi sions 39.7 GB : 196,531,736(~ 1.96 million) entries

slide-8
SLIDE 8
  • Co

Commen ent - a statement of fact or opinion, especially a remark that expresses a personal reaction

  • r attitude.

{"arc rchi hive ved":true,"aut utho hor":"jaquehamr","body dy":"Thanks for proving the point of the quote.\n\nTL;DR: WOOSH", "controversiality":0, "created_utc":"1239192802", "dow

  • wns

ns":0,"edi dite ted":"false", "gild lded ed":0, "id id":"c08q8en", "link nk_i _id":"t3_8auok", "nam ame":"t1_c08q8en","pare rent_ t_id id":"t1_c08q4sz","ret etri riev eved_ d_on

  • n":1425950159,"score":3,"scor
  • re_h

_hi dden":false,"subreddit":"atheism","sub ubre redd ddit it_i _id":"t5_2qh2p","ups":3}

Com

  • mment
slide-9
SLIDE 9

Sub Submis issi sion

  • Sub

ubmis issi sion - a statement of fact or opinion posted by a registered user with the intention to be elaborated by other users.

{"arc rchi hive ved":true,"aut utho hor":"[deleted]","crea eated ed":1297290547,"cre reate ted_ d_ut utc":"1297290547","dom dom ain":"self.WeAreTheFilmMakers","downs":0,"edited":"false","gilded":0,"hide_score":false,"i d":"fibse","is_ s_sel elf":true,"med edia_ a_em embe bed":{},"nam ame":"t3_fibse","num_ m_co comm mmen ents":2,"ov

  • ver_

r_18 18":fa lse,"perm rmal alin ink":"/r/WeAreTheFilmMakers/comments/fibse/question_about_resumes/","quar aran antin ine ":false,"retr trie ieved ed_o _on":1442846972,"save aved":false,"scor

  • re":2,"sec

ecur ure_ e_me medi dia_ a_emb mbed ed":{}, "sel elft ftext xt":"I'm currently a film student at the University of Cincinnati and I'm going to start applying for internships soon so I was wondering what I should put on my resume when applying.\n\ntl;dr I'm going to be sending out my resume soon and I'm looking for help on what I should include on it”, ”stic icki kied ed":false,"subr bred eddi dit":"WeAreTheFilmMakers", "sub ubre reddi dit_ t_id id":"t5_2qngr","thu humbn bnai ail":"default","titl tle":"Question about resumes", "ups ps":2, "url rl":"http://www.reddit.com/r/WeAreTheFilmMakers/comments/fibse/question_about_resumes/"}

slide-10
SLIDE 10

Proc

  • cess of
  • f Sum

Summariz izatio ion

Extracting a clean dataset “The most important tasks with regard to understanding the information available in comments are filtering, ranking and summarizing the comments.” - (Potthast et al. 2012)

  • Extract only the items which contain tl;dr
  • Ch

Chall llenge

  • "body":"It's pretty sad that someone can sum up ten years
  • f your life with a tl;dr"
slide-11
SLIDE 11

Proc

  • cess of
  • f Sum

Summariz izatio ion

Extracting a clean dataset Filtering the targets

  • Filter out comments/submissions with content

length < 50 chars (our approach)

  • E.

E.g "body":"Thanks for proving the point of the

quote.\n\nTL;DR: WOOSH“ – inv

nvalid id

  • Filter out tl;dr’s with content length < 5 chars (our approach)
  • E.

E.g. "body":"It's pretty sad that someone can sum

up ten years of your life with a tl;dr“ - inv

nvali lid

slide-12
SLIDE 12

Extracting a clean dataset Filtering the targets Processing & Ranking the target content

Proc

  • cess of
  • f Sum

Summariz izatio ion

  • TF

TF-IDF – ranking mod model els s

  • Reduces the influence of more common words
slide-13
SLIDE 13

Extracting a clean dataset Filtering the targets Processing & Ranking the target content Extracting relevant information by ranks

Proc

  • cess of
  • f Sum

Summariz izatio ion

  • Highest rank terms form the summarization (tl;dr)
slide-14
SLIDE 14

Extracting a clean dataset Filtering the targets Processing & Ranking the target content Extracting relevant information by ranks Presentation of the retrieved content

Proc

  • cess of
  • f Sum

Summariz izatio ion

  • 1. there used to be several chann

nnels related to technology and geek culture.

  • 2. then it merged with g4tv, a shitty comcast chann

nnel of little note that wanted techtv's audience and cancelled all the decent reasons to ever tune into techtv.

  • 3. there is nothing decent that comes on cable television that you can't watch

for free (and legally) on either hulu or the comedy chann nnel's website. (12 sente tences) Or Origina nal tl;dr dr: : there was a decent one. comcast more or less bought it out and axed all it's programming to get viewers for its gaming chann nnel but only succeeded in destroying the market and causing kevin rose to run off and create digg. (2 sente tenc nces)

slide-15
SLIDE 15

St Statistic ics abo about the dat data set set

Vali lid Sub ubmis issio ions Invali lid Sub ubmis issio ions

SUB UBMIS ISSION IONS DISTR TRIB IBUTION TION

99.6% 0.4%

Vali lid Co Comm mments ts Invali lid Co Comm mments ts

COMME COMMENTS TS DISTR TRIB IBUTION TION

99.9% 0.1%

1,850,031 (~ 1.85 million) 749376 (~ 0.75 million)

slide-16
SLIDE 16

St Statistic ics abo about the dat data set set

228.2 405.1

50 100 150 200 250 300 350 400 450

COMMENTS SUBMISSIONS

AVERAGE WOR ORD LENG NGTH TH Nu Numbe mber r of word rds

slide-17
SLIDE 17

St Statistic ics abo about the dat data set set

DISTR TRIB IBUTION TION OF OF COMME OMMENTS TS BY LENG NGTH TH Number of Comments

slide-18
SLIDE 18

St Statistic ics abo about the dat data set set

DISTR TRIB IBUTION TION OF OF SUB UBMIS MISSION IONS BY LENG NGTH TH

slide-19
SLIDE 19

Furt Further ide ideas

  • Developing the automatic extractive summarizer on the valid comments & submissions (in

progress)

  • Dealing with tl;dr at the semantic level
  • "body":"It's pretty sad that someone can sum up ten years of your life with a tl;dr“
  • Keyphrase extraction for summarization
  • Form a proper representation of valid and invalid comments/submissions/tl;dr’s
  • Dealing with encountered anomalies and faults in the detection process
slide-20
SLIDE 20

Exa Example les

  • Vali

lid Sub ubmiss ssio ion

slide-21
SLIDE 21
  • Vali

lid Co Comment :omgwtfthatistotallyhitleri'llneverbeabletobuythatbrandoflotioneveragainthankyouforpointingthisoutt

  • mei'llsendanemailoutoeverylawyericanfindsothatthiscompanycanbebroughttojustice!)
  • "wor
  • rdcount":2

Exa Example les

slide-22
SLIDE 22

Con

  • nclusion
  • Difficult to set up a good universal summarization tool (abstractive level)
  • Our approach tends to generalize the idea of a comment/submission/tl;dr
  • Yet the number of valid comments/submissions suggests of a good calibration
  • The existing approach can be further improved
slide-23
SLIDE 23

Refe References

1. http://blog.mashape.com/list-of-30-summarizer-apis-libraries-and-software/ 2. CS838-1 Advanced NLP:Automatic Summarization - Andrew Goldberg 3. Summarizing Newspaper Comments - Clare Llewellyn, Claire Grover and Jon Oberlander 4. Text Summarization using Singular Value Decomposition - Sharayu Rane 5. https://github.com/reddit/reddit/wiki/JSON 6. Automatic Summarization, Ani Nenkova and Kathleen McKeown 7. Automatic Summarization, Andrew Goldberg, 2007

slide-24
SLIDE 24

Thank You You!