Opinio n Mining F e iyu XU & Xiwe n CHE NG Xiwe n.c he ng @ - - PowerPoint PPT Presentation

opinio n mining
SMART_READER_LITE
LIVE PREVIEW

Opinio n Mining F e iyu XU & Xiwe n CHE NG Xiwe n.c he ng @ - - PowerPoint PPT Presentation

Opinio n Mining F e iyu XU & Xiwe n CHE NG Xiwe n.c he ng @ dfki.de DF K I , Sa a rb rue c ke n, Ge rma ny Ja n 19th, 2011 2011-1-19 L a ng ua g e T e c hno lo g y I 1 Disc ussio n o n Opinio n Mining Applic a tio n T e xtma


slide-1
SLIDE 1

2011-1-19 L a ng ua g e T e c hno lo g y I 1

Opinio n Mining

F e iyu XU & Xiwe n CHE NG

Xiwe n.c he ng @ dfki.de DF K I , Sa a rb rue c ke n, Ge rma ny Ja n 19th, 2011

slide-2
SLIDE 2

Disc ussio n o n Opinio n Mining Applic a tio n

slide-3
SLIDE 3

T e xtma p: to pic mo nito ring syste m

slide-4
SLIDE 4

T we e tmo tif: T

  • pic summa riza tio n o n T

witte r- e .g . wikile a k, pa re nting

slide-5
SLIDE 5

Wha t the tre nd: T re nd mo nito ring - e .g . wikile a k

slide-6
SLIDE 6

Opinio n g a the ring spe e d o n Inte rne t

  • WSJ pub lishe s a n a rtic le “why c hine se mo the r a re supe rio r” writte n

b y Amy Chua o n 8th, Ja n, 2011. Until 18th, Ja n

  • 6,800 c o mme nts o n WSJ;

K e ywo rd: Amy Chua

  • 3,490,000 o n Go o g le
  • 5,600 o n twitte r.c o m
  • 5,289 o n wo rdpre ss.c o m

K e ywo rd: pa re nting

  • 83,200,000 se a rc h re sults o n Go o g le ;
  • 1,620,000 fro m twitte r.c o m;
  • 502,000 fro m wo rdpre ss.c o m

2011-1-19 L a ng ua g e T e c hno lo g y I 6

slide-7
SLIDE 7

A q ue stio n fro m Quo ra

slide-8
SLIDE 8

Pro po sa ls o f Opinio n Mining Applic a tio n a nd So lutio n?

slide-9
SLIDE 9

Disc ussio n o n Re so urc e fo r Mo vie Re vie w Summa riza tio n

slide-10
SLIDE 10

Re vie ws o n “Da s L e b e n de r Ande re n” @ imdb

slide-11
SLIDE 11

Re vie ws o n “Da s L e b e n de r Ande re n” @ imdb

slide-12
SLIDE 12

T

  • p 250 mo vie s vo te d b y imdb use rs
slide-13
SLIDE 13

Wha t re so urc e a nd whic h fe a ture s yo u wo uld like to c ho o se fo r OM ta sks?

slide-14
SLIDE 14

E xpe rime nt o n K

  • mPa rse

Ma king NPCs e xpre ss the ir o pinio ns e mo tio na lly

slide-15
SLIDE 15

Go ssip Ga lo re in Ra sc a lli

slide-16
SLIDE 16

Ha nk in K

  • mPa rse
slide-17
SLIDE 17
  • Unsupe rvise d ma c hine le a rning
  • Da ta : c o mme nts ra nke d b y re vie we rs (1 ~ 10 sta rs)
  • F

e a ture s – N-Gra m T

  • ke n Pa tte rns

– De pe nde nc y Pa tte rns

  • E

xtra kno wle dg e – Wo rdNe t – Ne g a tio n e xpre ssio ns

  • L

e a rning a lg o rithm – Sc o ring syste m

Pa ul‘ s so lutio n

slide-18
SLIDE 18
  • Re so urc e

– I MDb (http:/ / www.imdb .c o m/ ), A mo vie o nline sto re ho use

  • I

nte re ste d in I MDB pa g e s: – with na me (a c to rs, a utho rs, dire c to rs e tc .) – with title (mo vie title , mo vie re c o mme nda tio ns fro m I MDb )

  • Co nta ining the info rma tio n:

– Mo vie title – Re vie w – Re vie w title – Re vie w da te – Autho r na me – Autho r o rig in (o ptio na l) – Re c o mme nda tio n o f o the r use rs to this re vie w (o ptio na l) – T he sc o re the a utho r g a ve the re vie we d mo vie x/ 10 (o ptio na l)

Da ta Pro c e ssing

slide-19
SLIDE 19

<Re c o rd na me ="Pa yc he c k (2003)" isA="Mo vie " type ="IMDb use r re vie ws"> <F e a ture na me ="Re c o mme nd ">0 o ut o f 3</ F e a ture > <F e a ture na me ="T ime ">25 De c e mb e r 2003</ F e a ture > <F e a ture na me ="Autho r">a k2k</ F e a ture > <F e a ture na me ="Re vie w">A po o r re ma ke o f Mino rity Re po rt, with le ss ta le nte d a c to rs. Pro mising plo t line tha t wilte d a wa y in the first thirty minute s o f the film. Inte re sting induc tive jo urne y a nd ne a t c a r c ha se s, b ut no whe re c lo se to my mo ne y's wo rth. I'd re c o mme nd to g o a nd se e L OR a g a in.</ F e a ture > <F e a ture na me ="Sc o re ">1/ 10</ F e a ture > <F e a ture na me ="F ro m">Illino is</ F e a ture > <F e a ture na me ="T itle ">A pe rfe c t Christma s mo vie ha s a b o ut a s muc h c o nne c tio n with re a lity a s Sa nta Cla use do e s.</ F e a ture > </ Re c o rd>

Da ta Pro c e ssing

slide-20
SLIDE 20

Pre sumptio ns a nd o b se rva tio ns:

  • Sc o re indic a te s the se ntime nt o f the re vie w
  • Sho rt re vie ws a re pre fe rre d o ve r lo ng re vie ws

– lo ng re vie ws ha ve a lo t o f o b je c tive pa rts a b o ut sto ryline , a ne c do te s e tc . – sho rt re vie ws c o nta ining o nly the o pinio n o ve r the mo vie a nd

  • fte n e xpre sse d se ntime nta l
  • T

he se ntime nt c la ssific a tio n o n e xtre me re vie ws (ve ry hig h o r ve ry lo w ra ting ) a re mo stly una mb ig uo us a nd c le a r while mid ra te d re vie ws ha ve a lo t o f unc le a r se nte nc e s, suc h a s o ne the o ne ha nd …o n the o the r

Da ta Pro c e ssing

slide-21
SLIDE 21
  • F

ilte ring the re vie w – T he numb e r o f to ke ns > 900 – with a ra ting 4, 5, 6, 7 o r 8 o ut o f 10

  • SCORE

a ssig nme nt to e a c h se nte nc e in the se le c te d re vie ws

– SCORE = Ra nk ( 1 ~ 10 sta rt) – SCORE + 1, if the se nte nc e :

  • I

s the first, se c o nd o r la st se nte nc e

  • And c o nta ins the ke ywo rds, suc h a s I

, me , mo vie , film a nd this mo vie . – SCORE – 1, if the se nte nc e :

  • Ha s the le ng th > 100
  • And c o nta ins the ke ywo rds, suc h a s imdb , yo u, yo ur, spo ile r a nd

re vie w e tc .

  • T

he se nte nc e with the hig he st SCORE fro m a re vie w a re se le c te d.

Da ta Pro c e ssing

slide-22
SLIDE 22

F e a ture s – N-g ra m to ke n pa tte rn

E xtra c ting uni-, b i- a nd trig ra ms o ut o f e ve ry se nte nc e fro m the se ntime nta l c o rpus

  • F
  • r e xa mp le : I a b so lute ly lo ve d this mo vie .
  • Unig ra ms:

– i (NP), a b so lute ly (RB), lo ve d (VVD)

  • Big ra ms:

– i a b so lute ly (NP RB), a b so lute ly lo ve d (RB VVD)

  • T

rig ra ms: – i a b so lute ly lo ve d (NP RB VVD), a b so lute ly lo ve d this (RB VVD DT )

slide-23
SLIDE 23

F e a ture s – De pe nde nc y Pa tte rn

So me imp o rta nt info rma tio n is misse d in N-g ra m to ke ns pa tte rn.

  • funny a nd mo vie a re no t c a ug ht

b y a n-g ra m (n<6) So , we inc lude de pe nd s pa tte rns:

  • a mo d(mo vie -9, funny-4)

T

  • o l: Sta nfo rd -De pe nde nc y Pa rse r

This is a funny super interesting and exciting movie.

slide-24
SLIDE 24
  • All 1-g ra m a dje c tive a nd a d ve rb pa tte rns will b e e xte nd e d with

Wo rdNe t. Bo th the syno nyms a nd the a nto nyms a re use d.

  • F
  • r instanc e , 1-g ram patte rn “dry” c an b e e xte nde d with

– Pa rc he d / a rid / a nhydro us / se re / drie d-up – We t / wa te ry / da mp / mo ist / humid / so g g y

  • In o ur e xpe rime nt, the

a nto nyms/ syno nyms a re the wo rds whic h c o nne c t the

  • rig ina l wo rd with a

ma ximum dista nc e o f two .

E xtra K no wle d g e - Unig ra m pa tte rns e xte nd e d with Wo rd Ne t

slide-25
SLIDE 25
  • So me e le me nts in a se nte nc e c a n c ha ng e the se ntime nt o f a wo rd o r

phra se , suc h a s

– Sub junc tive : I tho ug ht this mo vie is g o o d. – T e mpus: T his mo vie wa s g o o d. – Ne g a tio n: T his film is no t funny. – Quo ta tio n: My frie nd to ld me “this is the b e st mo vie e ve r, yo u ha ve to wa tc h it” b ut I didn’ t like d it.

  • In o ur wo rk, the c o nte nt in the q uo ta tio n is re mo ve d
  • we c a re o nly ne g a tio ns suc h a s no t, no , ne ve r a nd n’ t, inc luding

– no wo nde r, no t just, no t to me ntio n e tc . – Re stric te d c o mpa ra tive se nte nc e s “no t b e tte r a s” “no mo re ” e tc .

E xtra K no wle d g e – Ne g a tio ns

slide-26
SLIDE 26

Alg o rithm – Sc o re o f pa tte rns

  • E

a c h pa tte rn ha s a n iSCORE , inc luding two sub -va lue s – iSCORE

po s : the va lue o f b e ing po sitive

– iSCORE

ne g : the va lue o f b e ing ne g a tive

  • T

he iSCORE is initia lize d with the fre q ue nc y o f this p a tte rn fro m the c o rpus

slide-27
SLIDE 27
  • Altho ug h “mo re ” ne g a tive sc o re d se nte nc e s a re use d , i.e . (1/ 10,

2/ 10, 3/ 10) vs.(9/ 10, 10/ 10), p o sitive re vie ws a re still twic e the na tive

  • ne s.
  • Assuming 1) the re a re X ne g a tive se nte nc e s a nd Y po sitive o ne s o r
  • n the o the r wa y ro und, a nd 2) Y > X

e q ua lize r= Y / X BIAS= e q ua lize r/ (X + Y + Y – X) iSCORE

Y = iSCORE Y / 2Y – BIAS

Alg o rithm – Da ta b ia s

slide-28
SLIDE 28
  • iSCORE

= iSCORE

po s - iSCORE ne g

– If the va lue o f the iSCORE is po sitive the c o mpute d po la rity o f the pa tte rn is po sitive a nd if the va lue is ne g a tive the po la rity is ne g a tive

  • iSCORE

= iSCORE * 2, if the pa tte rn is b ina ry

  • iSCORE

= iSCORE * 3, if the pa tte rn is triple

  • iSCORE

= iSCORE * 2.5, if the pa tte rn is a de pe nd e nc y pa tte rn

Alg o rithm – iSCORE

slide-29
SLIDE 29
  • T

he syno nyms ha ve the sa me po la rity a s the wo rd, while the a nto nyms ha ve a re ve rse d po la rity.

Alg o rithm – iSCORE e xte nd e d b y Wo rd Ne t

slide-30
SLIDE 30
  • F
  • r instanc e , if Po larity(fast JJ) = po sitive , fo r wo rds with the Wo rdNe t de pth = 1

– [iSCORE (swift), iSCORE (pro mpt) , …] += 0.3 – [iSCORE (slo w) ] += - 0.3

  • fo r the wo rds with a Wo rdNe t de pth = 2

– [iSCORE (swift) , iSCORE (pro mpt), …] += 0.3 * 2= 0.6 – [iSCORE (slo w)] += - 0.3 * 2 = - 0.6 – [iSCORE (slug g ish)] += - 0.3 – iSCORE (syno nyms a t the xnd de pth) += 0.3 * ((ma x. de pth + 1) – x) – iSCORE (a nto nyms a t the xnd de pth) += -0.3 * ((ma x. de pth + 1) – x)

  • # 0.3 is an arb itrarily c ho se n value

Alg o rithm – iSCORE e xte nd e d b y Wo rd Ne t

slide-31
SLIDE 31

T he sc o pe is c o mpute d b y e xte nding e ve ry „a dja c e nt“ de pe nde nc e sta rting with the ne g a ting wo rd a nd skipp ing a ll lo ng dista nc e de pe nde nc ie s

  • T

his mo vie isn't to uc hing e ve ryo ne , b ut it has so me g re at sc e ne s. 10/ 10

  • One c o nstra int : Ne g a tio n o nly wo rk o ve r the o ne -ste p “ne g ”

c o nne c tio n.

Alg o rithm – iSCORE with ne g a tio ns

slide-32
SLIDE 32
  • T

his film is no t that g re at.

– iSCORE (g re a t ) += 1.0 – iSCORE (no t tha t g re a t) += -3.0 – iSCORE (c o p(g re a t-6, is-3)) += 2.5 – iSCORE (ne g (g re a t-6, no t-4)) += -2.5

Alg o rithm – iSCORE with ne g a tio ns

slide-33
SLIDE 33
  • Da ta : 8,038 po sitive a nd 3,016 ne g a tive se nte nc e s
  • F

e a ture s – N-g ra ms (0<n<4) c o nta ining a dje c tive s o r a dve rb s – Wo rdNe t de pth o f two , initia l va lue = 0,3 – Ne g a tio n

  • Inc o mple te Re sult

– 75,60% ~ 95,12%

E xpe rime nt 1

Random samples Right positives Right negatives False positives False negatives Unknowns

41 16 15 2 8

slide-34
SLIDE 34
  • Unkno wns:

– Ana lysis b e twe e n a sta tistic a l va lue o f -5 a nd 5 – Se nte nc e s with unkno wn po la rity:

  • o n the o ne ha nd…o n the o the r
  • the y ha ve do ne it a g a in.
  • i re a d so me whe re tha t it is 'the ne xt fina l de stina tio n'.
  • With unkno wns a s c o rre c t a n a c c ura c y o f 95,12% 
  • With unkno wns a s a n e rro r a n a c c ura c y o f 75,60%
  • Witho ut unkno wns a n a c c ura c y o f 93.94%

E xpe rime nt 1

slide-35
SLIDE 35
  • Da ta : 4,395 po sitive a nd 1,733 ne g a tive se nte nc e s
  • F

e a ture s – N-g ra ms (0<n<4) c o nta ining a dje c tive s o r a dve rb s – Wo rdNe t de pth o f two , initia l va lue = 0,3 – Ne g a tio n – De p -pa tte rns

  • Inc o mple te Re sult

– 78,23% ~ 94,55%

E xpe rime nt 2

Random samples True positives True negatives False positives False negatives Unknowns

147 80 35 12 24

slide-36
SLIDE 36
  • F

a lse ne g a tive s: – 90 minute s lo ng a nd o nly o ne unne c e ssa ry sc e ne . – i ne ve r tho ug ht pa c ino wo uld to p "g o dfa the r" b ut b o y wa s i wro ng . – it wa s ve ry e duc a tio na l a nd info rma tive . –

  • ne o f the b e tte r ww2 e sc a pe mo vie s.

– it's a n a ll-time fa ve a nd i'm ha p py to he a r tha t it's o ut o n vide o !

  • c o p(ha ppy-9, 'm-8) ne g a tive

– i do n't think the re is o ne b a d thing i c o uld sa y a b o ut it.

  • to o lo ng ne g a tio n sc o pe
  • Ob se rva tio n: mo st o b je c tive se nte nc e s a re ra te d ne g a tive
  • With unkno wns a s c o rre c t a n a c c ura c y o f 94,55% 
  • With unkno wns a s a n e rro r a n a c c ura c y o f 78,23%
  • Witho ut unkno wns a n a c c ura c y o f 93,49%

E xpe rime nt 2

slide-37
SLIDE 37
  • Da ta : 4,890 po sitive a nd 1,864 ne g a tive se nte nc e s

– Diffe re nt da ta se t

  • F

e a ture s – Sa me a s E xpe rime nt 2

  • Inc o mple te Re sult

– 78,07% ~ 95,61%

E xpe rime nt 3

Random samples True positives True negatives False positives False negatives Unknowns

114 51 38 1 4 20

slide-38
SLIDE 38
  • F

a lse po sitive s: – jo hn sing le to n ha s no t, a nd pro b a b ly ne ve r will ma ke a g o o d film. – the sc e na rio is pre tty inte re sting . (inte re sting ha s a ne g a tive iSc o re )

  • Unkno wns:

– T his is a g o o d mo vie b ut so me thing is missing . – i lo ve ka te / e va ng e line , b ut i ho pe miste r e ko will b e e a te n b y lo stzilla ! – do n't think the mo vie is o ve r just b e c a use the c re dits a re ro lling ! – i ha ve no o b je c tio n to lo ng mo vie s, just sho rt mo vie s tha t se e m lo ng . – he use s vio le nc e a g a inst pe o ple who a re me a n to him. Ob je c tive

  • With unkno wns a s c o rre c t a n a c c ura c y o f 95,61% 
  • With unkno wns a s a n e rro r a n a c c ura c y o f 78,07%
  • Witho ut unkno wns a n a c c ura c y o f 94,68%

E xpe rime nt 3

slide-39
SLIDE 39

Co mme nts, q ue stio ns, sug g e stio ns & Opinio ns?

slide-40
SLIDE 40

2011-1-19 L a ng ua g e T e c hno lo g y I 40