TheGameTheoreticWeb TheGameTheoreticWeb - - PowerPoint PPT Presentation

the game theoretic web the game theoretic web
SMART_READER_LITE
LIVE PREVIEW

TheGameTheoreticWeb TheGameTheoreticWeb - - PowerPoint PPT Presentation

TheGameTheoreticWeb TheGameTheoreticWeb Web(2.0)Mining:AnalyzingSocial Web(2.0)Mining:AnalyzingSocial Media Media AnupamJoshi AnupamJoshi


slide-1
SLIDE 1

Web(2.0)Mining:AnalyzingSocial Media Web(2.0)Mining:AnalyzingSocial Media

AnupamJoshi

  • EbiquityGroup,UMBC

joshi@cs.umbc.edu http://ebiquity.umbc.edu/ AnupamJoshi

  • EbiquityGroup,UMBC

joshi@cs.umbc.edu http://ebiquity.umbc.edu/

TheGameTheoreticWeb TheGameTheoreticWeb

slide-2
SLIDE 2

SocialMedia SocialMedia

Socialmediadescribesthe

  • nlinetoolsandplatforms

thatpeopleusetoshare

  • pinions,insights,

experiences,and perspectives - wikipedia

  • Levelofuserparticipation

andthoughtsharingacross variedtopics

Twitterment beta

slide-3
SLIDE 3

StateoftheBlogosphere StateoftheBlogosphere

“Blogosphere isthecollectivetermencompassingallblogsasacommunityor socialnetwork’’ WikipediaNov06

slide-4
SLIDE 4

Knowing&InfluencingyourAudience Knowing&InfluencingyourAudience

  • Yourgoalistocampaignforapresidential

candidate

  • Howcanyoutrackthebuzzabouthim/her?
  • Whataretherelevantcommunitiesand

bogs?

  • Whichcommunitiesaresupporters,whichareskeptical,

whichareputoffbythehype?

  • Isyourcampaignhavinganeffect?Thedesiredeffect?
  • Whichbloggersareinfluentialwithpoliticalaudience?Of

these,whicharealreadyonboardandwhicharelost causes?

  • Towhomshouldyousenddetailsortalkto?
slide-5
SLIDE 5

Knowing&InfluencingyourMarket Knowing&InfluencingyourMarket

  • YourgoalistomarketApple’siPhone
  • Howcanyoutrackthebuzzaboutit?
  • Whataretherelevantcommunitiesand

blogs?

  • Whichcommunitiesarefans,whichare

suspicious,whichareputoffbythehype?

  • Isyouradvertisinghavinganeffect?The

desiredeffect?

  • Whichbloggersareinfluentialinthis

market?Ofthese,whicharealready

  • nboardandwhicharelostcauses?
  • Towhomshouldyousenddetailsor

evaluationsamples?

slide-6
SLIDE 6

OpinionsinSocialMedia OpinionsinSocialMedia

“LastnightinBostonatamid-dollarfundraiser

JohnEdwardsgaveafantasticspeech.Itwas

  • netheloosest mostcharismatic speechesIhave

seenhimgive.Manyofthepointsandlinewere fromhisstandardstumpspeechbuttherewasa definiteconfidenceandsenseofhumorinhis delivery. HealsodwelledontheenvironmentmorethanI haveseenhimdoinotherspeeches.The environmentalsectionkickedoffwithwithagood andtrue linethatgotabigovation:“Onglobal warming:AlGorewasright.”……..”1

http://www.dailykos.com/storyonly/2007/10/4/71218/3740

Expressed Opinions Narrative Reader’sPerspective

  • Opinionscan

influencethevotesof

  • thers
slide-7
SLIDE 7

WhatisInfluence? WhatisInfluence?

  • MeasurableInfluence

Theabilityofabloggertopersuadeanotherbloggerto

  • Takeactionbymeansofcreatinganewpostaboutthetopic

andcommentingontheoriginal.

  • Quotetheblogger’sviewsinherpost.
  • Linktotheoriginalpostviatrackbacks,comments

.

  • Linktothebloggerthroughothermeanslikedel.icio.us,digg,

citeULike,Connotea,etc.

  • Subscribetotheblogfeed .
slide-8
SLIDE 8

Epidemic-basedInfluenceModels Epidemic-basedInfluenceModels

  • LinearThresholdModel

Σ bwv≥ θv

wistheactiveneighborofv,

θv intrinsicthresholdforanode

  • GreedyHeuristic
  • Assignrandomθv
  • Computeapproxinfluencedset
  • Ateachstep,addthenodethat

increasesthemarginalgaininthesize

  • ftheinfluencedset
  • 4

3 2 1 5

2/5 1/5 2/5 1/3 1/3 1/3 1 1/2 1/2 1

θv Active Active Inactive Otherapproaches:Latane’,iRank,

  • Kempetal.
slide-9
SLIDE 9

LimitationsofExistingApproaches LimitationsofExistingApproaches

  • Selectednodesmaybelongto

differenttopics

  • Opinionsorbiasnot

considered

  • Informationisspread

throughoutthenetworkwithout consideringsocialstructure

  • Intrinsicthresholdθvisbasedona

pseudorandomfunction

  • Staticviewofthenetwork,no

temporalevidence

First10nodesselectedusing GreedyHill%ClimbingHeuristic http://www.engadget.com http://www.boingboing.net http://www.dailykos.com http://postsecret.blogspot.com http://slashdot.org http://www.albinoblacksheep.com http://www.opinionjournal.com http://profiles.blogdrive.com http://godlessmom.blogspot.com http://thinkprogress.org TECH, POLITICS,DAILY/NEWS

slide-10
SLIDE 10

FindingCommunities(andFeeds)ThatMatter FindingCommunities(andFeeds)ThatMatter

  • !!!!

"!#$%%&'( ")!!*$+,!!&'- !!'! .&&'& "%%) TopAdvertisingFeeds

1.Adrants» MarketingandAdvertisingNewsWithAttitude 2.Adverblog:advertisingandnewmediamarketing 3.http://ad-rag.com 4.adfreak 5.AdJab 6.MITAdvertisingLab:futureofadvertisingandadvertisingtechnology 7.AdPulp:DailyJuicefromtheAdBiz 8.Advertising/DesignGoodness

RelatedTags:advertising marketing media news design

slide-11
SLIDE 11

FeedsThatMatter FeedsThatMatter

TopFeedsfor“Politics”

Mergedfolders:“political”,“politicalblogs”

  • TalkingPointsMemo:byJoshuaMicah

Marshall

  • DailyKos:StateoftheNation
  • Eschaton
  • TheWashingtonMonthly
  • Wonkette,PoliticsforPeoplewithDirtyMinds
  • http://instapundit.com/
  • InformedComment
  • PowerLine
  • AMERICAblog:Becauseagreatnation

deservesthetruth

  • CrooksandLiars

TopFeedsfor“Knitting”

Mergedfolders“knittingblogs”

  • YarnHarlotknitting
  • WendyKnits!
  • SeeEunnyKnit!
  • theblueblog
  • Grumperinagoestolocalyarnshopsand

HomeDepot

  • YouKnitWhat??
  • Mason-DixonKnitting
  • knitandtonic
  • CrazyAuntPurl
  • http://www.lollygirl.com/blog/
slide-12
SLIDE 12

http://instapundit.com http://michellemalkin.com/ http://dailykos.com http://crooksandliars.com http://volokh.com http://rightwingnews.com

InfluenceinCommunities InfluenceinCommunities

Communitiesdetectedusing“Fastalgorithmfordetecting communitystructureinnetworks”,M.E.J.Newman

slide-13
SLIDE 13

AuthorityandPopularity AuthorityandPopularity

Authority

  • contributestoinfluence
  • Influencemaybesubjective.
  • Asource,authoritativeinone

communitycouldinfluence anothercommunitynegatively. Withinacommunity,an authoritativesourcewouldbe influential.

Popularity

  • Authorityandpopularityoften

treatedequally

  • Onblogsearchengines,authority

ismeasuredusinginlinks,which isatbestpopularity

  • Popularitydoesn’tmeaninfluence
  • Dilbertisextremelypopularbut

notinfluential

slide-14
SLIDE 14

LinkPolarity/Bias LinkPolarity/Bias

  • Linkingaloneisnotindicatorofinfluence
  • Polaritycanindicatethetypeofinfluence
  • Consistentnegative/positiveopinionoveraperiodoftime

canindicatebias

  • Linkpolarity/citationsignalcanalsobehelpfulindetermining

trust

DemocratBlog RepublicanBlog StrongNegative Opinion MildlyNegative

  • pinion

StronglyPositive

  • pinion
slide-15
SLIDE 15

OurApproachtoLinkPolarity OurApproachtoLinkPolarity

  • ShallowSentimentAnalysis
  • Calculatethenumberofpositivelyoriented(!)and

Negativelyorientedwords(!)inthetext-windowaround thelink

  • ApplyStemming,basiccanonicalization
  • Corpusincludessimplebi-gramsoftheform“"
  • Polarity=(Np– Nn)/(Np+Nn)
  • Denominatoractsasanormalizationmechanism
  • NaturalLanguageProcessingis,yetlarge-

scaleeffectshelp!

slide-16
SLIDE 16

LinkPolarityExample LinkPolarityExample

  • “StephenColbert'sperformanceattheWhiteHouseCorrespondents' Association

dinnerhasgarneredhimhuge applause intheblogosphereandalsoonC-Span whereitwasshownmorethanonce.Thoseofuswhohavebeenangry withBush forquitesometimebecauseofhisarrogant andfecklesscorruption ofour countrywereevenmorethrilledtoseeandknowthathehadnorecoursebuttosit thereandwatchhisaspirationsforgreatnessbedestroyedbyamaster ofirony. This willbehislegacy:Istandbythisman.Istandbythismanbecausehestands forthings.Notonlyforthings,hestandsonthings.Thingslikeaircraftcarriersand rubbleandrecentlyfloodedcitysquares.Andthatsendsastrong message,that nomatterwhathappenstoAmerica,shewillalwaysrebound-- withthemost powerfully stagedphotoopsintheworld.WewhohavebeenwatchingStephen Colbertevisceratepoliticiansthathavecomeonhisshowknewhewasagifted comedian.ButittookSaturday'sdinnertodemonstratehowincrediblyeffective the artformColberthaschosenisforexposingthePotemkinRegimeBushandhis henchmenhavecreated.Roveandtherightwingmachinehavenoanswertothe performancebuttosay"itbombed", "itwasn'tfunny",andtohopethatbyignoring it,thecausticcleansingagentithaslobbedintotheircampcanbecontained.Yet, theRepublicanspinmeistersarethemastersofspin.”[2] This% http://dailykos.com/storyonly/2006/4/30/1441/59811 Np=8,Nn=4 ;Polarity=Np – Nn /Np +Nn =0.33

[2]http://www.pacificviews.org/weblog/archives/001989.html

slide-17
SLIDE 17

PropagatingInfluence PropagatingInfluence

  • BasedonworkofGuhaetal[1] formodelingpropagationof

trustanddistrust

  • Framework
  • Mijrepresentsinfluence/biasfromuseritoj.(0<=Mij <=1)
  • Mij isinitializedtothepolarityfromitoj.
  • BeliefMatrix#representstheinitialsetofknownbeliefs,andis

sparse

  • GoalistocomputeallunknownvaluesinM
  • BeliefMatrixafterith atomicpropagation
  • Mi+1 =Mi *Ci
  • CombinedOperator
  • Ci=a1 *M+a2 *MT*M+a3 *MT +a4 *M*MT
  • a{0.4,0.4,0.1,0.1}representsweighingfactor

[1]GuhaR,KumarR,RaghavanP,TomkinsA.Propagationoftrust anddistrust.In:$ %&&&'(!)!)*+,#-.//01,(#$.//01

slide-18
SLIDE 18

Experiments Experiments

  • Domain
  • PoliticalBlogosphere
  • DatasetfromBuzzmetrics[2] providespost-postlinkstructureover14million

posts

  • Fewoff-the-topicpostshelpaggregation
  • Potentialbusinessvalue
  • ReferenceDataset
  • Hand-labeleddatasetfromLadaAdamicetal[3] classifyingpoliticalblogsinto

rightandleftleaningbloggers

  • Timeframe:2004presidentialelections,over1500blogsanalyzed
  • Overlapof300blogsbetweenBuzzmetricsandreferencedataset
  • Goal
  • ClassifytheblogsinBuzzmetricsdatasetasdemocratandrepublicanand

comparewithreferencedataset

[2]LadaA.AdamicandNatalieGlance,"Thepoliticalblogosphereandthe2004USElection",inProceedingsoftheWWW-2005Workshop Buzzmetrics– www.buzzmetrics.com

slide-19
SLIDE 19

EvaluationMetrics EvaluationMetrics

Confusion Matrix

  • Accuracy=73%
  • TruePositiveRate(Recall)=

78%

  • FalsePositiveRate(FP)=

31%

  • TrueNegativeRate(Recall)

=69%

  • FalseNegativeRate(FN)=

21%

  • Precision(R)=75%
  • Precision(D)=72%
  • 2

PolarityImprovesClassificationbyalmost26%

slide-20
SLIDE 20

SampleData SampleData

  • Trustpropagation

compensatesforinitial incorrect polarity(DK– AT)

  • Trustpropagationdoesnot

changecorrect polarity(AT- DK)

  • Trustpropagationassigns

correctpolarityfornon- existent directlinks(AT-IP)

  • Numbersin

problematic(MM-AT)

  • Improvesentimentdetection?
slide-21
SLIDE 21

MSMClassificationResults MSMClassificationResults

slide-22
SLIDE 22

InterestingObservations InterestingObservations

  • 24outof27sourcesclassified

“correctly”

  • guardian,foxnews,

humaneventsonline,mediamatters

  • MainOutliers-- “thenation” and

“bostonglobe”

  • Bothleftandrightleaningblogs

talknegativelyabout“nytimes” and“abcnews” andpositively about“rawstory” and“examiner”

slide-23
SLIDE 23

IdentifyingBiasusingKLDivergence IdentifyingBiasusingKLDivergence

slide-24
SLIDE 24

20 40 60 80 100 120 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

NumberofSplogs

Splogsspoiltheparty Splogsspoiltheparty

Splogsintop100search results.Eachlinerepresentsa query.

&! !

PreliminaryWork:OpinionExtraction

slide-25
SLIDE 25

SPLOGS! SPLOGS!

slide-26
SLIDE 26
slide-27
SLIDE 27
  • SPLOGS BY NUMBERS
  • 75%ofupdatepings(eBiquity2006)
  • 20%ofindexedBlogosphere(Umbria2006)
  • 56%ofupdatepings(eBiquity2007)
slide-28
SLIDE 28
  • SPLOG-2005
  • SampledSummer2005atTechnorati
  • Asearchengine,somanysplogsalreadyremoved
  • Labeledsamplesof700blogsand700splogs
  • OnlyBlog-homepages
  • SPLOG-2006
  • SampledOct2006atWeblogs.com
  • Labeledsamplesof750blogsand750splogs
  • Blog-homepages+feeds

DATASETS DATASETS

slide-29
SLIDE 29

EXPERIMENTAL SETUP EXPERIMENTAL SETUP

  • Binaryfeatureencoding
  • Top50Kselectedusingfrequencycount
  • SVMs

– Defaultparameters – LinearKernel

  • Nostemmingorstopwordelimination
  • NaïveBayes
  • Tenfoldcross-validation
slide-30
SLIDE 30

URL

2005 2006

slide-31
SLIDE 31

URL

  • 3,4,5charactergramsfromURL
  • Capturesprofitablecontexts
  • Highlyeffectiveatpingstreams
  • Supportsanextremelylowcostclassifier

2005 2006

slide-32
SLIDE 32

WORDS

2005 2006

slide-33
SLIDE 33

WORDS

2005 2006

  • Words(Text)onaBlog
  • Previouslyeffectiveintopicclassification
  • Capturesprofitableadvertisingcontexts
  • InterestingAuthenticGenreObserved
slide-34
SLIDE 34

OUTLINKS

2005 2006

slide-35
SLIDE 35

OUTLINKS

2005 2006

  • Out-linkstokenizedbynon-alphabets
  • SimilartoURLn-grams,likelymorerobust
  • Novelfeaturespace
slide-36
SLIDE 36

ANCHORS

2005 2006

slide-37
SLIDE 37

ANCHORS

2005 2006

  • Anchortexttokenizedintowords
  • Subsumedbywords,butobfuscationdifficult
  • Capturepersonalizationofpublishingtemplate
  • Novelfeaturespace
slide-38
SLIDE 38
  • !"#$%&&'()

"*+"*', #,,, $-. ,#$/0" $,",(1$ #$2$ 3$/"$'. 42/#!!! */&/ ",*/5!.

Splog software ?! Splog software ?!

!

slide-39
SLIDE 39

"#$%&'()#*

slide-40
SLIDE 40

HTMLTAGS

2005 2006

slide-41
SLIDE 41

HTMLTAGS

2005 2006

  • UseHTMLTags– stylisticinformation
  • Capturesignaturesofsplogsoftware
  • Fullylanguageindependent
  • Novelfeaturespace
slide-42
SLIDE 42

BLOG IDENTIFIER BLOG IDENTIFIER LANGUAGE IDENTIFIER LANGUAGE IDENTIFIER

  • REGULAR

EXPRESSIONS REGULAR EXPRESSIONS BLACKLISTS WHITELISTS BLACKLISTS WHITELISTS URL FILTERS URL FILTERS HOMEPAGE FILTERS HOMEPAGE FILTERS FEED FILTERS FEED FILTERS

  • Ping Stream

Ping Stream Ping Stream

META-PING SYSTEM

Increasing Cost

slide-43
SLIDE 43

THEGAMETHEORETICWEB THEGAMETHEORETICWEB

slide-44
SLIDE 44

QouthPeterNorvig QouthPeterNorvig

  • “TheotherthingthatIhadn'treallythoughtaboutwhenwe

startedthisallishowkindofgametheoreticthewholething

  • is. Atfirstwethoughtofourselvesasthisobserverofthe
  • Web. ThattheWebwasoutthereandwemadeacopyofit

andindexeditandifpeoplewantedtheycouldcomeand accessthatindex.ButitwasjustareflectionoftheWebout

  • there. Andnowweunderstandthatwe'reco-evolvingwith

theWebandthatwhenwemakeamoveitchangestheWeb andwhentheWebchangeswechangeandgoingbackand forth.Andsoallthesearchengineoptimizersandsoonare watchingandwhatwedoandwewatchwhattheydoandthe

  • Webistheinteractionbetweenus. AndthatissomethingI

hadn'tevenconsideredbeforewesawithappening.”

  • %+-.//3

%+-.//3

slide-45
SLIDE 45

ThisistrueofSocialMediaaswell ThisistrueofSocialMediaaswell

  • IfIknowthatyouareoutthere,tryingtoinfermy
  • pinions(orpreventmefromspamming)thenIwill

activelyworktodefeatthat.Sincethecontentisuser generated,Icandothatfairlyquickly.

  • Spamadaptationisaclassicexample.
slide-46
SLIDE 46
  • Changeindistributioninfeaturespace
  • ConceptDrift– Seasonal,seeninbothsplogsand

blogs

  • AdversarialScenario– seeninsplogs
  • ConceptDescriptionneedstobeupdated

ADAPTIVE CONTEXT ADAPTIVE CONTEXT

f1,f2,f3 ..fm

P(O(x)/splog(x)) P(splog(x)/O(x))

slide-47
SLIDE 47

ENSEMBLE INTUITION ENSEMBLE INTUITION

  • Streamofunlabeledinstances(drifting)
  • Baseclassifierswithpotentially

independentfeaturespaces

  • Isanensemble(probabilisticcommittee)
  • fthecataloguemorerobusttodrift?
  • Areinstancesclassifiedbytheensemble

effectivetoretrainbaseclassifiers(semi- supervisedlearning)?

  • Motivatedbyco-training
slide-48
SLIDE 48

ENSEMBLE INTUITION ENSEMBLE INTUITION

baseclassifiers updatedclassifiers ensemblecommittee (probabilistic)

classify classify classify retrain

unlabeled instances

slide-49
SLIDE 49

POTENTIAL TO ADAPT POTENTIAL TO ADAPT

URL Anchor Chargram Outlink Tag Wordgrams Words

slide-50
SLIDE 50

EXPERIMENTAL SETUP EXPERIMENTAL SETUP

  • Acatalogofsevenclassifiers
  • SPLOG-2005asbaselabeleddataset
  • SPLOG-2006asevaluationstream
  • 10KTopFeatures
  • SVMbasedlearning
  • SPLOG-2006separatedoutintounlabeled

streamandtestset(3-fold)

  • F-1performancemetricevaluation
slide-51
SLIDE 51

RESULTS – WORD DRIFT RESULTS – WORD DRIFT

slide-52
SLIDE 52

RESULTS – ALL CLASSIFIERS RESULTS – ALL CLASSIFIERS

slide-53
SLIDE 53

Conclusion Conclusion

  • Using

wecandevelopanaccuratemodelfor influence,biasandtrustontheBlogosphere.

  • Weapplythisframeworkonreal-worlddataand

describetechniquesforidentifyinginfluenceonthe Blogosphere.

  • Splogsareabigissue– wehavedevelopedefficient

techniquestodetecttheminnearrealtime.

  • DoestheGameTheoreticNatureofthissystem

raisefundamentalnewchallengesforDataMining.

slide-54
SLIDE 54

BackupSlides

slide-55
SLIDE 55

Graphsareeverywhere..andsoarePowerlaws!!

InternetMappingProject [lumeta.com]

FriendshipNetwork[Moody‘01]

Insimplewords,powerlawcanbeexplainedby“richget richerphenomenon” OR“20%ofthepopulationholds 80%ofthewealth” Consideringwebasagraph: Scale%freenetwork: Structureandproperties independentofnetworksize Fewhighconnectivity node(hubs)

///!&

Propertiesofinterest(graphtheory) Averagedegreeofnode,degreedistribution,degreecorrelation, distributionof strongly/weaklyconnectedcomponents,clusteringcoefficientand reciprocity

GenerativeModelsforBlogosphere GenerativeModelsforBlogosphere

slide-56
SLIDE 56
  • Reducetimetogeneratedata
  • crawlingtheblogosphereoverafewweeks
  • samplingtherightblogstogetarepresentativesample
  • Reducetimeinpreprocessinganddatacleaning
  • removinglinkspointingoutsidethedataset,outsidethetimeframe
  • splogremoval[1]
  • Generategraphsofdifferentproperties\sizes
  • averagedegreeofnode,degreedistributions
  • Testingofnewalgorithmsforbloggraphs
  • e.g.spreadofinfluenceinblogosphere[2],communitydetection[3]
  • Extrapolation
  • howwillfastgrowthaffecttheblogosphereproperties?
  • howdoesthisaffecttheconnectedcomponents?

[1]Kolarietal“Svmsfortheblogosphere:Blogidentificationandsplogdetection,” inAAAISpringSymposiumonComputational ApproachestoAnalyzingWeblogs,2006. [2]Javaetal“Modelingthespreadofinfluenceontheblogosphere,” tech.rep.,UniversityofMaryland,BaltimoreCounty,March2006. [3]Linetal“DiscoveryofBlogCommunitiesbasedonMutualAwareness

GenerativeModelsforBlogosphere GenerativeModelsforBlogosphere

slide-57
SLIDE 57

+,- ..

  • *.

/.

PreferentialAttachment:Thelikelihoodoflinkingtoapopularwebsiteishigher

0/&'#1!&'''2'/3!# "%% 4#",,,,65.!!#"%% 678!39#:'#;<&!!#0:&'#&'=!#;&!&'&9'&&!#( 7"&#"%%6 ">?#1!'#&'8&#83'&! ''!! ( 7"&#"%%6

  • Twolevelnetwork:blogandpostlevel
  • Inlinksandoutlinkstoandfromposts
  • NEEDtomodelbloggerinteractions

ExistingApproaches ExistingApproaches

slide-58
SLIDE 58

1. Probabilityofrandomreads(rR) 2. Probabilityofrandomlyselectingwriter(rW) 3. Probabilitythatnewnodedoesnotlinktotheexisting network(pD) 4. Growthexponent(g)

– howmanylinksshouldbeaddedeverystep?

ModelParameters ModelParameters

slide-59
SLIDE 59

?@'3!'A @ !/A BB5'& &!' ''

1.Addnewblognode 2.Selectwriter 3.Writersreadblogposts,write posts

C!' &' AD4 BB5'& &!'A ?@& E &' A E '& A @/''3 &' 'F 4&'/ 4&' !'&' Reciprocallinks Stronglyconnectedcomponents Subsetofnodeshavingdirected pathfromeverynodetoeveryother node Weaklyconnectedcomponents Informationflow

(0 (0

&3' & 3!

ProposedModel ProposedModel

slide-60
SLIDE 60

PropertiesofSimulatedBlogGraphs PropertiesofSimulatedBlogGraphs

slide-61
SLIDE 61
slide-62
SLIDE 62

Effectoftextwindowsize Effectoftextwindowsize

  • Optimalwindowsizeis750charactersforourexperiments
  • Smallwindowsize– Non-opinionatedphrases
  • LargeWindowsize– Analysisofnon-relatedtext
  • Specifictoourexperiments,numbersmaynotbegeneralized
slide-63
SLIDE 63

Effectofatomicpropagationparameters Effectofatomicpropagationparameters

  • X-axisBitset={directtrust,co–citation,transposetrustand

trustcoupling}={0001- 1111}

  • Eachparametersettoeither0oritsoptimalvalue
  • Collectiveinfluenceofallparametershelps!
slide-64
SLIDE 64

AtomicPropagation AtomicPropagation

  • DirectPropagation
  • Given:AtrustsBandBtrustsC
  • Implies:AtrustsC
  • Operator:M
  • Co-citation
  • Given:AtrustsBandC,DtrustC
  • Implies:DtrustsB
  • Operator:MT *M
  • ;
  • ;
  • .
slide-65
SLIDE 65

AtomicPropagationContd… AtomicPropagationContd…

  • TransposeTrust
  • Given:AtrustsBandCtrustsB
  • Implies:CtrustsA,AtrustsC
  • Operator:MT
  • TrustCoupling
  • Given:DtrustsA,AtrustsC

andBtrustsC

  • Implies:DtrustsB
  • Operator:M*MT
  • ;
  • .

;

slide-66
SLIDE 66

AtomicPropagationcontd… AtomicPropagationcontd…

  • CombinedOperator
  • Ci=a1 *M+a2 *MT*M+a3 *MT +a4 *M*MT
  • ai {0.4,0.4,0.1,0.1}representsweighingfactor
  • BeliefMatrixafterith atomicpropagation
  • Mi+1 =Mi *Ci
  • Weperformpropagationstill“convergence” (tillthe

newiterationdoesnotchangevaluesinMabove “threshold”)

slide-67
SLIDE 67

SeparatingBlogWheatfromBlogChaff SeparatingBlogWheatfromBlogChaff Datacleaningfor

  • Splogremoval
  • Postcontentidentification

Non English Blog removal Non English Blog removal

  • Collection Parsing

Collection Parsing

  • Splog Detection

Splog Detection

  • Title and

Content Extraction Title and Content Extraction

  • Non English

Blog removal Non English Blog removal

  • Collection Parsing

Collection Parsing

  • Splog Detection

Splog Detection

  • Title and

Content Extraction Title and Content Extraction

  • BlogVox:SeparatingBlogWheatfromBlogChaff",IJCAI2007Analyticsof

NoisyandUnstructuredText

slide-68
SLIDE 68

DataCleaning:Splogs DataCleaning:Splogs

HostAds Indexaffiliates, Promote pageRank Plagiarized content PreliminaryWork:OpinionExtraction

slide-69
SLIDE 69

20 40 60 80 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 NumberofSplogs

EffectofSplogs EffectofSplogs

Distributionofsplogsfor ‘spamterms’ inTRECcorpus pregnancy insurance discount "BlogTrackOpenTask:SpamBlog Classification",4(.//56 !',

slide-70
SLIDE 70

DataCleaning:ContentIdentification DataCleaning:ContentIdentification

  • BaselineHeuristic
  • SVMMethod
slide-71
SLIDE 71

?&'4??

1

0/!<!

D'?

14! <4

'#)

D'?"DC8 .3&G 2/ + +. (.3%

D'?D' *DC8,

14

@' 1!

  • 4

D' H @'!&'/! 12?& 4.I8I ?/@'2

  • "

$ ) 6

  • J
  • K

%

  • "
  • K

$

4?? &

slide-72
SLIDE 72

BlogVoxOpinionExtractionSystem BlogVoxOpinionExtractionSystem

  • TREC06:Finding

posts,eitherpositiveornegative, aboutaquery

  • 2006TRECBlogcorpus:
  • 80Kblogs
  • 300Kposts
  • 50testqueries
  • BlogVox opinionextraction

system

  • Documentandsentencelevel

scorers

  • Combinedscoresusingan

SVMmeta-learner

  • Datacleaning:splogsand

postidentification

BlogVox

slide-73
SLIDE 73

BrandMonitoring/BusinessAnalytics BrandMonitoring/BusinessAnalytics

Limitations

  • Proprietary
  • Somecompaniesconductextensivemanualresearch

BlogAnalytics/MarketIntelligence Buzz Opinions Influence Reputation Competition FinancialAnalyst

slide-74
SLIDE 74

TopCitedMediaSources TopCitedMediaSources

TopMSMSourcesontheBlogosphere

slide-75
SLIDE 75

PropagatingInfluence PropagatingInfluence

  • Trust-only
  • Ignoredistrust(negativepolarities)completely
  • FinalBeliefMatrix=Mk ,M0 =T
  • (K:Numberofatomicpropagationstillconvergence)
  • One-stepDistrust
  • Distrustpropagatessinglestepwhiletrustpropagatesrepeatedly
  • FinalBeliefMatrix=Mk *(T-D),M0 =T
  • (K:Numberofatomicpropagationstillconvergence)
  • PropagatedDistrust
  • Treatdistrustandtrustequivalent
  • FinalBeliefMatrix=Mk ,M0 =T- D
  • (K:Numberofatomicpropagationstillconvergence)
slide-76
SLIDE 76

SPAMDEXING SPAMDEXING

?&&!# ?&!# ?&;'!# :!3?& C3?& ?G45 ?&G''! &5&! ;'2! &!&&'3! && 'E'3!

.5

7&9&? 4 &5& ! ?&&!# ?&! 6177-389 ?& /' &'!

(i) (ii) (iii)

slide-77
SLIDE 77

WORDGRAMS

2005 2006

slide-78
SLIDE 78

WORDGRAMS

2005 2006

  • Word-2-grams,2adjacentwords
  • ShallowNLPtechniquetotacklewordsalad
  • Wordsaladlesscommoninwebspam(TFIDF)
  • Word-x-gramfeatures,exponentialwithx
slide-79
SLIDE 79

CHARACTERGRAMS

2005 2006

slide-80
SLIDE 80

CHARACTERGRAMS

2005 2006

  • 3,4,5charactergramsfromblogcontent
  • Cancapturecharactersalad(e.g.p1lls)
  • Featureselectionimportant