Web(2.0)Mining:AnalyzingSocial Media Web(2.0)Mining:AnalyzingSocial Media
AnupamJoshi
- EbiquityGroup,UMBC
joshi@cs.umbc.edu http://ebiquity.umbc.edu/ AnupamJoshi
- EbiquityGroup,UMBC
TheGameTheoreticWeb TheGameTheoreticWeb - - PowerPoint PPT Presentation
TheGameTheoreticWeb TheGameTheoreticWeb Web(2.0)Mining:AnalyzingSocial Web(2.0)Mining:AnalyzingSocial Media Media AnupamJoshi AnupamJoshi
Twitterment beta
“Blogosphere isthecollectivetermencompassingallblogsasacommunityor socialnetwork’’ WikipediaNov06
blogs?
suspicious,whichareputoffbythehype?
desiredeffect?
market?Ofthese,whicharealready
evaluationsamples?
JohnEdwardsgaveafantasticspeech.Itwas
seenhimgive.Manyofthepointsandlinewere fromhisstandardstumpspeechbuttherewasa definiteconfidenceandsenseofhumorinhis delivery. HealsodwelledontheenvironmentmorethanI haveseenhimdoinotherspeeches.The environmentalsectionkickedoffwithwithagood andtrue linethatgotabigovation:“Onglobal warming:AlGorewasright.”……..”1
http://www.dailykos.com/storyonly/2007/10/4/71218/3740
Expressed Opinions Narrative Reader’sPerspective
wistheactiveneighborofv,
θv intrinsicthresholdforanode
increasesthemarginalgaininthesize
3 2 1 5
2/5 1/5 2/5 1/3 1/3 1/3 1 1/2 1/2 1
θv Active Active Inactive Otherapproaches:Latane’,iRank,
pseudorandomfunction
First10nodesselectedusing GreedyHill%ClimbingHeuristic http://www.engadget.com http://www.boingboing.net http://www.dailykos.com http://postsecret.blogspot.com http://slashdot.org http://www.albinoblacksheep.com http://www.opinionjournal.com http://profiles.blogdrive.com http://godlessmom.blogspot.com http://thinkprogress.org TECH, POLITICS,DAILY/NEWS
"!#$%%&'( ")!!*$+,!!&'- !!'! .&&'& "%%) TopAdvertisingFeeds
1.Adrants» MarketingandAdvertisingNewsWithAttitude 2.Adverblog:advertisingandnewmediamarketing 3.http://ad-rag.com 4.adfreak 5.AdJab 6.MITAdvertisingLab:futureofadvertisingandadvertisingtechnology 7.AdPulp:DailyJuicefromtheAdBiz 8.Advertising/DesignGoodness
RelatedTags:advertising marketing media news design
Mergedfolders:“political”,“politicalblogs”
Marshall
deservesthetruth
Mergedfolders“knittingblogs”
HomeDepot
http://instapundit.com http://michellemalkin.com/ http://dailykos.com http://crooksandliars.com http://volokh.com http://rightwingnews.com
Communitiesdetectedusing“Fastalgorithmfordetecting communitystructureinnetworks”,M.E.J.Newman
communitycouldinfluence anothercommunitynegatively. Withinacommunity,an authoritativesourcewouldbe influential.
treatedequally
ismeasuredusinginlinks,which isatbestpopularity
notinfluential
DemocratBlog RepublicanBlog StrongNegative Opinion MildlyNegative
StronglyPositive
dinnerhasgarneredhimhuge applause intheblogosphereandalsoonC-Span whereitwasshownmorethanonce.Thoseofuswhohavebeenangry withBush forquitesometimebecauseofhisarrogant andfecklesscorruption ofour countrywereevenmorethrilledtoseeandknowthathehadnorecoursebuttosit thereandwatchhisaspirationsforgreatnessbedestroyedbyamaster ofirony. This willbehislegacy:Istandbythisman.Istandbythismanbecausehestands forthings.Notonlyforthings,hestandsonthings.Thingslikeaircraftcarriersand rubbleandrecentlyfloodedcitysquares.Andthatsendsastrong message,that nomatterwhathappenstoAmerica,shewillalwaysrebound-- withthemost powerfully stagedphotoopsintheworld.WewhohavebeenwatchingStephen Colbertevisceratepoliticiansthathavecomeonhisshowknewhewasagifted comedian.ButittookSaturday'sdinnertodemonstratehowincrediblyeffective the artformColberthaschosenisforexposingthePotemkinRegimeBushandhis henchmenhavecreated.Roveandtherightwingmachinehavenoanswertothe performancebuttosay"itbombed", "itwasn'tfunny",andtohopethatbyignoring it,thecausticcleansingagentithaslobbedintotheircampcanbecontained.Yet, theRepublicanspinmeistersarethemastersofspin.”[2] This% http://dailykos.com/storyonly/2006/4/30/1441/59811 Np=8,Nn=4 ;Polarity=Np – Nn /Np +Nn =0.33
[2]http://www.pacificviews.org/weblog/archives/001989.html
sparse
[1]GuhaR,KumarR,RaghavanP,TomkinsA.Propagationoftrust anddistrust.In:$ %&&&'(!)!)*+,#-.//01,(#$.//01
posts
rightandleftleaningbloggers
comparewithreferencedataset
[2]LadaA.AdamicandNatalieGlance,"Thepoliticalblogosphereandthe2004USElection",inProceedingsoftheWWW-2005Workshop Buzzmetrics– www.buzzmetrics.com
78%
31%
=69%
21%
PolarityImprovesClassificationbyalmost26%
“correctly”
humaneventsonline,mediamatters
“bostonglobe”
talknegativelyabout“nytimes” and“abcnews” andpositively about“rawstory” and“examiner”
20 40 60 80 100 120 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
NumberofSplogs
Splogsintop100search results.Eachlinerepresentsa query.
PreliminaryWork:OpinionExtraction
– Defaultparameters – LinearKernel
"*+"*', #,,, $-. ,#$/0" $,",(1$ #$2$ 3$/"$'. 42/#!!! */&/ ",*/5!.
BLOG IDENTIFIER BLOG IDENTIFIER LANGUAGE IDENTIFIER LANGUAGE IDENTIFIER
EXPRESSIONS REGULAR EXPRESSIONS BLACKLISTS WHITELISTS BLACKLISTS WHITELISTS URL FILTERS URL FILTERS HOMEPAGE FILTERS HOMEPAGE FILTERS FEED FILTERS FEED FILTERS
Ping Stream Ping Stream
Increasing Cost
f1,f2,f3 ..fm
baseclassifiers updatedclassifiers ensemblecommittee (probabilistic)
classify classify classify retrain
unlabeled instances
URL Anchor Chargram Outlink Tag Wordgrams Words
Graphsareeverywhere..andsoarePowerlaws!!
InternetMappingProject [lumeta.com]
FriendshipNetwork[Moody‘01]
Insimplewords,powerlawcanbeexplainedby“richget richerphenomenon” OR“20%ofthepopulationholds 80%ofthewealth” Consideringwebasagraph: Scale%freenetwork: Structureandproperties independentofnetworksize Fewhighconnectivity node(hubs)
///!&
Propertiesofinterest(graphtheory) Averagedegreeofnode,degreedistribution,degreecorrelation, distributionof strongly/weaklyconnectedcomponents,clusteringcoefficientand reciprocity
[1]Kolarietal“Svmsfortheblogosphere:Blogidentificationandsplogdetection,” inAAAISpringSymposiumonComputational ApproachestoAnalyzingWeblogs,2006. [2]Javaetal“Modelingthespreadofinfluenceontheblogosphere,” tech.rep.,UniversityofMaryland,BaltimoreCounty,March2006. [3]Linetal“DiscoveryofBlogCommunitiesbasedonMutualAwareness
+,- ..
/.
PreferentialAttachment:Thelikelihoodoflinkingtoapopularwebsiteishigher
0/&'#1!&'''2'/3!# "%% 4#",,,,65.!!#"%% 678!39#:'#;<&!!#0:&'#&'=!#;&!&'&9'&&!#( 7"&#"%%6 ">?#1!'#&'8S'&! ''!! ( 7"&#"%%6
1. Probabilityofrandomreads(rR) 2. Probabilityofrandomlyselectingwriter(rW) 3. Probabilitythatnewnodedoesnotlinktotheexisting network(pD) 4. Growthexponent(g)
– howmanylinksshouldbeaddedeverystep?
?@'3!'A @ !/A BB5'& &!' ''
1.Addnewblognode 2.Selectwriter 3.Writersreadblogposts,write posts
C!' &' AD4 BB5'& &!'A ?@& E &' A E '& A @/''3 &' 'F 4&'/ 4&' !'&' Reciprocallinks Stronglyconnectedcomponents Subsetofnodeshavingdirected pathfromeverynodetoeveryother node Weaklyconnectedcomponents Informationflow
(0 (0
&3' & 3!
Non English Blog removal Non English Blog removal
Collection Parsing
Splog Detection
Content Extraction Title and Content Extraction
Blog removal Non English Blog removal
Collection Parsing
Splog Detection
Content Extraction Title and Content Extraction
NoisyandUnstructuredText
HostAds Indexaffiliates, Promote pageRank Plagiarized content PreliminaryWork:OpinionExtraction
20 40 60 80 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 NumberofSplogs
Distributionofsplogsfor ‘spamterms’ inTRECcorpus pregnancy insurance discount "BlogTrackOpenTask:SpamBlog Classification",4(.//56 !',
?&'4??
1
0/!<!
D'?
14! <4
'#)
D'?"DC8 .3&G 2/ + +. (.3%
D'?D' *DC8,
14
@' 1!
D' H @'!&'/! 12?& 4.I8I ?/@'2
$ ) 6
%
$
4?? &
posts,eitherpositiveornegative, aboutaquery
system
scorers
SVMmeta-learner
postidentification
BlogVox
BlogAnalytics/MarketIntelligence Buzz Opinions Influence Reputation Competition FinancialAnalyst
TopMSMSourcesontheBlogosphere
?&&!# ?&!# ?&;'!# :!3?& C3?& ?G45 ?&G''! &5&! ;'2! &!&&'3! && 'E'3!
7&9&? 4 &5& ! ?&&!# ?&! 6177-389 ?& /' &'!
(i) (ii) (iii)