http://cs224w.stanford.edu Teams of 2 3 students (1 is also ok) - - PowerPoint PPT Presentation

http cs224w stanford edu teams of 2 3 students 1 is also
SMART_READER_LITE
LIVE PREVIEW

http://cs224w.stanford.edu Teams of 2 3 students (1 is also ok) - - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Teams of 2 3 students (1 is also ok) Teams of 2 3 students (1 is also ok) Project:


slide-1
SLIDE 1

CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University

http://cs224w.stanford.edu

slide-2
SLIDE 2

 Teams of 2‐3 students (1 is also ok)  Teams of 2 3 students (1 is also ok)  Project:

  • Experimental evaluation of algorithms and models

Experimental evaluation of algorithms and models

  • n an interesting dataset
  • A theoretical project that considers a model, an

algorithm or a network property and derives a rigorous result about it

  • An in depth critical survey of one of the course
  • An in‐depth critical survey of one of the course

topics relating models, experimental results and underlying social theories and offering a novel perspective on the area

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

slide-3
SLIDE 3

 Answer the following questions:  Answer the following questions:

  • What is the problem you are solving?
  • Wh t d t

ill (h ill t it)?

  • What data will you use (how will you get it)?
  • How will you do the project?

Whi h l ith /t h i / d l l t

  • Which algorithms/techniques/models you plan to

use/develop?

  • Be as specific as you can!

p y

  • Who will you evaluate, measure success?
  • What do you expect to submit/accomplish by

What do you expect to submit/accomplish by the end of the quarter?

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

slide-4
SLIDE 4

 The project should contain at least some amount of

p j mathematical analysis, and some experimentation on real or synthetic data h l f h j ill i ll b 10

 The result of the project will typically be a 10 page

paper, describing the approach, the results, and the related work.

 Due on midnight OCT 18 2010  Upload PDF to http://coursework.stanford.edu

Upload PDF to http://coursework.stanford.edu

 TAs will assign group numbers – we will

send a link to a GoogleDoc g

 Name your file: <group#>_proposal.pdf

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

slide-5
SLIDE 5

 Wikipedia  Wikipedia  IM buddy graph  Yahoo Altavista web graph  Yahoo Altavista web graph  Stanford WebBase  Twitter Data  Twitter Data  Blogs and news data  Yahoo Music Ratings  Yahoo Music Ratings

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

slide-6
SLIDE 6

 Richly labeled network containing extracted  Richly labeled network containing extracted

data from Wikipedia (based on infoboxes):

  • Richly labeled network

Richly labeled network

  • multiple types of nodes and edges
  • About 2.6 million concepts described by 247

million triples, including abstracts in 14 different languages

  • http://dbpedia org
  • http://dbpedia.org

 Other OpenLinkedData datasets available at

http://esw.w3.org/DataSetRDFDumps

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

slide-7
SLIDE 7

 Networks of positive and negative edges  Networks of positive and negative edges

  • Data includes:
  • Trust/distrust edges
  • Trust/distrust edges
  • Also Epinions product reviews and review ratings
  • SNAP: http://snap stanford edu/data/#signnets

SNAP: http://snap.stanford.edu/data/#signnets

  • Trustlet: http://www.trustlet.org/wiki

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

slide-8
SLIDE 8

 Prosper marketplace – Peer to peer lending:  Prosper marketplace – Peer‐to‐peer lending:

  • Lenders ask for loans
  • P

l th bid ( i i t t t ) l t

  • People then bid (price, interest rate) on loans to

fund them

  • Rich social structure around the website
  • Rich social structure around the website

Data at http://www prosper com/tools/DataExport aspx Data at http://www.prosper.com/tools/DataExport.aspx

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

slide-9
SLIDE 9

 Turiya is a start up that collects game data from game

Turiya is a start up that collects game data from game publishers and processes these to produce business intelligence of value to it’s clients

 Data collected includes:  Data collected includes:

  • Players and their attributes
  • Logs of game events

g g

  • Information about virtual items
  • Information about transactions in real money or credits

A l i l d

 Analyses include:

  • Player segmentation
  • Virtual goods recommendations

If i t t d

Virtual goods recommendations

  • Lifetime value estimation of players

If you are interested – send us an email!

10/11/2010 9 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-10
SLIDE 10

 What to Wear is a Social Game played on Facebook

C t t t t tfit d b it th t d il

  • Contestants create outfits and submit these to a daily

competition, which has a theme like e.g. “an outfit for attending your ex’s‐wedding”

  • Contestants can also vote and comment on other people’s

Contestants can also vote and comment on other people s submissions

  • You get credit for both participating and judging
  • Items for outfits are either bought from the store or reused from

Items for outfits are either bought from the store or reused from the contestant’s closet

  • ~30,000 players/month

 Data about this game includes:

  • Player data
  • Data about previous competitions
  • Fashion items data

If i t t d

  • Data about outfits
  • Many other data (~400 relations in all)

If you are interested – send us an email!

10/11/2010 10 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

slide-11
SLIDE 11

 Amazon product review data:  Amazon product review data:

For each product:

  • P

d t i f l k

  • Product info: name, salesrank
  • Product categorization

All i

  • All reviews
  • user, rating, how helpful was the review

P l h b ht X l b ht Y t k!

  • People who bought X also bought Y – network!

If i t t d

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

If you are interested – send us an email!

slide-12
SLIDE 12

 Collaboration network of computer scientists  Collaboration network of computer scientists

  • Each CS publication is included:
  • Author names
  • Author names
  • Title
  • Year

Year

  • Conference, journal name

 Get the data at:

  • http://dblp.uni‐trier.de/xml/
  • http://kdl.cs.umass.edu/data/dblp/dblp‐info.html

p // / / p/ p

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

slide-13
SLIDE 13

 Patents (http://www.nber.org/patents/)

( p // g/p /)

  • Citations between patents
  • For each patent we also know:
  • Time
  • Time
  • Patent categorization
  • Patent inventor data, …

 Arxiv High‐energy Physics:

g e e gy ys cs

  • Citation network between papers
  • For each paper we also know
  • Author names
  • Author names
  • Title and abstract of the paper
  • Year of publication
  • Journal

Journal

 Data at: http://snap.stanford.edu/data/#citnets

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

slide-14
SLIDE 14

 ~50 million tweets per month starting

50 million tweets per month starting in June 2009 (6 months)

 Format:

2009 06 07 02 07 42 T 2009-06-07 02:07:42 U http://twitter.com/redsoxtweets W #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwy http://tinyurl.com/pyhgwy

 Two important things:

  • URLs
  • H

h t

If you are interested send us an email!

  • Hash‐tags

 Twitter social graph and some profiles:

http://an kaist ac kr/traces/WWW2010 html

– send us an email!

http://an.kaist.ac.kr/traces/WWW2010.html

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

slide-15
SLIDE 15

 Inferring links of the who‐follows‐whom  Inferring links of the who follows whom

network h h l f l f d h h ?

 What is the lifecycle of URLs and hash‐tags?

  • How do hash‐tags get adopted?

M l i l i h h hi h i ?

  • Multiple competing hash‐tags, which one wins?

 Finding early/influential users?  Community discovery  Where/how will the information propagate?

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

slide-16
SLIDE 16

 More than 1 million newsmedia and blog  More than 1 million newsmedia and blog

articles per day since August 2008

 Extracted phrases (quotes) and links  Extracted phrases (quotes) and links  http://memetracker.org  Format:  Format:

P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends- palins-experience-level T 2008-09-01 00:00:13 Q dangerously unprepared to be president Q dangerously unprepared to be president Q even more dangerously unprepared Q understands the challenges that we face Q worked and succeeded L http://www.cnn.com

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

slide-17
SLIDE 17

 How does information mutate/change over time?

How does information mutate/change over time?

 Which media sites are the most influential? Build a

predictive model of site influence predictive model of site influence

 Role discovery: Which nodes are early adopters,

late comers, summarizers? late comers, summarizers?

 Create a model of political bias (liberal vs.

conservative) conservative)

 What is genuine news, what are genuine phrases

and what is spam? and what is spam?

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

slide-18
SLIDE 18

 About the Dataset:

About the Dataset:

  • 6.5 million legal opinions from the United States Judiciary

from 1900 to the present; d t li k d (l t f t li )

  • documents are linked (later cases refer to earlier ones)
  • the documents are both stored in raw form on Amazon S3

and also have been pre‐processed for analysis by Hadoop

 Project ideas:

  • label cases as pro‐plaintiff or pro‐defendant
  • run PageRank Hub Authorities or other graph algorithms
  • run PageRank, Hub‐Authorities, or other graph algorithms
  • n the documents (they are hyperlinked)
  • identify legally important concepts

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

If you are interested – send us an email!

slide-19
SLIDE 19

 Complete edit history of Wikipedia until  Complete edit history of Wikipedia until

January 2008

 For every single edit the complete  For every single edit the complete

snapshot of the article is saved

 Each page has a talk page:  Each page has a talk page:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

slide-20
SLIDE 20

 Talk page:  Talk page:  Editors discuss things like:  Editors discuss things like:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

slide-21
SLIDE 21

 Every registered use has a personal page:  Every registered use has a personal page:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

slide-22
SLIDE 22

 Every user’s page has a talk page:  Every user s page has a talk page:  Users discuss things:  Users discuss things:

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

slide-23
SLIDE 23

 We have nicely parsed Wikipedia data

We have nicely parsed Wikipedia data

  • Each edit:
  • REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot

433328

  • CATEGORY American mathematicians

CATEGORY American_mathematicians

  • MAIN Boston_University MIT Harvard_University Cornell_University
  • OTHER De:Steven_Strogatz Es:Steven_Strogatz
  • EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html
  • TEMPLATE Cite_book Cite_book Cite_journal
  • COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]]
  • COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]]
  • MINOR 1
  • TEXTDATA 229

 Can identify networks:

Wh t lk t h

If you are interested

  • Who talks to whom
  • Who edits what

 Also: Wikipedia has elections for admins, articles

– send us an email!

p , get reverted, disputes resolved, …

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

slide-24
SLIDE 24

 We also have the Wikipedai webserver logs, i.e.,

We also have the Wikipedai webserver logs, i.e., page visit statistics

  • http://dammit.lt/wikistats/

h //l /bl /

  • http://lmonson.com/blog/
  • http://developer.amazonwebservices.com/connect/en

try.jspa?externalID=2596

 How does Wiki page visit statistics correlate with

external events, natural disasters? ,

  • Use Twitter or MemeTracker data to detect those
  • Compare occurrence of phrases and visits to Wikipedia

pages pages

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

slide-25
SLIDE 25

 Altavista web graph from 2002:  Altavista web graph from 2002:

  • Nodes are webpages
  • Di

t d d h li k

  • Directed edges are hyperlinks
  • 1.4 billion public webpages

S l billi d

  • Several billion edges
  • For each node we also know the page URL

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

slide-26
SLIDE 26

 SPAM:  SPAM:

  • Use the web‐graph structure

to more efficiently extract to more efficiently extract spam webpages

  • Link farms

Link farms

  • Spider traps

 Personalized and topic‐

Personalized and topic sensitive PageRank

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

slide-27
SLIDE 27

 Website structure identification:  Website structure identification:

  • From the webgraph extract “websites”
  • What are common navigational structures of

What are common navigational structures of websites?

  • Cluster website graphs
  • Identify common subgraphs and patterns
  • What are roles pages/links play in the graph:
  • C

t t

  • Content pages
  • Navigational pages
  • Index pages

p g

  • Build a summary/map of the website

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27

slide-28
SLIDE 28

 A collection of focused snapshots of the Web  A collection of focused snapshots of the Web  Data starts in 2004 and continues till today

  • General crawls
  • General crawls
  • start from ~1000 seed webpages
  • Crawl up to ~150 000 pager per site

Crawl up to 150,000 pager per site

  • Specialized crawls:
  • Universities

Universities

  • US Government
  • Hurricane Katrina (2005) – daily crawls
  • Monthly newspaper crawls

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28

slide-29
SLIDE 29

 Smaller than Altavista but you  Smaller than Altavista but you

also have the page content

 Study the evolution of the webgraph

  • How does website structure change and evolve

g

  • ver time
  • How do webpages (webpage structure) change

p g ( p g ) g

  • ver time

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29

slide-30
SLIDE 30

 A large IM buddy graph from March 2005  A large IM buddy graph from March 2005  230 million nodes  7 340 million undirected edges  7,340 million undirected edges  Limitations:

  • Only have the buddy graph with random node ids
  • No communication or edge strength

No communication or edge strength

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30

slide-31
SLIDE 31

 Find communities clusters in such a big graph  Find communities, clusters in such a big graph  Count frequent subgraphs

q g p

 Design algorithms to characterize the

f structure of the network as a whole

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31

slide-32
SLIDE 32

 Stanford Search Queries  Stanford Search Queries  New York Times articles since 1987

  • Article are manually annotated by subject

Article are manually annotated by subject categories and keywords

  • Entity or relation extraction
  • Extract keywords, predict article category

’ f l l d b h

 Don’t feel limited by these  You can collect the dataset yourself  And define the project/question yourself  And define the project/question yourself

10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32