http://cs224w.stanford.edu Teams of 2 3 students (1 is also ok) - - PowerPoint PPT Presentation
http://cs224w.stanford.edu Teams of 2 3 students (1 is also ok) - - PowerPoint PPT Presentation
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Teams of 2 3 students (1 is also ok) Teams of 2 3 students (1 is also ok) Project:
Teams of 2‐3 students (1 is also ok) Teams of 2 3 students (1 is also ok) Project:
- Experimental evaluation of algorithms and models
Experimental evaluation of algorithms and models
- n an interesting dataset
- A theoretical project that considers a model, an
algorithm or a network property and derives a rigorous result about it
- An in depth critical survey of one of the course
- An in‐depth critical survey of one of the course
topics relating models, experimental results and underlying social theories and offering a novel perspective on the area
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
Answer the following questions: Answer the following questions:
- What is the problem you are solving?
- Wh t d t
ill (h ill t it)?
- What data will you use (how will you get it)?
- How will you do the project?
Whi h l ith /t h i / d l l t
- Which algorithms/techniques/models you plan to
use/develop?
- Be as specific as you can!
p y
- Who will you evaluate, measure success?
- What do you expect to submit/accomplish by
What do you expect to submit/accomplish by the end of the quarter?
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
The project should contain at least some amount of
p j mathematical analysis, and some experimentation on real or synthetic data h l f h j ill i ll b 10
The result of the project will typically be a 10 page
paper, describing the approach, the results, and the related work.
Due on midnight OCT 18 2010 Upload PDF to http://coursework.stanford.edu
Upload PDF to http://coursework.stanford.edu
TAs will assign group numbers – we will
send a link to a GoogleDoc g
Name your file: <group#>_proposal.pdf
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4
Wikipedia Wikipedia IM buddy graph Yahoo Altavista web graph Yahoo Altavista web graph Stanford WebBase Twitter Data Twitter Data Blogs and news data Yahoo Music Ratings Yahoo Music Ratings
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Richly labeled network containing extracted Richly labeled network containing extracted
data from Wikipedia (based on infoboxes):
- Richly labeled network
Richly labeled network
- multiple types of nodes and edges
- About 2.6 million concepts described by 247
million triples, including abstracts in 14 different languages
- http://dbpedia org
- http://dbpedia.org
Other OpenLinkedData datasets available at
http://esw.w3.org/DataSetRDFDumps
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Networks of positive and negative edges Networks of positive and negative edges
- Data includes:
- Trust/distrust edges
- Trust/distrust edges
- Also Epinions product reviews and review ratings
- SNAP: http://snap stanford edu/data/#signnets
SNAP: http://snap.stanford.edu/data/#signnets
- Trustlet: http://www.trustlet.org/wiki
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Prosper marketplace – Peer to peer lending: Prosper marketplace – Peer‐to‐peer lending:
- Lenders ask for loans
- P
l th bid ( i i t t t ) l t
- People then bid (price, interest rate) on loans to
fund them
- Rich social structure around the website
- Rich social structure around the website
Data at http://www prosper com/tools/DataExport aspx Data at http://www.prosper.com/tools/DataExport.aspx
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
Turiya is a start up that collects game data from game
Turiya is a start up that collects game data from game publishers and processes these to produce business intelligence of value to it’s clients
Data collected includes: Data collected includes:
- Players and their attributes
- Logs of game events
g g
- Information about virtual items
- Information about transactions in real money or credits
A l i l d
Analyses include:
- Player segmentation
- Virtual goods recommendations
If i t t d
Virtual goods recommendations
- Lifetime value estimation of players
If you are interested – send us an email!
10/11/2010 9 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
What to Wear is a Social Game played on Facebook
C t t t t tfit d b it th t d il
- Contestants create outfits and submit these to a daily
competition, which has a theme like e.g. “an outfit for attending your ex’s‐wedding”
- Contestants can also vote and comment on other people’s
Contestants can also vote and comment on other people s submissions
- You get credit for both participating and judging
- Items for outfits are either bought from the store or reused from
Items for outfits are either bought from the store or reused from the contestant’s closet
- ~30,000 players/month
Data about this game includes:
- Player data
- Data about previous competitions
- Fashion items data
If i t t d
- Data about outfits
- Many other data (~400 relations in all)
If you are interested – send us an email!
10/11/2010 10 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu
Amazon product review data: Amazon product review data:
For each product:
- P
d t i f l k
- Product info: name, salesrank
- Product categorization
All i
- All reviews
- user, rating, how helpful was the review
P l h b ht X l b ht Y t k!
- People who bought X also bought Y – network!
If i t t d
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
If you are interested – send us an email!
Collaboration network of computer scientists Collaboration network of computer scientists
- Each CS publication is included:
- Author names
- Author names
- Title
- Year
Year
- Conference, journal name
Get the data at:
- http://dblp.uni‐trier.de/xml/
- http://kdl.cs.umass.edu/data/dblp/dblp‐info.html
p // / / p/ p
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
Patents (http://www.nber.org/patents/)
( p // g/p /)
- Citations between patents
- For each patent we also know:
- Time
- Time
- Patent categorization
- Patent inventor data, …
Arxiv High‐energy Physics:
g e e gy ys cs
- Citation network between papers
- For each paper we also know
- Author names
- Author names
- Title and abstract of the paper
- Year of publication
- Journal
Journal
Data at: http://snap.stanford.edu/data/#citnets
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
~50 million tweets per month starting
50 million tweets per month starting in June 2009 (6 months)
Format:
2009 06 07 02 07 42 T 2009-06-07 02:07:42 U http://twitter.com/redsoxtweets W #redsox Extra Bases: Sox win, 8-1: The Rangers spoiled Jon Lester's perfecto and his shutout.. http://tinyurl.com/pyhgwy http://tinyurl.com/pyhgwy
Two important things:
- URLs
- H
h t
If you are interested send us an email!
- Hash‐tags
Twitter social graph and some profiles:
http://an kaist ac kr/traces/WWW2010 html
– send us an email!
http://an.kaist.ac.kr/traces/WWW2010.html
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14
Inferring links of the who‐follows‐whom Inferring links of the who follows whom
network h h l f l f d h h ?
What is the lifecycle of URLs and hash‐tags?
- How do hash‐tags get adopted?
M l i l i h h hi h i ?
- Multiple competing hash‐tags, which one wins?
Finding early/influential users? Community discovery Where/how will the information propagate?
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15
More than 1 million newsmedia and blog More than 1 million newsmedia and blog
articles per day since August 2008
Extracted phrases (quotes) and links Extracted phrases (quotes) and links http://memetracker.org Format: Format:
P http://cnnpoliticalticker.wordpress.com/2008/08/31/mccain-defends- palins-experience-level T 2008-09-01 00:00:13 Q dangerously unprepared to be president Q dangerously unprepared to be president Q even more dangerously unprepared Q understands the challenges that we face Q worked and succeeded L http://www.cnn.com
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16
How does information mutate/change over time?
How does information mutate/change over time?
Which media sites are the most influential? Build a
predictive model of site influence predictive model of site influence
Role discovery: Which nodes are early adopters,
late comers, summarizers? late comers, summarizers?
Create a model of political bias (liberal vs.
conservative) conservative)
What is genuine news, what are genuine phrases
and what is spam? and what is spam?
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
About the Dataset:
About the Dataset:
- 6.5 million legal opinions from the United States Judiciary
from 1900 to the present; d t li k d (l t f t li )
- documents are linked (later cases refer to earlier ones)
- the documents are both stored in raw form on Amazon S3
and also have been pre‐processed for analysis by Hadoop
Project ideas:
- label cases as pro‐plaintiff or pro‐defendant
- run PageRank Hub Authorities or other graph algorithms
- run PageRank, Hub‐Authorities, or other graph algorithms
- n the documents (they are hyperlinked)
- identify legally important concepts
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
If you are interested – send us an email!
Complete edit history of Wikipedia until Complete edit history of Wikipedia until
January 2008
For every single edit the complete For every single edit the complete
snapshot of the article is saved
Each page has a talk page: Each page has a talk page:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19
Talk page: Talk page: Editors discuss things like: Editors discuss things like:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20
Every registered use has a personal page: Every registered use has a personal page:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21
Every user’s page has a talk page: Every user s page has a talk page: Users discuss things: Users discuss things:
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22
We have nicely parsed Wikipedia data
We have nicely parsed Wikipedia data
- Each edit:
- REVISION 4781981 72390319 Steven_Strogatz 2006-08-28T14:11:16Z SmackBot
433328
- CATEGORY American mathematicians
CATEGORY American_mathematicians
- MAIN Boston_University MIT Harvard_University Cornell_University
- OTHER De:Steven_Strogatz Es:Steven_Strogatz
- EXTERNAL http://www.edge.org/3rd_culture/bios/strogatz.html
- TEMPLATE Cite_book Cite_book Cite_journal
- COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]]
- COMMENT ISBN formatting &/or general fixes using [[WP:AWB|AWB]]
- MINOR 1
- TEXTDATA 229
Can identify networks:
Wh t lk t h
If you are interested
- Who talks to whom
- Who edits what
Also: Wikipedia has elections for admins, articles
– send us an email!
p , get reverted, disputes resolved, …
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23
We also have the Wikipedai webserver logs, i.e.,
We also have the Wikipedai webserver logs, i.e., page visit statistics
- http://dammit.lt/wikistats/
h //l /bl /
- http://lmonson.com/blog/
- http://developer.amazonwebservices.com/connect/en
try.jspa?externalID=2596
How does Wiki page visit statistics correlate with
external events, natural disasters? ,
- Use Twitter or MemeTracker data to detect those
- Compare occurrence of phrases and visits to Wikipedia
pages pages
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
Altavista web graph from 2002: Altavista web graph from 2002:
- Nodes are webpages
- Di
t d d h li k
- Directed edges are hyperlinks
- 1.4 billion public webpages
S l billi d
- Several billion edges
- For each node we also know the page URL
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
SPAM: SPAM:
- Use the web‐graph structure
to more efficiently extract to more efficiently extract spam webpages
- Link farms
Link farms
- Spider traps
Personalized and topic‐
Personalized and topic sensitive PageRank
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
Website structure identification: Website structure identification:
- From the webgraph extract “websites”
- What are common navigational structures of
What are common navigational structures of websites?
- Cluster website graphs
- Identify common subgraphs and patterns
- What are roles pages/links play in the graph:
- C
t t
- Content pages
- Navigational pages
- Index pages
p g
- Build a summary/map of the website
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
A collection of focused snapshots of the Web A collection of focused snapshots of the Web Data starts in 2004 and continues till today
- General crawls
- General crawls
- start from ~1000 seed webpages
- Crawl up to ~150 000 pager per site
Crawl up to 150,000 pager per site
- Specialized crawls:
- Universities
Universities
- US Government
- Hurricane Katrina (2005) – daily crawls
- Monthly newspaper crawls
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28
Smaller than Altavista but you Smaller than Altavista but you
also have the page content
Study the evolution of the webgraph
- How does website structure change and evolve
g
- ver time
- How do webpages (webpage structure) change
p g ( p g ) g
- ver time
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29
A large IM buddy graph from March 2005 A large IM buddy graph from March 2005 230 million nodes 7 340 million undirected edges 7,340 million undirected edges Limitations:
- Only have the buddy graph with random node ids
- No communication or edge strength
No communication or edge strength
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30
Find communities clusters in such a big graph Find communities, clusters in such a big graph Count frequent subgraphs
q g p
Design algorithms to characterize the
f structure of the network as a whole
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
Stanford Search Queries Stanford Search Queries New York Times articles since 1987
- Article are manually annotated by subject
Article are manually annotated by subject categories and keywords
- Entity or relation extraction
- Extract keywords, predict article category
’ f l l d b h
Don’t feel limited by these You can collect the dataset yourself And define the project/question yourself And define the project/question yourself
10/11/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32