Leman Akoglu Stony Brook University - PowerPoint PPT Presentation

Leman ¡Akoglu ¡ Stony ¡Brook ¡University ¡ http://www.cs.stonybrook.edu/~cse590

¡ Users write product reviews on many online sites: Yelp, Amazon, TripAdvisor ¡ Data of the form: <userID, productID, review-text, timestamp, rating> Task: ¡ How to find fake reviews and reviewers ? ¡ What strange behaviors do fake reviewers have? ¡ Can you use the network to find anomalies? Data: Amazon: http://liu.cs.uic.edu/download/data/ Yelp: http://www.yelp.com/academic_dataset 2 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ IPs communicating with other IPs ¡ <IP1, IP2, #bytes, protocol, time> ¡ Simulated data, over ~10 days Tasks: ¡ How to find events ? ¡ How to pinpoint culprits ? ¡ How can you explain the anomalies? ¡ How to model the time series? Data: ¡ “Challenge” network (with subtle anomalies) ¡ Found at: http://www.cs.stonybrook.edu/~leman/courses/ 14CSE590/data/challenge_network.zip 3 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider a large news corpus over time, like all USA Today articles over many years, or opinion platform like Twitter/blogs/forums Tasks: ¡ How can we find sentiment (+/-) associated with locations (e.g. Pittsburgh), people (e.g. Obama), and organizations (e.g. IBM)? § Exploit a senti-graph (bipartite): nodes-1: words, nodes-2: entities § Exploit sentiment associated with words (e.g. bankrupt, success, etc.) ¡ How does sentiment change over time? Data: ¡ Collect your own data. Newspapers sell their database for several hundred dollars. 4 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider Q&A sites where people ask and/or answer questions. Some sites are focused: e.g. MathOverflow, StackOverflow Tasks: ¡ How to automatically identify experts ? ¡ How to detect whether a user is about to leave? ¡ How to estimate quality of answers/questions? ¡ How to estimate the response times to questions? Data: ¡ StackOverflow data available online: http://blog.stackoverflow.com/category/cc-wiki- dump/ 5 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider the time series of Internet trends, like memes, stock prices, or #online searches. Tasks: ¡ How do these time series look like? How to spot change-points ? ¡ Can we characterize the time series (e.g. shape, distribution) so as to differentiate/classify ‘rumor’-based trends from ‘serious’ trends/ topics? § e.g., searches on celebrities vs home sale prices? Data: ¡ Google trends data available to download: http://www.google.com/trends/explore#cmpt=q 6 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider online platforms where people share ‘stuff’, e.g. pictures, links, videos, such as Reddit ¡ Several questions one can ask: Tasks: ¡ How do upvoted posts differ from downvoted ones? (control for the same shared link) ¡ What makes a user more engaged to use such sites? (regular vs sporadic users) ¡ How to characterize the life-span of a post? Data: ¡ Collect your own data: http://www.reddit.com/ 7 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider very special-topic online sites for specific group of people, e.g. people interested in gardening, wine, etc., chess players, moms, etc. Tasks: ¡ What topics are being talked about? ¡ How often do they ask questions? What are they about? What type questions are discussed most? (type: recommendation, opinion, etc.) ¡ What are most mentioned feelings/reasons/words, for specific concepts like ‘opening’, `divorce’, or ‘wine storage’? Data: ¡ Collect your own data: http://www.youbemom.com/forum/all 8 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Brain networks of 114 human subjects § Nodes: brain regions § Edges: connection strengths (weighted) ¡ Small graphs 70x70 nodes (regions) ¡ Big graphs ~2M nodes Tasks: ¡ Classify human: 1- high math vs normal 2- creative vs normal 3- male vs female etc. § Using/finding discriminative patterns Data: http://www.cs.stonybrook.edu/~leman/ courses/ 14CSE590/data/brainnetworks.rar 9 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider a political forum where users discuss several issues: § e.g.: abortion, creation, gay rights, guns, healthcare, re-election of Obama ¡ Opinions : “in-favor” or “opposed” ( signed edges ) Tasks: ¡ Given a user u and a new issue i Predict u’s opinion on i ( use network ) ¡ Anomalies? Spam? Conflicts? Data: Can crawl http://www.politicalforum.com/forum.php http://www.politicsforum.org/forum/ Wikipedia? 10 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider a who-trusts-whom network § Users decide whether to “trust” or “not-trust” each other. ( signed edges ) Task: ¡ Given a user i and a user j Predict whether i trust j and vice versa ( use network ) Data: Epinions.com at http://snap.stanford.edu/data/soc-sign-epinions.html 11 Fall 2014 CSE 590 - Data Mining meets Graph Mining

Consider who-follows-whom Twitter network Tasks: ¡ Given a user i and a user j Predict whether i and j follow each other ( use network ) ¡ Find community structure § Measure quality of communities (conductance, modularity) § How dense are they? Are they well separated? § What size are they? Communities-within-communities? Data: http://an.kaist.ac.kr/traces/WWW2010.html 12 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Available from: https://nycopendata.socrata.com/browse ¡ Types of data include § Electric consumption by zipcode § Emergency (911) or community-concern (311) calls by zipcode § Restaurant inspections § Noise complaints by zipcode § … ¡ Tasks: § Find anomalies/fraud/events § Summarize the data and visualize 13 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Given photos and tags of what people are wearing (temporal data) § Find trends (association rules: what is being worn with what) § How do these trends change over time, if at all? § What determines the #likes of a photo? (e.g., content, popularity, #friends) ¡ Data: § http://www.cs.stonybrook.edu/~leman/ courses/14CSE590/data/chictopia.tar 14 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ KDD is the premier data mining conference ¡ Every year there is a competition http://www.sigkdd.org/kddcup/index.php ¡ KDD-Cup 2010 - Student performance evaluation KDD-Cup 2009 - Customer relationship prediction KDD-Cup 2008 - Breast cancer KDD-Cup 2007 - Consumer recommendations KDD-Cup 2006 - Pulmonary embolisms detection from image data ... ¡ Similarly check out Kaggle : http://www.kaggle.com/ 15 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ MR: Distributed compute environment ¡ Hadoop: open-source version of MR § Can install on local machine http://snap.stanford.edu/class/cs246-2011/hw_files/ hadoop_install.pdf Tasks: ¡ How to partition a very large graph? Goal: § Many within-partition edges, Few cross-edges ¡ How to find single-source shortest paths ? § Given node i, find all shortest paths to other nodes § For weighted, directed graphs § Modification: with upper-bound on shortest path distance 16 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ http://snap.stanford.edu/class/ cs224w-2010/datasetsInfo.html ¡ http://www.stanford.edu/class/cs341/ data.html ¡ http://snap.stanford.edu/data/ Don’t feel limited by these ideas/datasets You can come up with your own ideas and collect interesting datasets J 17 Fall 2014 CSE 590 - Data Mining meets Graph Mining

Leman Akoglu Stony Brook University - PowerPoint PPT Presentation

Leman Akoglu Stony Brook University http://www.cs.stonybrook.edu/~cse590 Users write product reviews on many online sites: Yelp, Amazon, TripAdvisor Data of the form: <userID, productID, review-text,

Weighted Graphs and Disconnected Components Patterns and a Generator Mary McGlohon, Leman

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Mining Rich Graphs Ranking, Classification, and Anomaly Detection Leman Akoglu Feb 9 th 2018

Ranking in Heterogeneous Networks with Geo-Location Information Leman Akoglu Abhinav Mishra

REPORT ON RESULTS OF 2011 AUDITS OF: Stony Brook University Hospital, Stony Brook University

Barbara Chapman Stony Brook University Brookhaven National Laboratory How To Get Tied Up In

Focused Clustering and Outlier Detection in Large Attributed Graphs ACM SIG-KDD August 26, 2014

Doing Business with Stony Brook University Useful Information for New Vendors 1. Introduction

ROOT and C++11 ROOT Users Workshop 2013 Benjamin Bannier Stony Brook University March 13, 2013

Scott D. Stoller Scott Stoller, Stony Brook University 1 Outline Introduction to Trust

Rouven Essig C.N. Yang Institute for Theoretical Physics Stony Brook University Theory session

Carrie-Ann Miller Director of Experiential Learning for STEM Smart Programs at Stony Brook

Recent breakthroughs in sphere packing Abhinav Kumar Stony Brook, ICTS November 8, 2019 Abhinav

Single mask technology implementation Piotr Bielwka 10 th RD51 Stony Brook Single mask

The COVID-19 Pandemic Sharon Nachman, MD Chief, Division of Pediatric Infectious Diseases Stony

Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony

Hot State Tax Audit Topics TEI Houston Tax School - State and Local Tax Workshop Jeffrey S. Reed

Computer Assisted Proofs Colin R IBA LIP ENS Lyon Course 05 10th Oct. 2014 1 / 18

Advisory Panel on Clinical Trials Winter 2015 Meeting Alexandria, VA January 15, 2015 9:30

Riparian Areas Regulation Ralston Creek, Thetis Island Thetis Island Local Trust Committee

Capacity : an Abstract Model of Control over Personal Data Daniel Le Mtayer and Pablo Rauzy

Non-Political Security Learnings from the Mueller Report Arkadiy Tetelman (@arkadiyt) Agenda

Amateur: Augmented Reality Based Vehicle Navigation System Chu Cao 1 , Zhenjiang Li 2 , Pengfei

Hyperbolicity singularities in rarefaction waves Alexei Mailybaev Dan Marchesin Moscow State

Sambuz

Useful Links

Newsletter

Mail Us