CS345a: Data Mining Jure Leskovec Stanford University Instructors: - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec Stanford University

 Instructors:  Instructors:  Jure Leskovec  A  Anand Rajaraman d R j  TAs:  Abhishek Gupta Abhi h k G t  Roshan Sumbaly  Reach us at cs345a ‐ win0910 ‐ staff@ R h t 345 i 0910 t ff@ lists.stanford.edu  More info on www.stanford.edu/class/cs345a M i f t f d d / l / 345 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 2

 Homework: 20%  Homework: 20%  Gradiance and other  3 l t d  3 late days for the quarter f th t  All homeworks must be handed in  Project: 40%  Project: 40%  Start early  Takes lots of time k l f  Final Exam: 40% 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 3

 Basic databases: CS145  Basic databases: CS145  Algorithms:  Dynamic programming basic data structures  Dynamic programming, basic data structures  Basic statistics:  Moments t pi al distrib tions re ression  Moments, typical distributions, regression, …  Programming:  Your choice, but C++/Java will be very useful Y h i b t C /J ill b f l  We provide some background, but the class We provide some background, but the class will be fast paced 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 4

 Software implementation related to course  Software implementation related to course subject matter  Should involve an original component or  Should involve an original component or experiment  More later about available data and  More later about available data and computing resources  It’s going to be fun and hard work 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 5

 Many past projects have dealt with  Many past projects have dealt with collaborative filtering (advice based on what similar people do) similar people do)  E.g., Netflix Challenge  Others have dealt with engineering solutions  Others have dealt with engineering solutions to machine ‐ learning problems  Lots of interesting project ideas  Lots of interesting project ideas  If you can’t think of one please come talk to us 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 6

 Data:  Data:  Netflix  WebBase WebBase  Wikipedia  TREC  ShareThis  Google g  Infrastructure:  Aster Data cluster on Amazon EC2  Supports both MapReduce and SQL 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 7

 ML generally requires a large  ML generally requires a large “training set” of correctly classified data: classified data:  Example: classify Web pages by topic  Hard to find well ‐ classified data:  Open Directory works for page topics, Open Directory works for page topics, because work is collaborative and shared by many.  Other good exceptions? 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 8

Many problems require thought: Many problems require thought:   1. Tell important pages from unimportant (PageRank) (PageRank) 2. Tell real news from publicity (how?) 3. Distinguish positive from negative product 3 Distinguish positive from negative product reviews (how?) 4 4. Feature generation in ML Feature generation in ML 5. Etc., etc. 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 9

Working in pairs OK but Working in pairs OK, but …   1. No more than two per project. 2. We will expect more from a pair than from an 2 W ill t f i th f individual. 3 3. The effort should be roughly evenly distributed. The effort should be roughly evenly distributed 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 10

 Map ‐ Reduce and Hadoop Map Reduce and Hadoop  Recommendation systems  Collaborative filtering  Dimensionality reduction  Dimensionality reduction  Finding nearest neighbors  Finding similar sets  Minhashing, Locality ‐ Sensitive hashing  Clustering  PageRank and measures of importance in graphs  PageRank and measures of importance in graphs ( link analysis )  Spam detection  Topic ‐ specific search 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 11

 Large scale machine learning  Large scale machine learning  Association rules, frequent itemsets  Extracting structured data (relations) from the  Extracting structured data (relations) from the Web  Clustering data  Clustering data  Graph partitioning  Spam detection  Spam detection  Managing Web advertisements  Mining data streams  Mining data streams 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 12

 Lots of data is being collected Lots of data is being collected and warehoused  Web data, e ‐ commerce  purchases at department/ h d / grocery stores  Bank/Credit Card transactions  Computers are cheap and powerful p p p  Competitive Pressure is Strong  Provide better, customized services for an edge (e.g. in g ( g Customer Relationship Management) 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 13

 Data collected and stored at enormous speeds (GB/hour)  remote sensors on a satellite  telescopes scanning the skies  microarrays generating gene expression data p  scientific simulations generating terabytes of data  Traditional techniques infeasible for T di i l h i i f ibl f raw data  Data mining helps scientists  in classifying and segmenting data  in Hypothesis Formation 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 14

 There is often information “hidden” in the data that is not readily evident not readily evident  Human analysts take weeks to discover useful information  Much of the data is never analyzed at all M h f th d t i l d t ll 4,000,000 3,500,000 The Data Gap 3,000,000 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 T t l di k (TB) i 1995 1,000,000 Number of 500,000 00 000 analysts 0 1995 1996 1997 1998 1999 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 15

 Many Definitions  Many Definitions  Non ‐ trivial extraction of implicit, previously unknown and useful information from data unknown and useful information from data  Exploration & analysis, by automatic or semi automatic means of semi ‐ automatic means, of large quantities of data in order to discover in order to discover meaningful patterns 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 16

 Process of semi automatically analyzing large  Process of semi ‐ automatically analyzing large databases to find patterns that are:  valid: hold on new data with some certainty  valid: hold on new data with some certainty  novel: non ‐ obvious to the system  useful: should be possible to act on the item f l h ld b ibl t t th it  understandable: humans should be able to interpret the pattern interpret the pattern 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 17

 A big data mining risk is that you will  A big data ‐ mining risk is that you will “discover” patterns that are meaningless.  Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your more places for interesting patterns than your amount of data will support, you are bound to find crap find crap. 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 18

 A parapsychologist in the 1950’s hypothesized  A parapsychologist in the 1950 s hypothesized that some people had Extra ‐ Sensory Perception Perception  He devised an experiment where subjects were asked to guess 10 hidden cards – red or were asked to guess 10 hidden cards – red or blue  He discovered that almost 1 in 1000 had ESP –  He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 19

 He told these people they had ESP and called  He told these people they had ESP and called them in for another test of the same type  Alas he discovered that almost all of them  Alas, he discovered that almost all of them had lost their ESP  What did he conclude?  What did he conclude?  He concluded that you shouldn’t tell people  He concluded that you shouldn t tell people they have ESP; it causes them to lose it.  1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 20

 Banking: loan/credit card approval: g / pp  predict good customers based on old customers  Customer relationship management:  identify those who are likely to leave for a competitor id tif th h lik l t l f tit  Targeted marketing:  identify likely responders to promotions identify likely responders to promotions  Fraud detection: telecommunications, finance  from an online stream of event identify fraudulent events t  Manufacturing and production:  automatically adjust knobs when process parameter automatically adjust knobs when process parameter changes 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 21

 Medicine: disease outcome, effectiveness of Medicine: disease outcome, effectiveness of treatments  analyze patient disease history: find relationship between diseases between diseases  Molecular/Pharmaceutical:  id  identify new drugs tif d  Scientific data analysis:  identify new galaxies by searching for sub clusters id if l i b hi f b l  Web site/store design and promotion:  find affinity of visitor to pages and modify layout 1/5/2010 Jure Leskovec, Stanford CS345a: Data Mining 22

CS345a: Data Mining Jure Leskovec Stanford University Instructors: - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure Leskovec A Anand Rajaraman d R j TAs: Abhishek Gupta Abhi h k G t Roshan Sumbaly Reach us at cs345a win0910 staff@ R

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

http://cs224w.stanford.edu Nodes Nodes Network Adjacency matrix 11/30/17 Jure Leskovec,

Analytics on Sensor Networks Joint work with D. D. Ha Hallac , S. Vare, S. Bhooshan, R. Sosic, S.

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

Recommender Systems Francesco Ricci Database and Information Systems Free University of Bozen,

Social Commerce: Foundations, Social Marketing,

Real-time Collaborative Filtering Recommender Systems Huizhi Liang, Haoran Du, Qing Wang

Gerard Chick FCIPS chief knowledge officer Sponsored by 25 th September 2015 Todays Presenter

Networks and Telecommunications COI August 27, 2018 Advancing Government through Education,

Targeted Marketing and Response Modelling Roger Beecham www.roger-beecham.com Targeted

REGIONALLY ADDRESSING CONTAMINATION Messaging Review April 8, 2019 Recycling Partnership, Burns

Assessment Across QL Courses Jill Dunham and Betty Mayfield Hood College A new core curriculum