http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

 TAs :  Bahman Bahmani  Juthika Dabholkar  Pierre Kreitmann  Lu Li  Aditya Ramesh  Office hours:  Jure: Tuesdays 9-10am, Gates 418  See course website for TA office hours 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

 Course website: http://cs246.stanford.edu  Lecture slides (at least 6h before the lecture)  Announcements, homeworks, solutions  Readings!  Readings: Book Mining of Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman Free online: http://i.stanford.edu/~ullman/mmds.html 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

 4 longer homeworks: 40%  Theoretical and programming questions  All homeworks (even if empty) must be handed in  Assignments take time. Start early!  How to submit?  Paper: Box outside the class and in the Gates east wing  We will grade on paper!  You should also submit electronic copy:  1 PDF/ZIP file (writeups, experimental results, code)  Submission website: http://cs246.stanford.edu/submit/  SCPD: Only submit electronic copy & send us email  7 late days for the quarter:  Max 5 late days per assignment 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

 Short weekly quizzes: 20%  Short e-quizzes on Gradiance (see course website!)  First quiz is already online  You have 7 days to complete it. No late days!  Final exam: 40%  March 19 at 8:30am  It’s going to be fun and hard work  1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

 Homework schedule: Date Out In 1/11 HW1 1/25 HW2 HW1 2/8 HW3 HW2 2/22 HW4 HW3 3/7 HW4  No class: 1/16: Martin Luther King Jr. 2/20: President’s day 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

 Recitation sessions:  Review of probability and statistics  Installing and working with Hadoop  We prepared a virtual machine with Hadoop preinstalled  HW0 helps you write your first Hadoop program  See course website!  We will announce the dates later  Sessions will be recorded 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

 Algorithms (CS161)  Dynamic programming, basic data structures  Basic probability (CS109 or Stat116)  Moments, typical distributions, MLE, …  Programming (CS107 or CS145)  Your choice, but C++/Java will be very useful  We provide some background, but the class will be fast paced 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

 CS345a: Data mining got split into 2 courses  CS246: Mining massive datasets:  Methods/algorithms oriented course  Homeworks (theory & programming)  No class project  CS341: Project in mining massive datasets:  Project oriented class  Lectures/readings related to the project  Unlimited access to Amazon EC2 cluster  We intend to keep the class small  Taking CS246 is basically prerequisite 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

 For questions/clarifications use Piazza!  If you don’t have @stanford.edu email address email us and we will register you  To communicate with the course staff use  cs246-win1112-staff@lists.stanford.edu  We will post announcements to  cs246-win1112-all@lists.stanford.edu  If you are not registered or auditing send us email and we will subscribe you!  You are welcome to sit-in & audit the class  Send us email saying that you will be auditing 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

 Much of the course will be devoted to ways to data mining on the Web:  Mining to discover things about the Web  E.g., PageRank, finding spam sites  Mining data from the Web itself  E.g., analysis of click streams, similar products at Amazon, making recommendations 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

 Much of the course will be devoted to l arge scale computing for data mining  Challenges:  How to distribute computation?  Distributed/parallel programming is hard  Map-reduce addresses all of the above  Google’s computational/data manipulation model  Elegant way to work with big data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

 High-dimensional data:  Locality Sensitive Hashing  Dimensionality reduction  Clustering  The data is a graph:  Link Analysis: PageRank, Hubs & Authorities  Machine Learning:  k-NN, Perceptron, SVM, Decision Trees  Data is infinite:  Mining data streams 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

 Applications:  Association Rules  Recommender systems  Advertising on the Web  Web spam detection 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

 Discovery of patterns and models that are:  Valid: hold on new data with some certainty  Useful: should be possible to act on the item  Unexpected: non-obvious to the system  Understandable: humans should be able to interpret the pattern  Subsidiary issues:  Data cleansing: detection of bogus data  Visualization: something better than MBs of output  Warehousing of data (for retrieval) 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

 Predictive Methods  Use some variables to predict unknown or future values of other variables  Descriptive Methods  Find human-interpretable patterns that describe the data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

 Scalability  Dimensionality  Complex and Heterogeneous Data  Data Quality  Data Ownership and Distribution  Privacy Preservation  Streaming Data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

 Overlaps with:  Databases: Large-scale (non-main-memory) data  Machine learning: Complex methods, small data  Statistics: Models  Different cultures:  To a DB person, data mining Statistics/ Machine Learning/ is an extreme form of AI Pattern analytic processing – Recognition queries that examine large amounts of data Data Mining  Result is the query answer  To a statistician, data-mining is Database the inference of models systems  Result is the parameters of the model 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

 A big data-mining risk is that you will “discover” patterns that are meaningless .  Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

 Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception  He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue  He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

 He told these people they had ESP and called them in for another test of the same type  Alas, he discovered that almost all of them had lost their ESP  What did he conclude?  He concluded that you shouldn’t tell people they have ESP; it causes them to lose it  1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

 20+ billion web pages x 20KB = 400+ TB  1 computer reads 30-35 MB/sec from disk  ~4 months to read the web  ~1,000 hard drives to store the web  Takes even more to do something useful with the data!  Standard architecture is emerging:  Cluster of commodity Linux nodes  Gigabit ethernet interconnect 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

2-10 Gbps backbone between racks 1 Gbps between Switch any pair of nodes in a rack Switch Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

 Large-scale computing for data mining problems on commodity hardware  Challenges:  How do you distribute computation?  How can we make it easy to write distributed programs?  Machines fail:  One server may stay up 3 years (1,000 days)  If you have 1,0000 servers, expect to loose 1/day  In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann Lu Li Aditya Ramesh Office hours: Jure: Tuesdays 9-10am, Gates 418

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies

http://cs246.stanford.edu High dimensional == many features Find

http://cs246.stanford.edu Supermarket shelf management Market-basket model: Goal: Identify

New York State Health Workforce Data and Discussion CUNY Health and Human Services Inaugural

Community Shared Solar at NYCHA October 7, 2019 Ellen Zielinski, NYCHA Chris White, NYCHA 1

CASUMM For more information contact CASUMM at casumm@gmail.com With the support of action Aid

Leeds Planning Network Master Class: Innovation in Housing Supply 17 th November 2016 #LPNLEEDS

1 IMF KE IMF > KE IMF < KE Increasing EN Special cases: (1) EN(O) > EN(Cl) 2

Apache Storm: Hands-on Session A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria

Wispy: The Purdue VM Cloud Alex Younts 2009-03-02 Alex Younts Wispy: The Purdue VM Cloud

6.2 Controlling the Visibility of Data the Visibility of Data 6.2 Controlling Area

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann Lu Li Aditya Ramesh Office hours: Jure: Tuesdays 9-10am, Gates 418

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies

http://cs246.stanford.edu High dimensional == many features Find

http://cs246.stanford.edu Supermarket shelf management Market-basket model: Goal: Identify

New York State Health Workforce Data and Discussion CUNY Health and Human Services Inaugural

Community Shared Solar at NYCHA October 7, 2019 Ellen Zielinski, NYCHA Chris White, NYCHA 1

CASUMM For more information contact CASUMM at casumm@gmail.com With the support of action Aid

Leeds Planning Network Master Class: Innovation in Housing Supply 17 th November 2016 #LPNLEEDS

1 IMF KE IMF &gt; KE IMF &lt; KE Increasing EN Special cases: (1) EN(O) &gt; EN(Cl) 2

Apache Storm: Hands-on Session A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria

Wispy: The Purdue VM Cloud Alex Younts 2009-03-02 Alex Younts Wispy: The Purdue VM Cloud

6.2 Controlling the Visibility of Data the Visibility of Data 6.2 Controlling Area

1 IMF KE IMF > KE IMF < KE Increasing EN Special cases: (1) EN(O) > EN(Cl) 2