http cs246 stanford edu tas
play

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann Lu Li Aditya Ramesh Office hours: Jure: Tuesdays 9-10am, Gates 418


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2.  TAs :  Bahman Bahmani  Juthika Dabholkar  Pierre Kreitmann  Lu Li  Aditya Ramesh  Office hours:  Jure: Tuesdays 9-10am, Gates 418  See course website for TA office hours 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

  3.  Course website: http://cs246.stanford.edu  Lecture slides (at least 6h before the lecture)  Announcements, homeworks, solutions  Readings!  Readings: Book Mining of Massive Datasets by Anand Rajaraman and Jeffrey D. Ullman Free online: http://i.stanford.edu/~ullman/mmds.html 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

  4.  4 longer homeworks: 40%  Theoretical and programming questions  All homeworks (even if empty) must be handed in  Assignments take time. Start early!  How to submit?  Paper: Box outside the class and in the Gates east wing  We will grade on paper!  You should also submit electronic copy:  1 PDF/ZIP file (writeups, experimental results, code)  Submission website: http://cs246.stanford.edu/submit/  SCPD: Only submit electronic copy & send us email  7 late days for the quarter:  Max 5 late days per assignment 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

  5.  Short weekly quizzes: 20%  Short e-quizzes on Gradiance (see course website!)  First quiz is already online  You have 7 days to complete it. No late days!  Final exam: 40%  March 19 at 8:30am  It’s going to be fun and hard work  1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

  6.  Homework schedule: Date Out In 1/11 HW1 1/25 HW2 HW1 2/8 HW3 HW2 2/22 HW4 HW3 3/7 HW4  No class: 1/16: Martin Luther King Jr. 2/20: President’s day 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

  7.  Recitation sessions:  Review of probability and statistics  Installing and working with Hadoop  We prepared a virtual machine with Hadoop preinstalled  HW0 helps you write your first Hadoop program  See course website!  We will announce the dates later  Sessions will be recorded 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

  8.  Algorithms (CS161)  Dynamic programming, basic data structures  Basic probability (CS109 or Stat116)  Moments, typical distributions, MLE, …  Programming (CS107 or CS145)  Your choice, but C++/Java will be very useful  We provide some background, but the class will be fast paced 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

  9.  CS345a: Data mining got split into 2 courses  CS246: Mining massive datasets:  Methods/algorithms oriented course  Homeworks (theory & programming)  No class project  CS341: Project in mining massive datasets:  Project oriented class  Lectures/readings related to the project  Unlimited access to Amazon EC2 cluster  We intend to keep the class small  Taking CS246 is basically prerequisite 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

  10.  For questions/clarifications use Piazza!  If you don’t have @stanford.edu email address email us and we will register you  To communicate with the course staff use  cs246-win1112-staff@lists.stanford.edu  We will post announcements to  cs246-win1112-all@lists.stanford.edu  If you are not registered or auditing send us email and we will subscribe you!  You are welcome to sit-in & audit the class  Send us email saying that you will be auditing 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

  11.  Much of the course will be devoted to ways to data mining on the Web:  Mining to discover things about the Web  E.g., PageRank, finding spam sites  Mining data from the Web itself  E.g., analysis of click streams, similar products at Amazon, making recommendations 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

  12.  Much of the course will be devoted to l arge scale computing for data mining  Challenges:  How to distribute computation?  Distributed/parallel programming is hard  Map-reduce addresses all of the above  Google’s computational/data manipulation model  Elegant way to work with big data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

  13.  High-dimensional data:  Locality Sensitive Hashing  Dimensionality reduction  Clustering  The data is a graph:  Link Analysis: PageRank, Hubs & Authorities  Machine Learning:  k-NN, Perceptron, SVM, Decision Trees  Data is infinite:  Mining data streams 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

  14.  Applications:  Association Rules  Recommender systems  Advertising on the Web  Web spam detection 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

  15.  Discovery of patterns and models that are:  Valid: hold on new data with some certainty  Useful: should be possible to act on the item  Unexpected: non-obvious to the system  Understandable: humans should be able to interpret the pattern  Subsidiary issues:  Data cleansing: detection of bogus data  Visualization: something better than MBs of output  Warehousing of data (for retrieval) 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

  16.  Predictive Methods  Use some variables to predict unknown or future values of other variables  Descriptive Methods  Find human-interpretable patterns that describe the data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

  17.  Scalability  Dimensionality  Complex and Heterogeneous Data  Data Quality  Data Ownership and Distribution  Privacy Preservation  Streaming Data 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

  18.  Overlaps with:  Databases: Large-scale (non-main-memory) data  Machine learning: Complex methods, small data  Statistics: Models  Different cultures:  To a DB person, data mining Statistics/ Machine Learning/ is an extreme form of AI Pattern analytic processing – Recognition queries that examine large amounts of data Data Mining  Result is the query answer  To a statistician, data-mining is Database the inference of models systems  Result is the parameters of the model 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

  19.  A big data-mining risk is that you will “discover” patterns that are meaningless .  Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

  20.  Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception  He devised an experiment where subjects were asked to guess 10 hidden cards – red or blue  He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right! 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

  21.  He told these people they had ESP and called them in for another test of the same type  Alas, he discovered that almost all of them had lost their ESP  What did he conclude?  He concluded that you shouldn’t tell people they have ESP; it causes them to lose it  1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

  22. CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

  23.  20+ billion web pages x 20KB = 400+ TB  1 computer reads 30-35 MB/sec from disk  ~4 months to read the web  ~1,000 hard drives to store the web  Takes even more to do something useful with the data!  Standard architecture is emerging:  Cluster of commodity Linux nodes  Gigabit ethernet interconnect 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

  24. 2-10 Gbps backbone between racks 1 Gbps between Switch any pair of nodes in a rack Switch Switch CPU CPU CPU CPU … … Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26

  25.  Large-scale computing for data mining problems on commodity hardware  Challenges:  How do you distribute computation?  How can we make it easy to write distributed programs?  Machines fail:  One server may stay up 3 years (1,000 days)  If you have 1,0000 servers, expect to loose 1/day  In Aug 2006 Google had ~450,000 machines 1/8/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend