course information
play

Course Information Homepage: - PDF document

Course Information Homepage: http://www.ccs.neu.edu/home/mirek/classes/20 11-S-CS6220/ CS 6220: Data Mining Techniques Announcements Homework assignments Lecture handouts Office hours Mirek Riedewald Prerequisites: CS


  1. Course Information • Homepage: http://www.ccs.neu.edu/home/mirek/classes/20 11-S-CS6220/ CS 6220: Data Mining Techniques – Announcements – Homework assignments – Lecture handouts – Office hours Mirek Riedewald • Prerequisites: CS 5800 or CS 7800, or consent of instructor – No exception for first- year Master’s students— based on past experience 2 Grading Instructor Information • Homework: 40% • Instructor: Mirek Riedewald (332 WVH) • Midterm exam: 30% – Office hours: Tue 4:30-5:30pm, Thu 11am-noon • Final exam: 30% – Can email me your questions (include TA) • No copying or sharing of homework solutions allowed! – Email for appointment if you cannot make it – But you can discuss general challenges and ideas with during office hours (or stop by for 1-minute others • Material allowed for exams questions) – Any handwritten notes (originals, no photocopies) • TA: Peter Golbus (472 WVH) – Printouts of lecture summaries distributed by instructor – Office hours: TBD – Nothing else 3 4 Course Materials Course Content and Objectives • Become familiar with landmark general-purpose • No single textbook covers everything at the right data mining methods and understand the main level of depth and breadth… ideas behind each of them • Main textbook: Jiawei Han and Micheline – Classification and prediction: decision tree, regression tree, Naïve Bayes, Bayesian Belief Network, rule-based Kamber. Data Mining: Concepts and Techniques, classification, artificial neural network, SVM, nearest 2nd edition, Morgan Kaufmann, 2006 neighbor – Read it as we cover the material in class – Ensemble methods: bagging, boosting – Frequent pattern mining: frequent itemsets, frequent • Other resources mentioned in syllabus sequences – Consult them whenever the textbook is not sufficient – Clustering: K-means, hierarchical, density-based, high- dimensional data, outliers 5 6 1

  2. Course Content and Objectives What We Cannot Cover • Learn about major concepts and challenges in data mining • Specialized techniques and how to deal with them – Text mining, genome analysis, recommender – Overfitting – Bias-variance tradeoff systems – Evaluating the quality of a classifier or predictor • All possible variations of the presented – Exponential search space and pruning for pattern mining – Pattern interestingness measures landmark techniques – Intuitive clustering versus clusters found by an algorithm; evaluating a clustering • Volumes of theoretical and technical results • Gain practical experience in using some of these techniques for most techniques on real data – Choose the right method for the task and tune it for the given • Implementation details data 7 8 How to Succeed How to Succeed • Attend the lectures • Ask questions during the lecture – Advanced material, not readily found in any single textbook – There is no “stupid” question • Take notes during the lecture – Even seemingly simple questions show that you are – Helps remembering (compared to just listening) thinking about the material and are genuinely interested in – Capture lecture content more individually than our handouts understanding it – Free preparation for exams • Helps you stay alert and makes instructor happy… • Go over notes, handouts, book chapter soon after lecture, • Work on the HW assignment as soon as it comes out e.g., Wed or Thu – Time to ask questions and deal with unforeseen problems – Try to explain material to yourself or friend • Reveals if you really understood it – We might not be able to answer all last-minute questions if • Helps identify questions early — ask us ASAP to resolve them there are too many right before the deadline • Look at content from previous lecture right before the next lecture to “page - in the context” 9 10 What Else to Expect? Introduction • Need good programming skills • Motivation: Why data mining? – E.g., be able to write algorithm that recursively partitions a • What is data mining? set of multi-dimensional data points based on their values in different dimensions • Real-world example • Tree structures and how to traverse them – Recursion! • Cost of algorithms in big-O notation • Fairly basic probability concepts – Random variable, joint distribution, expectation, variance, confidence interval, conditional probability, independence • Basic logic and set operators – AND, OR, implication, union, intersection 11 12 2

  3. How Much Information? Web 2.0 • Source: • Billions of Web pages, social networks with millions of users, millions of blogs http://www2.sims.berkeley.edu/research/projects/ho – How do friends affect my reviews, purchases, choice of friends w-much-info-2003/execsum.htm – How does information spread? • 5 exabytes (10 18 ) of new information from print, film, – What are “friendship patterns” optical storage in 2002 • Small-world phenomenon: any two individuals likely to be connected – 37,000 times Library of Congress book collections (17M through short sequence of acquaintances books) • New information on paper, film, magnetic and optical media doubled between 2000 and 2003 • Information that flows through electronic channels — telephone, radio, TV, Internet — contained 18 exabytes of new information in 2002 13 14 eScience eScience Examples • • Genome data …science and engineering data are constantly being collected, created, deposited, accessed, analyzed and expanded in the pursuit of new • Large Hadron Collider knowledge. In the future, U.S. international leadership in science and engineering will increasingly depend upon our ability to leverage this – Petabytes of raw data reservoir of scientific data captured in digital form, and to transform these – Find particles data into information and knowledge aided by sophisticated data mining, – Analyze scientific analysis integration, analysis and visualization tools. (National Science Foundation Cyberinfrastructure Council, 2007) process itself • How do experienced • Computing has started to change how science is done, enabling new researchers attack a problem? scientific advances through enabling new kinds of experiments. These • SkyServer experiments are also generating new kinds of data – of increasingly – 818 GB, 3.4 billion rows exponential complexity and volume. Achieving the goal of being able to use, exploit and share these data most effectively is a huge challenge. • Cornell Lab of Ornithology (Towards 2020 Science, Report by Microsoft Research, 2006) – 80M observations, thousands of attributes Source: Nature 15 16 Other Examples Traditional Analysis: Regression • Fraudulent/criminal transactions in bank accounts, • Simple linear regression: Y = a + b*X credit cards, phone calls • Estimate parameters a and b from the data – Billions of transactions, real-time detection • Retail stores – Need to deal with noise, errors – What products are people buying together? – Given a set of data points (X i ,Y i ), find a and b that – What promotions will be most effective? minimize squared error, i.e., the sum of • Marketing – Which ads should be placed for which keyword query? (Y i – (a + b*X i )) 2 – What are the key groups of customers and what defines • Solution exists each group? • How much to charge an individual for car insurance • Can also compute confidence intervals • Spam filtering 17 18 3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend