and Data Science Lecture 1, September 9, 2015 Maria-Florina (Nina) - PowerPoint PPT Presentation

Foundations of Machine Learning and Data Science Lecture 1, September 9, 2015 Maria-Florina (Nina) Balcan

Course Staff Instructors: • Nina Balcan http://www.cs.cmu.edu/~ninamf • Avrim Blum http://www.cs.cmu.edu/~avrim TAs: • Sarah Allen http://www.cs.cmu.edu/~srallen • Nika Haghtalab http://www.cs.cmu.edu/~nhaghtal

Lectures in general On the board Ocasionally, will use slides

Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Spam Detection Branch Prediction Fraud Detection Playing Games Computational Advertising 4

Machine Learning is Changing the World “Machine learning is the hot new thing” (John Hennessy, President, Stanford) “A breakthrough in machine learning would be worth ten Microsofts ” (Bill Gates, Microsoft) “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, VP Engineering at Google )

The COOLEST TOPIC IN SCIENCE • “A breakthrough in machine learning would be worth ten Microsofts ” (Bill Gates, Chairman, Microsoft) • “Machine learning is the next Internet” (Tony Tether, Director, DARPA) • Machine learning is the hot new thing” (John Hennessy, President, Stanford) • “Web rankings today are mostly a matter of machine learning” (Prabhakar Raghavan, Dir. Research, Yahoo) • “Machine learning is going to result in a real revolution” (Greg Papadopoulos, CTO, Sun) • “Machine learning is today’s discontinuity” (Jerry Yang, CEO, Yahoo)

This course: foundations of Machine Learning and Data Science A 2 Â

Goals of Machine Learning Theory Develop and analyze models to understand: • what kinds of tasks we can hope to learn, and from what kind of data • what types of guarantees might we hope to achieve • prove guarantees for practically successful algs (when will they succeed, how long will they take?) • develop new algs that provably meet desired criteria (potentially within new learning paradigms) Interesting connections to other areas including: • Optimization • Algorithms • Game Theory • Probability & Statistics • Complexity Theory • Information Theory

Example: Supervised Classification Decide which emails are spam and which are important. Supervised classification Not spam spam Goal: use emails seen so far to produce good prediction rule for future data. 9

Example: Supervised Classification Represent each message by features. (e.g., keywords, spelling, etc.) example label Reasonable RULES: + + - + - Predict SPAM if unknown AND (money OR pills) + - - - Predict SPAM if 2money + 3pills – 5 known > 0 - Linearly separable 10

Two Main Aspects of Supervised Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Confidence Bounds, Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. Well understood for passive supervised learning. 11

Using Unlabeled Data and Interaction for Learning Application Areas Search/Information Retrieval Spam Detection Computer Vision Computational Biology Medical Diagnosis Robotics

Massive Amounts of Raw Data Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images 13

Semi-Supervised Learning raw data face not face Expert Labeler Labeled data 14 Classifier

Active Learning raw data face Expert Labeler not face O O O 15 Classifier

Other Protocols for Supervised Learning • Semi-Supervised Learning Using cheap unlabeled data in addition to labeled data. • Active Learning The algorithm interactively asks for labels of informative examples. Theoretical understanding entirely lacking 10 years ago. Lots of progress recently. We will cover some of these. 16

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Often would like low error hypothesis wrt the overall distrib.

Distributed Learning Data distributed across multiple locations. E.g., medical data

Distributed Learning Data distributed across multiple locations. E.g., scientific data

Distributed Learning • Data distributed across multiple locations. • Each has a piece of the overall data pie. • To learn over the combined D, must communicate. Important question: how much communication? Plus, privacy & incentives.

The World is Changing Machine Learning New approaches. E.g., Semi-supervised learning • Multi-task/transfer learning • Interactive learning • Deep Learning • Never ending learning • Distributed learning • Many competing resources & constraints. E.g., Computational efficiency Statistical efficiency • • (noise tolerant algos) Communication • Human labeling effort • Privacy/Incentives •

Structure of the Class Basic Learning Paradigm: Passive Supervised Learning • Basic models: PAC, SLT. • Simple algos and hardness results for supervised learning. • Standard Sample Complexity Results (VC dimension) • Modern Sample Complexity Results • Rademacher Complexity; localization • Weak-learning vs. Strong-learning • Classic, state of the art algorithms: AdaBoost and SVM (kernel based mehtods). • Margin analysis of Boosting and SVM

Structure of the Class Other Learning Paradigms • Incorporating Unlabeled Data in the Learning Process. • Incorporating Interaction in the Learning Process: • Active Learning • More general types of Interaction • Distributed Learning. • Transfer learning/Multi-task learning/Life-long learning. • Deep Learning. • Foundations and algorithms for constraints/externalities. E.g., privacy, limited memory, and communication.

Structure of the Class Other Topics. • Methods for summarizing and making sense of massive datasets including: • unsupervised learning. • spectral, combinatorial techniques. • streaming algorithms. • Online Learning, Optimization, and Game Theory • connections to Boosting

Admin • Course web page: http://www.cs.cmu.edu/~ninamf/courses/806/10-806-index.html Two grading schemes: 1) Project Oriented. 2) Homework Oriented. - Project [60%] - Hwk +grading [60%] - Take-home final [10%] - Take-home final [10%] - Project [30%] - Hwks + grading [30%]

Admin • Course web page: http://www.cs.cmu.edu/~ninamf/courses/806/10-806-index.html 1) Project Oriented. - Project [60%] • explore a theoretical or empirical question; • write-up --- ideally aim for a conference submission! • Small groups OK. - Take-home final [10%] - Hwks + grading [30%]

Admin • Course web page: http://www.cs.cmu.edu/~ninamf/courses/806/10-806-index.html 2) Homework Oriented. - Hwk +grading [60%] - Take-home final [10%] - Project [30%] • read a couple of papers and explain the idea.

Lectures in general On the board Ocasionally, will use slides

and Data Science Lecture 1, September 9, 2015 Maria-Florina (Nina) - PowerPoint PPT Presentation

Foundations of Machine Learning and Data Science Lecture 1, September 9, 2015 Maria-Florina (Nina) Balcan Course Staff Instructors: Nina Balcan http://www.cs.cmu.edu/~ninamf Avrim Blum http://www.cs.cmu.edu/~avrim

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Lecture 20 Jan-Willem van de Meent Schedule Schedule Adjustments Wed 28 Nov: Review Lecture

Jennifer Goforth Gregory Carol Tice The Content Marketing Writer Make a Living Writing blog

No one likes to be sold to! People hire and refer lawyers they know, like, and trust! One of the

Technology for Occupation and Participation CAOT Practice Network Exchange Forum on COVID-19

Workshop for Web-based Signage Electronics and Telecommunications Research Institute Sunghan Kim

SG01 at the NTCIR-13 STC-2 task Haizhou Zhao , Yi Du, Hangyu Li, Qiao Qian, Hao Zhou, Minlie

Public Hearing: 2017 CDBG-DR Harris County Draft Infrastructure Project Application No. 8

Designing a Private CDN with an Off-Sourced Network Infrastructure: Model and Case Study Claudia