Good Machine Learning = Generalization Goal of machine learning: - PowerPoint PPT Presentation

Good Machine Learning = Generalization • Goal of machine learning: build models that generalize well to predicting new data  “Overfitting”: fitting the training data too well, so we lose generality of model o Example: linear regression vs. Newton’s interpolating polynomial o Interpolating polynomial fits training data perfectly! o Which would you rather use to predict a new data point?

Email Classification • Want to predict if an email is spam or not  Start with the input data o Consider a lexicon of m words (Note: in English m ≈ 100,000) o Define m indicator variables X = <X 1 , X 2 , …, X m > o Each variable X i denotes if word i appeared in a document or not  Define output classes Y to be: {spam, non-spam}  Given training set of N previous emails o For each email message, we have a training instance: X = <X 1 , X 2 , …, X m > noting for each word, if it appeared in email o Each email message is also marked as spam or not (value of Y)

What is Bayes Doing in My Mail Server? • This is spam: Let’s get Bayesian on your spam: Content analysis details: (49.5 hits, 7.0 required) 0.9 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL [93.40.189.29 listed in zen.spamhaus.org] 1.5 URIBL_WS_SURBL Contains an URL listed in the WS SURBL blocklist [URIs: recragas.cn] 5.0 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist [URIs: recragas.cn] 5.0 URIBL_OB_SURBL Contains an URL listed in the OB SURBL blocklist [URIs: recragas.cn] 5.0 URIBL_SC_SURBL Contains an URL listed in the SC SURBL blocklist [URIs: recragas.cn] 2.0 URIBL_BLACK Contains an URL listed in the URIBL blacklist [URIs: recragas.cn] 8.0 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.0000] Who was crazy enough to think of that?

Spam, Spam… Go Away! • The constant battle with spam “And machine-learning algorithms developed to merge and rank large sets of Google search results allow us to combine hundreds of factors to classify spam.” Source: http://www.google.com/mail/help/fightspam/spamexplained.html

How Does This Do? • After training, can test with another set of data  “Testing” set also has known values for Y, so we can see how often we were right/wrong in predictions for Y  Spam data o Email data set: 1789 emails (1578 spam, 211 non-spam) o First, 1538 email messages (by time) used for training o Next 251 messages used to test learned classifier  Criteria: o Precision = # correctly predicted class Y/ # predicted class Y o Recall = # correctly predicted class Y / # real class Y messages Spam Non-spam Precision Recall Precision Recall Words only 97.1% 94.3% 87.7% 93.4% Words + add’l features 100% 98.3% 96.2% 100%

A Little Text Analysis of the Governator • Arnold Schwarzenegger’s actual veto letter:

Coincidence, You Ask? • San Francisco Chronicle, Oct. 28, 2009: “Schwarzenegger's press secretary, Aaron McLear, insisted Tuesday it was simply a ‘weird coincidence’." • Steve Piantadosi (grad student at MIT) blog post, Oct. 28, 2009:  “…assume that each word starting a line is chosen independently…”  “…[compute] the (token) frequency with which each letter appears at the start of a word…”  Multiply probabilities for letter starting each word of each line to get final answer: “ one in 1 trillion” • 50,000 times less likely than winning CA lottery

Now You Too Can Build Terminators! • Be careful!  “ My CPU is a neural net processor, a learning computer. The more contact I have with humans, the more I learn.”

Additional Thoughts on Machine Learning • Data analytics has enormous potential for decision-making  Analyzing credit risk (e.g., credit cards, mortgages)  Widely used for advertising and targeted marketing o E.g., Target models pregnancy  Large volumes of data are being collected on you o E.g., Google, Facebook, Amazon, Apple, etc. o Data vendors: Harte Hanks, Axciom • Creates new challenges  Validity  Privacy  Security

It’s Not Just Private Companies… From http://data-informed.com … While specifics about the text mining tools and algorithms are still to emerge, two things are clear: the government has been working for years with advanced analytics developers and machine learning experts to drill into reservoirs of unstructured data. … And the NSA is building a mammoth data center in Utah to support its efforts. It’s estimated that the facility can store 20 terabytes—the equivalent of all the data in the Library of Congress—a minute.

Discussion • To what extent are you willing to give up privacy for functionality or security? • Who is responsible when:  An important email is deleted as junk?  When a self-driving car gets in an accident?

Good Machine Learning = Generalization Goal of machine learning: - PowerPoint PPT Presentation

Good Machine Learning = Generalization Goal of machine learning: build models that generalize well to predicting new data Overfitting: fitting the training data too well, so we lose generality of model o Example: linear regression vs.

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

Architecture Aromatique Good Taste Good Food Good Health Based on sustainability Technical

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Outline 1. Introduction 2. Spectral Energy Distributions 3. Beaming Effect in Fermi Blazars 4.

with new hadronic top-tagging techniques T T.A. du Pree, P. Harris, J. Marrouche, N. Wardle

Cloud Security on the Dollar Menu ARNEL MANALO, CISSP, AWS-CSAA SHELLCON 2018 Agenda

A Practical Complexity-Theoretic Analysis of Mix Systems Vinh Pham 1 , Joss Wright 2 , Dogan

9. Discrete Transistor Amplifiers Lecture notes: Sec. 6 Sedra & Smith (6 th Ed): Sec. 5.6,

Real- Real -time systems time systems Real- Real -time programming time programming

Real numbers in the real world Industrial applications of theorem proving John Harrison Intel

Lecture 8: More Proofs review: proofs Start with hypotheses and facts Use rules of

Good Machine Learning = Generalization Goal of machine learning: - PowerPoint PPT Presentation

Good Machine Learning = Generalization Goal of machine learning: build models that generalize well to predicting new data Overfitting: fitting the training data too well, so we lose generality of model o Example: linear regression vs.

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

Architecture Aromatique Good Taste Good Food Good Health Based on sustainability Technical

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Outline 1. Introduction 2. Spectral Energy Distributions 3. Beaming Effect in Fermi Blazars 4.

with new hadronic top-tagging techniques T T.A. du Pree, P. Harris, J. Marrouche, N. Wardle

Cloud Security on the Dollar Menu ARNEL MANALO, CISSP, AWS-CSAA SHELLCON 2018 Agenda

A Practical Complexity-Theoretic Analysis of Mix Systems Vinh Pham 1 , Joss Wright 2 , Dogan

9. Discrete Transistor Amplifiers Lecture notes: Sec. 6 Sedra &amp; Smith (6 th Ed): Sec. 5.6,

Real- Real -time systems time systems Real- Real -time programming time programming

Real numbers in the real world Industrial applications of theorem proving John Harrison Intel

Lecture 8: More Proofs review: proofs Start with hypotheses and facts Use rules of

9. Discrete Transistor Amplifiers Lecture notes: Sec. 6 Sedra & Smith (6 th Ed): Sec. 5.6,