Introduction to machine learning Yifeng Tao School of Computer - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to machine learning Yifeng Tao School of Computer - - PowerPoint PPT Presentation

Introduction to Machine Learning Introduction to machine learning Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, Eric Xing, Barnabas Poczos Yifeng Tao Carnegie Mellon University 1


slide-1
SLIDE 1

Introduction to machine learning

Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Tom Mitchell, Eric Xing, Barnabas Poczos

Carnegie Mellon University 1 Yifeng Tao

Introduction to Machine Learning

slide-2
SLIDE 2

Logistics

  • Course website:

http://www.cs.cmu.edu/~yifengt/courses/machine-learning Slides uploaded after lecture

  • Time: Mon-Fri 9:50-11:30am lecture, 11:30-12:00pm discussion
  • Contact: yifengt@cs.cmu.edu

Carnegie Mellon University 2 Yifeng Tao

slide-3
SLIDE 3

What is machine learning

Carnegie Mellon University 3 Yifeng Tao

Probability Statistics Calculus Linear algebra Machine Learning Computer vision Natural language processing Computational Biology

slide-4
SLIDE 4

Computer vision

  • Object detection

Yifeng Tao Carnegie Mellon University 4

[Figure from https://www.cvdeveloper.com/projects and Alex Krizhevsky et al.]

slide-5
SLIDE 5

Natural language processing

  • NER, translation, document classification…

Yifeng Tao Carnegie Mellon University 5

[Figure from Jacob Devlin et al.]

slide-6
SLIDE 6

Computational Biology

  • DNA-protein binding

Yifeng Tao Carnegie Mellon University 6

[Figure from Haoyang Zeng et al.]

slide-7
SLIDE 7

What is machine learning?

  • What are we talking when we talk about AI and ML?

Carnegie Mellon University 7 Yifeng Tao

Deep learning Artificial intelligence Machine learning

slide-8
SLIDE 8

What’s more after introduction?

Carnegie Mellon University 8 Yifeng Tao

Machine learning Probabilistic graphical models Conditional probability Deep learning Optimization Learning theory

slide-9
SLIDE 9

What is machine learning

Carnegie Mellon University 9 Yifeng Tao

  • Methods that can help generalize information from the observed

data so that it can be used to make better decisions in the future.

  • Supervised learning: given a set of features and values

Learn a model that will predict a label to a new feature set.

  • Regression: predict continuous values
  • Classification: predict discrete labels
  • Unsupervised learning: Discover patterns in data
  • And more! E.g., transfer learning, semi-supervised learning,

reinforcement learning etc.

[Slide from Eric Xing et al.]

slide-10
SLIDE 10

Supervised learning

Carnegie Mellon University 10 Yifeng Tao

  • Goal of supervised learning: Construct a predictor to

minimize a risk (performance measure)

  • Not minimizing empirical errors:
  • Training and test sets

[Slide from Barnabas Poczos et al.]

slide-11
SLIDE 11

Topics

Carnegie Mellon University 11 Yifeng Tao

  • Supervised learning: linear models
  • Kernel machines: SVMs and duality
  • Unsupervised learning: latent space analysis and clustering
  • Supervised learning: decision tree, kNN and model selection
  • Learning theory
  • Neural network (basics)
  • Deep learning in CV and NLP
  • Probabilistic graphical models
  • Reinforcement learning and its application in clinical text mining
  • Attention mechanism and transfer learning in precision medicine
slide-12
SLIDE 12

Supervised learning: linear models

Yifeng Tao Lecture 1 May 13, 2019

Carnegie Mellon University 12 Yifeng Tao

slide-13
SLIDE 13

Example of regression

  • Predicting reviews of restaurant from factors

Yifeng Tao Carnegie Mellon University 13

i Price Distance Cuisine Review 1 30 21 7 4 2 15 12 8 2 3 27 53 9 5

[Slide from Barnabas Poczos et al.]

slide-14
SLIDE 14

Empirical Risk Minimization (ERM)

  • More in the learning theory part…

Yifeng Tao Carnegie Mellon University 14

[Slide from Barnabas Poczos et al.]

slide-15
SLIDE 15

Linear Regression

Yifeng Tao Carnegie Mellon University 15

[Slide from Barnabas Poczos et al.]

slide-16
SLIDE 16

Linear Regression

Yifeng Tao Carnegie Mellon University 16

[Slide from Barnabas Poczos et al.]

slide-17
SLIDE 17

Least Squares Estimator

Yifeng Tao Carnegie Mellon University 17

[Slide from Barnabas Poczos et al.]

slide-18
SLIDE 18

Least Squares Estimator

Yifeng Tao Carnegie Mellon University 18

[Slide from Barnabas Poczos et al.]

slide-19
SLIDE 19

Normal Equations

Yifeng Tao Carnegie Mellon University 19

[Slide from Barnabas Poczos et al.]

slide-20
SLIDE 20

Cases ATA not invertible

  • Gene expression data:
  • n=20,000, p=50-4,000
  • Regularization: Lasso

Yifeng Tao Carnegie Mellon University 20

[Slide from Barnabas Poczos et al.]

slide-21
SLIDE 21

Geometric Interpretation

Yifeng Tao Carnegie Mellon University 21

[Slide from Barnabas Poczos et al.]

slide-22
SLIDE 22

Pseudo Inverse (skip)

Yifeng Tao Carnegie Mellon University 22

[Slide from Barnabas Poczos et al.]

slide-23
SLIDE 23

Pseudo Inverse (skip)

Yifeng Tao Carnegie Mellon University 23

[Slide from Barnabas Poczos et al.]

slide-24
SLIDE 24

Polynomial Regression

Yifeng Tao Carnegie Mellon University 24

[Slide from Barnabas Poczos et al.]

slide-25
SLIDE 25

Maximum Likelihood Estimation (MLE)

  • Goal: estimate distribution parameters θ from a dataset of n

independent, identically distributed (i.i.d.), fully observed training cases:

  • Maximum Likelihood Estimation (MLE):
  • One of the most common estimators
  • With iid and fully-observability assumptions:
  • Pick the setting of parameters most likely to have generated the data we

saw:

  • Maximum conditional likelihood:

Yifeng Tao Carnegie Mellon University 25

[Slide from Eric Xing et al.]

slide-26
SLIDE 26

Least Squares and MLE

Yifeng Tao Carnegie Mellon University 26

[Slide from Barnabas Poczos et al.]

slide-27
SLIDE 27

Least Squares and MLE

  • By independence assumption:
  • Therefore,

Yifeng Tao Carnegie Mellon University 27

[Slide from Eric Xing et al.]

slide-28
SLIDE 28

Regularized Least Squares

  • Recap the polynomial regression
  • Intuition of overfitting: very large weights
  • How to solve/alleviate it?

Yifeng Tao Carnegie Mellon University 28

[Figure from Christopher M. Bishop]

slide-29
SLIDE 29

Maximum a Posteriori Estimation (MAP)

  • The Bayesian theory:
  • The posterior equals to the likelihood times the prior, up to a constant
  • Maximum a posteriori estimator (MAP):
  • This allows us to capture uncertainty about the model in a principled way

Yifeng Tao Carnegie Mellon University 29

slide-30
SLIDE 30

Regularized Least Squares and MAP

Yifeng Tao Carnegie Mellon University 30

[Slide from Barnabas Poczos et al.]

slide-31
SLIDE 31

Regularized Least Squares and MAP

Yifeng Tao Carnegie Mellon University 31

[Slide from Barnabas Poczos et al.]

slide-32
SLIDE 32

Yifeng Tao Carnegie Mellon University 32

Example of classification

  • Predicting reviews of restaurant from features

Yifeng Tao Carnegie Mellon University 32

i Price Distance Cuisine Review 1 30 21 7 Good 2 15 12 8 Bad 3 27 53 9 Good

[Slide from Barnabas Poczos et al.]

slide-33
SLIDE 33

Logistic regression

Yifeng Tao Carnegie Mellon University 33

  • Logistic/Sigmoid function:
  • p: the probability that y is 1

[Figure from Wikipedia]

slide-34
SLIDE 34

Logistic regression: MLE

  • The likelihood function is:
  • Therefore, the conditional log-likelihood function:
  • The -l(β) is also referred to as “cross-entropy loss”
  • Good new: l(β) is a concave function of β
  • Bad news: no closed-form solution to maximize

l(β)

  • Solution: optimization algorithm (to be discussed)

Yifeng Tao Carnegie Mellon University 34

[Slide from Tom Mitchell et al.]

slide-35
SLIDE 35

Logistic regression: MAP

  • Gaussian prior of β:
  • Laplacian prior of β:

Yifeng Tao Carnegie Mellon University 35

slide-36
SLIDE 36

Maximize conditional likelihood: gradient ascent

  • Gradient ascent algorithm: iterate until change < ε:
  • This applies to linear regression as well, although there exist closed

form.

Yifeng Tao Carnegie Mellon University 36

initial point initial point

[Slide from Tom Mitchell et al.]

slide-37
SLIDE 37

Bayesian classifier

Yifeng Tao Carnegie Mellon University 37

i Price Distance Cuisine Review 1 30 21 7 Good 2 15 12 8 Bad 3 27 53 9 Good

  • Generative model vs discriminative model

[Slide from Tom Mitchell et al.]

slide-38
SLIDE 38

Bayesian classifier

Yifeng Tao Carnegie Mellon University 38

X

slide-39
SLIDE 39

Naïve Bayes

  • The Bayesian classifier requires large number of samples to train
  • Naïve Bayes assumes:
  • The Xi are conditionally independent, given Y.
  • Therefore the classification rule for Xnew=(X1, …, Xn) is

Yifeng Tao Carnegie Mellon University 39

[Slide from Tom Mitchell et al.]

slide-40
SLIDE 40

Naïve Bayes Algorithm

  • Very fast to train/estimate!

Yifeng Tao Carnegie Mellon University 40

[Slide from Tom Mitchell et al.]

slide-41
SLIDE 41

Bag of words: model the documents

  • 8th floor of Gates building, CMU

Yifeng Tao Carnegie Mellon University 41

[Figure from https://twitter.com/smithamilli/status/837153616116985856]

slide-42
SLIDE 42

Document classification

  • The (independent) probability that the i-th word of a given document
  • ccurs in a document from class C:
  • The probability that a given document D contains all of the words

given a class C:

  • What is the probability that a given document D belongs to a given

class C?

Yifeng Tao Carnegie Mellon University 42

[Slide from Wikipedia]

slide-43
SLIDE 43

Continuous Xi in Naïve Bayes

Yifeng Tao Carnegie Mellon University 43

[Slide from Tom Mitchell et al.]

slide-44
SLIDE 44

Estimating parameters of GNB

  • Y discrete, Xi continuous

Yifeng Tao Carnegie Mellon University 44

[Slide from Tom Mitchell et al.]

slide-45
SLIDE 45

Inference of Gaussian Naïve Bayes

Yifeng Tao Carnegie Mellon University 45

[Slide from Tom Mitchell et al.]

slide-46
SLIDE 46

Linear models in application

  • R: glmnet package
  • Comprehensive regression formats
  • Linear / Logistic / Cox regression…
  • Flexible penalty form
  • Ridge, Lasso, elastic net regression
  • Optimization algorithms with a lot of heuristics
  • E.g., coordinate descent, warm start…
  • Easy to analyze results in a few lines
  • Python: scikit-learn package

Yifeng Tao Carnegie Mellon University 46

slide-47
SLIDE 47

Take home message

  • Machine learning is to learn a model based on observed data that

can be generalized

  • Two paradigms in supervised learning
  • Regression
  • Classification
  • Objective function and probabilistic explanation (MLE)
  • For linear regression, minimize MSE can be equivalent to MLE
  • For logistic regression, minimize cross-entropy is equivalent to MLE
  • Regularization and probabilistic explanation (MAP)
  • Ridge is equivalent to Gaussian prior
  • Lasso is equivalent to Laplacian prior
  • Generative model vs discriminative model for classification
  • The most import application of Naïve Bayes: document classification
  • The inference of Naïve Bayes can be reduced to the same form of logistic

regression

Carnegie Mellon University 47 Yifeng Tao

slide-48
SLIDE 48

References

  • Eric Xing, Tom Mitchell. 10701 Introduction to Machine Learning:

http://www.cs.cmu.edu/~epxing/Class/10701-06f/

  • Barnabás Póczos. 10715 Advanced Introduction to Machine

Learning: https://sites.google.com/site/10715advancedmlintro2017f/lectures

  • Eric Xing, Ziv Bar-Joseph. 10701 Introduction to Machine Learning:

http://www.cs.cmu.edu/~epxing/Class/10701/

  • Christopher M. Bishop. Pattern recognition and Machine Learning
  • Wikipedia

Carnegie Mellon University 48 Yifeng Tao