Introduction to Machine Learning Milan Straka October 07, 2019 - PowerPoint PPT Presentation

NPFL129, Lecture 1 Introduction to Machine Learning Milan Straka October 07, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Machine Learning A possible definition of learning from Mitchell (1997): A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Task T k classification : assigning one of categories to a given input x ∈ R regression : producing a number for a given input structured prediction , denoising , density estimation , … Experience E supervised : usually a dataset with desired outcomes ( labels or targets ) unsupervised : usually data without any annotation (raw text, raw images, … ) reinforcement learning , semi-supervised learning , … Measure P accuracy , error rate , F-score , … NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 2/21

Deep Learning Highlights Image recognition Object detection Image segmentation Human pose estimation Image labeling Visual question answering Speech recognition and generation Lip reading Machine translation Machine translation without parallel data Chess, Go and Shogi Multiplayer Capture the flag NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 3/21

Introduction to Machine Learning History https://www.slideshare.net/deview/251-implementing-deep-learning-using-cu-dnn/4 NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 4/21

Machine and Representation Learning Figure 1.5, page 10 of Deep Learning Book, http://deeplearningbook.org. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 5/21

Basic Machine Learning Settings x ∈ R d Assume we have an input of . Then the two basic ML tasks are: t ∈ R 1. regression : The goal of a regression is to predict real-valued target variable of the given input. K 2. classification : Assuming we have a fixed set of labels, the goal of a classification is to choose a corresponding label/class for a given input. We can predict the class only. We can predict the whole distribution of all classes probabilities. ( x , t ) We usually have a training set , which is assumed to consist of examples of generated independently from a data generating distribution . The goal of optimization is to match the training set as well as possible. However, the goal of machine learning is to perform well on previously unseen data, to achieve lowest generalization error or test error . We typically estimate it using a test set of examples independent of the training set, but generated by the same data generating distribution. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 6/21

Notation a a A A , , , : scalar (integer or real), vector, matrix, tensor a a A , , : scalar, vector, matrix random variable df f x dx : derivative of with respect to ∂ f f x ∂ x : partial derivative of with respect to ( ∂ x ∂ f ( x ) ) ∂ f ( x ) ∂ f ( x ) ∇ , , … , f f x x ∂ x ∂ x : gradient of with respect to , i.e., 1 2 n NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 7/21

Example Dataset Assume we have the following data, generated from an underlying curve by adding a small amount of Gaussian noise.        Figure 1.2 of Pattern Recognition and Machine Learning. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 8/21

Linear Regression x ∈ R d Given an input value , one of the simplest models to predict a target real value is linear regression : d ∑ T f ( x ; w , b ) = x + + … + x + b = + b = x w + b . w x w w x w 1 1 2 2 D D i i i =1 w b The are usually called weights and is called bias . Sometimes it is convenient not to deal with the bias separately. Instead, we might enlarge the T x x w input vector by padding a value 1, and consider only , where the role of a bias is accomplished by the last weight. Therefore, when we say “ weights ” , we usually mean both weights and biases. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 9/21

X Separate Bias vs. Padding with Ones f ( x ) = w x + T b Using an explicit bias term in the form of . ⎡ x ⎤ ⎡ w ⎤ + w x + b x x 11 12 1 11 2 12 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ + w x + b ⎢ ⎥ ⎢ ⎥ x x w x [ w 2 ] 21 22 1 21 2 22 ⎢ ⎥ 1 ⎢ ⎥ ⋅ + b = w ⎣ ⋮ n 2 ⎦ ⎣ ⋮ ⎦ + w x + b x x w x n 1 1 n 1 2 n 2 1 X b With extra padding in and an additional weight representing the bias. ⎡ x ⎤ ⎡ w ⎤ 1 + w + b x x x 11 12 ⎡ w ⎤ 1 11 2 12 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 1 + w + b ⎢ x x ⎥ 1 ⎢ w x x ⎥ 21 22 ⎢ ⎥ 1 21 2 22 ⎢ ⎥ ⎢ ⎥ ⋅ = ⎣ b ⎦ w 2 ⎣ ⋮ 1 ⎦ ⎣ ⋮ ⎦ + w + b x x w x x n 1 n 2 1 n 1 2 n 2 NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 10/21

Linear Regression , … , x , … , t N x t 1 1 N N Assume we have a dataset of input values and targets . To find the values of weights, we usually minimize an error function between the real target values and their predictions. A popular and simple error function is mean squared error : N 1   ∑ 2  MSE( w ) = ( f ( x ; w ) − ) . t i i N i =1        Often, sum of squares N 1 ∑ 2 ( f ( x ; w ) − ) t i i 2 i =1    Figure 1.3 of Pattern Recognition and Machine Learning. is used instead, because the math comes out nicer. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 11/21

Linear Regression There are several ways how to minimize the error function, but in the case of linear regression and sum of squares error, there exists an explicit solution. Our goal is to minimize the following quantity: N ∑ 1 2 ( x w − ) . T t i 2 i i X ∈ R N × D t ∈ R N x i i Note that if we denote the matrix of input values with on a row and the vector of target values, we can rewrite the minimized quantity as 2 1 ∣∣ Xw − t ∣∣ . 2 NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 12/21

Linear Regression 1 ∑ i N i 2 ( x w − ) T t 2 i In order to find a minimum of , we can inspect values where the derivative w j of the error function is zero, with respect to all weights . N N N ∂ ∑ ∑ ∑ 1 2 1 T T T ( x w − ) = 2( x w − t ) x = ( x w − ) ( ij ) t x t i i ij i 2 2 ∂ w i i i j i i i N ( x w − ) = 0 ∑ i T j x t ij i i Therefore, we want for all that . We can write all the equations X ( Xw − T t ) = 0 together using matrix notation as and rewrite to T T X Xw = X t . T D × D X X The matrix is of size . If it is regular, we can compute its inverse and therefore −1 T T w = ( X X ) X t . NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 13/21

Linear Regression X ∈ R N × D t ∈ R N Input : Dataset ( , ). w ∈ R D Output : Weights minimizing MSE of linear regression. −1 T T w ← ( X X ) X t . 2 O ( ND ) N ≥ D The algorithm has complexity , assuming . T T T X Xw = X X X t When the matrix is singular, we can solve using SVD, which will be demonstrated on the next lecture. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 14/21

Linear Regression Example 0 1 x = ( x , x , … , x ) M ≥ 0 M Assume our input vectors comprise of , for .                                         Figure 1.4 of Pattern Recognition and Machine Learning. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 15/21

Linear Regression Example RMSE = MSE To plot the error, the root mean squared error is frequently used.  The displayed error nicely illustrates two main challenges in machine    learning:  underfitting overfitting           Figure 1.5 of Pattern Recognition and Machine Learning. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 16/21

Model Capacity We can control whether a model underfits or overfits by modifying its capacity . representational capacity effective capacity Figure 5.3, page 115 of Deep Learning Book, http://deeplearningbook.org NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 17/21

Linear Regression Overfitting Note that employing more data also usually alleviates overfitting (the relative capacity of the model is decreased).                     Figure 1.6 of Pattern Recognition and Machine Learning. NPFL129, Lecture 1 Machine Learning TL;DR Linear Regression Regularization 18/21

Introduction to Machine Learning Milan Straka October 07, 2019 - PowerPoint PPT Presentation

NPFL129, Lecture 1 Introduction to Machine Learning Milan Straka October 07, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Machine Learning A

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

MY FIRST STEPS IN SLIT SPECTROSCOPY BAAVSS Spectroscopy Workshop Norman Lockyer Observatory

Gauge mediation with a local flavour Felix Brmmer partly based on 1312.0935 (with M. McGarrie,

General Gauge Mediation @ the EW scale Diego Redigolo GGI, Florence September 4th based on

Year in Review 2015 Brad Sharpe, MD UCSF Division of Hospital Medicine VS. Update in Hospital

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 Tom Mitchell October 8, 2018

Lecture 2 Diagnostics and Model Evaluation Colin Rundel 1/23/2017 1 From last time 2 Linear

Introduction to Machine Learning Evaluation: Measures for Regression Learning goals Know the

CSE 158 Lecture 2 Web Mining and Recommender Systems Supervised learning Regression