Statistical Filtering and Control for AI and Robotics Part II. - PowerPoint PPT Presentation

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression & Kalman filtering Riccardo Muradore 1 / 66

Outline Linear Methods for Regression Gaussian filter Stochastic model Kalman filtering Kalman smoother 2 / 66

References These lectures are based on the following books Sebastian Thrun, Wolfram Burgard and Dieter Fox, “ProbabilisticRobotics”, MIT Press, 2005 Trevor Hastie, Robert Tibshirani and Jerome Friedman, “The Elements of Statistical Learn- ing: Data Mining, Inference, and Prediction”, Springer, 2009 Several pictures from those books have been copied and pasted here 3 / 66

Linear Methods for Regression 4 / 66

probability Supervised learning: use the inputs (i.e. predictors, independent variables, features) to predict the values of the outputs (i.e. responses, dependent variables) This distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and classification when we predict qualitative outputs. Notation: ◮ x ∈ R m random variable ( x i ∈ R is its i -th component) ◮ x ∈ R m an observation of the random variable x ∈ R m ( x i ∈ R is its i -th component) ◮ X ∈ R m × N a collection of N observations ( X T ∈ R m is its i -th row) i We will focus on the regression problem: this means that input and output vectors consist of qualitative measurements 5 / 66

Linear Models Input: x ∈ R m , x ∈ R m , X ∈ R N × m Output: y ∈ R p , y ∈ R p , Y ∈ R N × p y ∈ R p , ˆ y ∈ R p , ˆ Y ∈ R p × N Prediction: ˆ Linear Model: ( from now on p = 1 ) y = f ( x ) = x T β where β ∈ R m Prediction y = x T ˆ ˆ β β ∈ R m is the matrix of coefficients that we have to determine where ˆ Remark. If p = 1, the gradient f ′ ( x ) = ∇ x f ( x ) = β is a vector pointing in the steepest uphill direction 6 / 66

Least Squares Let X ∈ R N × m and Y ∈ R N a training set of data (collection of N pairs ( x , y )) How to choice β ? First of all we have to introduce an index as a function of β . Let RSS ( β ) be the residual sum of squares N � ( Y i − X i β ) T ( Y i − X i β ) = ( Y − X β ) T ( Y − X β ) RSS ( β ) := i =1 We search for ˆ β := arg min β RSS ( β ) Computing the first and second derivative we get the normal equations − 2 X T ( Y − X β ) ∇ β RSS ( β ) = ∇ 2 2 X T X ββ RSS ( β ) = 7 / 66

Least Squares If X T X is nonsingular (i.e. X has full column rank), the unique solution is given by the normal equations X T ( Y − X β ) = 0 ∇ β RSS ( β ) = 0 ⇔ i.e. ˆ β = ( X T X ) − 1 X T Y and the prediction of y given a new value x is y = x T ˆ ˆ β Observations: ◮ We assume that the underlying model is linear ◮ Statistics of x and y do not play any role (it seems ...) 8 / 66

Least Squares p > 1 Linear model Y = XB + E where X ∈ R N × m , Y ∈ R N × p , E ∈ R N × p and B ∈ R m × p The RSS takes the form RSS ( B ) := trace { ( Y − XB ) T ( Y − XB ) } and the least square estimation of B is written in the same way B = ( X T X ) − 1 X T Y ˆ Multiple outputs do not affect one another’s least squares estimates If the component of the vector r.v e are correlated, i.e. e ∼ N (0 , Σ), then we can define a weighted RSS N � ( Y i − X i B ) T Σ − 1 ( Y i − X i B ) RSS ( B , Σ) := i =1 9 / 66

Geometric interpretation The normal equations X T ( Y − X β ) = 0 means the estimation ˆ Y = X ˆ β = X ( X T X ) − 1 X T Y is the orthogonal projection of Y into the subspace X 10 / 66

Statistical interpretation We now consider the r.v. x and y as input and output, respectively, and we seek a function f ( x ) for predicting y . The criterion should be now deal with stochastic quantities: we introduce the expected squared prediction error EPE (strictly related with the mean squared error MSE) � � ( y − f ( x )) T ( y − f ( x )) EPE ( f ) := E � ( y − f ( x )) T ( y − f ( x )) p ( x , y ) dxdy = S x , S y where we implicitly assumed that x and y have a joint PDF. EPE ( f ) is a L 2 loss function Conditioning on x we can re-write EPE ( f ) as � � � � ( y − f ( x )) T ( y − f ( x )) | x EPE ( f ) := E x E y | x 11 / 66

Statistical interpretation We can determine f ( · ) pointwise � � ( y − c ) T ( y − c ) | x = x f ( x ) = arg min E y | x c which means that f ( x ) = E [ y | x = x ] i.e. the best f ( x ) is the conditional mean (according to the EPE criterion). Beautiful but, given the data X , Y how can we compute the conditional expectation?!? 12 / 66

Statistical interpretation Let us assume again f ( x ) = x T β then � � ( y − x T β ) T ( y − x T β ) EPE ( f ) := E Differentiating w.r.t. β we end up with � − 1 E [ x T y ] � E [ xx T ] β = Computing the auto- and cross-correlation (i.e. using real numbers!) N → S xx := 1 i X i = 1 � N →∞ E [ xx T ] X T N X T X − N i =1 N → S xy := 1 = 1 � N →∞ E [ x T y ] X i Y T N XY T − i N i =1 13 / 66

Statistical interpretation Then we get � 1 � − 1 1 ˆ N X T X N XY T β = � − 1 XY T � X T X = Again the normal equations !!! But now we can provide a statistical interpretation of ˆ β . Let y = x T β + e , e ∼ N (0 , σ 2 ) be our model ( p = 1), then ˆ β is a Gaussian variable ˆ β ∼ N ( β, ( X T X ) − 1 σ 2 ) � − 1 X y − � − 1 X e � � In fact, since ˆ X T X X T X β = y = x T ˆ ˆ β + e 14 / 66

Gauss-Markov theorem Given the linear model y = x T β, Y = X β the least squares estimator ˆ 0 ˆ φ ( x 0 ) = x T β of φ ( x 0 ) = x T 0 β is unbiased because E [ x T 0 ˆ β ] = x T 0 β Theorem If ¯ φ ( x 0 ) is any other unbiased estimation ( E [¯ φ ( x 0 )] = x T 0 β ) then Var (ˆ φ ( x 0 )) ≤ Var (¯ φ ( x 0 )) Remark. Mean square error of a generic estimator ¯ φ ( p = 1) ( ∗ ) MSE (¯ φ ) = E [(¯ = Var (¯ + ( E [¯ φ − φ ) 2 ] φ ] − φ ) 2 φ ) � �� variance bias (*) = sum and subtract E [¯ φ ]. 15 / 66

Gauss-Markov theorem Given the stochastic linear model y = x T β + e , e ∼ N (0 , σ 2 ) and let ¯ φ ( x 0 ) be the estimator for y 0 = φ ( x 0 ) + e 0 , φ ( x 0 ) = x T 0 β . The expected prediction error (EPE) of ¯ φ ( x 0 ) is EPE (¯ E [( y 0 − ¯ φ ( x 0 )) 2 ] φ ( x 0 )) = σ 2 + E [( x T 0 β − ¯ φ ( x 0 )) 2 ] = σ 2 + Var (¯ φ ) + ( E [¯ φ ] − φ ) 2 = � �� MSE 16 / 66

Bias-variance trade-off underfitting VS overfitting 17 / 66

Statistical models Statistical model: y = f ( x ) + e where y is a random error with zero mean ( E [ e ] = 0) and is independent of x . This means that the relationship between y and x is not deterministic ( f ( · )) The additive r.v. e takes care of measurement noise, model uncertainty and non measured variables correlated with y as well We often assume that the random variables e are independent and identically distributed (i.i.d.) 18 / 66

Statistical models Assuming a linear basis expansion for f θ ( x ) parametrized by the unknowns collected within the vector θ K � f θ ( x ) = h k ( x ) θ k 1 where examples of h k ( x ) can be h k ( x ) = x k ( x k ) 2 h k ( x ) = h k ( x ) = sin( x k ) 1 h k ( x ) = 1 + e − x T β k The optimization problem to solve is N � ˆ ( y i − f θ ( x i )) 2 θ = arg min θ ∈ Θ RSS ( θ ) = 1 where RSS stands for Residual Sum of Squares 19 / 66

Statistical models Are there other kinds of criterion besides RSS, EPE? YES, A more general principle for estimation is maximum likelihood estimation Let p θ ( y ) be the PDF of the samples y 1 , . . . , y N The log-probability (or log-likelihood) of the observed samples is N � L ( θ ) = log p θ ( y i ) 1 Principle of maximum likelihood : the most reasonable values for θ are those for which the probability of the observed samples is largest 20 / 66

Statistical models If the error e in the following statistical model y = f θ ( x ) + e is Gaussian, e ∼ N (0 , σ 2 ), then the conditional probability is p ( y | x , θ ) ∼ N ( f θ ( x ) , σ 2 ) Then log-likelihood of the data is N � log p ( y i | f θ ( x i ) , θ ) L ( θ ) = 1 − N 1 � N i =1 ( y i − f θ ( x i )) 2 = 2 log(2 π ) − N log σ − 2 σ 2 Least squares for the additive error model is equivalent to maximum likelihood using the conditional probability (The yellow is the RSS ( θ ) ) 21 / 66

Penalty function and Regularization methods Penalty function, or regularization methods, introduces our knowledge about the type of functions f ( x ) we are looking for PRSS ( f , λ ) := RSS ( f ) + λ g ( f ) where the functional g ( f ) will force our knowledge (or desiderata) on f Example. One-dimension cubic smoothing spline is the solution of N � � ( y i − f ( x i )) 2 + λ [ f ′′ ( s )] 2 dx PRSS ( f , λ ) := i =1 Remark. Penalty function methods have a Bayesian interpretation: ◮ g ( f ) is the log-prior distribution ◮ PRSS ( f , λ ) is the log-posterior distribution ◮ the solution of arg min f PRSS ( f , λ ) is the posterior mode 22 / 66

Statistical Filtering and Control for AI and Robotics Part II. - PowerPoint PPT Presentation

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression & Kalman filtering Riccardo Muradore 1 / 66 Outline Linear Methods for Regression Gaussian filter Stochastic model Kalman filtering Kalman

Statistical Filtering and Control for AI and Robotics Part I. Bayes filtering Riccardo Muradore

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

Least-Action Filtering L. C. G. Rogers Statistical Laboratory, University of Cambridge

The Robot Operating System (ROS) Introduction, Concepts and Examples Stefano Rosa, 8/5/2015

ROBOTICS ROBOTICS A brief history A brief history Basilio Bona ROBOTICA 03CFIOR 1 Outline

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Human-Oriented Robotics Unsupervised Learning Kai Arras Social Robotics Lab, University of

General AIMD Congestion Control Y. Richard Yang and Simon S. Lam Motivation for new congestion

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

Confidence Intervals II 18.05 Spring 2018 R Quiz Open internet, open notes (no communication

Two-Stage Residual Inclusion Estimation: A Practitioners Guide to Stata Implementation by Joseph

Hochberg Multiple Test Procedure Under Negative Dependence Ajit C. Tamhane Northwestern

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

Simple Linear Regression Ronet Bachman, Ph.D. Presented by Justice Research and Statistics

Sambuz

Useful Links

Newsletter

Mail Us

Statistical Filtering and Control for AI and Robotics Part II. - PowerPoint PPT Presentation

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression & Kalman filtering Riccardo Muradore 1 / 66 Outline Linear Methods for Regression Gaussian filter Stochastic model Kalman filtering Kalman

Statistical Filtering and Control for AI and Robotics Part I. Bayes filtering Riccardo Muradore

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Sensors for Robotics

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

Least-Action Filtering L. C. G. Rogers Statistical Laboratory, University of Cambridge

The Robot Operating System (ROS) Introduction, Concepts and Examples Stefano Rosa, 8/5/2015

ROBOTICS ROBOTICS A brief history A brief history Basilio Bona ROBOTICA 03CFIOR 1 Outline

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Human-Oriented Robotics Unsupervised Learning Kai Arras Social Robotics Lab, University of

General AIMD Congestion Control Y. Richard Yang and Simon S. Lam Motivation for new congestion

Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit

Confidence Intervals II 18.05 Spring 2018 R Quiz Open internet, open notes (no communication

Two-Stage Residual Inclusion Estimation: A Practitioners Guide to Stata Implementation by Joseph

Hochberg Multiple Test Procedure Under Negative Dependence Ajit C. Tamhane Northwestern

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462,

Simple Linear Regression Ronet Bachman, Ph.D. Presented by Justice Research and Statistics

Sambuz

Useful Links

Newsletter

Mail Us

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics