Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) - PowerPoint PPT Presentation

Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) and Frédéric Pascal (3) (1) Center for Computer Vision (CVN), CentraleSupélec / Opis Team, Inria (2) Parietal Team, Inria (3) Laboratory of Signals and Systems (L2S), CentraleSupélec, University Paris-Saclay {emilie.chouzenoux, frederic.pascal}@centralesupelec.fr, l-emir-omar.chehab@inria.fr http://www-syscom.univ-mlv.fr/~chouzeno/ http://fredericpascal.blogspot.fr MDS Sept. - Dec., 2020

Contents 1 Introduction - Reminders of probability theory and mathematical statistics (Bayes, estimation, tests) - FP 2 Robust regression approaches - EC / OC 3 Hierarchical clustering - FP / OC 4 Stochastic approximation algorithms - EC / OC 5 Nonnegative matrix factorization (NMF) - EC / OC 6 Mixture models fitting / Model Order Selection - FP / OC 7 Inference on graphical models - EC / VR 8 Exam

Key references for this course Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. Hastie, T., Tibshirani, R. and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second edition. Springer, 2009. James, G., Witten, D., Hastie, T. and Tibshirani, R. An Introduction to Statistical Learning, with Applications in R. Springer, 2013 + many many references... F. Pascal 3 / 85

Course 1 Introduction - Reminders of probability theory and mathematical statistics F. Pascal 4 / 85

I. Introduction in stat. signal processing II. Random Variables / Vectors / CV III. Essential theorems IV. Statistical modelling V. Theory of Point Estimation VI. Hypothesis testing - Decision theory

What is Machine Learning? Statistical machine learning is concerned with the development of algorithms and techniques that learn from observed data by constructing stochastic models that can be used for making predictions and decisions. Topics covered include Bayesian inference and maximum likelihood modeling; regression, classification, density estimation, clustering, principal component analysis; parametric, semi-parametric, and non-parametric models; basis functions, neural networks, kernel methods, and graphical models; deterministic and stochastic optimization; overfitting, regularization, and validation. Introduction in stat. signal processing F. Pascal 5 / 85

From data to processing - robustness, dimension... Big Picture Data driven Model driven ( n > p ) ( n > p ) n > p ( n < p ) n < p R < n , p Classical Regularization Structure Processing a priori Introduction in stat. signal processing F. Pascal 6 / 85

General context Statistical Signal Processing Signals z : multivariate random complex observations (vectors). Example : z ∈ C p Signal corrupted by an additive noise: z = β d ( θ ) + n with n ∼ C N ( 0 , Σ ) , θ and β unknown. Several processes PCA and dimension reduction Parameter estimation Detection / Filtering Clustering / Classification ... Introduction in stat. signal processing F. Pascal 7 / 85

Covariance & Subspace Two quantities common to all these processes “Optimal” processes rely on the second order statistics of z , notably on: The covariance matrix (assume circularity): � zz H � Σ = E Information on the variance and correlations between elements of z . The principal subspace (of rank R ) � � zz H �� Π R = P R E Rank R orthogonal subspace where most of the information lies in. Introduction in stat. signal processing F. Pascal 8 / 85

Examples Estimation (MLE, GMM...) Parameter θ of the signal d ( θ ) to be estimated from observations Example : Maximum Likelihood Estimator (MLE) ( d ( θ ) − z ) H Σ − 1 ( d ( θ ) − z ) min θ Low rank version (e.g. MUSIC): replace Σ − 1 by Π ⊥ Applications : DoA, inverse problems, source separation... Detection (ACE, GLRT, ANMF, MSD...) Binary hypothesis test: is d ( θ 0 ) present? Example : Adaptive Cosine Estimator (ACE, or ANMF): | d ( θ 0 ) H Σ − 1 z | 2 H 1 Λ ACE = ≷ η ( d ( θ 0 ) H Σ − 1 d ( θ 0 ))( z H Σ − 1 z ) H 0 Low rank version: replace Σ − 1 by Π ⊥ Applications : RADAR, imaging, audio... Introduction in stat. signal processing F. Pascal 9 / 85

Filtering (MF, AMF, Projection...) Maximizing the output signal to noise ratio (SNR): Example : Adaptive Matched Filter | d H ( θ ) Σ − 1 z | 2 y = d ( θ 0 ) H Σ − 1 d ( θ 0 ) Low rank version: replace Σ − 1 by Π ⊥ Applications : De-noising, interference cancellation (telecom)... Classification (SVM, K-means, KL divergence...) Select a class for the observations: covariance and subspace are descriptors Example : KL divergence between two distributions (or other divergences, Wasserstein, Riemanian ...) KL ( Z 1 , Z 2 ) 2 [Tr( Σ 2 − 1 Σ 1 ) + Tr( Σ 1 − 1 Σ 2 ) − 2 k ] 1 = �� Σ 1 1/2 Σ 2 Σ 1 1/2 � 1/2 � W 2 2 ( Z 1 , Z 2 ) Tr( Σ 1 ) + Tr( Σ 2 ) − 2Tr = Applications : Machine learning, segmentation, profile determination... Introduction in stat. signal processing F. Pascal 10 / 85

Example of non Gaussianity (1/3): High Resolution SAR images HR SAR images SMDS Data Introduction in stat. signal processing F. Pascal 11 / 85

Example of non Gaussianity (2/3): Hyperspectral data NASA Hyperion sensor Introduction in stat. signal processing F. Pascal 12 / 85

Example of non Gaussianity (3/3): Financial data Nasdaq-100, SP 500 Courtesy of E. Ollila [Ollila18] Introduction in stat. signal processing F. Pascal 13 / 85

I. Introduction in stat. signal processing II. Random Variables / Vectors / CV III. Essential theorems IV. Statistical modelling V. Theory of Point Estimation VI. Hypothesis testing - Decision theory

Menu - Probabilities and statistics basics Example: Fair Six-Sided Die: Sample space: Ω = {1,2,3,4,5,6} Events: Even= {2,4,6} , Odd= {1,3,5} ⊆ Ω 1 1 Probability: P (6) = 6 , P ( Even ) = P ( Odd ) = 2 Outcome: 6 ∈ E . P (6 ∩ Even ) 1/6 1 Conditional probability: P (6 | Even ) = = 1/2 = P ( Even ) 3 General Axioms: P ( � ) = 0 ≤ P ( A ) ≤ 1 = P ( Ω ) , P ( A ∪ B ) + P ( A ∩ B ) = P ( A ) + P ( B ) , P ( A ∩ B ) = P ( A | B ) P ( B ) . Random Variables / Vectors / CV F. Pascal 14 / 85

Menu - Probabilities and statistics basics Example: (Un)fair coin: Ω = { Tail , Head } ≃ {0,1} with P (1) = θ ∈ [0,1] : Likelihood: P (1101 | θ ) = θ × θ × (1 − θ ) × θ Maximum Likelihood (ML) estimate: ˆ θ = argmax θ P (1101 | θ ) = 3 4 Prior: If we are indifferent, then P ( θ ) = const . � ) Evidence: P (1101) = � θ P (1101 | θ ) P ( θ ) = 1 20 (actually P (1101 | θ ) P ( θ ) ∝ θ 3 (1 − θ ) (Bayes rule). Posterior: P ( θ | 1101) = P (1101) Maximum a Posterior (MAP) estimate: ˆ θ = argmax θ P ( θ | 1101) = 3 4 P (11011) 2 Predictive distribution: P (1 | 1101) = P (1101) = 3 Expectation: E [ f | ...] = � θ f ( θ ) P ( θ | ...) , e.g. E [ θ | 1101] = 2 3 Variance: V ( θ | 1101) = E [( θ − E [ θ ]) 2 | 1101] = 2 63 1 Probability density: P ( θ ) = ε P ([ θ , θ + ε ]) for ε → 0 Random Variables / Vectors / CV F. Pascal 15 / 85

Random Variables (r.v.) / Vectors (r.V.) Notations Let X (resp. x ) a random variable (resp. vectors). Denote by P or P θ its probability : P ( X = x ) or P θ ( X = x ) for the discrete case f ( x ) or f θ ( x ) for the continuous case (with PDF) Some other notations: E [.] or E θ [.] (resp. V [.] / V θ [.] ) stands for the statistical expectation (resp. the variance) i.i.d. → Independent (denoted ⊥ ) and Identically Distributed, i.e. same distribution and X ⊥ Y ⇐ ⇒ for any measurable functions h and g , E [ g ( X ) h ( Y )] = E [ g ( X )] E [ h ( Y )] . n -sample ( X 1 ,..., X n ) ⇐ ⇒ X 1 ,..., X n are i.i.d. PDF, CDF and iff resp. means Probability Density Function, Cumulative Distribution Function and “ if and only if’ ’ Random Variables / Vectors / CV F. Pascal 16 / 85

Convergences Multivariate case Let ( x ) n ∈ N ∈ R d a sequence of r.V. and ( x ) ∈ R d defined on the same probability space ( Ω , A , P ) , then a . s Almost Sure CV: x n ⇒ ∃ N ∈ A such that P ( N ) = 0 and n →∞ x ⇐ − − − − → ∀ ω ∈ N c , lim n →∞ x n ( ω ) = x ( ω ) P CV in probability: x n n →∞ P ( � x n − x � ≥ ε ) = 0 where − − n →∞ x ⇐ − − → ⇒ ∀ ε > 0, lim �� d � 1/2 for x ∈ R d . i = 1 x 2 � x � = i P x n n →∞ x ⇐ ⇒ each component converges in probability. − − − − → L p ⇒ ( x ) n ∈ N , x ∈ L p and CV in L p : Let p ∈ N ∗ , x n n →∞ x ⇐ − − − − → � � � x n − x � p E − − n →∞ 0. − − → L p Random Variables / Vectors / CV F. Pascal 17 / 85

Convergence in distribution dist . CV in distribution: x n n →∞ x if for any continuous and bounded − − − − → � � � � function g , one has lim n →∞ E g ( x n ) = E g ( x ) . � The CV in distribution of a sequence of r.V. is stronger than the CV of each component! How to characterise the CV in distribution? Theorem (Levy continuity Theorem) � and ϕ ( u ) = E � the characteristic � � exp( iu t x n ) exp( iu t x ) Let ϕ n ( u ) = E functions of x n and x . Then, dist . ⇒ ∀ u ∈ R d , ϕ n ( u ) − x n n →∞ x ⇐ n →∞ ϕ ( u ). − − − − → − − − → Proposition (a.s., P , dist. convergences ) n →∞ h ( x ) , if h is a continuous function x n − n →∞ x = − − − → ⇒ h ( x n ) − − − − → Discussion on the cv hierarchy... Random Variables / Vectors / CV F. Pascal 18 / 85

Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) - PowerPoint PPT Presentation

Advanced Machine Learning Emilie Chouzenoux (1) , L. Omar Chehab (2) and Frdric Pascal (3) (1) Center for Computer Vision (CVN), CentraleSuplec / Opis Team, Inria (2) Parietal Team, Inria (3) Laboratory of Signals and Systems (L2S),

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Kernel PCA 11 ADVANCED MACHINE LEARNING Overview Todays Lecture

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Partial Answers to Homework #1 3.D.5 Consider again the CES utility function of Exercise 3.C.6, and

Alex Suciu Northeastern University Topology Seminar Institute of Mathematics of the Romanian

Econometric Analysis of Monetary Policy at the Zero Lower Bound Daisuke Ikeda Bank of Japan 26

The r-process in supernovae and neutron star mergers Almudena Arcones r-process in ultra

Minimizing Finite Automata with Graph Programs Detlef Plump 1 Robin Suri 2 Ambuj Singh 3 1 The

Building real-time analytics applications using A LinkedIn case study Member Job Ad Post

Part I The consumer problems Introduction Utility maximization Expenditure minimization Wealth

Secure Distributed Programming on EcmaScript 5 + HTML5 platforms Mark S. Miller and the Cajadores