A Brief Introduction to Machine Learning (With Applications to - PowerPoint PPT Presentation

When the True Distribution p ( x , t ) is Known... ... we don’t need data D ... and we have a standard inference problem, i.e., estimation (regression) or detection (classification). The solution can be directly computed from the posterior distribution p ( t | x ) = p ( x , t ) p ( x ) as t ∗ ( x ) = arg min ˆ t E t ∼ p t | x [ ℓ ( t , ˆ t ) | x ] ˆ Osvaldo Simeone A Brief Intro to ML + Comm 28 / 126

When the Model p ( x , t ) is Known... With quadratic loss, conditional mean: ˆ t ∗ ( x ) = E t ∼ p t | x [ t | x ] With probability of error, maximum a posteriori (MAP): ˆ t ∗ ( x ) = arg max t p ( t | x ) x \ t 0 1 Example: with joint distribution 0 0.05 0.45 , we have 1 0.4 0.1 p ( t = 1 | x = 0 ) = 0 . 9 and ˆ t ∗ ( x = 0 ) = 0 . 9 × 1 + 0 . 1 × 0 = 0 . 9 for quadratic loss, ˆ t ∗ ( x = 0 ) = 1 for probability of error (MAP) . Osvaldo Simeone A Brief Intro to ML + Comm 29 / 126

When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

Logistic Regression Example: Binary classification ( t ∈ { 0 , 1 } ) 1. Model selection (inductive bias): logistic regression (discriminative model) φ ( x ) = [ φ 1 ( x ) ··· φ D ′ ( x )] T is a vector of features (e.g., bag-of-words model for a text). Osvaldo Simeone A Brief Intro to ML + Comm 31 / 126

Logistic Regression Parametric probabilistic model: p ( t = 1 | x , w ) = σ ( w T φ ( x )) where σ ( a ) = ( 1 +exp( − a )) − 1 is the sigmoid function. Osvaldo Simeone A Brief Intro to ML + Comm 32 / 126

Logistic Regression 2. Learning: To be discussed 3. Inference: With probability of error loss, MAP classification t = 1 w T φ ( x ) ≷ 0 � �� t = 0 logit or LLR Osvaldo Simeone A Brief Intro to ML + Comm 33 / 126

Multi-Layer Neural Networks 1. Model selection (inductive bias): multi-layer neural network (discriminative model) Multiple layers of learnable weights enable feature learning. Osvaldo Simeone A Brief Intro to ML + Comm 34 / 126

Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 35 / 126

Learning: Maximum Likelihood ML selects a value of θ that is the most likely to have generated the observed training set D : maximize p ( D | θ ) ⇐ ⇒ maximize ln p ( D | θ ) ( log-likelihood, or LL) ⇐ ⇒ minimize − ln p ( D | θ ) ( negative log-likelihood, or NLL) For discriminative models: N ∑ minimize − ln p ( t D | x D , θ ) = − ln p ( t n | x n , θ ) n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 37 / 126

Learning: Maximum Likelihood The problem rarely has analytical solutions and is typically addressed by Stochastic Gradient Descent (SGD). For discriminative models, we have θ new ← θ old + γ ∇ θ ln p ( t n | x n , θ ) | θ = θ old γ is the learning rate. With multi-layer neural networks, this approach yields the backpropagation algorithm. Osvaldo Simeone A Brief Intro to ML + Comm 38 / 126

Model Selection How to select a model (inductive bias)? Model selection typically requires the model order, i.e., the capacity of the model. Ex.: For logistic regression, ◮ Model order M : Number of features Osvaldo Simeone A Brief Intro to ML + Comm 40 / 126

Model Selection Example: Regression using a discriminative model p ( t | x ) M w m x m ∑ + N ( 0 , 1 ) m = 0 � �� ˆ t ( x ): polynomial of order M 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 41 / 126

Model Selection With M = 1, using ML learning of the coefficients – 3 2 1 M = 1 0 -1 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 42 / 126

Model Selection: Underfitting... With M = 1, the ML predictor ˆ t ( x ) underfits the data: ◮ the model is not rich enough to capture the variations present in the data; ◮ large training loss N L D ( θ ) = 1 t ( x n )) 2 ∑ ( t n − ˆ N n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 43 / 126

Model Selection With M = 9, using ML learning of the coefficients – 3 2 = 9 M 1 M = 1 0 -1 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 44 / 126

Model Selection: ... vs Overfitting With M = 9, the ML predictor overfits the data: ◮ the model is too rich and, in order to account for the observations in the training set, it appears to yield inaccurate predictions outside it; ◮ presumably we have a large generalization loss L p (ˆ t ) = E ( x , t ) ∼ p xt [( t − ˆ t ( x )) 2 ] Osvaldo Simeone A Brief Intro to ML + Comm 45 / 126

Model Selection M = 3 seems to be a resonable choice... ... but how do we know given that we have no data outside of the training set? 3 2 = 9 M 1 M = 1 0 -1 M = 3 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 46 / 126

Model Selection: Validation Keep some data (validation set) to estimate the generalization error for different values of M (See cross-validation for a more efficient way to use the data.) Osvaldo Simeone A Brief Intro to ML + Comm 47 / 126

Model Selection: Validation Validation allows model order selection. 1.6 underfitting overfitting 1.4 root average squared loss 1.2 1 0.8 generalization 0.6 (via validation) 0.4 training 0.2 0 1 2 3 4 5 6 7 8 9 Validation can also be used more generally to select other hyperparameters (e.g., learning rate). Osvaldo Simeone A Brief Intro to ML + Comm 48 / 126

Model Selection: Validation Model order selection should depend on the amount of data... It is a problem of bias (asymptotic error) versus generalization gap. 1 generalization (via validation) 0.8 root average quadratic loss = 1 0.6 M 0.4 = 7 M 0.2 training 0 0 10 20 30 40 50 60 70 Osvaldo Simeone A Brief Intro to ML + Comm 49 / 126

Application to Communication Networks Fog network architecture [5GPPP] Core Core Cloud Network Cloud Edge Edge Access Cloud Network Wireless Edge Osvaldo Simeone A Brief Intro to ML + Comm 50 / 126

At the Edge: Overview At the edge: ◮ PHY: Detection and decoding, precoding and power allocation, modulation recognition, localization, interference cancelation, joint source channel coding, equalization in the presence of non-linearities ◮ MAC/ Link: Radio resource allocation, scheduling, multi-RAT handover, dynamic spectrum access, admission control ◮ Network: Proactive caching ◮ Application: Computing resource allocation, content request prediction Osvaldo Simeone A Brief Intro to ML + Comm 51 / 126

At the Edge: PHY Channel detection and decoding – classification [Cammerer et al '17] Osvaldo Simeone A Brief Intro to ML + Comm 52 / 126

At the Edge: PHY Channel detection and decoding – classification [Farsad and Goldsmith '18] Osvaldo Simeone A Brief Intro to ML + Comm 53 / 126

At the Edge: PHY Channel equalization in the presence of non-linearities, e.g., for optical links – regression [ Wang et al ‘16 ] Osvaldo Simeone A Brief Intro to ML + Comm 54 / 126

At the Edge: PHY Channel equalization in the presence of non-linearities, e.g., for satellite links with non-linear ampliers – regression [ Bouchired et al ’98] Osvaldo Simeone A Brief Intro to ML + Comm 55 / 126

At the Edge: PHY Channel decoding for modulation schemes with complex optimal decoders, e.g., continuous phase modulation – classification [De Veciana and Zakhor '92] Osvaldo Simeone A Brief Intro to ML + Comm 56 / 126

At the Edge: PHY Channel decoding – classification Leverage domain knowledge to set up the parametrized model to be learned [ Nachmani et al ‘16] Osvaldo Simeone A Brief Intro to ML + Comm 57 / 126

At the Edge: PHY Modulation recognition – classification [Agirman-Tosun et al '11] Osvaldo Simeone A Brief Intro to ML + Comm 58 / 126

At the Edge: PHY Localization – regression (coordinates) [Fang and Lin ‘08] Osvaldo Simeone A Brief Intro to ML + Comm 59 / 126

At the Edge: PHY Precoding and power allocation – regression [Sun et al ’17] Osvaldo Simeone A Brief Intro to ML + Comm 60 / 126

At the Edge: PHY Interference cancellation – regression [Balatsoukas- Stimming ‘17 ] Osvaldo Simeone A Brief Intro to ML + Comm 61 / 126

At the Edge: MAC/ Link Spectrum sensing – classification [Tumuluru et al '10] Osvaldo Simeone A Brief Intro to ML + Comm 62 / 126

At the Edge: MAC/ Link Mmwave channel quality prediction using depth images – regression [ Okamoto et al '18] Osvaldo Simeone A Brief Intro to ML + Comm 63 / 126

At the Edge: Network and Application Content prediction for proactive caching – classification [Chen et al '17] Osvaldo Simeone A Brief Intro to ML + Comm 64 / 126

At the Cloud: Overview At the cloud: ◮ Network: Routing (classification vs look-up tables), SDN flow table updating, proactive caching, congestion control ◮ Application: Cloud/ fog computing, Internet traffic classification Osvaldo Simeone A Brief Intro to ML + Comm 65 / 126

At the Cloud: Network Link prediction for wireless routing – classification/ regression [Wang et al 06] Osvaldo Simeone A Brief Intro to ML + Comm 66 / 126

At the Cloud: Network Link prediction for optical routing – classification/ regression [Musumeci et al ’18] Osvaldo Simeone A Brief Intro to ML + Comm 67 / 126

At the Cloud: Network Congestion prediction for smart routing – classification [Tang et al ‘ 17] Osvaldo Simeone A Brief Intro to ML + Comm 68 / 126

At the Cloud: Network and Application Traffic classification – classification [Nguyen et al '08] Osvaldo Simeone A Brief Intro to ML + Comm 69 / 126

Overview Supervised Learning Unsupervised Learning Reinforcement Learning Osvaldo Simeone A Brief Intro to ML + Comm 70 / 126

Unsupervised Learning Unsupervised learning tasks operate over unlabelled data sets. General goal: discover properties of the data, e.g., for compressed representation “Some of us see unsupervised learning as the key towards machines with common sense.” (Y. LeCun) Osvaldo Simeone A Brief Intro to ML + Comm 71 / 126

“Defining” Unsupervised Learning Training set D : x n ∼ i.i.d. p ( x ) , n = 1 ,..., N Goal: Learn some useful properties of the distribution p ( x ) Alternative viewpoints to frequentist framework: Bayesian and MDL Osvaldo Simeone A Brief Intro to ML + Comm 72 / 126

Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 74 / 126

Models Unsupervised learning models typically involve hidden or latent variables. z n = hidden, or latent, variables for each data point x n Ex.: z n = cluster index of x n Osvaldo Simeone A Brief Intro to ML + Comm 76 / 126

Models Unsupervised learning models typically involve hidden or latent variables. z n = hidden, or latent, variables for each data point x n Ex.: z n = cluster index of x n Osvaldo Simeone A Brief Intro to ML + Comm 77 / 126

(a) Directed Generative Models Model data x as being caused by z: p ( x | θ ) = ∑ p ( z | θ ) p ( x | z , θ ) z Osvaldo Simeone A Brief Intro to ML + Comm 78 / 126

(a) Directed Generative Models Ex.: Document clustering ◮ x is a document, and z is (interpreted as) topic ◮ p ( z | θ ) = distribution of topics ◮ p ( x | z , θ ) = distribution of words in document given topic Basic representatives: ◮ Mixture of Gaussians ◮ Likelihood-free models Osvaldo Simeone A Brief Intro to ML + Comm 79 / 126

(d) Autoencoders Model encoding from data to hidden variables, as well as decoding from hidden variables back to data: p ( z | x , θ ) and p ( x | z , θ ) , Osvaldo Simeone A Brief Intro to ML + Comm 80 / 126

(d) Autoencoders Ex.: Compression ◮ x is an image and z is (interpreted as) a compressed (e.g., sparse) representation ◮ p ( z | x , θ ) = compression of image to representation ◮ p ( x | z , θ ) = decompression of representation into an image Basic representative: Principal Component Analysis (PCA), dictionary learning, neural network-based autoencoders Osvaldo Simeone A Brief Intro to ML + Comm 81 / 126

A Brief Introduction to Machine Learning (With Applications to - PowerPoint PPT Presentation

A Brief Introduction to Machine Learning (With Applications to Communications) Osvaldo Simeone Kings College London 11 June 2018 Osvaldo Simeone A Brief Intro to ML + Comm 1 / 126 Goals and Learning Outcomes Goals: Provide an

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Real-Time Embedded Convex Optimization Stephen Boyd joint work with Michael Grant, Jacob

CEE 370 Environmental Engineering Principles Lecture #34 Solid Waste II: Landfills Reading:

The Honorable Nathan Deal, Governor August 25, 2015 8/25/2015 1 Welcome Approval of

Mental Health Commission improvements which might be made in light www.mhcirl.ie of ECHR

On the duality of topological Boolean algebras Matthew de Brecht 1 Graduate School of Human and

Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu) 1 Processor Applications

GSM System Overview GSM System Overview GSM System Overview GSM System Overview Phone Lin

ADVANCE IN IN GSM ECE 2526 MOBILE COMMUNICATION Monday, 17 February 2020 RECALL - NUMBER OF