a brief introduction to machine learning with
play

A Brief Introduction to Machine Learning (With Applications to - PowerPoint PPT Presentation

A Brief Introduction to Machine Learning (With Applications to Communications) Osvaldo Simeone Kings College London 11 June 2018 Osvaldo Simeone A Brief Intro to ML + Comm 1 / 126 Goals and Learning Outcomes Goals: Provide an


  1. When the True Distribution p ( x , t ) is Known... ... we don’t need data D ... and we have a standard inference problem, i.e., estimation (regression) or detection (classification). The solution can be directly computed from the posterior distribution p ( t | x ) = p ( x , t ) p ( x ) as t ∗ ( x ) = arg min ˆ t E t ∼ p t | x [ ℓ ( t , ˆ t ) | x ] ˆ Osvaldo Simeone A Brief Intro to ML + Comm 28 / 126

  2. When the True Distribution p ( x , t ) is Known... ... we don’t need data D ... and we have a standard inference problem, i.e., estimation (regression) or detection (classification). The solution can be directly computed from the posterior distribution p ( t | x ) = p ( x , t ) p ( x ) as t ∗ ( x ) = arg min ˆ t E t ∼ p t | x [ ℓ ( t , ˆ t ) | x ] ˆ Osvaldo Simeone A Brief Intro to ML + Comm 28 / 126

  3. When the Model p ( x , t ) is Known... With quadratic loss, conditional mean: ˆ t ∗ ( x ) = E t ∼ p t | x [ t | x ] With probability of error, maximum a posteriori (MAP): ˆ t ∗ ( x ) = arg max t p ( t | x ) x \ t 0 1 Example: with joint distribution 0 0.05 0.45 , we have 1 0.4 0.1 p ( t = 1 | x = 0 ) = 0 . 9 and ˆ t ∗ ( x = 0 ) = 0 . 9 × 1 + 0 . 1 × 0 = 0 . 9 for quadratic loss, ˆ t ∗ ( x = 0 ) = 1 for probability of error (MAP) . Osvaldo Simeone A Brief Intro to ML + Comm 29 / 126

  4. When the Model p ( x , t ) is Known... With quadratic loss, conditional mean: ˆ t ∗ ( x ) = E t ∼ p t | x [ t | x ] With probability of error, maximum a posteriori (MAP): ˆ t ∗ ( x ) = arg max t p ( t | x ) x \ t 0 1 Example: with joint distribution 0 0.05 0.45 , we have 1 0.4 0.1 p ( t = 1 | x = 0 ) = 0 . 9 and ˆ t ∗ ( x = 0 ) = 0 . 9 × 1 + 0 . 1 × 0 = 0 . 9 for quadratic loss, ˆ t ∗ ( x = 0 ) = 1 for probability of error (MAP) . Osvaldo Simeone A Brief Intro to ML + Comm 29 / 126

  5. When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

  6. When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

  7. When the True Distribution p ( x , t ) is Not Known... ... we need data D ... and we have a learning problem 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 30 / 126

  8. Logistic Regression Example: Binary classification ( t ∈ { 0 , 1 } ) 1. Model selection (inductive bias): logistic regression (discriminative model) φ ( x ) = [ φ 1 ( x ) ··· φ D ′ ( x )] T is a vector of features (e.g., bag-of-words model for a text). Osvaldo Simeone A Brief Intro to ML + Comm 31 / 126

  9. Logistic Regression Parametric probabilistic model: p ( t = 1 | x , w ) = σ ( w T φ ( x )) where σ ( a ) = ( 1 +exp( − a )) − 1 is the sigmoid function. Osvaldo Simeone A Brief Intro to ML + Comm 32 / 126

  10. Logistic Regression 2. Learning: To be discussed 3. Inference: With probability of error loss, MAP classification t = 1 w T φ ( x ) ≷ 0 � �� � t = 0 logit or LLR Osvaldo Simeone A Brief Intro to ML + Comm 33 / 126

  11. Multi-Layer Neural Networks 1. Model selection (inductive bias): multi-layer neural network (discriminative model) Multiple layers of learnable weights enable feature learning. Osvaldo Simeone A Brief Intro to ML + Comm 34 / 126

  12. Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 35 / 126

  13. Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 36 / 126

  14. Learning: Maximum Likelihood ML selects a value of θ that is the most likely to have generated the observed training set D : maximize p ( D | θ ) ⇐ ⇒ maximize ln p ( D | θ ) ( log-likelihood, or LL) ⇐ ⇒ minimize − ln p ( D | θ ) ( negative log-likelihood, or NLL) For discriminative models: N ∑ minimize − ln p ( t D | x D , θ ) = − ln p ( t n | x n , θ ) n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 37 / 126

  15. Learning: Maximum Likelihood ML selects a value of θ that is the most likely to have generated the observed training set D : maximize p ( D | θ ) ⇐ ⇒ maximize ln p ( D | θ ) ( log-likelihood, or LL) ⇐ ⇒ minimize − ln p ( D | θ ) ( negative log-likelihood, or NLL) For discriminative models: N ∑ minimize − ln p ( t D | x D , θ ) = − ln p ( t n | x n , θ ) n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 37 / 126

  16. Learning: Maximum Likelihood The problem rarely has analytical solutions and is typically addressed by Stochastic Gradient Descent (SGD). For discriminative models, we have θ new ← θ old + γ ∇ θ ln p ( t n | x n , θ ) | θ = θ old γ is the learning rate. With multi-layer neural networks, this approach yields the backpropagation algorithm. Osvaldo Simeone A Brief Intro to ML + Comm 38 / 126

  17. Supervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x , t | θ ) p ( t | x , θ ) or � �� � � �� � generative model discriminative model 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Inference: Use model to obtain the predictor ˆ t ( x ) (to be tested on new data) Osvaldo Simeone A Brief Intro to ML + Comm 39 / 126

  18. Model Selection How to select a model (inductive bias)? Model selection typically requires the model order, i.e., the capacity of the model. Ex.: For logistic regression, ◮ Model order M : Number of features Osvaldo Simeone A Brief Intro to ML + Comm 40 / 126

  19. Model Selection How to select a model (inductive bias)? Model selection typically requires the model order, i.e., the capacity of the model. Ex.: For logistic regression, ◮ Model order M : Number of features Osvaldo Simeone A Brief Intro to ML + Comm 40 / 126

  20. Model Selection Example: Regression using a discriminative model p ( t | x ) M w m x m ∑ + N ( 0 , 1 ) m = 0 � �� � ˆ t ( x ): polynomial of order M 1.5 1 0.5 0 -0.5 -1 -1.5 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 41 / 126

  21. Model Selection With M = 1, using ML learning of the coefficients – 3 2 1 M = 1 0 -1 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 42 / 126

  22. Model Selection: Underfitting... With M = 1, the ML predictor ˆ t ( x ) underfits the data: ◮ the model is not rich enough to capture the variations present in the data; ◮ large training loss N L D ( θ ) = 1 t ( x n )) 2 ∑ ( t n − ˆ N n = 1 Osvaldo Simeone A Brief Intro to ML + Comm 43 / 126

  23. Model Selection With M = 9, using ML learning of the coefficients – 3 2 = 9 M 1 M = 1 0 -1 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 44 / 126

  24. Model Selection: ... vs Overfitting With M = 9, the ML predictor overfits the data: ◮ the model is too rich and, in order to account for the observations in the training set, it appears to yield inaccurate predictions outside it; ◮ presumably we have a large generalization loss L p (ˆ t ) = E ( x , t ) ∼ p xt [( t − ˆ t ( x )) 2 ] Osvaldo Simeone A Brief Intro to ML + Comm 45 / 126

  25. Model Selection M = 3 seems to be a resonable choice... ... but how do we know given that we have no data outside of the training set? 3 2 = 9 M 1 M = 1 0 -1 M = 3 -2 -3 0 0.2 0.4 0.6 0.8 1 Osvaldo Simeone A Brief Intro to ML + Comm 46 / 126

  26. Model Selection: Validation Keep some data (validation set) to estimate the generalization error for different values of M (See cross-validation for a more efficient way to use the data.) Osvaldo Simeone A Brief Intro to ML + Comm 47 / 126

  27. Model Selection: Validation Validation allows model order selection. 1.6 underfitting overfitting 1.4 root average squared loss 1.2 1 0.8 generalization 0.6 (via validation) 0.4 training 0.2 0 1 2 3 4 5 6 7 8 9 Validation can also be used more generally to select other hyperparameters (e.g., learning rate). Osvaldo Simeone A Brief Intro to ML + Comm 48 / 126

  28. Model Selection: Validation Validation allows model order selection. 1.6 underfitting overfitting 1.4 root average squared loss 1.2 1 0.8 generalization 0.6 (via validation) 0.4 training 0.2 0 1 2 3 4 5 6 7 8 9 Validation can also be used more generally to select other hyperparameters (e.g., learning rate). Osvaldo Simeone A Brief Intro to ML + Comm 48 / 126

  29. Model Selection: Validation Model order selection should depend on the amount of data... It is a problem of bias (asymptotic error) versus generalization gap. 1 generalization (via validation) 0.8 root average quadratic loss = 1 0.6 M 0.4 = 7 M 0.2 training 0 0 10 20 30 40 50 60 70 Osvaldo Simeone A Brief Intro to ML + Comm 49 / 126

  30. Application to Communication Networks Fog network architecture [5GPPP] Core Core Cloud Network Cloud Edge Edge Access Cloud Network Wireless Edge Osvaldo Simeone A Brief Intro to ML + Comm 50 / 126

  31. At the Edge: Overview At the edge: ◮ PHY: Detection and decoding, precoding and power allocation, modulation recognition, localization, interference cancelation, joint source channel coding, equalization in the presence of non-linearities ◮ MAC/ Link: Radio resource allocation, scheduling, multi-RAT handover, dynamic spectrum access, admission control ◮ Network: Proactive caching ◮ Application: Computing resource allocation, content request prediction Osvaldo Simeone A Brief Intro to ML + Comm 51 / 126

  32. At the Edge: PHY Channel detection and decoding – classification [Cammerer et al '17] Osvaldo Simeone A Brief Intro to ML + Comm 52 / 126

  33. At the Edge: PHY Channel detection and decoding – classification [Farsad and Goldsmith '18] Osvaldo Simeone A Brief Intro to ML + Comm 53 / 126

  34. At the Edge: PHY Channel equalization in the presence of non-linearities, e.g., for optical links – regression [ Wang et al ‘16 ] Osvaldo Simeone A Brief Intro to ML + Comm 54 / 126

  35. At the Edge: PHY Channel equalization in the presence of non-linearities, e.g., for satellite links with non-linear ampliers – regression [ Bouchired et al ’98] Osvaldo Simeone A Brief Intro to ML + Comm 55 / 126

  36. At the Edge: PHY Channel decoding for modulation schemes with complex optimal decoders, e.g., continuous phase modulation – classification [De Veciana and Zakhor '92] Osvaldo Simeone A Brief Intro to ML + Comm 56 / 126

  37. At the Edge: PHY Channel decoding – classification Leverage domain knowledge to set up the parametrized model to be learned [ Nachmani et al ‘16] Osvaldo Simeone A Brief Intro to ML + Comm 57 / 126

  38. At the Edge: PHY Modulation recognition – classification [Agirman-Tosun et al '11] Osvaldo Simeone A Brief Intro to ML + Comm 58 / 126

  39. At the Edge: PHY Localization – regression (coordinates) [Fang and Lin ‘08] Osvaldo Simeone A Brief Intro to ML + Comm 59 / 126

  40. At the Edge: PHY Precoding and power allocation – regression [Sun et al ’17] Osvaldo Simeone A Brief Intro to ML + Comm 60 / 126

  41. At the Edge: PHY Interference cancellation – regression [Balatsoukas- Stimming ‘17 ] Osvaldo Simeone A Brief Intro to ML + Comm 61 / 126

  42. At the Edge: MAC/ Link Spectrum sensing – classification [Tumuluru et al '10] Osvaldo Simeone A Brief Intro to ML + Comm 62 / 126

  43. At the Edge: MAC/ Link Mmwave channel quality prediction using depth images – regression [ Okamoto et al '18] Osvaldo Simeone A Brief Intro to ML + Comm 63 / 126

  44. At the Edge: Network and Application Content prediction for proactive caching – classification [Chen et al '17] Osvaldo Simeone A Brief Intro to ML + Comm 64 / 126

  45. At the Cloud: Overview At the cloud: ◮ Network: Routing (classification vs look-up tables), SDN flow table updating, proactive caching, congestion control ◮ Application: Cloud/ fog computing, Internet traffic classification Osvaldo Simeone A Brief Intro to ML + Comm 65 / 126

  46. At the Cloud: Network Link prediction for wireless routing – classification/ regression [Wang et al 06] Osvaldo Simeone A Brief Intro to ML + Comm 66 / 126

  47. At the Cloud: Network Link prediction for optical routing – classification/ regression [Musumeci et al ’18] Osvaldo Simeone A Brief Intro to ML + Comm 67 / 126

  48. At the Cloud: Network Congestion prediction for smart routing – classification [Tang et al ‘ 17] Osvaldo Simeone A Brief Intro to ML + Comm 68 / 126

  49. At the Cloud: Network and Application Traffic classification – classification [Nguyen et al '08] Osvaldo Simeone A Brief Intro to ML + Comm 69 / 126

  50. Overview Supervised Learning Unsupervised Learning Reinforcement Learning Osvaldo Simeone A Brief Intro to ML + Comm 70 / 126

  51. Unsupervised Learning Unsupervised learning tasks operate over unlabelled data sets. General goal: discover properties of the data, e.g., for compressed representation “Some of us see unsupervised learning as the key towards machines with common sense.” (Y. LeCun) Osvaldo Simeone A Brief Intro to ML + Comm 71 / 126

  52. Unsupervised Learning Unsupervised learning tasks operate over unlabelled data sets. General goal: discover properties of the data, e.g., for compressed representation “Some of us see unsupervised learning as the key towards machines with common sense.” (Y. LeCun) Osvaldo Simeone A Brief Intro to ML + Comm 71 / 126

  53. “Defining” Unsupervised Learning Training set D : x n ∼ i.i.d. p ( x ) , n = 1 ,..., N Goal: Learn some useful properties of the distribution p ( x ) Alternative viewpoints to frequentist framework: Bayesian and MDL Osvaldo Simeone A Brief Intro to ML + Comm 72 / 126

  54. “Defining” Unsupervised Learning Training set D : x n ∼ i.i.d. p ( x ) , n = 1 ,..., N Goal: Learn some useful properties of the distribution p ( x ) Alternative viewpoints to frequentist framework: Bayesian and MDL Osvaldo Simeone A Brief Intro to ML + Comm 72 / 126

  55. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  56. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  57. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  58. Unsupervised Learning Tasks Density estimation : estimate p ( x ) , e.g., for use in plug-in estimators, compression algorithms, to detect outliers Clustering : partition all points in D in groups of similar objects (e.g., document clustering) Dimensionality reduction, representation and feature extraction : represent each data points x n in a space of lower dimensionality, e.g., to highlight independent explanatory factors, and/or to ease visualization, interpretation, or successive tasks Generation of new samples : learn a machine that produces samples approximately distributed according to p ( x ) , e.g., to produce artificial scenes based for games or films Osvaldo Simeone A Brief Intro to ML + Comm 73 / 126

  59. Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 74 / 126

  60. Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 75 / 126

  61. Models Unsupervised learning models typically involve hidden or latent variables. z n = hidden, or latent, variables for each data point x n Ex.: z n = cluster index of x n Osvaldo Simeone A Brief Intro to ML + Comm 76 / 126

  62. Models Unsupervised learning models typically involve hidden or latent variables. z n = hidden, or latent, variables for each data point x n Ex.: z n = cluster index of x n Osvaldo Simeone A Brief Intro to ML + Comm 77 / 126

  63. (a) Directed Generative Models Model data x as being caused by z: p ( x | θ ) = ∑ p ( z | θ ) p ( x | z , θ ) z Osvaldo Simeone A Brief Intro to ML + Comm 78 / 126

  64. (a) Directed Generative Models Ex.: Document clustering ◮ x is a document, and z is (interpreted as) topic ◮ p ( z | θ ) = distribution of topics ◮ p ( x | z , θ ) = distribution of words in document given topic Basic representatives: ◮ Mixture of Gaussians ◮ Likelihood-free models Osvaldo Simeone A Brief Intro to ML + Comm 79 / 126

  65. (a) Directed Generative Models Ex.: Document clustering ◮ x is a document, and z is (interpreted as) topic ◮ p ( z | θ ) = distribution of topics ◮ p ( x | z , θ ) = distribution of words in document given topic Basic representatives: ◮ Mixture of Gaussians ◮ Likelihood-free models Osvaldo Simeone A Brief Intro to ML + Comm 79 / 126

  66. (d) Autoencoders Model encoding from data to hidden variables, as well as decoding from hidden variables back to data: p ( z | x , θ ) and p ( x | z , θ ) , Osvaldo Simeone A Brief Intro to ML + Comm 80 / 126

  67. (d) Autoencoders Ex.: Compression ◮ x is an image and z is (interpreted as) a compressed (e.g., sparse) representation ◮ p ( z | x , θ ) = compression of image to representation ◮ p ( x | z , θ ) = decompression of representation into an image Basic representative: Principal Component Analysis (PCA), dictionary learning, neural network-based autoencoders Osvaldo Simeone A Brief Intro to ML + Comm 81 / 126

  68. (d) Autoencoders Ex.: Compression ◮ x is an image and z is (interpreted as) a compressed (e.g., sparse) representation ◮ p ( z | x , θ ) = compression of image to representation ◮ p ( x | z , θ ) = decompression of representation into an image Basic representative: Principal Component Analysis (PCA), dictionary learning, neural network-based autoencoders Osvaldo Simeone A Brief Intro to ML + Comm 81 / 126

  69. Unsupervised Learning 1. Model selection (inductive bias): Define a parametric model p ( x | θ ) 2. Learning: Given data D , optimize a learning criterion to obtain the parameter vector θ 3. Clustering, feature extraction, sample generation... Osvaldo Simeone A Brief Intro to ML + Comm 82 / 126

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend