Machine Learning and Visualisation Ian T. Nabney Aston University, - - PowerPoint PPT Presentation

machine learning and visualisation
SMART_READER_LITE
LIVE PREVIEW

Machine Learning and Visualisation Ian T. Nabney Aston University, - - PowerPoint PPT Presentation

Machine Learning and Visualisation Ian T. Nabney Aston University, Birmingham, UK March 2015 Ian T. Nabney Machine Learning and Visualisation 1/45 Outline The challenge of hidden knowledge Data visualisation: latent variable models Data


slide-1
SLIDE 1

Machine Learning and Visualisation

Ian T. Nabney

Aston University, Birmingham, UK

March 2015

Ian T. Nabney Machine Learning and Visualisation 1/45

slide-2
SLIDE 2

Outline

The challenge of hidden knowledge Data visualisation: latent variable models Data visualisation: topographic mappings Non-linear modelling and feature selection

Ian T. Nabney Machine Learning and Visualisation 2/45

slide-3
SLIDE 3

Acknowledgements

Collaborators Chris Bishop, Mike Tipping, David Lowe, Markus Sv´ ensen, Chris Williams Peter Ti˜ no, Yi Sun, Dharmesh Maniyar, John Owen Phil Laflin, Bruce Williams, Paola Gaolini, Jens L¨

  • sel

Martin Schroeder, Ain Abdul Karim, Dan Cornford, Cliff Bailey, Naomi Hubber, Shahzad Mumtaz, Midhel Randrianandrasana Richard Barnes, Colin Smith, Dan Wells

Ian T. Nabney Machine Learning and Visualisation 3/45

slide-4
SLIDE 4

Hidden Knowledge

Hidden Knowledge

Understanding the vast quantities of data that surround us is a real challenge; particularly in situations with a lot of variables We can understand more of it with help. Machine learning is the computer-based generation of models from data. A model is a parameterised function from input attributes to an output prediction. Parameters in the model express the hidden connection between inputs and predictions. They are learned from data.

Ian T. Nabney Machine Learning and Visualisation 4/45

slide-5
SLIDE 5

Data Visualisation

What is Visualisation?

Goal of visualisation is to present data in a human-readable way. Visualisation is an important tool for developing a better understanding of large complex datasets. It is particularly helpful for users such as research scientists or clinicians who are not specialists in data modelling.

Detection of outliers. Clustering and segmentation. Aid to feature selection. Feedback on results of analysis.

Two aspects: data projection and information visualisation.

Ian T. Nabney Machine Learning and Visualisation 5/45

slide-6
SLIDE 6

Data Visualisation

Data Projection

The goal is to project data to a lower-dimensional space (usually 2d) while preserving as much information or structure as possible. Once the projection is done standard information visualisation approaches can be used to support user interaction. The quantity and complexity of many datasets means that simple visualisation methods, such as Principal Component Analysis, are not very effective.

Ian T. Nabney Machine Learning and Visualisation 6/45

slide-7
SLIDE 7

Data Visualisation

Information Visualisation

Shneiderman: Overview first; zoom and filter; details on demand. Overview provided by projection. Zooming possible in Matlab plots. Filtering by user interaction; e.g. specify pattern of values that is of interest. Details by providing local information. See more of this later on practical examples.

Ian T. Nabney Machine Learning and Visualisation 7/45

slide-8
SLIDE 8

Data Visualisation

Information Visualisation Examples

Word Cloud (www.wordle.net)

Ian T. Nabney Machine Learning and Visualisation 8/45

slide-9
SLIDE 9

Data Visualisation

Uncertainty

Doubt is not a pleasant condition, but certainty is absurd. Voltaire Real data is noisy. We are forced to deal with uncertainty, yet we need to be quantitative. The optimal formalism for inference in the presence of uncertainty is probability theory. We assume the presence of an underlying regularity to make predictions. Bayesian inference allows us to reason probabilistically about the model as well as the data.

Ian T. Nabney Machine Learning and Visualisation 9/45

slide-10
SLIDE 10

Data Visualisation

Data Projection

y1 y2 y3

f(y; W) D V

Define f to

  • ptimise some

criterion. PCA is minimal variance; Sammon mapping is minimal stress.

Ian T. Nabney Machine Learning and Visualisation 10/45

slide-11
SLIDE 11

Data Visualisation

What can we learn from this?

−20 −15 −10 −5 5 10 −20 −15 −10 −5 5 10 Sinus VEL VER

Ian T. Nabney Machine Learning and Visualisation 11/45

slide-12
SLIDE 12

Data Visualisation

Projection

What is the simplest way to project data? A linear map. What is the best way to linearly project data? Want to preserve as much information as possible. If we assume that information is measured by variance this implies choosing new coordinate axes along directions of maximal variance; these can be found by analysing the covariance matrix of the data. This gives Principal Component Analysis (PCA). For large datasets, the end result is usually a circular blob in the middle of the screen.

Ian T. Nabney Machine Learning and Visualisation 12/45

slide-13
SLIDE 13

Data Visualisation

PCA

Let S be the covariance matrix of the data, so that Sij = 1 N

  • n

(xn

i − xi)(xn j − xj)

The first q principal components are the first q eigenvectors wj of S, ordered by the size of the eigenvalues λj. The percentage of the variance explained by the first q PC’s is

q

j=1 λj

d

j=1 λj

where the data dimension is d. These vectors are orthonormal (perpendicular and unit length). The variance when the data is projected onto them is maximal. Plot the sorted principal values: plot(-sort(-eig(cov(data))));

Ian T. Nabney Machine Learning and Visualisation 13/45

slide-14
SLIDE 14

Data Visualisation: Topographic Mappings

Topographic Mappings

Basic aim is that distances in the visualisation space are as close a possible to those in original data space. Given a dissimilarity matrix dij, we want to map data points xi to points yi in a feature space such that their dissimilarities in feature space, ˜ dij, are as close as possible to the dij. We say that the map preserves similarities. The stress measure is used as objective function E = 1

  • ij dij
  • i<j
  • dij − ˜

dij 2 dij

Ian T. Nabney Machine Learning and Visualisation 14/45

slide-15
SLIDE 15

Data Visualisation: Topographic Mappings

Multi-Dimensional Scaling

Given distances or dissimilarities drs between every pair of

  • bservations try to preserve these as far as possible in lower

dimensional space. In classical scaling, the distance between the objects is assumed to be Euclidean. A linear projection then corresponds to PCA. The Sammon mapping is a non-linear multidimensional scaling technique more general (and more widely used) than classical scaling. Neuroscale is a neural network based scaling technique that has the advantage of actually giving a map that generalises!

Ian T. Nabney Machine Learning and Visualisation 15/45

slide-16
SLIDE 16

Data Visualisation: Topographic Mappings

Neuroscale

Ian T. Nabney Machine Learning and Visualisation 16/45

slide-17
SLIDE 17

Data Visualisation: Topographic Mappings

Biological Application: Streptomyces Gene Expression

Data supplied by Colin Smith (Surrey University). Streptomyces Coelicolor is a bacterium which undergoes developmental changes correlated to sporulation and production of antibiotics. 7825 genes include more than 20 clusters coding for secondary metabolites including a large proportion of regulatory genes. The dataset consists of ten time points from 16 to 67 hours after inoculation of the growth medium. Analysis based on 3067 genes that were significantly expressed. SCO6283, SCO6284, SCO6277, SCO6278 co-regulated genes involved in synthesis of type I polyketide, SCO3245 in synthesis of lipid.

Ian T. Nabney Machine Learning and Visualisation 17/45

slide-18
SLIDE 18

Data Visualisation: Topographic Mappings

Streptomycin

Life of streptomycin Bioinformatics Measuring the expression levels

  • f thousands of genes over

multiple timepoints.

Ian T. Nabney Machine Learning and Visualisation 18/45

slide-19
SLIDE 19

Data Visualisation: Topographic Mappings

SCO6283, SCO6284, SCO6277, SCO6278 in cluster 11, SCO3245 in cluster 12.

Ian T. Nabney Machine Learning and Visualisation 19/45

slide-20
SLIDE 20

Data Visualisation: Topographic Mappings

Genes involved with synthesis of two distinct secondary metabolites may be coregulated by a common network.

Ian T. Nabney Machine Learning and Visualisation 20/45

slide-21
SLIDE 21

Data Visualisation: Latent Variable Models

Latent Variable Models

The projection approach is one way of reducing the data complexity. An alternative view is to hypothesise how the data might have been generated. Hidden Connections A hidden connection is stronger than an obvious one. Heraclitus

Ian T. Nabney Machine Learning and Visualisation 21/45

slide-22
SLIDE 22

Data Visualisation: Latent Variable Models

Latent Variable Models

How is the idea of hidden connections applied to statistical pattern recognition? Separate the observed variables and the latent variables. Latent variables generate observations. Use (probabilistic) inference to deduce what is happening in latent variable space. Often use Bayes’ Theorem: P(L|O) = P(O|L) P(L) P(O) Static case: GTM. Two latent variables and a non-linear transformation to observation space. Dynamic case:

Hidden Markov Models: discrete state space. Speech recognition. State Space Models: continuous state space. Tracking.

Ian T. Nabney Machine Learning and Visualisation 22/45

slide-23
SLIDE 23

Data Visualisation: Latent Variable Models

Visualisation with Density Models

Construct a generative model for the data mapping from a low-dimensional latent space H to the data space D. Maps latent variables r to observed variables x giving a probability density p(x|r). To visualise the data we want to map from observed variables to latent variables: use Bayes’ theorem to compute p(r|x) = p(x|r)p(r) p(x) . Plot a summary statistic of p(ri|xi) for each data point xi: usually the mean. If the mapping is linear and there is a single Gaussian noise model, we recover PCA.

Ian T. Nabney Machine Learning and Visualisation 23/45

slide-24
SLIDE 24

Data Visualisation: Latent Variable Models

Latent space Data space z2 z1 x1 x x3

2

y(z;W)

Ian T. Nabney Machine Learning and Visualisation 24/45

slide-25
SLIDE 25

Data Visualisation: Latent Variable Models

The Generative Topographic Mapping

GTM (Bishop, Svens´ en and Williams) is a latent variable model with a non-linear RBF fM mapping a (usually two dimensional) latent space H to the data space D. Data doesn’t live exactly on manifold, so smear it with Gaussian noise. Introduce latent space density p(x): approximate by a data sample. This is a generative probabilistic model. This model assumes that the data lies close to a two dimensional manifold; however, this is likely to be too simple a model for interesting data. We can measure the non-linearity of the sheet and use this to understand the visualisation plot. Train the model in maximum likelihood framework using an iterative algorithm (EM).

Ian T. Nabney Machine Learning and Visualisation 25/45

slide-26
SLIDE 26

Data Visualisation: Latent Variable Models

Enhancements to GTM

Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the

  • ther goodies work for it as well.

Can cope with missing data in training and visualisation.

Ian T. Nabney Machine Learning and Visualisation 26/45

slide-27
SLIDE 27

Data Visualisation: Latent Variable Models

Enhancements to GTM

Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the

  • ther goodies work for it as well.

Can cope with missing data in training and visualisation. MML methods for feature selection.

Ian T. Nabney Machine Learning and Visualisation 26/45

slide-28
SLIDE 28

Data Visualisation: Latent Variable Models

Enhancements to GTM

Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the

  • ther goodies work for it as well.

Can cope with missing data in training and visualisation. MML methods for feature selection. Structured covariance.

Ian T. Nabney Machine Learning and Visualisation 26/45

slide-29
SLIDE 29

Data Visualisation: Latent Variable Models

Enhancements to GTM

Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the

  • ther goodies work for it as well.

Can cope with missing data in training and visualisation. MML methods for feature selection. Structured covariance. Mixed data types.

Ian T. Nabney Machine Learning and Visualisation 26/45

slide-30
SLIDE 30

Data Visualisation: Latent Variable Models

Local Parallel Coordinates

Parallel coordinates maps d-dimensional data space onto two display dimensions by using d equidistant axes parallel to the y-axis. Each data point is displayed as a piecewise linear graph intersecting each axis at the position corresponding to the data value for that dimension. It is impractical to display this for all the data points, so allow the user to select a region of interest. The user can also interact with the local parallel coordinates plot to obtain detailed information.

Ian T. Nabney Machine Learning and Visualisation 27/45

slide-31
SLIDE 31

Data Visualisation: Latent Variable Models

Hierarchical GTM: Drilling Down

Bishop and Tipping introduced the idea of hierarchical visualisation for probabilistic PCA. We have developed a general framework for arbitrary latent variable models. Because GTM is a generative latent variable model, it is ‘straightforward’ to train hierarchical mixtures of GTMs. We model the whole data set with a GTM at the top level, which is broken down into clusters at deeper levels of the hierarchy. Because the data can be visualised at each level of the hierarchy, the selection of clusters, which are used to train GTMs at the next level down, can be carried out interactively by the user.

Ian T. Nabney Machine Learning and Visualisation 28/45

slide-32
SLIDE 32

Data Visualisation: Latent Variable Models

Chemometric Application: HTS Data Exploration

Scientists at Pfizer searching for active compounds can now screen millions of compounds in a fortnight. Gain a better understanding of the results of multiple screens through the use of novel data visualisation and modelling techniques. Find clusters of similar compounds (measured in terms of biological activity) and using a representative subset to reduce the number of compounds in a screen. Build local prediction models.

Ian T. Nabney Machine Learning and Visualisation 29/45

slide-33
SLIDE 33

Data Visualisation: Latent Variable Models

We have taken data from Jens L¨

  • sel (Pfizer) which consists of

6912 14-dimensional vectors representing chemical compounds using topological indices developed at Pfizer. The task is to predict LogP. Plots segment the data (by responsibility) which can be used to build local predictive models which are often more accurate than global models. Only 14 inputs, compared with c. 1000 for other methods of predicting logP. Results comparable with other algorithms for logP.

Ian T. Nabney Machine Learning and Visualisation 30/45

slide-34
SLIDE 34

Data Visualisation: Latent Variable Models Ian T. Nabney Machine Learning and Visualisation 31/45

slide-35
SLIDE 35

Data Visualisation: Latent Variable Models Ian T. Nabney Machine Learning and Visualisation 32/45

slide-36
SLIDE 36

Data Visualisation: Latent Variable Models

Gaussian Process Latent Variable Model

Ian T. Nabney Machine Learning and Visualisation 33/45

slide-37
SLIDE 37

Non-linear Modelling and Feature Selection

Non-linear Modelling and Feature Selection

Many chemometric problems can best be addressed using non-linear predictive models (e.g. QSAR). Models must be multivariate (there is no single ‘silver bullet’), but there are hundreds (thousands, tens of thousands) of possible features (e.g. for small molecules, proteins, . . . ). Linear models have a constant sensitivity to input variables. Non-linear models have a variable sensitivity; niches of good performance/variable importance.

Ian T. Nabney Machine Learning and Visualisation 34/45

slide-38
SLIDE 38

Non-linear Modelling and Feature Selection

GTM-FS

d1 and d2 have high saliency, d3 has low saliency

Ian T. Nabney Machine Learning and Visualisation 35/45

slide-39
SLIDE 39

Non-linear Modelling and Feature Selection

Chemometric Data

GTM Visualisation GTM-FS Visualisation Magnification factors on a log scale

Ian T. Nabney Machine Learning and Visualisation 36/45

slide-40
SLIDE 40

Non-linear Modelling and Feature Selection

Feature Saliencies Both GTM models

  • utperform Kohonen SOM

GTM-FS performs better than GTM on magnification factors (71 to 126) and (subjectively) has more coherent clusters GTM-FS performs worse than GTM on nearest-neighbour error (41% to 38%)

Ian T. Nabney Machine Learning and Visualisation 37/45

slide-41
SLIDE 41

Block-structured Covariance

Block GTM

Include prior information about the correlations of variables into a GTM by using a full covariance matrix in the noise model and enforcing a block structure. This results in a reasonably sparse covariance matrix and keeps the number of unknown parameters low. The additional flexibility of the model allows the model to fit the data more closely. The extension of the learning algorithm is straightforward and the only changes occur in the computation of responsibilities in the E-step and of Σ in the M-step. Σ =       Σ1 . . . Σ2 ... . . . . . . ... ... . . . Σp      

Ian T. Nabney Machine Learning and Visualisation 38/45

slide-42
SLIDE 42

Block-structured Covariance

Finding the Blocks: I

Find the block structure by visualising the correlation coefficients as a heat map. However for this method to be successful one needs to order this heat map highly correlated variables are close to each other (i.e. forming blocks). Generate a dendrogram using hierarchical clustering combined with heuristics to reorder the leaves to reflect their proximity. To achieve this the tree is ordered in such a way that the distance between neighbouring leaves is minimized. Use a recursive algorithm: Optimal Leaf Ordering (OLO). (Available in the Matlab Bioinformatics Toolbox). Swaps sub-trees if this reduces distances to neighbours.

Ian T. Nabney Machine Learning and Visualisation 39/45

slide-43
SLIDE 43

Block-structured Covariance

Finding the Blocks: II

Bayesian Correlation Estimation based on the paper of Liechty et al. (2004). For the grouping one is only interested in the off-diagonal elements of the empirical correlation matrix C. Assume that Cij ∼ N(µ, σ2) with priors µ ∼ N(0, τ 2) and σ2 ∼ IG(α, β) with the hyperparameters known. Extend this to groups with µθi,θj where the posterior p(θi| · · · ) defines the groups. The full posterior distribution of θi, µ and σ can be sampled using the Metropolis–Hastings algorithms. Very slow. Created a simpler Quick BCE which just estimates p(θi = k).

Ian T. Nabney Machine Learning and Visualisation 40/45

slide-44
SLIDE 44

Block-structured Covariance

Results on Toy Data

The nearest neighbour label error with high (ST=20) and low (ST=2) structure for the GTM model with different covariance structures. PCA=(blue, dotted line with big dot). S-GTM=(green, constant line with X). B-GTM=(red, dashed line with diamond). F-GTM=(black, dashed and dotted line).

Ian T. Nabney Machine Learning and Visualisation 41/45

slide-45
SLIDE 45

Block-structured Covariance

Conclusions

Visualisation is an important tool for all types of user; the domain expert must be involved in the process. Interaction with the plots allows the user to query the data more effectively. Presenting the data in the right way is key. Feature selection is a very important tool. Accounting for known structure (e.g. block covariance) improves results.

Ian T. Nabney Machine Learning and Visualisation 42/45

slide-46
SLIDE 46

Block-structured Covariance

AgustaWestland

AW has pioneered CVM, the continuous recording of airframe vibration (0-200Hz), to improve the investigation of unusual

  • ccurrences and monitor airframe integrity.

Develop a probabilistic framework for inferring flight mode and key parameters from multiple streams of vibration data. Improve indicators of airframe condition: the wavelet transform and kernel entropy to assess the dynamics (i.e. non-stationary characteristics) of the vibration signal. Integrated diagnosis based on probabilistic models of normality and using a belief network to model prior knowledge about the domain and interactions between key variables.

Ian T. Nabney Machine Learning and Visualisation 43/45

slide-47
SLIDE 47

Block-structured Covariance

Understanding the Data

8 sensors measuring vibration; 108 frequency bands per sensor.

Ian T. Nabney Machine Learning and Visualisation 44/45

slide-48
SLIDE 48

Block-structured Covariance Ian T. Nabney Machine Learning and Visualisation 45/45