Machine Learning Anders Holst SICS Big Data Analytics Analysis - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Anders Holst SICS Big Data Analytics Analysis - - PowerPoint PPT Presentation

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data Analytics Analysis Big Data Big Value Real world Question Data Model Conclusion Machine Learning Use real data to train a model, which can


slide-1
SLIDE 1

Machine Learning

Anders Holst SICS

slide-2
SLIDE 2

Big Data Analytics

Big Data Big Value

Analysis

slide-3
SLIDE 3

Big Data Analytics

Big Data Big Value

Analysis

Data Conclusion Real world

Model

Question

slide-4
SLIDE 4

Machine Learning

Use real data to train a model, which can then be used to solve various tasks.

slide-5
SLIDE 5

Machine Learning

Use real data to train a model, which can then be used to solve various tasks. Tasks:

  • Classification
  • Clustering
  • Prediction
  • Anomaly detection
slide-6
SLIDE 6

Machine Learning

Use real data to train a model, which can then be used to solve various tasks. Tasks:

  • Classification
  • Clustering
  • Prediction
  • Anomaly detection

Applications:

  • Medical diagnosis
  • Computer vision
  • Speech recognition
  • Fraud detection
  • Recommender systems
  • Sales prediction
slide-7
SLIDE 7

Machine Learning

Input features Output value

?

X1 X2

slide-8
SLIDE 8

Machine Learning

Data types:

  • Binary or discrete
  • Continuous values
  • Time series
  • Natural language text
  • Images
  • Sound

Input features Output value

slide-9
SLIDE 9

Machine Learning Methods

  • Case based methods

Table lookup, Nearest neighbour, k-Nearest neighbour

  • Logical Inference

Inductive logic, Decision trees, Rule based systems

  • Artificial Neural Networks

Multilayer perceptrons, Self Organizing Maps, Bolzmann machines, Deep neural networks

  • Statistical methods

Naive Bayes, Mixture models, Hidden Markov models, Bayesian networks, MCMC, Kernel density estimators, Particle filters

  • Heuristic search

Genetic algorithms, Reinforcement learning, Simulated annealing, Minimum Description Length

slide-10
SLIDE 10

?

Case based methods

  • ”Similar patterns belong to the

same class”

  • Easy to train (just save every

pattern), but takes longer during recall, to find the similar patterns

  • Model size increases with the

number of seen examples

  • Requires specification of a

distance measure

slide-11
SLIDE 11

?

Logical Inference

  • Construct logical expressions

that characterizes the classes

  • Typically considers one

feature at a time – axis parallell decision regions

  • A decision tree be constructed

using e.g. information theory

X1>3.5 X2>1.8

slide-12
SLIDE 12

?

Artificial Neural Networks

  • Inspired by the neural structure of

the brain

  • Neural units connected by
  • weights. Weights are adjusted to

produce the best mapping.

  • ”Deep” architectures has gained

popularity – requires much data to train

Wij Wjk

slide-13
SLIDE 13

?

Statistical methods

  • Large number of methods, from

simple to complex

  • The common idea is to calculate

the probability of each class given a feature vector, P(c|x)

  • Parametric versus nonparametric

methods – depending on whether the forms of the class distributions are known or not

slide-14
SLIDE 14

Neural Networks Logical Inference Case- based Statistical Methods

slide-15
SLIDE 15

Representation

Neural Networks Logical Inference Case- based Statistical Methods

slide-16
SLIDE 16

Representation

  • The exact choice of method is often not critical, but the

choise of representation of features is:

– With the wrong representation no method will succeed – Once you have found a good representation, almost any method

will do

  • Once preprocessing has turned data into something

reasonable, a simple model may be sufficient

– With limited amount of independent data, the number of

parameters must be kept low, so keep it as simple as possible

  • Finding a suitable representation requires much domain

knowledge and problem understanding

– No black box solution in general

slide-17
SLIDE 17

Neural Network book, 1969

slide-18
SLIDE 18

Data cleaning Representation

Neural Networks Logical Inference Case- based Statistical Methods

slide-19
SLIDE 19

Data cleaning

Real data is not clean:

  • Missing data
  • Out of sync fields
  • Misspellings
  • Special values (temperature -9999)
  • Spikes (10e+14)
  • Dirty or drifting sensors (0.3 – 100.3 %)
  • Data from different sources (old / new), with

slightly different meaning

  • Inconsistent data
  • Irrelevant data
slide-20
SLIDE 20

Data cleaning

Attr 1 Attr 2 Attr3 Attr 4 Attr 5 12.2827 2002080612220500 10.47 5.2

  • Cool. on

12.2826 2002080612220622 15.39 4.7 Switch 12.2825 2002080612220743 12.66 5.9 hasp temp 680 12.2824 2002080612220886

  • 999.0

22.8 Hasp-temp 1.22823 2002080612221012

  • 999.0

Overflow cool 12.2819 2002080612221136

  • 999.0

Overflow Cooling 12.2815 1858111700000000 13.49 Error cooling on 122821 1858111700000000 25.85 Error sw. 12.2823 2002080612221631 22.98 0.6 not in phase ... ... ... ... ...

slide-21
SLIDE 21

Data cleaning Representation Validation

Neural Networks Logical Inference Case- based Statistical Methods

slide-22
SLIDE 22

Validation

  • “Validation” is used to estimate the performance on new

data, i.e. how the model would perform when actually used

  • To get good generalization you must avoid overtraining the

machine learning model

  • There are unimaginably many ways that makes the result

look better in the laboratory than in the real life

  • However hard you try to avoid it, you will always get too
  • ptimistic validation results!
slide-23
SLIDE 23

Some ways to guarantee overtraining:

  • Too few data samples
  • Too complicated model
  • Too similar training, test and validation samples
  • Fine-tuning your parameters
  • Evaluating several models with the same validation set

Validation

slide-24
SLIDE 24

Data cleaning Representation Validation Deployment

Neural Networks Logical Inference Case- based Statistical Methods

slide-25
SLIDE 25

Deployment

  • The method is on its own
  • Keep it simple and robust
  • Must the network be regularly retrained?

Can the “ground truth” be trusted? Can stability and performance be guaranteed?

  • Did your pre-study test the right thing?

Distinction between prediction and control Distinction between prediction and causation

  • Be prepared to go all over the process again
slide-26
SLIDE 26

Data cleaning

Representation Validation

Deployment

Neural Networks Logical Inference Case- based Statistical Methods

slide-27
SLIDE 27

Conclusions

  • Thoroughly understand the problem you are working on

and try to understand the process that generated the data

  • Select a suitable representation, of the relevant features
  • Take extreme care with validation, and test the application
  • n as much real-world data as you can
  • Keep it as simple as possible (but still powerful enough to

solve the problem at hand).