Supervised Learning Part 1 Theory Sven Krippendorf Workshop on Big - - PowerPoint PPT Presentation

supervised learning
SMART_READER_LITE
LIVE PREVIEW

Supervised Learning Part 1 Theory Sven Krippendorf Workshop on Big - - PowerPoint PPT Presentation

Supervised Learning Part 1 Theory Sven Krippendorf Workshop on Big Data in String Theory Boston, 01.12.2017 Content Theory Applications: Mathematica Discussion Def: Supervised Learning Supervised learning is the machine


slide-1
SLIDE 1

Supervised Learning

Part 1 — Theory

Sven Krippendorf
 Workshop on Big Data in String Theory Boston, 01.12.2017

slide-2
SLIDE 2

Content

  • Theory
  • Applications: Mathematica
  • Discussion
slide-3
SLIDE 3

Def: Supervised Learning

Supervised learning is the machine learning task of inferring a function from labelled training data. Workflow:

  • 1. Determine training examples
  • 2. Prepare training set
  • 3. How to represent the input object
  • 4. How to represent the output object
  • 5. Determine your algorithm
  • 6. Run algorithm, adjust/determine parameters
  • 7. Evaluate accuracy
slide-4
SLIDE 4

Learning algorithms

  • Support Vector Machines
  • Naive Bayes
  • Linear discriminant analysis
  • Decision trees
  • k-nearest neighbour algorithm
  • Neural networks
slide-5
SLIDE 5

Known issues

  • Bias-variance tradeoff
  • Function complexity and amount of training data
  • Dimensionality of input space
  • Noise in output values
  • Heterogeneous data
slide-6
SLIDE 6

Examples

  • Geometric classification
  • Handwritten number recognition - the harmonic oscillator
  • f ML
  • Voice recognition (spectral features)
slide-7
SLIDE 7

A 1st problem

  • 10
  • 5

5 10 x

  • 10
  • 5

5 10 y

Classify data into two classes: Class 1: above line Class 2: below line Input: data points

slide-8
SLIDE 8

SVM

  • 10
  • 5

5 10 x

  • 10
  • 5

5 10 y

Which line?

slide-9
SLIDE 9

SVM

  • SVM (support vector machine) identify the lines maximally

separating the data sets:
 
 
 
 
 
 
 
 
 


  • Useful to be stable against “perturbations”
  • 10
  • 5

5 10 x

  • 10
  • 5

5 10 y

w.xi − b ≥ 1

2 |w|

w.xi − b ≤ −1

slide-10
SLIDE 10

SVM

  • How are these lines determined? Minimisation with

constraints, dual problem using Lagrange-multipliers. This problem then can be dealt with using quadratic programming algorithms.

  • These are readily implemented in standard environments

(Mathematica, Matlab, Python, etc.)

  • In higher dimensions: plane, hyperplane
slide-11
SLIDE 11

SVM: hard layer vs soft layer

  • Penalty for outliers, might be a better fit to data



 
 
 
 
 
 
 
 
 


  • 10
  • 5

5 10 x

  • 10
  • 5

5 10 y

slide-12
SLIDE 12

2nd problem

  • 10
  • 5

5 10 x

  • 10
  • 5

5 10 y

  • 10
  • 5

5 10 x

  • 10
  • 5

5 10 y

slide-13
SLIDE 13

SVM: Kernel trick

Different representation of data via kernel map:

{x, y} → {x2, y2} {x, y} → {x2, y}

20 40 60 80 100 20 40 60 80 100 20 40 60 80

  • 1

1 2 3 4 5 6

  • 10
  • 5
5 10 x
  • 10
  • 5
5 10 y
  • 10
  • 5
5 10 x
  • 10
  • 5
5 10 y
slide-14
SLIDE 14

Linear discriminant analysis

  • Use information about mean/variance of data set to

distinguish classes.
 


  • Set threshold to identify class:
  • 100
  • 50

50 100 T 100 200 300 400 500 #

  • 6
  • 4
  • 2
2 4 6
  • 6
  • 4
  • 2
2 4 6

(x − µ0)Σ−1

0 (x − µ0) + log |Σ0| − (x − µ1)Σ−1 1 (x − µ1) − log |Σ1| < threshold

slide-15
SLIDE 15

k-nearest neighbour

  • Classify data according to data point and nearest

neighbours.

k

more noise less noise boundaries clear boundaries less clear

  • 10
  • 5

5 10

  • 5

5 10

  • 6
  • 4
  • 2

2 4 6

  • 4
  • 2

2 4 6 8 10

  • 6
  • 4
  • 2

2 4 6

  • 2

2 4 6 8

slide-16
SLIDE 16

Neural Network

  • d (data), w (weight), b (bias). Linear layer: wij dj + b
  • Softmax Layer:
  • Loss function, capturing how far the desired output is

from true output.

… input layer hidden layer

  • utput layer

d, w, b d, w, b

softmax(di) = edi P

j edj

slide-17
SLIDE 17

Hand-written number recognition

  • Handwritten number recognition:
  • Input 28x28 matrix with entries {0,1}
  • Simple network, taking every entry as an input and using
  • ne layer (w.x+b), a success rate of 89% is achieved.
  • More sophisticated networks achieve incredible accuracy.

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-18
SLIDE 18

Voice recognition

  • A voice signal



 
 
 


  • Representing it via wavelet transform (spectrogram)

1 2 3 4 5 6 7

  • 0.5

0.5

100000 200000 300000 0.1 0.2 0.3 0.4 0.5

slide-19
SLIDE 19

String theory example: Dimers

6 1 2 3

WdP1 = X23Y31Z12 − X12Y31Z23 + X36Y62Z23 − X23Y62Z36 −X36Y23Z12Φ61 + X12Y23Z36Φ61

1 2 1 3 2 3 1 3 1 2 1 2 1

  • bounds on number of families (1002.1790):
slide-20
SLIDE 20

end of part 1

slide-21
SLIDE 21

Supervised Learning

Part 2 — Applications

Sven Krippendorf
 Workshop on Big Data in String Theory Boston, 01.12.2017

slide-22
SLIDE 22

Disclaimer

  • There are many tools you can use…
  • I just talk about one at a very basic level: Mathematica


… simply because it’s quick for me and I assume people are familiar with it.

slide-23
SLIDE 23

Mathematica

slide-24
SLIDE 24

Mathematica

  • You need version 11.1.1 or later…
slide-25
SLIDE 25

Example 1: basic SVN

  • Let’s switch to notebook01.nb
slide-26
SLIDE 26

Example 2: kernel trick

  • let’s look at notebook02.nb
slide-27
SLIDE 27

Example 3: kernel trick

  • let’s look at notebook03.nb
slide-28
SLIDE 28

Thank you.

Let’s discuss about applications…

slide-29
SLIDE 29

Supervised Learning

Part 3 — Discussion

Sven Krippendorf
 Workshop on Big Data in String Theory Boston, 01.12.2017