Software development in AppStat B. K egl / AppStat 1 AppStat: - - PowerPoint PPT Presentation

software development in appstat
SMART_READER_LITE
LIVE PREVIEW

Software development in AppStat B. K egl / AppStat 1 AppStat: - - PowerPoint PPT Presentation

Software development in AppStat B. K egl / AppStat 1 AppStat: Applied Statistics and Machine Learning AppStat: Apprentissage Automatique et Statistique Appliqu ee Bal azs K egl Linear Accelerator Laboratory, CNRS/University of


slide-1
SLIDE 1
  • B. K´

egl / AppStat 1

Software development in AppStat

AppStat: Applied Statistics and Machine Learning AppStat: Apprentissage Automatique et Statistique Appliqu´ ee

Bal´ azs K´ egl

Linear Accelerator Laboratory, CNRS/University of Paris Sud Service Informatique Nov 30, 2010

1

slide-2
SLIDE 2
  • B. K´

egl / AppStat 2

Overview

  • Introduction
  • me
  • the team
  • collaborations
  • Scientific projects → software
  • discriminative learning → boosting → multiboost.org
  • inference, Monte-Carlo integration → adaptive MCMC → integration

into root (save it for next time)

slide-3
SLIDE 3
  • B. K´

egl / AppStat 3

Scientific path

Hungary 1989 – 94 M.Eng. Computer Science BUTE 1994 – 95 research assistant BUTE Canada 1995 – 99 Ph.D. Computer Science Concordia U 2000 postdoc Queen’s U 2001 – 06 assistant professor U of Montreal France 2006 – research scientist (CR1) CNRS / U Paris Sud

  • Research interests: machine learning, pattern recognition, signal pro-

cessing, applied statistics

  • Applications: image and music processing, bioinformatics, software en-

gineering, grid control, experimental physics

slide-4
SLIDE 4
  • B. K´

egl / AppStat 4

The team

  • B. Kégl (team leader)

2006 -

  • boosting
  • MCMC
  • Auger
  • R. Busa-Fekete (postdoc)

2008 -

  • boosting
  • optimization
  • SysBio
  • R. Bardenet (Ph.D student)

2009 -

  • MCMC
  • optimization
  • Auger
  • D. Benbouzid (Ph.D. student)

2010 -

  • boosting
  • JEM EUSO

"""""""""""""""""""

  • F-D. Collin (software

engineer; 01/12/2010)

  • multiboost.org
  • MCMC in root
  • system integration
  • D. Garcia (postdoc; 01/01/2011)
  • generative models
  • Auger / JEM EUSO
  • tutoring
slide-5
SLIDE 5
  • B. K´

egl / AppStat 5

Collaborations

Computer Science

LAL

AppStat Auger JEM EUSO ILC, LSST, etc. LTCI

Telecom ParisTech

TAO

LRI

ESBG

Hungarian Academy Experimental Science Existing link Future link

boosting

  • ptimization

MCMC d r u g c

  • c

k t a i l

  • p

t i m i z a t i

  • n

t r i g g e r boosting MCMC reconstruction boosting hypothesis test

slide-6
SLIDE 6
  • B. K´

egl / AppStat 6

Funding

  • ANR “jeune chercheur” MetaModel
  • 2007–2010, 150Ke
  • ANR “COSINUS” Siminole
  • 2010–2014, 1043Ke (658Ke at LAL)
  • MRM Grille Paris Sud
  • 2010–2012, 60Ke (31Ke at LAL)
slide-7
SLIDE 7
  • B. K´

egl / AppStat 7

Siminole within ANR COSINUS

  • COSINUS = Conception and Simulation
  • Theme 1: simulation and supercomputing
  • Theme 2: conception and optimization
  • Theme 3: large-scale data storage and processing
  • Siminole
  • principal theme: Theme 2
  • secondary theme: Theme 1
slide-8
SLIDE 8
  • B. K´

egl / AppStat 8

Siminole within ANR COSINUS

  • Simulation: third pillar of scientific discovery
  • Improving simulation
  • algorithmic development inside the simulator
  • implementation on high-end computing devices
  • our approach: control the number of calls to the simulator
slide-9
SLIDE 9
  • B. K´

egl / AppStat 9

Siminole within ANR COSINUS

  • Optimization
  • simulate from f(x), find max

x

f(x)

  • Inference
  • simulate from p(x|θ), find p(θ|x)
  • Discriminative learning
  • simulate from p(x,θ), find θ = f(x)
slide-10
SLIDE 10
  • B. K´

egl / AppStat 10

Discriminative learning → boosting → multiboost.org

  • Discriminative learning (classification)
  • Infer f(x) : Rd → 1,...,K from a database D =
  • (x1,y1),...,(xn,yn)
  • boosting, AdaBoost
  • one of the state-of-the-art classification algorithms
  • multiboost.org
  • our implementation
slide-11
SLIDE 11
  • B. K´

egl / AppStat 11

Machine learning at the crossroads

Machine learning

Neuroscience Signal processing Statistique Artificial intelligence Optimization Probability theory Information theory Cognitive science

slide-12
SLIDE 12
  • B. K´

egl / AppStat 12

Machine Learning

  • From a statistical point of view
  • non-parametric fitting, capacity/complexity control
  • large dimensionality
  • large data sets, computational issues
  • mostly classification (categorization, discrimination)
slide-13
SLIDE 13
  • B. K´

egl / AppStat 13

Discriminative learning

  • observation vector: x ∈ Rd
  • class label: y ∈ {−1,1} – binary classification
  • class label: y ∈ {1,...,K} – multi-class classification
  • classifier: g : Rd → {−1,1}
  • discriminant function: f : Rd → [−1,1]

g(x) =

  • 1,

if f(x) ≥ 0, −1, if f(x) < 0

slide-14
SLIDE 14
  • B. K´

egl / AppStat 14

Discriminative learning

  • Inductive learning
  • training sample: Dn =
  • (x1,y1),...,(xn,yn)
  • function set: F
  • learning algorithm: ALGO :
  • Rd ×{−1,1}

n → F ALGO(Dn) → f

  • goal: small generalization error P
  • f(X) = Y
slide-15
SLIDE 15
  • B. K´

egl / AppStat 15

5 10 15 20 25 30 x1 500 600 700 800 900 1000 x2

Data for twoclass classification problem

slide-16
SLIDE 16
  • B. K´

egl / AppStat 16

5 10 15 20 25 30 x1 500 600 700 800 900 1000 x2

2D Gaussian fit for class 1

slide-17
SLIDE 17
  • B. K´

egl / AppStat 17

5 10 15 20 25 30 x1 500 600 700 800 900 1000 x2

2D Gaussian fit for class 2

slide-18
SLIDE 18
  • B. K´

egl / AppStat 18

Classification

  • Terminology
  • Conditional densities: p(x|Y = 1), p(x|Y = −1)
  • Prior probabilities: p(Y = 1), p(Y = −1)
  • Posterior probabilities: p(Y = 1|x), p(Y = −1|x)
  • Bayes theorem:

p(Y = 1|x) = p(x|Y = 1)p(Y = 1) p(x) ∼ p(x|Y = 1)p(Y = 1)

  • Decision:

g(x) =

  • 1

if

p(x|Y=1)p(Y=1) p(x|Y=−1)p(Y=−1) > 1,

−1

  • therwise.
slide-19
SLIDE 19
  • B. K´

egl / AppStat 19

5 10 15 20 25 30 x1 500 600 700 800 900 1000 x2

Discriminant function with Gaussian fits

slide-20
SLIDE 20
  • B. K´

egl / AppStat 20

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

'Two Moons' data for twoclass classification problem

slide-21
SLIDE 21
  • B. K´

egl / AppStat 21

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Gaussian fit for class 1

slide-22
SLIDE 22
  • B. K´

egl / AppStat 22

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Gaussian fit for class 2

slide-23
SLIDE 23
  • B. K´

egl / AppStat 23

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

Discriminant function with Gaussian fits

slide-24
SLIDE 24
  • B. K´

egl / AppStat 24

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Parzen fit for class 1, h 0.12

slide-25
SLIDE 25
  • B. K´

egl / AppStat 25

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Parzen fit for class 2, h 0.12

slide-26
SLIDE 26
  • B. K´

egl / AppStat 26

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

Discriminant function with Parzen fits, h 0.12

slide-27
SLIDE 27
  • B. K´

egl / AppStat 27

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Parzen fit for class 1, h 0.02

slide-28
SLIDE 28
  • B. K´

egl / AppStat 28

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Parzen fit for class 2, h 0.02

slide-29
SLIDE 29
  • B. K´

egl / AppStat 29

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

Discriminant function with Parzen fits, h 0.02

slide-30
SLIDE 30
  • B. K´

egl / AppStat 30

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Parzen fit for class 1, h 3

slide-31
SLIDE 31
  • B. K´

egl / AppStat 31

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

2D Parzen fit for class 2, h 3

slide-32
SLIDE 32
  • B. K´

egl / AppStat 32

1 1 2 3 4 5 6 x1 1 1 2 3 4 5 x2

Discriminant function with Parzen fits, h 3

slide-33
SLIDE 33
  • B. K´

egl / AppStat 33

0.2 0.4 0.6 0.8 h 0.00 0.05 0.10 0.15 0.20 error rate

Training and test error rates for Parzen fits with different bandwidths

slide-34
SLIDE 34
  • B. K´

egl / AppStat 34

Non-parametric fitting

  • Capacity control, regularization
  • trade-off between approximation error and estimation error
  • complexity grows with data size
  • no need to correctly guess the function class
slide-35
SLIDE 35
  • B. K´

egl / AppStat 35

Curse of dimensionality

  • Capacity/complexity control becomes a real issue in high-

dimensional spaces

  • in a 10000-dimensional space a linear function has 10000 parameters!
  • Examples
  • images
  • music
  • language, text
  • bioinfo (genetics, proteomics)
slide-36
SLIDE 36
  • B. K´

egl / AppStat 36

Machine learning problems

  • Common goal: predict the future
  • make inferences on unknown future observations
  • Non-supervised learning
  • density estimation p(obs)
  • clustering, dimensionality reduction, one-class learning
  • Supervised learning
  • Classification : f(obs) → category
  • Regression : f(obs) → response
slide-37
SLIDE 37
  • B. K´

egl / AppStat 37

The supervised learning model

  • observation vector: x ∈ Rd
  • class label: y ∈ {−1,1} (or y ∈ {1,...,K})
  • classifier: g : Rd → {−1,1}
  • Discriminant function: f : Rd → [−1,1]

→ classifier g(x) =

  • 1,

if f(x) ≥ 0, −1, if f(x) < 0

  • decision boundary: {x : f(x) = 0}
slide-38
SLIDE 38
  • B. K´

egl / AppStat 38

The supervised learning model

  • Learning by experience, with a supervisor
  • training set : Dn =
  • (x1,y1),...,(xn,yn)
  • function class : F
  • learning algorithm : ALGO :
  • Rd ×{−1,1}

n → F ALGO(Dn) → f

  • goal: small generalization error R(g) = P
  • g(X) = Y
  • = P
  • f(X)Y ≤ 0
  • learning principle: minimize the training error
  • R(g) = 1

n

n

i=1

I{g(xi) = yi}

slide-39
SLIDE 39
  • B. K´

egl / AppStat 39

The supervised learning model

−1 g f 1

  • Margin: γ = y· f(x)
  • classification error ≡ negative margin
  • the magnitude of a positive margin quantifies the confidence
  • learning principle: minimize a smooth loss function over the margin
  • Rγ(f) = 1

n

n

i=1

L

  • f(xi)yi
slide-40
SLIDE 40
  • B. K´

egl / AppStat 40

The supervised learning model

  • Margin loss functions

2 1 1 2 Γ 1 2 3 4 Error NN2SVM2 NNl SVM1 AdaBoost

slide-41
SLIDE 41
  • B. K´

egl / AppStat 41

History

  • Algorithms
  • 1958: Perceptron [Rosenblatt, ’58] – [Minsky–Papert ’69]
  • 1986: Multilayer perceptrons (neural networks) and the

back-propagation algorithm [Rumelhart–Hinton–Williams, ’86]

  • 1995: Support vector machines [Boser–Guyon–Vapnik, ’92], [Cortes–

Vapnik, ’95]

  • 1997: boosting, AdaBoost [Freund, ’95], [Freund–Schapire, ’97]
slide-42
SLIDE 42
  • B. K´

egl / AppStat 42

The perceptron

  • Linear discriminant functions: f(x) =

d

i=0

w(i) ·x(i) = w,x

g( ) x f( ) x w

(2)

Σ

x

(1)

x x

(d) (2) (0)

x =1 w w

(1)

w

(0) (d)

... ...

slide-43
SLIDE 43
  • B. K´

egl / AppStat 43

The perceptron

  • Linear discriminant functions: f(x) =

d

i=0

w(i) ·x(i) = w,x

  • Algorithm
  • simple iterative error correction
  • convergence if the data is linearly separable
  • oscillation for linearly non-separable data
slide-44
SLIDE 44
  • B. K´

egl / AppStat 44

Generalized linear discriminant functions

  • Model:

f(x) =

N

j=1

α(j)h(j)(x)

  • h(j) : Rd → [−1,1]
  • simple classifiers/discriminant functions, features, experts
  • α(j) ∈ R+
  • weight of the expert h(j) in the final vote
slide-45
SLIDE 45
  • B. K´

egl / AppStat 45

Multilayer perceptron (neural net)

  • Model: f(x) =

N

j=1

α(j)σ(w j,x)

Σ Σ

x h ( )

(1) (1)

α

(t)

α

(T)

α f( ) x

Σ

x

(1)

x x

(d) (2) (0)

x =1

... Σ

w wT

j

w1

... ...

h ( ) x h ( ) x

(t) (T)

slide-46
SLIDE 46
  • B. K´

egl / AppStat 46

Multilayer perceptron (neural net)

  • Model: f(x) =

N

j=1

α(j)σ(w j,x)

  • Algorithm:
  • gradient descent optimization
  • differentiable error functions → margin loss
  • differentiable activation function σ: the sigmoid
  • local minima, “engineering”, parameters to tune
slide-47
SLIDE 47
  • B. K´

egl / AppStat 47

Support vector machine

  • Model:

f(x) = ∑

j∈Isv

α(j)y jK(x j,x)

  • Isv ⊂ {1,...,n} is the set of support vectors
  • K(·,·) is a similarity function (kernel)
  • goal: classification boundary equidistant from classes
  • “sophisticated nearest neighbor”
  • slow and complex quadratic programming optimization
  • turn-key algorithm, very limited parameter tuning
slide-48
SLIDE 48
  • B. K´

egl / AppStat 48

Support vector machine

  • Model:

f(x) = ∑

j∈Isv

α(j)y jK(x j,x)

  • Kernel:
  • K(x,x′) = x,x′ −

→ f(x) is linear

  • K(x,x′) =
  • 1+x,x′

d − → f(x) is a polynomial of degree d

  • K(x,x′) = exp
  • −1/hx−x′2

− → f(x) is a Gaussian mixture (→ Parzen)

slide-49
SLIDE 49
  • B. K´

egl / AppStat 49

AdaBoost

  • Model:

f(x) =

N

j=1

α(j)h(j)(x)

  • no restriction on the form of h(j)(x)
  • often “decision stumps” :

hℓ,θ(x) =

  • +1

if x(ℓ) ≥ θ, −1

  • therwise

where x =

  • x(1),...,x(d)
slide-50
SLIDE 50
  • B. K´

egl / AppStat 50

AdaBoost

  • Intuitive elementary algorithm
  • add one expert at a time
  • add the best expert on training points mis-classified by previous ex-

perts

  • weight of the expert chosen proportionally to its correctness
slide-51
SLIDE 51
  • B. K´

egl / AppStat 51

AdaBoost

  • Weighting over the training points w1,...,wn
  • normalized:

n

i=1

wi = 1

  • initialized uniformly : w = (1/n,...,1/n)
  • if xi is mis-classified by h(j), increase wi
  • otherwise, decrease wi
  • “difficult” training points get larger weights gradually
slide-52
SLIDE 52
  • B. K´

egl / AppStat 52

AdaBoost [Freund – Schapire ’97]

ADABOOST(Dn = {(xi,yi)}n

i=1,BASE(·,·),T)

1 w(1) ← (1/n,...,1/n) ⊲ initial weights 2 for t ← 1 to T 3 h(t) ← BASE

  • Dn,w(t)

⊲ calling the base learner 4 γ(t) ←

n

i=1

w(t)

i h(t)(xi)yi

⊲ edge = 1−2×error 5 α(t) ← 1 2 ln 1+γ(t) 1−γ(t)

  • ⊲ coefficient of h(t)

6 for i ← 1 to n ⊲ re-weighting the points 7 if h(t)(xi) = yi then 8 w(t+1)

i

← w(t)

i

1 1−γ(t) 9 else 10 w(t+1)

i

← w(t)

i

1 1+γ(t) 11 return f (T)(·) =

T

t=1

α(t)h(t)(·)

slide-53
SLIDE 53
  • B. K´

egl / AppStat 53

AdaBoost

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

t 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

t 3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

t 10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

t 40

slide-54
SLIDE 54
  • B. K´

egl / AppStat 54

AdaBoost

  • Algorithm
  • extremely simple learning, limited parameter tuning
  • fast
  • intuitive interpretation: weighted vote of experts
  • the choice of the pool of experts captures the a-priori knowledge
  • no restriction on the form of the experts
  • label noise can be a problem
slide-55
SLIDE 55
  • B. K´

egl / AppStat 55

multiboost.org

  • Multi-class multi-label boosting software
  • based on ADABOOST.MH [Schapire-Singer ’99]
  • started by Norman Casagrande (M.Sc. student in Montreal, now with

last.fm)

  • multi-platform C++
  • command-line UI, easy-to-use for a non-expert
  • adapting to a new data type is easy for an advanced user
  • tons of features
  • scales nicely
slide-56
SLIDE 56
  • B. K´

egl / AppStat 56

multiboost.org

  • Plan
  • going beyond classification: regression, ranking, collaborative filter-

ing, reinforcement learning

  • technical improvements: multicore, GPU, grid, memory handling, etc.
  • redesign: orthogonal features → templates
  • implement a software development cycle (tests, etc.), can be tricky to

balance between research and production

slide-57
SLIDE 57
  • B. K´

egl / AppStat 57

multiboost.org

  • Plan for F-D.
  • get acquainted with Machine Learning through understanding the code

(of course, we’ll be there)

  • first concrete task: port it to multi-core (we have good understanding

how) than to GPU (we have a vague understanding, more challenging to F-D.)

  • gradually implement a software development cycle
  • redesign