Practical problem s in m ulti-w ay analysis Constraints Missing - - PowerPoint PPT Presentation

practical problem s in m ulti w ay analysis
SMART_READER_LITE
LIVE PREVIEW

Practical problem s in m ulti-w ay analysis Constraints Missing - - PowerPoint PPT Presentation

Practical problem s in m ulti-w ay analysis Constraints Missing data Jackknifing and split-half analysis Some examples: SLICING recovering exponentials Fluorescence EEM data Chromatographic data Rasmus Bro


slide-1
SLIDE 1

Rasmus Bro

Practical problem s in m ulti-w ay analysis

  • Constraints
  • Missing data
  • Jackknifing and split-half analysis

Some examples:

  • SLICING – recovering exponentials
  • Fluorescence EEM data
  • Chromatographic data
slide-2
SLIDE 2

rb@life.ku.dk

Constraints in PARAFAC

slide-3
SLIDE 3

rb@life.ku.dk

W hy use constraints

  • Sometimes the parameters
  • f the model can have a

very direct physical meaning (spectra, concentrations, elution profiles)

  • It is natural to require these

parameters to “make sense”

  • This may be accomplished

by fitting a model under additional constraints

800 900 1000 1100 1200 1300 1400 1500 1600 Wavelength [nm] 800 900 1000 1100 1200 1300 1400 1500 1600 Wavelength [nm]

Data Loadings PCA

slide-4
SLIDE 4

rb@life.ku.dk

  • Exam ple
  • Instead of ‘PCA’:

| | X - AB’| |

  • fit the model:

| | X - AB’| | , subject to A and B are nonnegative

Using constraints

slide-5
SLIDE 5

rb@life.ku.dk

W hy constraints?

  • Obtain sensible param eters
  • Ex.: Require chromatographic profiles to have but one peak
  • Obtain unique solution
  • Ex.: Use selective channels in data to obtain uniqueness
  • Avoiding degeneracy and num erical problem s
  • Ex.: Enabling a PARAFAC model of data otherwise

inappropriate for the model

  • Speed up algorithm s
  • Ex.: Use truncated bases to reexpress problem by a smaller

problem

slide-6
SLIDE 6

Typical constraints

Spectroscopy Chromatography FI A Kinetics Auto- & cross correlation Uncertainty Nonnegativity Unimodality Selectivity Smoothness Known spectra …

slide-7
SLIDE 7

rb@life.ku.dk

PARAFAC - algorithm

  • Algorithm - Alternating least squares (ALS)

Ex.: Bilinear model : | | X – ABT| |

  • 1. BT = (ATA) -1ATX = A+ X
  • 2. AT = (BTB) -1BTXT = B+ XT
  • 3. Goto 1 until convergence (small change in fit | | X-ABT| | )
slide-8
SLIDE 8

rb@life.ku.dk

PARAFAC - algorithm

   

 

   

 

   

 

 

    

              

 

1 k k 1 1 k k 1 1 k k

  • 1. Initialize

and 2. ' * ' 3. ' ' * ' 4. ' * ' diag ' , = 1,..,

  • 5. Step 2 until relative change in fit is small

K k K k

diag k K

B C A X BD B B C C B X AD A A C C D B B A A A X B

Why ALS? Simple Extends to N-way Handles missing Handles ML fitting Constraints:

  • Nonnegativity
  • Unimodality
  • Orthogonality
  • Linear constraints
  • Fixed parameters
  • Smoothness
  • Functional
  • etc
slide-9
SLIDE 9

FI A exam ple

Model of a single sam ple w ith one analyte

50 100 250 300 350 400 450 0.1 0.2 0.3 Time Wavelength Absorbance

250 300 350 400 450 0.05 0.1 0.15

Wavelength/nm

20 40 60 80 100 0.02 0.04 0.06

Time Absorbance Absorbance

  • Samples of 2, 3, 4-HBA (hydroxy

benzaldehyde)

  • UV-VIS FIA system with pH-

gradient imposed

  • Spectrum sum of acidic and basic
  • spectrum. Same for time profiles.

Only sums are measured

  • Model not important here.

PARALIND: Xk = AHD kB’ (Morteza Bahram will talk about this)

slide-10
SLIDE 10

Effect of constraints

Concentrations 2HBA 3HBA 4HBA Eq

0.9769 0.9837 0.9979

NNLS

0.9988 0.9787 0.9996

NNLS/Eq

0.9992 0.9987 0.9996

NNLS/ULSR/Eq

0.9992 0.9987 0.9996

NNLS/ULSR/Fix/Eq

0.9990 0.9987 0.9996

  • Eq

: Equality of summed profiles

  • NNLS

: Non-negativity of all parameters

  • ULSR

: Unimodality of FIAgrams/ time profiles

  • Fix

: Fixing purely acidic/ basic times to only reflect acidic/ basic analytes

20 40 60 80 100 0.2 Time 4HBA - a.u.

Non-negativity & equality constrained

slide-11
SLIDE 11
  • Eq

: Equality of summed profiles

  • NNLS : Non-negativity of all parameters
  • ULSR : Unimodality of FIAgrams/ time profiles
  • Fix

: Fixing purely acidic/ basic times to only reflect acidic/ basic analytes

SPECTRA 2HBA acidic 2HBA basic 3HBA acidic 3HBA basic 4HBA acidic 4HBA basic Eq

0.9893* 0.9871* 0.9689* 0.7647* 0.9106* 0.9211*

NNLS

0.9944 0.9117* 0.9952 0.9241 0.9974 0.9977

NNLS/Eq

0.9946 0.9312* 0.9953 0.9988 0.9965 0.9971

NNLS/ULSR/Eq

0.9946 0.9590* 0.9953 0.9989 0.9966 0.9943

NNLS/ULSR/Fix/Eq

0.9946 0.9989 0.9954 0.9986 0.9961 0.9977

Effect of constraints

slide-12
SLIDE 12

Data analysis requires good data – g.i.g.o.

Exam ple Fluorescence excitation- emission matrix contains chemical information that PARAFAC can handle and physical scattering signals that do not fit PARAFAC

slide-13
SLIDE 13

Know ing your data

Chem ical part

slide-14
SLIDE 14

Chem ical part

Know ing your data

slide-15
SLIDE 15

MILES – maximum likelihood

w 12

eTWTWe

w 22 w 32 w 42 w J2 w 142 w 4J2

Know ing your data

Downweigh areas of less importance Done by extending least squares fit to weighted and

  • ff-diagonal-weighted least

squares

slide-16
SLIDE 16

MI LES – m axim um likelihood

  • Algorithm MI LES (Maximum likelihood via Iterative Least

squares EStimation) based on Majorization

  • Enables weighted least squares and maximum likelihood

fitting of any model which has a least squares algorithm

  • 1. Initialize model, m0, with LS, set c := 0;

2.

  • 3. mc+ 1 =
  • 4. c := c+ 1; go to step 2 until convergence

   

T c c

1 / ( )

q m W W x m



2 F

argmin

m

m q

Calculate q Fit LS model to q instead of to data

Given vectorized data x and weights W

slide-17
SLIDE 17

21 samples containing L-phenylalanine, L-3,4-dihydroxy-phenyl-alanine (DOPA), 1,4-dihydroxy-benzene & L-tryptophan

  • Three types of unwanted variation
  • Measurement error (~ iid Gaussian)
  • Rayleigh and Raman scatter
  • Non-chemical area

Baunsgaard D, Factors affecting 3-way modelling (PARAFAC) of fluorescence landscapes, The Royal Veterinary & Agricultural University, 1999

slide-18
SLIDE 18

PARAFAC results

RAW DATA

Artifact

Least squares PARAFAC

slide-19
SLIDE 19

PARAFAC results

RAW DATA

Artifact

MILES interpretation of data MILES PARAFAC Least squares PARAFAC

slide-20
SLIDE 20

Bootstrapping a bit

Em ission spectra from 1 0 0 resam plings

260 280 300 320 340 360 380 400 420

  • 0.2

0.2 0.6 Loading 260 280 300 320 340 360 380 400 420

  • 0.2

0.2 0.4 Loading Emission /nm

  • R. Bro, N. D. Sidiropoulos, and A. K. Smilde. Maximum likelihood fitting using simple least squares algorithms. J.Chemometrics, 2002
slide-21
SLIDE 21

rb@life.ku.dk

Missing data

slide-22
SLIDE 22

Missing data

No m issing

Ex.: standard PCA loss function | | X-TP’| | = I.e., a summation of errors over all elements of X

I f m issing

Only fit the model to the data that exist I.e., fit to the loss function where wij is zero if xij is missing and one otherwise

  

      

 

2 1 1 1 I J F ij if jf i j f

x t p

  

      

 

2 1 1 1 I J F ij ij if jf i j f

w x t p

slide-23
SLIDE 23

How can that loss function be optimized?

Method 1 : use weighted least squares regression Method 2 : use imputation

  • 1. Put numbers in missing elements
  • 2. Fit model to these ‘wrong’ data (Ex: M = TP’ in PCA)
  • 3. Replace missing elements with model guess (Ex: xij = Mij in

PCA)

  • 4. Go to step 2 until convergence

Both methods give same result. Method 2 is easy to implement, Method 1 sometimes faster, but more memory-demanding

Missing data

slide-24
SLIDE 24

How can that loss function be optimized?

Method 1 : use weighted least squares regression Method 2 : use imputation

  • 1. Put numbers in missing elements
  • 2. Fit model to these ‘wrong’ data (Ex: M = TP’ in PCA)
  • 3. Replace missing elements with model guess (Ex: xij = Mij in

PCA)

  • 4. Go to step 2 until convergence

Both methods give same result. Method 2 is easy to implement, Method 1 sometimes faster, but more memory-demanding

Missing data

slide-25
SLIDE 25

Ex.: Fluorescence data 15% missing data Three PCA components should be sufficient Ad hoc approaches such as NIPALS do not work (too many significant components)

Missing data

slide-26
SLIDE 26

rb@life.ku.dk

Jackknifing and split-half analysis

slide-27
SLIDE 27

rb@life.ku.dk

Jackknifing and split-half analysis

*J.W.Tukey Annals of Mathematical Statistics 1958, 29, 614.

W hat is the problem ?

  • Detect outliers
  • Uncertainty measures are difficult to define
  • And assumptions are hardly ever met anyway

Jack-knifing* is a solution

  • Based on resampling (cross-validation)
  • Works regardless of model structure
  • No distributional assumptions except that

resampled objects are independent

slide-28
SLIDE 28

rb@life.ku.dk

Jack-knifed PARAFAC

slide-29
SLIDE 29

250 300 350 400 450 500 wavelength 250 300 350 400 450 500 wavelength 250 300 350 400 450 500

wavelength

250 300 350 400 450 500 2nd sample 2nd sample 2nd sample 2nd sample 1st sample 3rd sample 3rd sample 5th sample

Prelim inary m odel

Emission spectral profiles

slide-30
SLIDE 30

Prelim inary m odel

1 4 5 10 15 20 25 30 35 560 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1718 19 20 21 22 23 24 25 2627 b

  • loadings difference

6

2 1 4 10 20 30 560 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1718 19 20 21 22 23 24 25 2627

sum of squared residuals

b

  • loadings difference

x 10 2

Detection of outliers: loadings RIP plot (resampled influence plot)

 

2

  • verall,

, 1 1 F J jf m jf f j

b b

 



slide-31
SLIDE 31

Prelim inary m odel

  • verall model

mth predicted sample 1 2 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 23 24 25 26 27

Detection of outliers: scores IMP - Identity Match Plot

slide-32
SLIDE 32

Final m odel

250 300 350 400 450 500

  • 0.05

0.05 0.1 0.15 wavelenght 250 300 350 400 450 500

  • 0.05

0.05 0.1 0.15 0.2 wavelenght 250 300 350 400 450 500

  • 0.05

0.05 0.1 0.15 0.2 wavelenght 250 300 350 400 450 500

  • 0.05

0.05 0.1 0.15 0.2 wavelenght

Removing low excitation and sample # 2,3,5,10 Emission spectral profiles

No more outliers and now uncertainty and standard

slide-33
SLIDE 33

So …

  • Jack-knifing is a natural extension of cross-

validation

  • Provides an exploratory and highly needed

approach to assess model stability/ robustness

  • Enables calculation of the standard errors even in

cases with little statistical background information

  • Allows the detection of outliers
slide-34
SLIDE 34

rb@life.ku.dk

Num ber of com ponents?

Sim pler than PCA ( but takes m ore tim e) :

  • Cross-val, Scree etc. as in PCA
  • Plus split-half
  • Plus core consistency
  • Plus chemical validation
  • Plus algorithmic indications (degeneracy,

many iterations, local minima etc.)

These are the main ones. Always look at the model to validate it. Use core consistency but carefully. Use split-half for definitive validation

slide-35
SLIDE 35

rb@life.ku.dk

Num ber of com ponents?

  • PARAFAC
  • Simpler than PCA (but takes more time):
  • Cross-val, Scree etc. as in PCA
  • Plus split-half
  • Plus core consistency
  • Plus algorithmic indications (degeneracy,

many iterations, local minima etc.) This is how it is norm ally presented but in reality it is only that sim ple for ‘trivial’ problem s.

  • I n reality you have to look at the details of

the m odel,

  • You have to know the data,
  • You have to be critical tow ards diagnostics
slide-36
SLIDE 36

rb@life.ku.dk

Cross-validation

10 breads 11 attributes Judge 2 Judge 8 Judge 1 11 attributes 11 attributes

X

10 breads 8 Judges 11 attributes

_

Two-three components reasonable

  • NB. Cross-validation hardly ever use-d/ful in

practice

Number of components

PARAFAC

Fit Cross-val

1 35.3 14.5 2 49.2 26.2 3 57.4 32.9 4 62.7 34.4 5 67.2 33.0

slide-37
SLIDE 37

Split-half analysis for PARAFAC

  • Fit model to several independent samples
  • If loadings the same, reasonable number chosen

Original data Split 2 Split 1

slide-38
SLIDE 38

Split-half analysis for PARAFAC

  • Fit model to several independent samples
  • If loadings the same, reasonable number chosen

Multi-way analysis in fluorescence data, Report 2002, Giorgia Servente & Jordi Riu

Original data Split 1 Split 2

Pick the best one

slide-39
SLIDE 39

rb@life.ku.dk

SLICING – recovering exp

slide-40
SLIDE 40

Resolving tw o-w ay exponentials

Time (ms)

100 200 300 400 500 600 700 800 900 1000

I (t) = M0 · exp(-t/ T2)

Two-way data but with very special structure in loadings

X = AB’ =

1 27 9 3 1 1 16 8 4 2

           

Sam ple 1 Sam ple 2

2 7 1 6 9 8 3 4 1 2

slide-41
SLIDE 41

rb@life.ku.dk

X1 = AD1B’ =

9 3 1 3 1 8 4 2 1 2

                 

4 3 8 9 1 6 2 7

Sam ple 2 Sam ple 1

SLAB 1 (X1)

2 1 4 3 8 9

Sam ple 2 Sam ple 1

SLAB 2 (X2)

X2 = AD2B’ =

9 3 1 1 1 8 4 2 1 1

                 

SLI CI NG - Pseudo Three-w ay Data

slide-42
SLIDE 42

rb@life.ku.dk

Slicing principle

slide-43
SLIDE 43

rb@life.ku.dk

Exam ple: Sensory quality

  • f potatoes

Predicting cooked potato quality from NMR on raw data

  • Potatoes cooked and served hot. The

assessors (10) evaluate texture profile, e.g. mealiness

  • Raw (!) potato measured by NMR (CPMG

pulse sequence)

  • A. K. Thybo et. al., Food Science and

Technology, 33 (2):103-111, 2000. = Lagging PARAFAC

200 400 600 800 1000 1200 1400 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Raw data

??? Intensity

slide-44
SLIDE 44

rb@life.ku.dk

2 3 4 5 6 7 2 3 4 5 6 7

Predictions mealy

Reference Predicted

r=0.858 RMSEP=0.79

  • Decom posing NMR into m eaningful latent variables
  • PARAFAC on lagged data enables meaningful decomposition
  • Four components adequate for describing NMR data
  • Predictions of m ealiness from NMR
  • Predicting sensory quality of cooked potatoes from amount of latent variables (i.e.

directly from NMR of raw potatoes)

50 100 150 200 250 300 350 400 450

  • 0.1

0.1 0.2 0.3 0.4 0.5 0.6

Profiles PARAFAC decomposition

??? a.u.

Exam ple: Sensory quality

slide-45
SLIDE 45

rb@life.ku.dk

Fluorescence EEM data

slide-46
SLIDE 46

Lam p ( uv-vis) Sam ple Excitation monochromator Emission monochromator Detector/ I ntensity Excitation Em ission

Fluorescence spectroscopy

Excitation-em ission m atrix – a chem ical fingerprint

slide-47
SLIDE 47

rb@life.ku.dk

Practical exam ple

  • Each sample yields 100  300 matrix
  • Model with PCA (truncated SVD)
slide-48
SLIDE 48

rb@life.ku.dk

  • Each sample yields 100  300 matrix
  • Model with PCA (truncated SVD)

X = TP’ T P’

Practical exam ple

slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51

Sugar processing

Sugar made from beets

  • End product sampled every

8’th hour during the three months of operation

  • Measured

spectrofluorometrically

  • 260 samples

Thick juice Massecuite Standard liqour Sugar boiling Centrifuge Thin juice

A little

Syrup Wash syrup Wash water Wash juice Water

Sugar

X7 X8 X1 X2 X3 X4 X5 X6

slide-52
SLIDE 52

Color

15 25 35 10 20 30 40 50 60 70 80 90 Days

CaO

10 20 30 40 50 60 70 80 90 Days a.u.

Correlation to process- & quality = + + +

8h sample (14 weeks total)

268 sugar fluorescense-landscapes PARAFAC model

4 components

300 400 500 Wavelength (nm) 0.60 0.69

  • 0.42

0.43 0.64 0.81

  • Color

(r) Ash (r)

  • comp. 1

"Tyrosine" "Tryptophan"

  • Comp. 4

Emission -spectra

8h shift (week 1-14) 50 100 150 200 250

4 3 2 1

Scores for components 4 estimated emission spectra

Correlation to quality parameters ash and color

4 3 2 1

Blue – reference Red – From fluorescense

slide-53
SLIDE 53

rb@life.ku.dk

PARAFAC + fluorescence

  • Several advantages
  • Chromatographic analysis of the whole process
  • Process monitoring (MSPC) on a chemical basis
  • Chemical understanding of why coloring occurs
  • On-line prediction of quality
  • On-line prediction of process indicators
  • Nowadays = PAT – process analytical technology
slide-54
SLIDE 54

rb@life.ku.dk

Handling scatter

Excitation Emission

  • 1. Order Rayleigh
  • 2. Order Rayleigh

Raman

slide-55
SLIDE 55

rb@life.ku.dk

Handling scatter

  • Down-weighting of the scatter region (MILES)
  • Specific modeling of scatter
  • Subtraction of a standard
  • Inserting missing values
  • Constraints in the PARAFAC decomposition
  • Inserting zeros outside the data area
  • Avoid the of the matrix that includes scatter.
slide-56
SLIDE 56

rb@life.ku.dk

Handling scatter

slide-57
SLIDE 57

rb@life.ku.dk

Handling scatter

Raw data Interpolated values Emission Fluorescence

slide-58
SLIDE 58

rb@life.ku.dk

Handling scatter

data shape-preserving polynom ial cubic polynom ial

Em ission Fluorescence

slide-59
SLIDE 59

rb@life.ku.dk

Handling scatter

300 350 400 450 300 400 500 200 400 600 800 Exc Raw EEM Emi Intensity 300 350 400 450 300 400 500 50 100 150 Exc Scatter removed Emi Intensity 300 350 400 4 300 400 500 50 100 150 Exc Scatter interpolated Emi Intensity

Makes modeling faster, simpler, easier!

slide-60
SLIDE 60

rb@life.ku.dk

Handling scatter

Function available at w w w .m odels.life.ku.dk

slide-61
SLIDE 61

rb@life.ku.dk

Chromatographic data

slide-62
SLIDE 62

rb@life.ku.dk

  • PARAFAC can not handle shifts and shape changes

PARAFAC2

PARAFAC Xk = AD kBT

slide-63
SLIDE 63

rb@life.ku.dk

  • R. A. Harshman. UCLA working papers in phonetics 22: 30-47, 1972.
  • H. A. L. Kiers, J. M. F. ten Berge, R. Bro. J. Chemom. 13:275-294, 1999.
  • R. Bro, C. A. Andersson, H. A. L. Kiers. J. Chemom. 13:295-309, 1999.
  • PARAFAC2 for handling shifts*

PARAFAC2 Xk = AD kBk

T

subject to Bk

TBk constant

PARAFAC Xk = AD kBT

*Actually it is more general than shifts but it’s a feasible approximation to what PARAFAC2 can handle

PARAFAC2

slide-64
SLIDE 64

PARAFAC2 for shifted data

  • Two-way shifts
  • Chromatography
  • Retention times constant = > bilinear data
  • Retention times vary = > breakdown

Elution profiles - no shifts Loadings - no shifts Elution profiles - shifts Loadings - shifts

slide-65
SLIDE 65

PARAFAC2 for shifted data

  • 15 samples of thick juice
  • Sephadex G25 low pressure

chromatographic system

  • 28 discrete fractions measured

spectrofluorometrically

.....

Emission Sample-elution mode Emission Spectra Excitation Spectra Profiles sample 1 Profiles sample K Profiles sample 2

=

Sample 1 Sample K Sample 2

slide-66
SLIDE 66

PARAFAC2 for shifted data

Compare PARAFAC(1) and PARAFAC2

Widely different spectra If PARAFAC1 and PARAFAC2 agree, then PARAFAC1 is the choice (simpler) else PARAFAC1 may be too simple

250 300 350 400 450 500 550 Em./nm 250 300 350 400 450 500 550

Four-way PARAFAC1 Four-way PARAFAC2

Em./nm 250 300 350 400 450 0.2 0.4 0.6 0.2 0.4 0.6 Exc./nm 250 300 350 400 450 0.2 0.4 0.6 0.2 0.4 0.6 Exc./nm

slide-67
SLIDE 67

PARAFAC2 for shifted data

Compare profiles with three-way PARAFAC1 profiles in which shifts do not affect

10 20 0.5

Reference

Sample 1-5

10 20

0.5

Four-way PARAFAC2

10 20 0.5

Sample 6-10

10 20

0.5

10 20 0.5

Sample 11-15 Elution time

10 20

0.5 Elution time

slide-68
SLIDE 68

PARAFAC2 for shifted data

slide-69
SLIDE 69

PARAFAC2 for shifted data

slide-70
SLIDE 70

PARAFAC2 for shifted data

PARAFAC2

slide-71
SLIDE 71

PARAFAC2 for shifted data

PARAFAC2

slide-72
SLIDE 72

w w w .m odels.life.ku.dk Data, m -files, theses, m ovies and m ore