Rasmus Bro
Practical problem s in m ulti-w ay analysis
- Constraints
- Missing data
- Jackknifing and split-half analysis
Some examples:
- SLICING – recovering exponentials
- Fluorescence EEM data
- Chromatographic data
Practical problem s in m ulti-w ay analysis Constraints Missing - - PowerPoint PPT Presentation
Practical problem s in m ulti-w ay analysis Constraints Missing data Jackknifing and split-half analysis Some examples: SLICING recovering exponentials Fluorescence EEM data Chromatographic data Rasmus Bro
Rasmus Bro
Practical problem s in m ulti-w ay analysis
Some examples:
rb@life.ku.dk
rb@life.ku.dk
very direct physical meaning (spectra, concentrations, elution profiles)
parameters to “make sense”
by fitting a model under additional constraints
800 900 1000 1100 1200 1300 1400 1500 1600 Wavelength [nm] 800 900 1000 1100 1200 1300 1400 1500 1600 Wavelength [nm]
Data Loadings PCA
rb@life.ku.dk
| | X - AB’| |
| | X - AB’| | , subject to A and B are nonnegative
rb@life.ku.dk
inappropriate for the model
problem
Spectroscopy Chromatography FI A Kinetics Auto- & cross correlation Uncertainty Nonnegativity Unimodality Selectivity Smoothness Known spectra …
rb@life.ku.dk
Ex.: Bilinear model : | | X – ABT| |
rb@life.ku.dk
1 k k 1 1 k k 1 1 k k
and 2. ' * ' 3. ' ' * ' 4. ' * ' diag ' , = 1,..,
K k K k
diag k K
B C A X BD B B C C B X AD A A C C D B B A A A X B
Why ALS? Simple Extends to N-way Handles missing Handles ML fitting Constraints:
Model of a single sam ple w ith one analyte
50 100 250 300 350 400 450 0.1 0.2 0.3 Time Wavelength Absorbance
250 300 350 400 450 0.05 0.1 0.15Wavelength/nm
20 40 60 80 100 0.02 0.04 0.06Time Absorbance Absorbance
benzaldehyde)
gradient imposed
Only sums are measured
PARALIND: Xk = AHD kB’ (Morteza Bahram will talk about this)
Concentrations 2HBA 3HBA 4HBA Eq
0.9769 0.9837 0.9979
NNLS
0.9988 0.9787 0.9996
NNLS/Eq
0.9992 0.9987 0.9996
NNLS/ULSR/Eq
0.9992 0.9987 0.9996
NNLS/ULSR/Fix/Eq
0.9990 0.9987 0.9996
: Equality of summed profiles
: Non-negativity of all parameters
: Unimodality of FIAgrams/ time profiles
: Fixing purely acidic/ basic times to only reflect acidic/ basic analytes
20 40 60 80 100 0.2 Time 4HBA - a.u.
Non-negativity & equality constrained
: Equality of summed profiles
: Fixing purely acidic/ basic times to only reflect acidic/ basic analytes
SPECTRA 2HBA acidic 2HBA basic 3HBA acidic 3HBA basic 4HBA acidic 4HBA basic Eq
0.9893* 0.9871* 0.9689* 0.7647* 0.9106* 0.9211*
NNLS
0.9944 0.9117* 0.9952 0.9241 0.9974 0.9977
NNLS/Eq
0.9946 0.9312* 0.9953 0.9988 0.9965 0.9971
NNLS/ULSR/Eq
0.9946 0.9590* 0.9953 0.9989 0.9966 0.9943
NNLS/ULSR/Fix/Eq
0.9946 0.9989 0.9954 0.9986 0.9961 0.9977
Exam ple Fluorescence excitation- emission matrix contains chemical information that PARAFAC can handle and physical scattering signals that do not fit PARAFAC
Chem ical part
Chem ical part
MILES – maximum likelihood
w 12
eTWTWe
w 22 w 32 w 42 w J2 w 142 w 4J2
Downweigh areas of less importance Done by extending least squares fit to weighted and
squares
MI LES – m axim um likelihood
squares EStimation) based on Majorization
fitting of any model which has a least squares algorithm
2.
T c c
1 / ( )
q m W W x m
2 F
argmin
m
m q
Calculate q Fit LS model to q instead of to data
Given vectorized data x and weights W
21 samples containing L-phenylalanine, L-3,4-dihydroxy-phenyl-alanine (DOPA), 1,4-dihydroxy-benzene & L-tryptophan
Baunsgaard D, Factors affecting 3-way modelling (PARAFAC) of fluorescence landscapes, The Royal Veterinary & Agricultural University, 1999
RAW DATA
Artifact
Least squares PARAFAC
RAW DATA
Artifact
MILES interpretation of data MILES PARAFAC Least squares PARAFAC
Em ission spectra from 1 0 0 resam plings
260 280 300 320 340 360 380 400 420
0.2 0.6 Loading 260 280 300 320 340 360 380 400 420
0.2 0.4 Loading Emission /nm
rb@life.ku.dk
No m issing
Ex.: standard PCA loss function | | X-TP’| | = I.e., a summation of errors over all elements of X
I f m issing
Only fit the model to the data that exist I.e., fit to the loss function where wij is zero if xij is missing and one otherwise
2 1 1 1 I J F ij if jf i j f
x t p
2 1 1 1 I J F ij ij if jf i j f
w x t p
How can that loss function be optimized?
Method 1 : use weighted least squares regression Method 2 : use imputation
PCA)
Both methods give same result. Method 2 is easy to implement, Method 1 sometimes faster, but more memory-demanding
How can that loss function be optimized?
Method 1 : use weighted least squares regression Method 2 : use imputation
PCA)
Both methods give same result. Method 2 is easy to implement, Method 1 sometimes faster, but more memory-demanding
Ex.: Fluorescence data 15% missing data Three PCA components should be sufficient Ad hoc approaches such as NIPALS do not work (too many significant components)
rb@life.ku.dk
rb@life.ku.dk
*J.W.Tukey Annals of Mathematical Statistics 1958, 29, 614.
W hat is the problem ?
Jack-knifing* is a solution
resampled objects are independent
rb@life.ku.dk
250 300 350 400 450 500 wavelength 250 300 350 400 450 500 wavelength 250 300 350 400 450 500
wavelength
250 300 350 400 450 500 2nd sample 2nd sample 2nd sample 2nd sample 1st sample 3rd sample 3rd sample 5th sample
Prelim inary m odel
Prelim inary m odel
1 4 5 10 15 20 25 30 35 560 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1718 19 20 21 22 23 24 25 2627 b
6
2 1 4 10 20 30 560 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1718 19 20 21 22 23 24 25 2627
sum of squared residuals
b
x 10 2
Detection of outliers: loadings RIP plot (resampled influence plot)
2
, 1 1 F J jf m jf f j
b b
Prelim inary m odel
mth predicted sample 1 2 1 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 18 19 20 21 22 23 24 25 26 27
Detection of outliers: scores IMP - Identity Match Plot
Final m odel
250 300 350 400 450 500
0.05 0.1 0.15 wavelenght 250 300 350 400 450 500
0.05 0.1 0.15 0.2 wavelenght 250 300 350 400 450 500
0.05 0.1 0.15 0.2 wavelenght 250 300 350 400 450 500
0.05 0.1 0.15 0.2 wavelenght
Removing low excitation and sample # 2,3,5,10 Emission spectral profiles
No more outliers and now uncertainty and standard
So …
validation
approach to assess model stability/ robustness
cases with little statistical background information
rb@life.ku.dk
Sim pler than PCA ( but takes m ore tim e) :
many iterations, local minima etc.)
These are the main ones. Always look at the model to validate it. Use core consistency but carefully. Use split-half for definitive validation
rb@life.ku.dk
many iterations, local minima etc.) This is how it is norm ally presented but in reality it is only that sim ple for ‘trivial’ problem s.
the m odel,
rb@life.ku.dk
10 breads 11 attributes Judge 2 Judge 8 Judge 1 11 attributes 11 attributes
X
10 breads 8 Judges 11 attributes
_
Two-three components reasonable
practice
Number of components
PARAFAC
Fit Cross-val
1 35.3 14.5 2 49.2 26.2 3 57.4 32.9 4 62.7 34.4 5 67.2 33.0
Split-half analysis for PARAFAC
Original data Split 2 Split 1
Split-half analysis for PARAFAC
Multi-way analysis in fluorescence data, Report 2002, Giorgia Servente & Jordi Riu
Original data Split 1 Split 2
Pick the best one
rb@life.ku.dk
Resolving tw o-w ay exponentials
Time (ms)
100 200 300 400 500 600 700 800 900 1000
I (t) = M0 · exp(-t/ T2)
Two-way data but with very special structure in loadings
X = AB’ =
1 27 9 3 1 1 16 8 4 2
Sam ple 1 Sam ple 2
2 7 1 6 9 8 3 4 1 2
rb@life.ku.dk
X1 = AD1B’ =
9 3 1 3 1 8 4 2 1 2
4 3 8 9 1 6 2 7
Sam ple 2 Sam ple 1
SLAB 1 (X1)
2 1 4 3 8 9
Sam ple 2 Sam ple 1
SLAB 2 (X2)
X2 = AD2B’ =
9 3 1 1 1 8 4 2 1 1
SLI CI NG - Pseudo Three-w ay Data
rb@life.ku.dk
Slicing principle
rb@life.ku.dk
Exam ple: Sensory quality
Predicting cooked potato quality from NMR on raw data
assessors (10) evaluate texture profile, e.g. mealiness
pulse sequence)
Technology, 33 (2):103-111, 2000. = Lagging PARAFAC
200 400 600 800 1000 1200 1400 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Raw data
??? Intensity
rb@life.ku.dk
2 3 4 5 6 7 2 3 4 5 6 7
Predictions mealy
Reference Predicted
r=0.858 RMSEP=0.79
directly from NMR of raw potatoes)
50 100 150 200 250 300 350 400 450
0.1 0.2 0.3 0.4 0.5 0.6
Profiles PARAFAC decomposition
??? a.u.
Exam ple: Sensory quality
rb@life.ku.dk
Lam p ( uv-vis) Sam ple Excitation monochromator Emission monochromator Detector/ I ntensity Excitation Em ission
Fluorescence spectroscopy
Excitation-em ission m atrix – a chem ical fingerprint
rb@life.ku.dk
Practical exam ple
rb@life.ku.dk
X = TP’ T P’
Practical exam ple
Sugar processing
Sugar made from beets
8’th hour during the three months of operation
spectrofluorometrically
Thick juice Massecuite Standard liqour Sugar boiling Centrifuge Thin juice
A little
Syrup Wash syrup Wash water Wash juice Water
Sugar
X7 X8 X1 X2 X3 X4 X5 X6
Color
15 25 35 10 20 30 40 50 60 70 80 90 Days
CaO
10 20 30 40 50 60 70 80 90 Days a.u.
Correlation to process- & quality = + + +
8h sample (14 weeks total)
268 sugar fluorescense-landscapes PARAFAC model
4 components
300 400 500 Wavelength (nm) 0.60 0.69
0.43 0.64 0.81
(r) Ash (r)
"Tyrosine" "Tryptophan"
Emission -spectra
8h shift (week 1-14) 50 100 150 200 250
4 3 2 1
Scores for components 4 estimated emission spectra
Correlation to quality parameters ash and color
4 3 2 1
Blue – reference Red – From fluorescense
rb@life.ku.dk
PARAFAC + fluorescence
rb@life.ku.dk
Handling scatter
Excitation Emission
Raman
rb@life.ku.dk
Handling scatter
rb@life.ku.dk
Handling scatter
rb@life.ku.dk
Handling scatter
Raw data Interpolated values Emission Fluorescence
rb@life.ku.dk
Handling scatter
data shape-preserving polynom ial cubic polynom ial
Em ission Fluorescence
rb@life.ku.dk
Handling scatter
300 350 400 450 300 400 500 200 400 600 800 Exc Raw EEM Emi Intensity 300 350 400 450 300 400 500 50 100 150 Exc Scatter removed Emi Intensity 300 350 400 4 300 400 500 50 100 150 Exc Scatter interpolated Emi Intensity
Makes modeling faster, simpler, easier!
rb@life.ku.dk
Handling scatter
Function available at w w w .m odels.life.ku.dk
rb@life.ku.dk
rb@life.ku.dk
PARAFAC Xk = AD kBT
rb@life.ku.dk
PARAFAC2 Xk = AD kBk
T
subject to Bk
TBk constant
PARAFAC Xk = AD kBT
*Actually it is more general than shifts but it’s a feasible approximation to what PARAFAC2 can handle
PARAFAC2 for shifted data
Elution profiles - no shifts Loadings - no shifts Elution profiles - shifts Loadings - shifts
chromatographic system
spectrofluorometrically
.....
Emission Sample-elution mode Emission Spectra Excitation Spectra Profiles sample 1 Profiles sample K Profiles sample 2
=
Sample 1 Sample K Sample 2
PARAFAC2 for shifted data
Compare PARAFAC(1) and PARAFAC2
Widely different spectra If PARAFAC1 and PARAFAC2 agree, then PARAFAC1 is the choice (simpler) else PARAFAC1 may be too simple
250 300 350 400 450 500 550 Em./nm 250 300 350 400 450 500 550
Four-way PARAFAC1 Four-way PARAFAC2
Em./nm 250 300 350 400 450 0.2 0.4 0.6 0.2 0.4 0.6 Exc./nm 250 300 350 400 450 0.2 0.4 0.6 0.2 0.4 0.6 Exc./nm
PARAFAC2 for shifted data
Compare profiles with three-way PARAFAC1 profiles in which shifts do not affect
10 20 0.5Reference
Sample 1-5
10 200.5
Four-way PARAFAC2
10 20 0.5Sample 6-10
10 200.5
10 20 0.5Sample 11-15 Elution time
10 200.5 Elution time
PARAFAC2 for shifted data
PARAFAC2 for shifted data
PARAFAC2 for shifted data
PARAFAC2
PARAFAC2 for shifted data
PARAFAC2