Large chromatographic data sets analysis on the example of - - PowerPoint PPT Presentation

large chromatographic data sets analysis on the example
SMART_READER_LITE
LIVE PREVIEW

Large chromatographic data sets analysis on the example of - - PowerPoint PPT Presentation

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Large chromatographic data sets analysis on the example of metabolomic data Aneta Sawikowska 1 , 2 , Pawe l Krajewski 2 1 Poznan University of


slide-1
SLIDE 1

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

Large chromatographic data sets analysis

  • n the example of metabolomic data

Aneta Sawikowska1,2, Pawe l Krajewski2

1Poznan University of Life Sciences, Poznan, Poland 2Institute of Plant Genetics, Polish Academy of Sciences, Poznan, Poland

02.12.2016

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-2
SLIDE 2

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

Plan

1 Introduction 2 Parameters of experimental design 3 Preprocessing 4 Statistical analysis 5 Correlation network analysis 6 Conclusions

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-3
SLIDE 3

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-4
SLIDE 4

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

Peaks can be interpreted as intervals in which a metabolite

  • r a group of metabolites with similar properties occur.
  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-5
SLIDE 5

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

v varieties, d drought treatment, p time points, r replications TOTAL - about 12 mln observations

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-6
SLIDE 6

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

Parameters of experimental design: 9 varieties, 3 drought treatment (I, II, I+II) and control, 8 time points, 4 biological replications. Preprocessing in own scripts in the R system Statistical analysis in Genstat

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-7
SLIDE 7
slide-8
SLIDE 8

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Different baseline for the same data

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-9
SLIDE 9

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Different baseline for the same data

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-10
SLIDE 10

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

The baseline-estimation problem

A given vector x = {x1,x2,...,xi} of i observed intensities can be modeled as the sum of a ideal spectrum s and a background b, convolved with a blurring function p, with noise n added to the result: x = (s +b)∗p +n with ∗ denoting convolution. The noise is often taken to be Gaussian or Poissonian. The problem is to recover s, hence s = (x −n)∗p−1 −b with p−1 being the inverse of the blurring function. The problem is that knowledge of p−1, b, and n is often incomplete or totally lacking.

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-11
SLIDE 11

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Baseline removal by differentiation

Baseline: a line that is a base for measurements, leads to problems with measurement of peak area and can negatively affect all subsequent steps, needs to be removed. Differentiation - common approach to remove baseline by calculating the vectors yj = xj+1 −xj, for j = 1,...,J −1, where xj - the observation at the j-th retention time, J - the number of retention time points.

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-12
SLIDE 12

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Baseline removal by differentiation

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-13
SLIDE 13
slide-14
SLIDE 14

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Retention time alignment (COW)

Why correlation optimised warping? Advantages of COW are: it aligns profiles by matching shapes, the profiles are as similar as possible while preserving the peak shape and area (automated COW).

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-15
SLIDE 15

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Reference chromatogram selection

Similarity index For a given chromatogram yT similarityindex =

I

i=1

|ρ(yT,yi)|, where ρ is Pearson’s correlation coefficient between yT and yi 0 ≤ similarityindex ≤ 1

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-16
SLIDE 16
slide-17
SLIDE 17

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Theory - COW on two chromatograms

m - ”segment length”, t - ”slack size” Lp +1, LT +1 - the number of data points in profile P, T N = Lp

m - the number of segments

∆ = LT

N −m - the difference in segment length in P and T

(∆−t;∆+t) - the interval in which warpings are allowed xi - the position of the beginning of segment i after warping

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-18
SLIDE 18

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

COW on chromatograms for individual varieties

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-19
SLIDE 19

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

COW on chromatograms for individual varieties

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-20
SLIDE 20
slide-21
SLIDE 21

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Individual peaks - for individual chromatograms

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-22
SLIDE 22

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Second difference and it’s smoothing

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-23
SLIDE 23

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis Baseline removal (differentiation) Retention time alignment (COW) Peak detection

Common peaks - sum of all individual peaks

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-24
SLIDE 24

Common peaks - deconvolution problem

slide-25
SLIDE 25

Common peaks - deconvolution problem

slide-26
SLIDE 26

Common peaks - deconvolution problem

slide-27
SLIDE 27
slide-28
SLIDE 28

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

Statistical analysis - Mixed linear model

y - observation of a peak, the content of a metabolite in a sample y = µ +Variety +Drought treatment +Variety ∗Drought treatment +e

1 Log transformation. 2 Analysis of variance by REML (all effects fixed). 3 Significant peaks selection by tests based on F approximation

with Bonferroni correction.

4 The hierarchical group-average method (UPGMA), boxplots

and correlation analysis based on significant peaks.

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31

Parameters of experimental design Preprocessing Statistical analysis Correlation network analysis

Conclusions

Analysis of a large set of chromatographic data within a reasonably short time on computing clusters at Poznan Supercomputing and Networking Centre. Statistical analysis was performed on 3-factorial experiment for a large number of data: 100 lines (the population of recombinant inbred lines derived from the cross between European and Syrian barley), about 100 metabolites, treatment and control, 2 time points, 3 biological replications, total: about 120 000 observations. Correlation analysis was done.

  • A. Sawikowska, P. Krajewski

Large chromatographic data analysis

slide-32
SLIDE 32

Acknowledgements

Piasecka, A.; Sawikowska, A.; Kuczy´ nska, A.; Krystkowiak, K.; Miko lajczak, K.; Ogrodowicz, P.; Gudy´ s, K.; Guzy-Wr´

  • belska,

J.; Krajewski, P.; Kachlicki, P., Drought related secondary metabolites of barley (Hordeum vulgare L.) leaves and their mQTLs, The Plant Journal, doi: 10.1111/tpj.13430, accepted.

1IPG PAS, 2IBC PAS

dr Anna Piasecka1,2

  • prof. Piotr Kachlicki1