Searching for Periodic Gene Expression Patterns Using Lomb-Scargle - - PDF document

searching for periodic gene expression patterns using
SMART_READER_LITE
LIVE PREVIEW

Searching for Periodic Gene Expression Patterns Using Lomb-Scargle - - PDF document

Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Earl F. Glynn Arcady R. Mushegian Jie Chen Stowers Institute Stowers Institute & Stowers Institute & Univ. of Kansas Medical Center Univ. of Missouri


slide-1
SLIDE 1

1

1

Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms

http://research.stowers-institute.org/efg/2004/CAMDA Critical Assessment of Microarray Data Analysis Conference

November 11, 2004

Jie Chen

Stowers Institute &

  • Univ. of Missouri

Kansas City

Arcady R. Mushegian

Stowers Institute &

  • Univ. of Kansas Medical Center

Earl F. Glynn

Stowers Institute

2

Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms

  • Periodic Patterns in Biology
  • Introduction to Lomb-Scargle Periodogram
  • Data Pipeline
  • Application to Bozdech’s Plasmodium dataset
  • Conclusions
slide-2
SLIDE 2

2

3

Periodic Patterns in Biology

Photograph taken at Reptile Gardens, Rapid City, SD www.reptile-gardens.com

A vertebrate’s body plan: a segmented pattern.

Segmentation is established during somitogenesis.

4

Periodic Patterns in Biology

From Bozdech, et al, Fig. 1A, PLoS Biology, Vol 1, No 1, Oct 2003, p 3.

Intraerythrocytic Developmental Cycle of Plasmodium falciparum Expression Ratio = RNA from parasitized red blood cells RNA from all development cycles = Cy5 Cy3

Values for Log2(Expression Ratio) are approximately normally distributed. Assume gene expression reflects observed biological periodicity.

slide-3
SLIDE 3

3

5

Simple Periodic Gene Expression Model

Time Expression period (T) period (T)

“On” “On” “On” “Off” “Off”

frequency = 1 period f = 1 T Gene Expression = Constant × Cosine(2πf t) “Periodic” if only observed over a single cycle? ω = angular frequency = 2πf

6

Introduction to Lomb-Scargle Periodogram

  • What is a Periodogram?
  • Why Lomb-Scargle Instead of Fourier?
  • Example Using Cosine Expression Model
  • Mathematical Details
  • Mathematical Experiments
  • Single Dominant Frequency
  • Multiple Frequencies
  • Mixtures: Signal and Noise
slide-4
SLIDE 4

4

7

What is a Periodogram?

  • A graph showing frequency “power” for a spectrum
  • f frequencies
  • “Peak” in periodogram indicates a frequency with

significant periodicity

Time Log2(Expression) Periodic Signal Periodogram Frequency Spectral “Power” Computation

8

Why Lomb-Scargle Instead of Fourier?

  • Missing data handled naturally
  • No data imputation needed
  • Any number of points can be used
  • No need for 2N data points like with FFT
  • Lomb-Scargle periodogram has known

statistical properties

Note: The Lomb-Scargle algorithm is NOT equivalent to the conventional periodogram analysis based Fourier analysis.

slide-5
SLIDE 5

5

9

Lomb-Scargle Periodogram Example Using Cosine Expression Model

T = 1 f

10 20 30 40

  • 1.0

0.0 0.5 1.0

Cosine Curve (N=48)

Time [hours] Expression N = 48 0.00 0.05 0.10 0.15 0.20 5 10 20

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 48 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8

Peak Significance

Frequency [1/hour] Probability

p = 3.3e-009 at Peak

A small value for the false-alarm probability indicates a highly significant periodic signal. Evenly-spaced time points

10

Lomb-Scargle Periodogram Example Using Noisy Cosine Expression Model

10 20 30 40

  • 1.0

0.0 1.0

Cosine Curve + Noise (N=48)

Time [hours] Expression N = 48

Time Interval Variability

log10(delta T) Frequency

  • 1.0
  • 0.5

0.0 0.5 1.0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 5 10 20

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 45.7 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8

Peak Significance

Frequency [1/hour] Probability

p = 2.54e-007 at Peak

Unevenly-spaced time points

slide-6
SLIDE 6

6

11

Lomb-Scargle Periodogram Example Using Noise

10 20 30 40

  • 1.0

0.0 0.5 1.0

Noise (N=48)

Time [hours] Expression N = 48

Time Interval Variability

log10(delta T) Frequency

  • 1.0
  • 0.5

0.0 0.5 1.0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 5 10 20

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 7.4 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8

Peak Significance

Frequency [1/hour] Probability

p = 0.973 at Peak

12

Lomb-Scargle Periodogram Mathematical Details

Source: Numerical Recipes in C (2nd Ed), p. 577

PN(ω) has an exponential probability distribution with unit mean.

slide-7
SLIDE 7

7

13

Mathematical Experiment: Single Dominant Frequency

10 20 30 40

  • 1.0

0.0 0.5 1.0

Cosine Curve (N=48)

Time [hours] Expression N = 48 0.00 0.05 0.10 0.15 0.20 5 10 20

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 24 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8

Peak Significance

Frequency [1/hour] Probability

p = 3.3e-009 at Peak

Expression = Cosine(2πt/24)

Single “peak” in periodogram. Single “valley” in significance curve.

14

Mathematical Experiment: Multiple Frequencies

10 20 30 40

  • 2
  • 1

1 2 3

Sum of 3 Cosines (N=48)

Time [hours] Expression N = 48 0.00 0.05 0.10 0.15 0.20 5 10 20

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 21.8 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8

Peak Significance

Frequency [1/hour] Probability

p = 0.00246 at Peak

Expression = Cosine(2πt/48) + Cosine(2πt/24) + Cosine(2πt/ 8)

Multiple peaks in periodogram. Corresponding valleys in significance curve.

slide-8
SLIDE 8

8

15

Mathematical Experiment: Multiple Frequencies

10 20 30 40

  • 2

2 4

Sum of 3 Cosines (N=48)

Time [hours] Expression N = 48 0.00 0.05 0.10 0.15 0.20 5 10 20

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 48 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.4 0.8

Peak Significance

Frequency [1/hour] Probability

p = 2.37e-007 at Peak

Expression = 3*Cosine(2πt/48) + Cosine(2πt/24) + Cosine(2πt/ 8)

“Weaker” periodicities cannot always be resolved statistically.

16

Mathematical Experiment: Multiple Frequencies: “Duty Cycle”

10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 duty cycle: 1/2 Time [hours] Expression N = 48 0.0 0.1 0.2 0.3 0.4 0.5 5 10 15 20 25 Lomb-Scargle Periodogram Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 24 hours 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Peak Significance Frequency [1/hour] Probability p = 2.54e-007 at Peak 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 duty cycle: 2/3 Time [hours] Expression

N = 48 0.0 0.1 0.2 0.3 0.4 0.5 5 10 15 20 25

Lomb-Scargle Periodogram Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 24 hours 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 1.0 Peak Significance Frequency [1/hour] Probability p = 5.06e-006 at Peak

50% 66.6% (e.g., human sleep cycle)

One peak with symmetric “duty cycle”. Multiple peaks with asymmetric cycle.

slide-9
SLIDE 9

9

17

Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise

'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) log10(p) Frequency

  • 8
  • 6
  • 4
  • 2

50 100 150 p corresponding to max Periodogram Power Spectral Density 100 % simulated periodic genes 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) log10(p) Frequency

  • 8
  • 6
  • 4
  • 2

500 1000 1500 p corresponding to max Periodogram Power Spectral Density 50 % simulated periodic genes 'p' Histogram for 5000 Simulated Expresson Profiles (N= 48 ) log10(p) Frequency

  • 8
  • 6
  • 4
  • 2

500 1000 1500 2000 p corresponding to max Periodogram Power Spectral Density 0 % simulated periodic genes

“p” histogram 100% periodic genes 50% periodic 50% noise 100% noise

18

1000 2000 3000 4000 5000

  • 8
  • 6
  • 4
  • 2

Multiple Testing Correction Methods

Rank Order of Sorted p Values Log10(p) bonferroni holm hochberg fdr none 50 % simulated periodic genes

Mathematical Experiment: Mixtures: Periodic Signal Vs. Noise

50% periodic, 50% noise

Multiple-Hypothesis Testing Bonferroni Holm Hochberg Benjamini & Hochberg FDR None

More False Negatives More False Positives

slide-10
SLIDE 10

10

19

Data Pipeline to Apply to Bozdech’s Data

  • 1. Apply quality control checks to data
  • 2. Apply Lomb-Scargle algorithm to all

expression profiles

  • 3. Apply multiple hypothesis testing to

define “significant” genes

  • 4. Analyze biological significance of

significant genes

20

Bozdech’s Plasmodium dataset:

  • 1. Apply Quality Control Checks

Global views of experiment. Remove certain outliers.

slide-11
SLIDE 11

11

21

Bozdech’s Plasmodium dataset:

  • 1. Apply Quality Control Checks

Many missing data points require imputation for Fourier analysis.

22

Bozdech’s Plasmodium dataset:

  • 2. Apply Lomb-Scargle Algorithm

10 20 30 40

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.0

Mean Expression Profile

Time [hours] Expression N = 46

Time Interval Variability

log10(delta T) Frequency

  • 1.0
  • 0.5

0.0 0.5 1.0 10 20 30 40 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

Frequency [1/hour] Normalized Power Spectral Density

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 27.4 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance

Frequency [1/hour] Probability

p = 0.0581 at Peak

Complete/06-MeanExpressionProf ile.pdf 2004-10-27 11:39

A weak diurnal period is visible in “mean” data profile.

slide-12
SLIDE 12

12

23

Bozdech’s Plasmodium dataset:

  • 2. Apply Lomb-Scargle Algorithm

10 20 30 40

  • 2
  • 1

1

i3518_1

N = 46

Time Interval Variability

  • 1.0
  • 0.5

0.0 0.5 1.0 10 20 30 40 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 45.7 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance p = 1.48e-008 at Peak

Periodic Expression Patterns

10 20 30 40

  • 4
  • 2

2

  • pfi17638

N = 46

Time Interval Variability

  • 1.0
  • 0.5

0.0 0.5 1.0 10 20 30 40 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 45.7 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance p = 1.19e-008 at Peak

Examples of highly-significant periodic expression profiles.

24

Bozdech’s Plasmodium dataset:

  • 2. Apply Lomb-Scargle Algorithm

10 20 30 40

  • 0.5

0.0 0.5 1.0

j167_5

N = 35

Time Interval Variability

  • 1.0
  • 0.5

0.0 0.5 1.0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 17.8 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance p = 0.998 at Peak

Aperiodic/Noise Expression Patterns

10 20 30 40

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

f35105_2

N = 45

Time Interval Variability

  • 1.0
  • 0.5

0.0 0.5 1.0 10 20 30 40 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 32 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance p = 0.516 at Peak

slide-13
SLIDE 13

13

25

Bozdech’s Plasmodium dataset:

  • 2. Apply Lomb-Scargle Algorithm

10 20 30 40

  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

f58149_1

N = 39

Time Interval Variability

  • 1.0
  • 0.5

0.0 0.5 1.0 5 10 15 20 25 30 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 48 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance p = 8.54e-006 at Peak

Small “N”

N=39

10 20 30 40

  • 3
  • 2
  • 1

1 2

n170_1

N = 32

Time Interval Variability

  • 1.0
  • 0.5

0.0 0.5 1.0 5 10 15 20 25 30 0.00 0.05 0.10 0.15 0.20 5 10 15 20 25

Lomb-Scargle Periodogram

p = 0.05 p = 0.01 p = 0.001 p = 1e-04 p = 1e-05 p = 1e-06

Period at Peak = 64 hours

0.00 0.05 0.10 0.15 0.20 0.0 0.2 0.4 0.6 0.8 1.0

Peak Significance p = 2.74e-005 at Peak

N=32

26

Bozdech’s Plasmodium dataset:

  • 2. Apply Lomb-Scargle Algorithm

Signal and Noise Mixture

'p' histogram

log10(p) Number of Probes

  • 8
  • 6
  • 4
  • 2

50 100 150 200 Complete Bozdech set of 6875 probes

Periodic Probes Aperiodic Probes or Noise

histogram-log10p.pdf 2004-11-06 10:26

slide-14
SLIDE 14

14

27

Bozdech’s Plasmodium dataset:

  • 3. Apply Multiple-Hypothesis Testing

α = 1E-4

Bonferroni Holm Hochberg Benjamini & Hochberg FDR None

More False Negatives More False Positives

1000 2000 3000 4000 5000 6000 7000

  • 8
  • 6
  • 4
  • 2

Multiple Testing Correction Methods

Rank Order of Sorted p Values Log10(p) bonferroni holm hochberg fdr none (Using R's p.adjust methods)

p-adjust.pdf 2004-11-06 10:12

Significance

28

Bozdech’s Plasmodium dataset:

  • 3. Apply Multiple-Hypothesis Testing

3823 4456 4961 5351 5648

None

3584

4358

4906 5315 5618 Benjamini & Hochberg FDR 15 1723 3359 4009

Hochberg

13 1705 3351 3995

Holm

13 1461 3050 3707

Bonferroni

0.00001 0.0001 0.001 0.01 0.05 α Significance Level

p Adjustment Method

A priori plan: Use Benjamini & Hochberg FDR level of 0.0001. Observed number of periodic probes consistent with biological observation

  • f ~60% of Plasmodium genome being transcriptionally active during the

intraerythrocytic developmental cycle.

slide-15
SLIDE 15

15

29

Bozdech’s Plasmodium dataset:

  • 4. Analyze Biological Significance

Lomb-Scargle: 4358 Probes, α = 1E-4 significance Comparison with Bozdech’s Results

13.5 243 1795 32 .. 42 Bozdech Complete 4358 4115 Lomb-Scargle Periodic 63.4 6875 Total 81.0% 5080 43 .. 46

(Bozdech Quality Control Dataset)

N

time series points

% Probes Dataset

While Lomb-Scargle identified 243 new low “N” periodic probes, the low percentage in that group may indicate some other problem.

30

Bozdech’s Plasmodium dataset:

  • 4. Analyze Biological Significance

Lomb-Scargle: 4358 Probes, α = 1E-4 significance Comparison with Bozdech’s Results

4108 4078 30 Lomb-Scargle Periodic 98.1% 4157 Tag=1 Bozdech Quality Control 3.3% 914 Tag=0 % Probes Dataset 81.0% 5071 Total

Assume Tag=0 set should be ignored since “flagged”. Assume Tag=1 are “array features that were unflagged” and met certain intensity requirements. The 79 “lost” periodic probes would be retained with a slightly different cutoff.

slide-16
SLIDE 16

16

31

Bozdech’s Plasmodium dataset:

  • 4. Analyze Biological Significance

Lomb-Scargle: 4358 Probes, α = 1E-4 significance Comparison with Bozdech’s Results

Unclear how to apply Bozdech’s ad hoc “Overview” criteria for use with Lomb-Scargle method: “70% power in max frequency with top 75% of max frequency magnitude.” The best 3711 Lomb-Scargle “p” values contained 3449 (92.9%) of the Overview probes.

3611 Lomb-Scargle Periodic Probes Dataset 3711 Bozdech Overview

32

Bozdech’s Plasmodium dataset:

  • 4. Analyze Biological Significance

Lomb-Scargle Results 4358 Probes

“Phaseograms”

Time Probes Ordered by Phase Time Probes Ordered by Phase

Bozdech: “Overview” Dataset 2714 genes, 3395 probes

slide-17
SLIDE 17

17

33

Bozdech’s Plasmodium dataset:

  • 4. Analyze Biological Significance

Lomb-Scargle: 4358 Probes, α = 1E-4 significance Periodogram Map

Frequency

Probes Ordered by Peak Frequency

  • Shows periodograms,

not expression profiles

  • Shows frequency space,

not time

  • Dominant frequency band

corresponds to 48-hr period

  • Are “weak” bands indicative of

complex expression, perhaps a diurnal component, or an asymmetric “duty cycle”?

Period

34

Summary

Usually only look at “independent” Fourier frequencies Need estimate of number of “independent frequencies” but explore using continuum Ad hoc scoring rules Use “p” values Permutation tests needed to assess statistical properties Known statistical properties 2N points for FFT; 0 padding No special requirement Missing data imputed No special processing Requires uniform spacing No special requirement Weights frequency intervals Weights data points Fourier Method Lomb-Scargle Method

slide-18
SLIDE 18

18

35

Conclusions

  • Lomb-Scargle periodogram is effective tool to

identify periodic gene expression profiles

  • Results comparable with Fourier analysis
  • Lomb-Scargle can help when data are missing
  • r not evenly spaced

We wanted to validate the Lomb-Scargle method before applying to our somitogenesis problem, since the Fourier technique would be difficult to use. Scargle (1982): “surprising result is that the … spectrum of a process can be estimated … [with] only the order of the samples ...”

36

Conclusions

  • Conclusions should not be drawn using the

individual p-value calculated for each profile. A multiple comparison procedure False Discovery Rate (FDR) must be used to control the error rate.

  • Expression profiles may be more complex than

simple cosine curves

  • Power spectra of non-sinusoid rhythms are more

difficult to interpret

slide-19
SLIDE 19

19

37

Supplementary Information

http://research.stowers-institute.org/efg/2004/CAMDA

38

Acknowledgements

Stowers Institute for Medical Research Pourquie Lab Olivier Pourquie Mary-Lee Dequeant