La qualit des donnes et des resultats en analyse protomique - - PDF document

la qualit des donn es et des resultats en analyse prot
SMART_READER_LITE
LIVE PREVIEW

La qualit des donnes et des resultats en analyse protomique - - PDF document

La qualit des donnes et des resultats en analyse protomique Pierre-Alain Binz Swiss Institute of Bioinformatics, Geneva, Switzerland EMBNet course, 5 Mars 2004 Here are my results: Can I believe in them? Are they meaningful ?


slide-1
SLIDE 1

1

La qualité des données et des resultats en analyse protéomique Pierre-Alain Binz

Swiss Institute of Bioinformatics, Geneva, Switzerland

EMBNet course, 5 Mars 2004

Here are my results: Can I believe in them? Are they meaningful ?

That’s not the question: But: Can others believe in them?

slide-2
SLIDE 2

2

Why to talk about quality in Why to talk about quality in Proteomics Proteomics? ?

Proteomics was mainly technology development, now it goes to biological interpretation Publications are difficult to reproduce Reduce propagation of errors Allow integration of information

Tasks/needs for Tasks/needs for Bioinformatics Bioinformatics in in Proteomics Proteomics

Process handling:

  • Sample and information tracking, workflow integration tools (LIMS)
  • Signal detection (MS peaks, spots, …)

Interpretation of experimental data:

  • Image analysis tools (qualitative and quantitative sample

comparison)

  • Protein identification, characterization tools (matching, data mining,

scoring, prediction, analysis, validation)

  • Predict and associate protein forms as members of pathways

Information source:

  • Databases (sequences, families, structure, function, pathways, 2-DE

maps, MS data, DNA arrays, LIMS DB…)

slide-3
SLIDE 3

3

Complexity in Complexity in proteomics proteomics

Heterogeneous physicochemical properties:

  • Multiple protein forms: splicing variants, processing events, PTMs
  • Wide range of pI, Mw, solubility, concentration

Complex interactions:

  • Protein/protein, protein/DNA, protein/chemicals

Variable, dynamic systems:

  • Proteomes differ from individual to individual
  • Proteomes vary as function of environment (time, drugs, stress, …)

Proteome Proteome complexity complexity

a b c a c d

splicing variants

a’ b c d

truncations, fragments

a b’ c’ d a b c d a b c d a b c d a b c d a b c d a b c d

discrete and heterogeneous PTMs I have identified the protein ABC The protein ABC? OK, which one?

slide-4
SLIDE 4

4 Identification: matching experimental results with a proteomics database entry: P01009, α1-antitrypsin, metallothionein, neurexin

What is identification, what is characterization? What is identification, what is characterization?

22 spots in plasma 2-DE Characterization describe structural details (maturation, mutation, PTM) quantify the expression level (relative, absolute) as function of external factors (time, drug, disease, …) describe functional details (in complex, localization, partners)

What is identification, what is characterization? What is identification, what is characterization?

slide-5
SLIDE 5

5

Proteomics

Proteomics today: today: a couple of types of biological questions a couple of types of biological questions but also: many many proteomes proteomes many different proteins many different proteins many different protein forms per protein many different protein forms per protein many workflows many workflows many different instrumentations many different instrumentations many many bioinformatics bioinformatics tools tools

1) Classical 1-DE/2-DE -- spot excision -- protein identification +: >1000 protein forms detected, PTMs, – limits for uncompatible protein forms quantitation 2) molecular scanner from 1-DE/2-DE +: idem 2-DE, contextual info – idem 1-DE/2DE, running time 3) MudPIT and similar +: no gels (virtually no uncompatible – identify peptides, not protein forms proteins) reproducibility due to complexity 4) ICAT and similar + idem MudPIT, quantitation possible – only Cys-containing proteins, no differentiation of protein forms 5) SELDI + good for diagnostics, rapid, selectivity – Mw range limited, complexity limited 6) Protein interactions, protein arrays...

Proteomics Proteomics Workflows using Mass Workflows using Mass Spectrometry: Spectrometry: complementarity complementarity

i d e n t i f i c a t i

  • n

a n d c h a r a c t e r i s a t i

  • n

: W h a t a p p r

  • a

c h e s , w h a t t

  • l

s ?

slide-6
SLIDE 6

6

Proteomics Proteomics Workflows using Mass Workflows using Mass Spectrometry: Spectrometry: complementarity complementarity

Method Identification Characterisation 1) Classical 1-DE/2-DE PMF, MS/MS PTM, sequence alterations

  • - spot excision

quantitation on separation step

  • - protein identification/

characterization 2) molecular scanner from PMF, MS/MS PTM, sequence alterations 1-DE/2-DE quantitation with isotope labels, 3) MudPIT and similar MS/MS no distinction of protein forms, no quantitation (15N) 4) ICAT and similar MS/MS no distinction of protein forms, quantitation with isotope labels 5) SELDI ~ no selection is part of the process relative quantitation of signals 6) Protein interactions, protein arrays... ~ detection of binding partners

protein separation proteolytic cleavage sample sample treatment Sample complexity reduction mass spectrometry protein/peptide identification protein/peptide quantitation validation, interpretation classical protein identification workflow Reduction/ alkylation 1-DE, 2-DE MALDI- PMF MS ESI MS-MS

Protein identification /characterization variables

various sample preparation various MS technologies (MALDI-MS, ESI-MS/MS, ...) various tools various parameters various databases different results with variable confidence

slide-7
SLIDE 7

7

  • H. ducreyi proteins identified by 2D LC

(requiring at least 1 significant peptide)

578 total unique proteins identified

ESI - QSTAR™

Pulsar System

MALDI – 4700

Proteomics Analyzer

292 372 498

Successful MS/MS Spectra = 2498/7414 (34%) Successful MS/MS Spectra = 1709/6222 (27%)

206 80

  • T. Nadler, ABI

PeptIdent sequence recovery from PMF on MALDI-TOF Mascot sequence recovery from LC-MS/MS on ESI-QTOF

Q9Y2X3

slide-8
SLIDE 8

8

Only those validated by identification with two methods? Every identified protein entries / peptides? What validation criteria ? How to represent your confidence?

What is correct ? What is correct ?

Quality in Proteomics : quid? Quality in Proteomics : quid?

  • Appropriate choice of sample and technologies
  • QC procedures (+/- controls, replicates)
  • Reduce human errors
  • Manage data
  • Detect and consider levels of accuracy in databases
  • Detect bioinformatics tools weaknesses
  • Interpret correctly / believe in results
  • Compare with others (compatibility issues)
slide-9
SLIDE 9

9

Quality in Proteomics : searches on the web Quality in Proteomics : searches on the web

In general, difficult to find: homogeneous protocols, validity limits of technologies, quality criteria for interpretation. Medline abstracts: Only hints; papers SHOULD describe in Material and Methods section In the ABRF web forum: Query quality and proteomics: 176 hits; only a few about ways to validate and qualify a result or a method Google search: Many hits, few real descriptions

Quality in Proteomics : searches on the web Quality in Proteomics : searches on the web

Google search (2/2): Some Proteomics core labs says that they deliver protein identification results after applying quality criteria … Foundation of the German Society for Proteomics Research: Aims to establish technology standards (quality criteria) ESF workshop on data integration Some grant proposal guidelines Proteomics Standards Initiative

slide-10
SLIDE 10

10

Use appropriate samples / controls Adjust threshold values Perform more than once Use different approaches Check consistency Get more information Improve the tools Have a critical eye

How to improve How to improve confidence and quality? confidence and quality?

Use appropriate samples / controls Use appropriate samples / controls Adjust threshold values Perform more than once Use different approaches Check consistency Get more information Improve the tools Have a critical eye

How to improve How to improve confidence and quality? confidence and quality?

slide-11
SLIDE 11

11

Use appropriate samples /controls Adjust threshold values Adjust threshold values Perform more than once Use different approaches Check consistency Get more information Improve the tools Have a critical eye

How to improve How to improve confidence and quality? confidence and quality?

slide-12
SLIDE 12

12

Some quality criteria Some quality criteria

The following criteria were set for considering an identification as positive in MS-Fit database searching: (a) at least four matching peptide masses; (b) at least 50% of the measured masses must match the theoretical masses; (c) 40 p.p.m. or better mass accuracy

  • S. Fulda et al. European Journal of Biochemistry Volume 267 Issue 19 Page 5900

FROM THE ABRF DISCUSSION FORUM: Briefly, we search all data on PeptideSearch and ProFound using the falling search parameters:

  • 1. Taxonomy: all kingdoms
  • 2. Modifications: none
  • 3. Missed cleavage sites: 1
  • 4. Mass tolerance: 0.3 Da or 0.015%, monoisotopic
  • 5. MW range - from =BD to 2x the SDS PAGE estimated MW.

The primary criteria we use for an identification are a ProFound score of 1.0 for the top ranked protein and a minimum sequence coverage of 20% - with both criteria having to be met. The median sequence coverage for the 90 proteins identified was 34%. (Kenneth Williams (Kenneth.Williams@yale.edu), 1998)

slide-13
SLIDE 13

13

From the ABRF discussion List" <ABRF@list.abrf.org> Subject: Re: Manual Validation of MS/MS spectra In general, I agree with Steve's criteria for manual validation of MS/MS. However, I use a different set of criteria for the initial thresholds, when using data searched with Sequest alone. To decide on the cut-off threshold, we determined the range of XCorrs that we got with a random sequence (at the suggestion of Jimmy Eng, we simply inverted the protein sequences in our database, so that they read from C to N-terminus--this nicely randomizes the database, without changing the composition or protein sizes). This will tell you what threshold you need to eliminate random chance hits (for us, this threshold for +1 ions is 2.1, for +2 ions its 2.5, for +3 its 3.1, which allow about 1% of bad data through, when I want more stringent, I use 2.3/2.7/3.3). However, there is good data below these thresholds--I've seen good data down as far down as XCorr 1 for singly or doubly charged or 1.3 for triply charged, when working with weak or noisy spectra. I've seen bad data above these thresholds (particularly for what we call decoys--where I'm using LCQ "tree" data, and the decoy is the "fake" charge form set up by the computer--of course, at the beginning, you don't know which is the decoy and which is the correct, "main" ms/ms data file.) One thing I do is search against mascot as well. Katheryn Resing

Date: Mon, 13 Jan 2003 14:46:26 -0500 From: "Christian, Rob" <Rob_Christian@mspeople.com> To: 'ABRF Discussion List' <ABRF@list.abrf.org> Subject: RE: Manual Validation of MS/MS spectra Benjamin - some other information you can use is the mass assignment of the product ions and the presence of appropriate immonium ions. The latter will

  • nly be available if you are using a triple quadrupole or a QTof type
  • instrument. If you have an accurate mass instrument, such as a QTof, mass

assignment of the product and precursor ions can be extremely powerful

  • information. You should look for continuity in the mass errors as you

progress through a series of b or y ions. For example, a spectrum might contain 10 ions that match the masses of y ions for a peptide. If 7 of these ions had mass errors of 5ppm and 3 had 50 ppm errors then, unless the instrument is incorrectly calibrated, this data should be further scrutinized. Hope this helps Best Regards, Rob Christian, Ph.D.

slide-14
SLIDE 14

14

Use appropriate samples / controls Adjust threshold values Perform more than once Perform more than once Use different approaches Check consistency Get more information Improve the tools Have a critical eye

How to improve How to improve confidence and quality? confidence and quality?

Analyzing populations of gels with Melanie Analyzing populations of gels with Melanie

Population based comparison

Disease Disease 2

  • r

Control

slide-15
SLIDE 15

15

DISEASE ASSOCIATED “DIAGNOSTIC” MARKERS (ISLETS) DISEASE ASSOCIATED “DIAGNOSTIC” MARKERS (ISLETS) ROSIGLITAZONE TARGET MARKERS (ISLETS) ROSIGLITAZONE TARGET MARKERS (ISLETS)

T test n=5 P<0.01

lean

  • b/ob

lean

  • b/ob
lean control
  • bese control
% volume POM1 POM1 .02 .04 .06 .08 .1 .12 lean control
  • bese control
% volume POM3 POM3 .02 .04 .06 .08 .1 .12 .14 lean control
  • bese control
% volume POM4 POM4 .05 .1 .15 .2 .25 .3 .35 .4 .45 .5 .55 lean control
  • bese control
% volume POM5 POM5 .02 .04 .06 .08 .1 .12 .14 .16 .18 .2 .22 lean control
  • bese control
% volume POM6 POM6 .025 .05 .075 .1 .125 .15 .175 .2 .225 lean control
  • bese control
% volume POM7 POM7 .01 .02 .03 .04 .05 .06 lean control
  • bese control
% volume POM8 POM8 .05 .1 .15 .2 .25 .3 .35 .4 .45 .5 lean control
  • bese control
% volume POM9 POM9 .05 .075 .1 .125 .15 .175 .2 .225 .25 .275 % volume lean control
  • bese control
POM10 POM10 .025 .05 .075 .1 .125 .15 .175 .2 .225 .25 lean control
  • bese control
% volume POM11 POM11 .02 .04 .06 .08 .1 .12 .14 .16 .18 lean control
  • bese control
% volume POM12 POM12 .01 .02 .03 .04 .05 .06 .07 .08 lean control
  • bese control
% volume POM13 POM13 .1 .2 .3 .4 .5 .6 .7 lean control
  • bese control
  • bese treated
% volume POMT1 POMT1 .02 .04 .06 .08 .1 .12 lean control
  • bese control
  • bese treated
% volume POMT2 POMT2 .01 .02 .03 .04 .05 .06 .07 .08 .09 lean control
  • bese control
  • bese treated
% volume POMT3 POMT3 .02 .04 .06 .08 .1 .12 .14 lean control
  • bese control
  • bese treated
% volume POMT4 POMT4 .05 .1 .15 .2 .25 .3 .35 .4 .45 .5 .55 lean control
  • bese control
  • bese treated
% volume POMT5 POMT5 .02 .04 .06 .08 .1 .12 .14 .16 .18 .2 .22 lean control
  • bese control
  • bese treated
% volume POMT11 POMT11 .02 .04 .06 .08 .1 .12 .14 .16 .18 .1 .2 .3 .4 .5 .6 .7 .8 .9 lean control
  • bese control
  • bese treated
% volume POMT12 POMT12 lean control
  • bese control
  • bese treated
% volume POMT13 POMT13 .1 .2 .3 .4 .5 .6 .7

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

P

s

Results Summary Results Summary

DM=ob/ob diagostic markers TM=Rosi target markers SM=Rosi side effect markers

0 SM 12 DM 2 TM 1 SM 14 DM 7 TM 2 SM 13 DM 8 TM 1 SM 28 DM 6 TM 0 SM 6 DM 2 TM

107 Differentially Expressed Proteins

Islets Liver Muscle WAT BAT Liver nuclei

30 DM 8 TM

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

P

s

And now what? Believe ? Understand ? Validate? => Identify and quantify these targets And now what? Believe ? Understand ? Validate? => Identify and quantify these targets

slide-16
SLIDE 16

16

Use appropriate samples /controls Adjust threshold values Perform more than once Use different approaches Use different approaches Check consistency Check consistency Get more information Get more information Improve the tools Have a critical eye

How to improve How to improve confidence and quality? confidence and quality? Separation of nucleolar proteins

  • one-dimensional
  • two-dimensional

pI = 4 pI = 7 97 KDa 14 KDa 14 KDa 97 KDa

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

Courtesy of Alexander Scherl

slide-17
SLIDE 17

17

Annotated 2-DE Gel

  • 46 annotated spots
  • 35 different proteins

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

Courtesy of Alexander Scherl

Annotated SDS-PAGE gel

  • 108 bands cut and

analyzed

  • 190 different

proteins

SwissProt entries Hypothetical protein (Y052_Human) Proliferating-cell nucleolar antigen P120 (Nol1_Human) Antigen NGP1 (NGP1_Human) Hypothetical protein (Y682_Human) Probable ATP dependent RNA helicase DDX10 (DD10_Human) DNA directed RNA polymerase I 135 kDa polypeptide (RPA2_Human) Periodic tryptophan protein 2 homolog (PWP2_Human) U3 small nucleolar ribonucleoprotein MPP10 (MP10_Human) Exosome complex exonuclease RRP44 (RR44_Human) 116 kDa U5 small nuclear ribonucleoprotein component (U5S1_Human) TrEMBL entries DHM1-like protein (Q9ul53) Hypothetical 115.7 kDa protein (Q9h0a0) NCBInr entries DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 24 (GI:9966805) Similar to KIAA0266 gene product (GI:12654625) Hypothetical protein FLJ20419 (GI:8923388)

94 - 67 - 43 - 30 - 30 - 20 - 14 -

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

Courtesy of Alexander Scherl

slide-18
SLIDE 18

18

Functional classification

213 identified proteins

known function 108 unknown function 105 unknown function 61 Hypothetical function 44

Sequence analysis by BLAST

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

Courtesy of Alexander Scherl

Functional classification

Others, 15 Chromatin structure, 6 Ribosomal proteins, 33 Ribosomal Biogenesis, 44 mRNA metabolism, 21 Translation factors, 4 Chaperones, 7 Fibrous proteins, 11 DNA-PK complex, 5 Proliferation, 6 Unpredictable, 61

97 % of known proteins have nucleolar localization

Faculté de Médecine Université de Genève Hôpitaux Universitaires de Genève

Courtesy of Alexander Scherl

slide-19
SLIDE 19

19

GlycoMod: prediction of glycosilations FindMod: prediction of PTMs FindPept: prediction of non-specific cleavages, contaminants, etc PeptideMass: calculation of theoretical peptide masses

Use appropriate samples / controls Adjust threshold values Perform more than once Use different approaches Check consistency Get more information Improve the tools Improve the tools Have a critical eye

How to improve How to improve confidence and quality? confidence and quality?

slide-20
SLIDE 20

20

Use of the predictors Use of the predictors

PMF: Aldente: Hough transform to correct for bad calibration, score from machine learning, PTMs from SWISS-PROT MS/MS: POPITAM swarm intelligence to look for sequence tags

slide-21
SLIDE 21

21

Popitam

Protein Or Peptide Identification using TAndem Mass spectrometry

slide-22
SLIDE 22

22

Peptide sequence database 3) COMPARING IDENTIFICATION de novo sequencing DIRECTED ACYCLIC GRAPH (SPECTRUM GRAPH) 2) STRUCTURING

MS-MS: Popitam

SOURCE MS/MS PEAK LIST INTERPRETED PEAK LIST 1) INTERPRETING

b+-NH3

ionic hypothesis

a+-H20 b+ y++

ionic m/z singly charged b-ions

Graph

  • all amino acid tags and

complete sequences

LVNELTEFAK Q N T H S P QNTHSP L V N E LVNE F A K FAK

  • finding sections in the

graph which best explain theoretical peptides

slide-23
SLIDE 23

23

TDCDHYTTNK

full path algorithm

TDCDHYTTNK

db

CDH YTT YTTN TDCD … NK

tag algorithm

  • structuring the source data

use information coming from the peak succession Understand the instrument specific ion series No calibration necessary Tag approach: No precursor m/z necessary to start, Merging tags and looking for PTMs

Significant advantages :

slide-24
SLIDE 24

24

Use appropriate samples / controls Adjust threshold values Perform more than once Use different approaches Check consistency Get more information Improve the tools Have a critical eye Have a critical eye

How to improve How to improve confidence and quality? confidence and quality?

slide-25
SLIDE 25

25

How to deal with dataflows? What about international efforts?

How to improve How to improve confidence and quality? confidence and quality?

slide-26
SLIDE 26

26

What introduction? All that allow to understand the relevance. What information / material and methods? All what is useful to understand the experiments. What data / results? All what is useful to be reviewed. What discussion ? All that argue for an appropriate interpretation What about data that do not fit in a paper form?

And now, you can publish And now, you can publish… …

slide-27
SLIDE 27

27 query query LIMS exp paper MS 2DE PPI SWISS-PROT fct export The Big Picture (PAB) PSI MS focuses

PAB, Jan 2003

slide-28
SLIDE 28

28

Proteomics data integration

  • Proteomics data today:

“Publish and vanish”

  • Need to develop infrastructure to exchange,

analyse and archive proteomics data across different, fast-evolving technologies

  • Long term:

Develop modular standard for functional genomics in collaboration with MGED

  • Initial focus on limited, feasible domains:
  • Mass spectrometry
  • Protein-protein interactions

Mass Spectrometry Topics Strategy

・Minimal requirements for publication (MIAME like) ・XML for proteomics (MAGE, HUP-ML, others) Controlled vocabulary ・Database schema (PeDRo and others) ・Tools (export, import, converters, queries)

  • pen source

repository ・Database (s) repository(ies) Access to data (test sites) ・Quality and user requirements ・Coordinations

  • Integrate into wider proteomics context
slide-29
SLIDE 29

29

Status

  • Input formats from
  • Bruker-Daltonics, Inc.
  • Ciphergen Biosystems, Inc.
  • Micromass-Waters
  • Protagen, Inc.
  • Thermo LabSystems
  • Institute for Systems Biology
  • U Manchester (Pedro*)
  • Draft format developed during EBI workshop

July 21-25

  • Presented at HUPO congress Montreal,

October 2003

  • Format improved and available on

http://psidev.sourceforge.net

  • *Taylor, C.F, et al. A systematic approach to modeling, capturing, and disseminating

proteomics experimental data. Nat Biotechnol. 21:247-254 (2003).

slide-30
SLIDE 30

30

Conclusion: Conclusion: Biological question Choice of sample(s) Workflow(s) capabilities and choice Validation Data management

Thank you for your attention!