Mapping the sub-cellular proteome Laurent Gatto lg390@cam.ac.uk - - PowerPoint PPT Presentation

mapping the sub cellular proteome
SMART_READER_LITE
LIVE PREVIEW

Mapping the sub-cellular proteome Laurent Gatto lg390@cam.ac.uk - - PowerPoint PPT Presentation

Mapping the sub-cellular proteome Laurent Gatto lg390@cam.ac.uk @lgatt0 http://cpu.sysbiol.cam.ac.uk/ Slides: https://zenodo.org/record/1180393 22 Feb 2018, De Duve Institute Take home messages 1. Protein sub-cellular localisation:


slide-1
SLIDE 1

Mapping the sub-cellular proteome

Laurent Gatto lg390@cam.ac.uk – @lgatt0 http://cpu.sysbiol.cam.ac.uk/ Slides: https://zenodo.org/record/1180393 22 Feb 2018, De Duve Institute

slide-2
SLIDE 2

Take home messages

  • 1. Protein sub-cellular localisation: available technologies and
  • pportunities.
  • 2. Reliance on computational biology to acquire reliable

biological knowledge.

slide-3
SLIDE 3

Regulations

slide-4
SLIDE 4

Cell organisation

Spatial proteomics is the systematic study of protein localisations.

Image from Wikipedia http://en.wikipedia.org/wiki/Cell_(biology).

slide-5
SLIDE 5

Spatial proteomics - Why?

Localisation is function

◮ The cellular sub-division allows cells to establish a range of

distinct micro-environments, each favouring different biochemical reactions and interactions and, therefore, allowing each compartment to fulfil a particular functional role.

◮ Localisation and sequestration of proteins within sub-cellular

niches is a fundamental mechanism for the post-translational regulation of protein function.

Re-localisation in

◮ Differentiation stem cells. ◮ Activation of biological processes.

Examples later.

slide-6
SLIDE 6

Spatial proteomics - Why?

Mis-localisation

Disruption of the targeting/trafficking process alters proper sub-cellular localisation, which in turn perturb the cellular functions of the proteins.

◮ Abnormal protein localisation leading to the loss of

functional effects in diseases (Laurila and Vihinen, 2009).

◮ Disruption of the nuclear/cytoplasmic transport (nuclear

pores) have been detected in many types of carcinoma cells (Kau et al., 2004).

◮ Sub-cellular localisation of MC4R with ADCY3 at neuronal

primary cilia underlies a common pathway for genetic predisposition to obesity (Siljee et al., 2018).

slide-7
SLIDE 7

Spatial proteomics - How, experimentally

Single cell direct

  • bservation

Population level Subcellular fractionation (number of fractions)

Tagging Quantitative mass spectrometry Cataloguing Relative abundance

1 fraction 2 fractions (enriched and crude) n discrete fractions n continuous fractions (gradient approaches)

Subtractive proteomics (enrichment) Invariant rich fraction (clustering)

(χ )

2 PCP LOPIT (PCA, PLS-DA) Pure fraction catalogue GFP Epitope Prot.-spec. antibody

Figure : Organelle proteomics approaches (Gatto et al., 2010)

slide-8
SLIDE 8

Fusion proteins and immunofluorescence

Figure : Targeted protein localisation. Example of discrepancies between IF and FPs as well as between FP tagging at the N and C termini (Stadler et al., 2013).

slide-9
SLIDE 9

Spatial proteomics - How, experimentally

Single cell direct

  • bservation

Population level Subcellular fractionation (number of fractions)

Tagging Quantitative mass spectrometry Cataloguing Relative abundance

1 fraction 2 fractions (enriched and crude) n discrete fractions n continuous fractions (gradient approaches)

Subtractive proteomics (enrichment) Invariant rich fraction (clustering)

(χ )

2 PCP LOPIT (PCA, PLS-DA) Pure fraction catalogue GFP Epitope Prot.-spec. antibody

Figure : Organelle proteomics approaches (Gatto et al., 2010).

Gradient approaches: Dunkley et al. (2006), Foster et al. (2006), based on works by de Duve, Claude and Palade. Explorative/discovery approaches, steady-state global localisation maps.

slide-10
SLIDE 10

Fractionation/centrifugation

Quantitation/identification by mass spectrometry

e.g. Mitochondrion

Cell lysis

e.g. Mitochondrion

slide-11
SLIDE 11

Quantitation data and organelle markers

Fraction1 Fraction2 . . . Fractionm markers p1 q1,1 q1,2 . . . q1,m unknown p2 q2,1 q2,2 . . . q2,m loc1 p3 q3,1 q3,2 . . . q3,m unknown p4 q4,1 q4,2 . . . q4,m loci . . . . . . . . . . . . . . . . . . pj qj,1 qj,2 . . . qj, m unknown

slide-12
SLIDE 12

Data analysis

◮ Visualisation (cluster, unsupervised learning) ◮ Classification (supervised learning) ◮ Novelty detection (semi-supervised learning) ◮ Data integration (transfer learning) ◮ . . .

To uncover and understand biology

slide-13
SLIDE 13

Visualisation

0.2 0.3 0.4 0.5

Correlation profile − ER

Fractions

1 2 4 5 7 8 11 12 0.1 0.2 0.3 0.4

Correlation profile − Golgi

Fractions

1 2 4 5 7 8 11 12 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Correlation profile − mit/plastid

Fractions

1 2 4 5 7 8 11 12 0.15 0.20 0.25 0.30 0.35

Correlation profile − PM

Fractions

1 2 4 5 7 8 11 12 0.1 0.2 0.3 0.4 0.5 0.6

Correlation profile − Vacuole

Fractions

1 2 4 5 7 8 11 12

  • −10

−5 5 −5 5

Principal component analysis

PC1 PC2

  • ER

Golgi mit/plastid PM vacuole marker PLS−DA unknown

Figure : From Gatto et al. (2010), Arabidopsis thaliana data from Dunkley et al. (2006)

slide-14
SLIDE 14

Supervised Machine Learning

−6 −4 −2 2 4 −4 −2 2 4

Organelle markers

PC1 (48.41%) PC2 (23.85%)

  • 40S Ribosome

60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome unknown

−6 −4 −2 2 4 −4 −2 2 4

Classifcation (SVM)

PC1 (48.41%) PC2 (23.85%)

  • Figure : Support vector machines classifier (after 5% FDR classification

cutoff) on the embryonic stem cell data from Christoforou et al. (2016).

slide-15
SLIDE 15

Importance of annotation

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 PC1 (58.53%) PC2 (29.96%)

  • ER/Golgi

mitochondrion PM unknown

Incomplete annotation, and therefore lack of training data, for many/most organelles. Drosophila data from Tan et al. (2009).

slide-16
SLIDE 16

Semi-supervised learning: novelty detection

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 PC1 (58.53%) PC2 (29.96%)

  • ER/Golgi

mitochondrion PM unknown

−3 −2 −1 1 2 3 −3 −2 −1 1 2 3 PC1 (58.53%) PC2 (29.96%)

Cytoskeleton ER Golgi Lysosome mitochondrion Nucleus Peroxisome PM Proteasome Ribosome 40S Ribosome 60S

Figure : Left: Original Drosophila data from Tan et al. (2009). Right: After semi-supervised learning and classification, Breckels et al. (2013).

slide-17
SLIDE 17

Improving on LOPIT

Improving is obtaining better sub-cellular resolution to increase the number of protein that can be confidently assigned to a sub-cellular niche ⇒ biological discoveries.

−2 2 4 −2 −1 1 2 3 4 PC1 (40.28%) PC2 (25.7%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown

−6 −4 −2 2 4 −4 −2 2 4 PC1 (48.41%) PC2 (23.85%)

  • 40S Ribosome

60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome unknown

Figure : E14TG2a embryonic stem cells: old (left, published in Breckels et al. (2013)) vs. new, better resolved (right) experiments (Christoforou et al. (2016)).

slide-18
SLIDE 18

Improving on LOPIT

LOPIT Dunkley et al. (2006) Gatto et al. (2014a) Computational: transfer learning Breckels et al. (2016a) Experimental: hyperLOPIT Christoforou et al. (2016) Mulvey et al. (2017) Breckels et al. (2016b) Biological discoveries

slide-19
SLIDE 19

Experimental advances: hyperLOPIT Christoforou et al. (2016)

Figure : From Mulvey et al. (2017) Using hyperLOPIT to perform high-resolution mapping of the spatial proteome: (1) organelle separation and enrichment by density gradient ultracentrifugation, (2) chromatin and cytosol enrichment fractions, and (3) accurate quantification using synchronous precursor selection (SPS)-MS3 for TMT 11-plex quantification.

slide-20
SLIDE 20

−2 2 4 −2 −1 1 2 3 4 PC1 (40.28%) PC2 (25.7%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown

−4 −2 2 4 −4 −2 2 4 PC1 (50.56%) PC2 (24.34%)

  • 40S Ribosome

60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome unknown

Figure : E14TG2a LOPIT on 8 fractions (using iTRAQ 8-plex) and 1109 proteins vs. hyperLOPIT on 10 fractions (using TMT 11-plex) and SPS-MS3 for 5032 proteins.

slide-21
SLIDE 21

Computational advances: Transfer learning

What about using addition data, such as annotations from the Gene Ontology (GO), sequence features (pseudo aminoacid composition), signal peptide, trans-membrane domains (length, number, ...), images (IF, FP), interaction data, prediction software, . . .

◮ From a user perspective: ”free/cheap” vs. expensive and

time-consuming experiments.

◮ Abundant (all proteins, 100s of features) vs. (experimentally)

limited/targeted (1000s of proteins, 6 – 20 of features)

◮ For localisation in system at hand: low vs. high quality ◮ Static vs. dynamic

slide-22
SLIDE 22

Transfer learning

Support/complement the primary target domain (experimental data) with auxiliary data (annotation, imaging, PPI, ...) features without compromising the integrity of our primary data.

slide-23
SLIDE 23

Fractionation/centrifugation

Quantitation/identification by mass spectrometry Database query Extract GO CC terms Convert terms to binary

PRIMARY EXPERIMENTAL DATA AUXILIARY DRY DATA

O00767 P51648 Q2TAA5 Q9UKV5 . . . . . . GO:0016021 GO:0005789 GO:0005783 ... ... ... 1 1 1 ... ... ... 1 1 0 ... ... ... 1 1 0 ... ... ... 0 0 0 ... ... ... . . . . . . . . . . . . . . . . . . x1 . . . . . . . . xn GO1 ... ... ... ... GOA O00767 P51648 Q2TAA5 Q9UKV5 . . . . . . 0.1361 0.150 0.1062 0.147 0.277 0.1429 0.0380 0.00338 0.1914 0.205 0.0566 0.165 0.237 0.0996 0.0180 0.02727 0.1297 0.201 0.0546 0.146 0.292 0.1463 0.0206 0.00902 0.0939 0.207 0.0419 0.204 0.344 0.1098 0.0000 0.00000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 . . . . . . . . xn X113 X114 X115 X116 X117 X118 X119 X121

Visualisation Visualisation

e.g. Mitochondrion

Cell lysis

e.g. Mitochondrion

slide-24
SLIDE 24

Breckels et al. (2016a) Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics.

−2 2 4 −2 −1 1 2 3 4 PC1 (40.28%) PC2 (25.7%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown

−6 −4 −2 2 4 −4 −2 2 4 PC1 (48.41%) PC2 (23.85%)

  • 40S Ribosome

60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome unknown

Application of transfer learning on the old E14TG2a embryonic stem cells (left, Breckels et al. (2013)) and GO cellular compartment, and validated using the new, better resolved, hyperLOPIT data (right, Christoforou et al. (2016)).

slide-25
SLIDE 25

Transfer learning results

0.25 0.50 0.75 1.00 knn knn−TL svm svm−TL

Scores

  • utcome

correct incorrect

Figure : From Breckels et al. (2016a) Learning from heterogeneous data

sources: an application in spatial proteomics.

slide-26
SLIDE 26

Biological discoveries

◮ Multi-localisation ◮ Trans-localisation

Dependent on good sub-cellular resolution and adequate computational tools.

slide-27
SLIDE 27

Embracing uncertainty

A Bayesian Mixture Modelling Approach For Spatial Proteomics

We propose a Bayesian generative classifier based on Gaussian mixture models to assign proteins probabilistically to sub-cellular niches, thus proteins have a probability distribution over sub-cellular locations. This methodology allows proteome-wide uncertainty quantification, thus adding a further layer to the analysis of spatial proteomics.

slide-28
SLIDE 28

Embracing uncertainty

−6 −4 −2 2 4 −4 −2 2 4 PCA plot with Protein P51863 indicated PC1 (48.41%) PC2 (23.85%)

40S Ribosome 60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome unknown

P51863

0.00 0.25 0.50 0.75 1.00 40S Ribosome 60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome Membership Probability

Distribution of Subcellular Membership for Protein P51863

Figure : V-ATPase subunit d1 (P51683) with uncertain localisation between the endosome and lysosome.

slide-29
SLIDE 29

Dual-localisation Proteins may be present simultaneously in several organelles (e.g. trafficking). Simulation on A. thaliana data from Dunkley et al. (2006) (Gatto et al., 2014b) (left). Example from embryonic stem cells (Christoforou et al., 2016) (right).

−6 −4 −2 2 4 6 −4 −2 2 4 PC1 (64.36%) PC2 (22.34%)

  • ● ●
  • ER lumen

ER membrane Golgi Mitochondrion Plastid PM Ribosome TGN vacuole unknown

  • ● ● ●
  • ● ● ● ● ●
  • ● ● ● ● ●
slide-30
SLIDE 30

Dual-localisation Proteins may be present simultaneously in several organelles (e.g. trafficking). Simulation on A. thaliana data from Dunkley et al. (2006) (Gatto et al., 2014b) (left). Example from embryonic stem cells (Christoforou et al., 2016) (right).

−6 −4 −2 2 4 6 −4 −2 2 4 PC1 (64.36%) PC2 (22.34%)

  • ● ●
  • ER lumen

ER membrane Golgi Mitochondrion Plastid PM Ribosome TGN vacuole unknown

  • ● ● ●
  • ● ● ● ● ●
  • ● ● ● ● ●

From Betschinger et al. (2013)

−6 −4 −2 2 4 −4 −2 2 4

Mouse ESC (E14TG2a) in serum LIF

PC1 (50.05%) PC2 (24.61%)

  • Actin cytoskeleton

Cytosol Endosome ER/GA Extracellular matrix Lysosome Mitochondria Nucleus − Chromatin Nucleus − Nucleolus Peroxisome Plasma Membrane Proteasome Ribosome 40S Ribosome 60S unknown

  • Tfe3
slide-31
SLIDE 31

Spatial dynamics

Trans-localisation event during monocyte to macrophage differentiation

Investigate the effect of lipopolysaccharides (LPS)-mediated inflammatory response in human monocytic cells (THP-1)

Data

◮ Triplicate temporal profiling (0, 2, 4, 6, 12, 24 hours). ◮ Triplicate spatial profiling (0 vs 12 hours) - early trafficking,

before actual morphological differentiation at 24h. Work lead by Dr Claire Mulvey at the Cambridge Centre for Proteomics.

slide-32
SLIDE 32

−10 −5 5 10 −5 5 10

Unstimulated

PC1 (36.64%) PC2 (20.7%)

  • −15

−10 −5 5 10 −5 5 10

LPS 12hrs

PC1 (37.4%) PC2 (19.23%)

  • ● ●
  • Cytosol

Endoplasmic Reticulum Golgi Apparatus Lysosome Mitochondria Nucleus Peroxisome plasma mem unknown

Figure : Spatial maps of unstimulated and LPS-treated cells (combined triplicates).

slide-33
SLIDE 33

−10 −5 5 10 −5 5 10

Unstimulated

PC1 (36.64%) PC2 (20.7%)

  • PKCA
  • PKCB

−15 −10 −5 5 10 −5 5 10

LPS 12hrs

PC1 (37.4%) PC2 (19.23%)

  • ● ●
  • PKCA
  • PKCB
  • Cytosol

Endoplasmic Reticulum Golgi Apparatus Lysosome Mitochondria Nucleus Peroxisome plasma mem unknown

Figure : Relocation of Protein Kinase C α and β from the cytosol to the plasma membrane, driving maturation into a differentiated macrophage phenotype.

slide-34
SLIDE 34

−10 −5 5 10 −5 5 10

Unstimulated

PC1 (36.64%) PC2 (20.7%)

  • STAT2
  • STAT3
  • STAT6

−15 −10 −5 5 10 −5 5 10

LPS 12hrs

PC1 (37.4%) PC2 (19.23%)

  • ● ●
  • STAT2
  • STAT3
  • STAT6
  • Cytosol

Endoplasmic Reticulum Golgi Apparatus Lysosome Mitochondria Nucleus Peroxisome plasma mem unknown

Figure : Relocation of Signal transducer and activator of transcription 6 (STAT6) from the cytosol to the Nucleus, activating anti-bacterial and anti-viral-like response. Validated by microscopy and see also Chen et al. (2011).

slide-35
SLIDE 35

Computational infrastructure Reliance on computational biology to acquire reliable biological knowledge.

slide-36
SLIDE 36

Beyond the figures1

◮ Software: infrastructure (MSnbase, Gatto and Lilley (2012)),

dedicated machine learning (pRoloc, Gatto et al. (2014b)), interactive visualisation2 (pRolocGUI, Breckels et al. (2017)) and data (pRolocdata, Gatto et al. (2014b)) for spatial proteomics.

1... which are all reproducible, by the way. 2https://lgatto.shinyapps.io/christoforou2015/ 3between and within domains/software

slide-37
SLIDE 37

Beyond the figures1

◮ Software: infrastructure (MSnbase, Gatto and Lilley (2012)),

dedicated machine learning (pRoloc, Gatto et al. (2014b)), interactive visualisation2 (pRolocGUI, Breckels et al. (2017)) and data (pRolocdata, Gatto et al. (2014b)) for spatial proteomics.

◮ The Bioconductor (Huber et al., 2015) ecosystem for high

throughput biology data analysis and comprehension: open source, and coordinated and collaborative3 open development, enabling reproducible research, enables understanding of the data (not a black box) and drive scientific innovation.

1... which are all reproducible, by the way. 2https://lgatto.shinyapps.io/christoforou2015/ 3between and within domains/software

slide-38
SLIDE 38

Conclusions

  • 1. Protein sub-cellular localisation: technologies (hyperLOPIT)

and opportunities (sub-cellular maps, multi- and trans- localisation).

−10 −5 5 10 −5 5 10 Unstimulated PC1 (36.64%) PC2 (20.7%)

  • STAT2
  • STAT3
  • STAT6

−15 −10 −5 5 10 −5 5 10 LPS 12hrs PC1 (37.4%) PC2 (19.23%)

  • ● ●
  • STAT2
  • STAT3
  • STAT6
  • Cytosol

Endoplasmic Reticulum Golgi Apparatus Lysosome Mitochondria Nucleus Peroxisome plasma mem unknown

  • 2. Reliance on computational biology and dedicated software to

interpret data and acquire biological knowledge. > library("pRoloc")

slide-39
SLIDE 39

References I

J Betschinger, J Nichols, S Dietmann, P D Corrin, P J Paddison, and A Smith. Exit from pluripotency is gated by intracellular redistribution of the bhlh transcription factor tfe3. Cell, 153(2):335–47, Apr 2013. doi: 10.1016/j.cell.2013.03.012. L M Breckels, S B Holden, D Wojnar, C M Mulvey, A Christoforou, A Groen, M W Trotter, O Kohlbacher, K S Lilley, and L Gatto. Learning from heterogeneous data sources: An application in spatial proteomics. PLoS Comput Biol, 12(5):e1004920, May 2016a. doi: 10.1371/journal.pcbi.1004920. Lisa Breckels, Thomas Naake, and Laurent Gatto. pRolocGUI: Interactive visualisation of spatial proteomics data,

  • 2017. URL http://ComputationalProteomicsUnit.github.io/pRolocGUI/. R package version 1.11.2.

LM Breckels, L Gatto, A Christoforou, AJ Groen, KS Lilley, and MW Trotter. The effect of organelle discovery upon sub-cellular protein localisation. J Proteomics, 88:129–40, Aug 2013. LM Breckels, CM Mulvey, KS Lilley, and L Gatto. A bioconductor workflow for processing and analysing spatial proteomics data [version 1; referees: awaiting peer review]. F1000Research, 5(2926), 2016b. doi: 10.12688/f1000research.10411.1. H Chen, H Sun, F You, W Sun, X Zhou, L Chen, J Yang, Y Wang, H Tang, Y Guan, W Xia, J Gu, H Ishikawa, D Gutman, G Barber, Z Qin, and Z Jiang. Activation of stat6 by sting is critical for antiviral innate immunity. Cell, 147(2):436–46, Oct 2011. doi: 10.1016/j.cell.2011.09.022. A Christoforou, C M Mulvey, L M Breckels, A Geladaki, T Hurrell, P C Hayward, T Naake, L Gatto, R Viner, A Martinez Arias, and K S Lilley. A draft map of the mouse pluripotent stem cell spatial proteome. Nat Commun, 7:8992, Jan 2016. doi: 10.1038/ncomms9992. TPJ Dunkley, S Hester, IP Shadforth, J Runions, T Weimar, SL Hanton, JL Griffin, C Bessant, F Brandizzi, C Hawes, RB Watson, P Dupree, and KS Lilley. Mapping the Arabidopsis organelle proteome. PNAS, 103(17): 6518–6523, Apr 2006. LJ Foster, CL de Hoog, Y Zhang, Y Zhang, X Xie, VK Mootha, and M Mann. A mammalian organelle map by protein correlation profiling. Cell, 125(1):187–199, Apr 2006. L Gatto and KS Lilley. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics, 28(2):288–9, Jan 2012. L Gatto, JA Vizcaino, H Hermjakob, W Huber, and KS Lilley. Organelle proteomics experimental designs and

  • analysis. Proteomics, 2010.
slide-40
SLIDE 40

References II

L Gatto, L M Breckels, S Wieczorek, T Burger, and K S Lilley. Mass-spectrometry based spatial proteomics data analysis using pRoloc and pRolocdata. Bioinformatics, Jan 2014a. L Gatto, LM Breckels, T Burger, DJ Nightingale, AJ Groen, C Campbell, N Nikolovski, CM Mulvey, A Christoforou, M Ferro, and KS Lilley. A foundation for reliable spatial proteomics data analysis. MCP, 13(8): 1937–52, Aug 2014b. W Huber, V J Carey, R Gentleman, S Anders, M Carlson, B S Carvalho, H C Bravo, S Davis, L Gatto, T Girke, R Gottardo, F Hahne, K D Hansen, R A Irizarry, M Lawrence, M I Love, J MacDonald, V Obenchain, A K Ole´ s, H Pag` es, A Reyes, P Shannon, G K Smyth, D Tenenbaum, L Waldron, and M Morgan. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods, 12(2):115–21, Jan 2015. doi: 10.1038/nmeth.3252. TR Kau, JC Way, and PA Silver. Nuclear transport and cancer: from mechanism to intervention. Nat Rev Cancer, 4(2):106–17, Feb 2004. K Laurila and M Vihinen. Prediction of disease-related mutations affecting protein localization. BMC Genomics, 10:122, 2009. C M Mulvey, L M Breckels, A Geladaki, N K Britovek, DJH Nightingale, A Christoforou, M Elzek, M J Deery, L Gatto, and K S Lilley. Using hyperlopit to perform high-resolution mapping of the spatial proteome. Nat Protoc, 12(6):1110–1135, Jun 2017. doi: 10.1038/nprot.2017.026. J E Siljee, Y Wang, A A Bernard, B A Ersoy, S Zhang, A Marley, M Von Zastrow, J F Reiter, and C Vaisse. Subcellular localization of mc4r with adcy3 at neuronal primary cilia underlies a common pathway for genetic predisposition to obesity. Nat Genet, Jan 2018. doi: 10.1038/s41588-017-0020-9. C Stadler, E Rexhepaj, V R Singan, R F Murphy, R Pepperkok, M Uhl´ en, J C Simpson, and E Lundberg. Immunofluorescence and fluorescent-protein tagging show high correlation for protein localization in mammalian cells. Nat Methods, 10(4):315–23, Apr 2013. DJL Tan, H Dvinge, A Christoforou, P Bertone, A Arias Martinez, and KS Lilley. Mapping organelle proteins and protein complexes in Drosophila melanogaster. J Proteome Res, 8(6):2667–2678, Jun 2009. P Wu and TG Dietterich. Improving svm accuracy by training on auxiliary data sources. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, New York, NY, USA, 2004. ACM.

slide-41
SLIDE 41

Acknowledgements

◮ Mr Oliver Crook and Dr Lisa Breckels, Computational

Proteomics Unit, Cambridge (machine learning, algorithms, software).

◮ Dr Sebastian Gibb and Dr Johannes Rainer (software). ◮ Prof Kathryn Lilley et al., Cambridge Centre of Proteomics

and Dr Claire Mulvey, Cancer Research UK Cambridge Institute (spatial proteomics)

◮ Funding: BBSRC, Wellcome Trust

Slides: https://zenodo.org/record/1180393 Thank you for your attention

slide-42
SLIDE 42

Supplementary slides: Computational infrastructure

slide-43
SLIDE 43

Figure : Collaboration between packages: Dependency graph containing 41 MS and proteomics-tagged packages (out of 100+) and their dependencies.

slide-44
SLIDE 44

MSnbase example

Figure : Collaboration within packages: Contributions to the MSnbase package (1220 downloads from unique IP addresses in January 2018) since its creation, the last one leading to common proteomics/metabolomics infrastructure. More details: https://lgatto.github.io/msnbase-contribs/

slide-45
SLIDE 45

Supplementary slides: tranfer learning

slide-46
SLIDE 46

Application to PPI/Protein complexes

−10 −5 5 10 −5 5 10

markers

PC1 (47.02%) PC2 (22.25%)

  • ● ●
  • ● ●●
  • 14−3

19S 20S 40S 60S CCT eIF3 Ku70/Ku80 PA28 Rab unknown

Figure : Data on proteasome complexes from Fabre et al. Mol Syst Biol (2015), DOI: 10.15252/msb.20145497

slide-47
SLIDE 47

Transfer learnig, based on Wu and Dietterich (2004):

Class-weighted kNN

V (ci)j = θ∗nP

ij + (1 − θ∗)nA ij

−2 2 4 −2 −1 1 2 3 4 PC1 (40.28%) PC2 (25.7%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown

Linear programming SVM

f (x, v; αP, αA, b) =

m

  • l=1

yl

  • αP

l K P(xl, x) + αA l K A(vl, v)

  • + b
slide-48
SLIDE 48

D ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡E ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ A ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡B ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡C ¡

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 0.00 0.25 0.50 0.75 1.00 0.7 0.8 0.9 1.0 0.00 0.25 0.50 0.75 1.00 0.75 0.80 0.85 0.90 0.95 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary Combined Primary Auxiliary F1 score −6 −4 −2 −6 −4 −2 2 PC1 (3.43%) PC2 (2.08%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown −2 2 4 −2 −1 1 2 3 4 PC1 (40.28%) PC2 (25.7%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown

  • 0.5

0.6 0.7 0.8 0.9 Combined Primary Auxiliary F1 score Proteasome Plasma membrane Nucleus − Nucleolus Nucleus − Chromatin Mitochondrion Lysosome Endoplasmic reticulum Cytosol 60S Ribosome 40S Ribosome 1/3 2/3 1 Classifier weight Class

Data from mouse stem cells (E14TG2a).

slide-49
SLIDE 49

−2 2 4 −2 −1 1 2 3 4 PC1 (40.28%) PC2 (25.7%)

  • 40S Ribosome

60S Ribosome Cytosol Endoplasmic reticulum Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Nucleolus Plasma membrane Proteasome unknown

−6 −4 −2 2 4 −4 −2 2 4 PC1 (48.41%) PC2 (23.85%)

  • 40S Ribosome

60S Ribosome Actin cytoskeleton Cytosol Endoplasmic reticulum/Golgi apparatus Endosome Extracellular matrix Lysosome Mitochondrion Nucleus − Chromatin Nucleus − Non−chromatin Peroxisome Plasma membrane Proteasome unknown

0.25 0.50 0.75 1.00 knn knn−TL svm svm−TL Scores

  • utcome

correct incorrect

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 FPR (1 − specificity) TPR (sensitivity) k−NN k−NN TL (Breckels) k−NN TL (Wu) SVM SVM TL (Breckels)

Figure : From Breckels et al. (2016a) Learning from heterogeneous data

sources: an application in spatial proteomics.