A simple tool from a complex system: A simple tool from a complex - - PowerPoint PPT Presentation

a simple tool from a complex system a simple tool from a
SMART_READER_LITE
LIVE PREVIEW

A simple tool from a complex system: A simple tool from a complex - - PowerPoint PPT Presentation

A simple tool from a complex system: A simple tool from a complex system: high- -throughput, unsupervised generation of throughput, unsupervised generation of high Protein Families Protein Families Protein Families Protein Families from


slide-1
SLIDE 1

Duccio Medini Duccio Medini Duccio Medini Duccio Medini Cellular Cellular Microbiology Microbiology and and Bioinformatics Bioinformatics Unit Unit Novartis Novartis Vaccines, S iena (I) Vaccines, S iena (I)

A simple tool from a complex system: A simple tool from a complex system: high high-

  • throughput, unsupervised generation of

throughput, unsupervised generation of Protein Families Protein Families Protein Families Protein Families from the from the Protein Homology Network. Protein Homology Network. Protein Homology Network. Protein Homology Network.

slide-2
SLIDE 2

What is the What is the What is the What is the Protein Homology Network (PHN)? Protein Homology Network (PHN)? Protein Homology Network (PHN)? Protein Homology Network (PHN)?

slide-3
SLIDE 3

PHN PHN: definitions definitions

251

complete genomes→ 761260 predicted proteins

Nodes

Nodes→ Proteins

Links

Links→ Blast alignments with E-score < ε (cut-off) Proteins Proteins Homology relations Homology relations

Connected Component

Connected Component → group of proteins connected by a path. Component A Component B

slide-4
SLIDE 4

PHN PHN: snapshot of a small portion (1/20)

Full: 760,000 proteins and 7x107 links (at ε = 1 0-5)

slide-5
SLIDE 5

The structure of the PHN PHN depends on the homology cut-off ε

ε= 10

= 10-200

  • 200 ÷ 10

10-100

  • 100

S everal S everal relationships elationships missed missed

slide-6
SLIDE 6

The structure of the PHN PHN depends on the homology cut-off ε

ε = 10

= 10-80

  • 80 ÷ 10

10-40

  • 40

S everal S everal relationships elationships missed missed + “ strange” + “ strange” connections!

  • nnections!
slide-7
SLIDE 7

ε = 10

= 10-30

  • 30 ÷ 10

10-10

  • 10

S

  • me relationships

S

  • me relationships still

till missed missed + several + several inter-family inter-family

The structure of the PHN PHN depends on the homology cut-off ε

slide-8
SLIDE 8

The structure of the PHN PHN depends on the homology cut-off ε

ε = 10

= 10-5

  • 5

The “ giant The “ giant component” component” dominates dominates the network he network

slide-9
SLIDE 9

PHN: the giant connected component giant connected component

At ε = 10-5 63% 63% of the proteins are in the giant component Fraction of nodes included in the largest connected component

slide-10
SLIDE 10

PHN topology

i i i i

M k η η η = − = ; 1

( )

i i i i i

C C k k E C = − = ; 1 2

Proximity Proximity of a node: f a node: clustering index C Connected Connected components:

  • mponents:

compactness index η

Albert R, Barabasi AL (2002) Reviews of Modern Physics 74: 47-97

slide-11
SLIDE 11

How do we identify Protein Families? How do we identify Protein Families? How do we identify Protein Families? How do we identify Protein Families?

Family “ A” Family “ A” Family “ B” Family “ B”

slide-12
SLIDE 12

Overlap measure: Overlap measure: neighborhood similarity

θ ij = nij max ki ,kj

( )

We define the overlap θij of two nodes i,j as the normalized fraction of nearest neighbors that they have in common

ki=10 kj=8 nij=3 θij=0.3 θij=0 θjk≈1 i j k θ is

des igned to identify pairs

  • f nodes

s haring a large fraction of their neares t neighbors .

j i

slide-13
SLIDE 13

The modularity measure The modularity measure Q

Q = bi − ai

2

( )

i

Q: correspondence of a network partitioning to the network

modular structure (Newman MEJ, Girvan M (2004) Phys

ical Review E 69: 26113-26127)

ai = fraction of edges with at least one end in the i-th component, bi = fraction of edges with both ends in the i-th component.

PHN-Families PHN-Families: connected components for θ = 0,5.

slide-14
SLIDE 14

Comparison to PFAM Comparison to PFAM (~ 75% testable)

10-10 23.6% single domain, shared 10-87 68.3%

  • ne or two

multi-domains 10-10 8.1% do not share a domain Fraction Fraction Protein Protein Classification lassification

Removed links Removed links

0.58 1.5% do not share a domain 0.68 98.5% share a domain Fraction Fraction Protein Classification Protein Classification

Added Links Added Links

〈θij〉 〉 〈

ij

ε

98.5% confirmed 98.5% confirmed 76.4% confirmed 76.4% confirmed

slide-15
SLIDE 15

S ummary: the PHN-Families Algorithm S ummary: the PHN-Families Algorithm

slide-16
SLIDE 16

Result: PHN-Families Result: PHN-Families

28,226 PHN-Families 28,226 PHN-Families(giant component disconnected into 14,443 PHN-Families + 26,000 isolated proteins)

Before Beforepartitioning After After partitioning

slide-17
SLIDE 17

How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families?

1. 1. Enhanced Enhanced annotation annotation of new genomic

  • f new genomic sequences

equences

  • 2. Whole genome profiling and comparison
  • 3. Identification and study of bacterial organelles
slide-18
SLIDE 18

How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families?

1. Enhanced annotation of new genomic sequences

  • 2. Whole g

e genome p e profiling and c comparison

  • 3. Identification and study of bacterial organelles
slide-19
SLIDE 19

Microorganisms Microorganisms Microorganisms Microorganisms Protein Protein Protein Protein Families Families Families Families ( (functions functions functions functions) )

Protein Families as discrete characters: Protein Families as discrete characters: the genomic matrix the genomic matrix

slide-20
SLIDE 20

PHN-Family profiles: genomic genomic signatures signatures Bacillales Archea

slide-21
SLIDE 21

How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families?

1. Enhanced annotation of new genomic sequences

  • 2. Whole genome profiling and comparison

3.

  • 3. Identification and stud

Identification and study of bacterial organelles y of bacterial organelles

slide-22
SLIDE 22

From PHN-Families to bacterial organelles bacterial organelles

A classification of proteins into families allows to recognize the similarities between complex structures, even if some individual components are missing, different, or placed in an unexpected position.

slide-23
SLIDE 23

Can we group all the building blocks of Can we group all the building blocks of Type IV S ecretion S ystems? Type IV S ecretion S ystems?

S elected 12 major structural components from 6 reference T4S S belonging to A.tumefaciens, IncN R46, B.suis, B.pertussis, and H.pylori, which provide a good sampling of the diversity of known TTS S s.

Covacci et al., S cience (1999) 284, 1328-33.

174 1 VirD4 724 1 VirB11 119 1 VirB10 2 127 2 VirB9 2 69 2 VirB8 1 1 3 5 7 7 6 VirB7 3 117 2 VirB6 7 46 2 VirB5 228 1 VirB4 1 13 19 3 VirB3 1 7 9 18 4 VirB2 4 42 2 VirB1 Proteins PHN- Families Functional Class

slide-24
SLIDE 24

Evolutionary diversification of Type IV S S Evolutionary diversification of Type IV S S

Conserved core Variable set

Groups of probably co-evolved Type IV S S

A B C D

slide-25
SLIDE 25

PHN-Families are coherent with molecular molecular philogenesys hilogenesys

230 230 point accepted mutations 180 180 point accepted mutations

slide-26
SLIDE 26

Conclusions Conclusions

  • The complex system:

The complex system: The Protein Homology Network is formed by clusters (families of homologous proteins) interconnected.

  • The simple tool:

The simple tool: We have developed a computational method to identify these groups of proteins, the PHN-Families, an unsupervised classification of quality comparable to collections cured by human experts.

  • The huge amount of genomic da

The huge amount of genomic data produced can be classified ta produced can be classified before expert curation, to study:

Whole genomes / Organelles / S

pecific families.

  • Integration with Pfam

Integration with Pfam and other databases will connect PHN-Fams to experimental data.

slide-27
SLIDE 27

Aknowledgements Aknowledgements

Claudio Donati Claudio Donati Antonello Antonello Covacci Covacci The BioInformatic The BioInformatic Unit (NV&D) nit (NV&D) Alessandro Muzzi Alessandro Muzzi Nicola Pacchiani Nicola Pacchiani Roberto Palmas Roberto Palmas Riccardo Riccardo Beltrami Beltrami

  • D. Medini D, Covacci A, Donati C,

Protein Homology Network Families Reveal S tep-Wise Diversification of Type III and Type IV S ecretion S ystems, PLoSComputational Biology Vol. 2, No. 12, e173

The Pfam The Pfam group group (The WT S anger Institute) (The WT S anger Institute) Robert Finn Robert Finn