SLIDE 1 Duccio Medini Duccio Medini Duccio Medini Duccio Medini Cellular Cellular Microbiology Microbiology and and Bioinformatics Bioinformatics Unit Unit Novartis Novartis Vaccines, S iena (I) Vaccines, S iena (I)
A simple tool from a complex system: A simple tool from a complex system: high high-
- throughput, unsupervised generation of
throughput, unsupervised generation of Protein Families Protein Families Protein Families Protein Families from the from the Protein Homology Network. Protein Homology Network. Protein Homology Network. Protein Homology Network.
SLIDE 2
What is the What is the What is the What is the Protein Homology Network (PHN)? Protein Homology Network (PHN)? Protein Homology Network (PHN)? Protein Homology Network (PHN)?
SLIDE 3
PHN PHN: definitions definitions
251
complete genomes→ 761260 predicted proteins
Nodes
Nodes→ Proteins
Links
Links→ Blast alignments with E-score < ε (cut-off) Proteins Proteins Homology relations Homology relations
Connected Component
Connected Component → group of proteins connected by a path. Component A Component B
SLIDE 4
PHN PHN: snapshot of a small portion (1/20)
Full: 760,000 proteins and 7x107 links (at ε = 1 0-5)
SLIDE 5 The structure of the PHN PHN depends on the homology cut-off ε
ε= 10
= 10-200
10-100
S everal S everal relationships elationships missed missed
SLIDE 6 The structure of the PHN PHN depends on the homology cut-off ε
ε = 10
= 10-80
10-40
S everal S everal relationships elationships missed missed + “ strange” + “ strange” connections!
SLIDE 7 ε = 10
= 10-30
10-10
S
S
till missed missed + several + several inter-family inter-family
The structure of the PHN PHN depends on the homology cut-off ε
SLIDE 8 The structure of the PHN PHN depends on the homology cut-off ε
ε = 10
= 10-5
The “ giant The “ giant component” component” dominates dominates the network he network
SLIDE 9
PHN: the giant connected component giant connected component
At ε = 10-5 63% 63% of the proteins are in the giant component Fraction of nodes included in the largest connected component
SLIDE 10 PHN topology
i i i i
M k η η η = − = ; 1
( )
i i i i i
C C k k E C = − = ; 1 2
Proximity Proximity of a node: f a node: clustering index C Connected Connected components:
compactness index η
Albert R, Barabasi AL (2002) Reviews of Modern Physics 74: 47-97
SLIDE 11
How do we identify Protein Families? How do we identify Protein Families? How do we identify Protein Families? How do we identify Protein Families?
Family “ A” Family “ A” Family “ B” Family “ B”
SLIDE 12 Overlap measure: Overlap measure: neighborhood similarity
θ ij = nij max ki ,kj
( )
We define the overlap θij of two nodes i,j as the normalized fraction of nearest neighbors that they have in common
ki=10 kj=8 nij=3 θij=0.3 θij=0 θjk≈1 i j k θ is
des igned to identify pairs
s haring a large fraction of their neares t neighbors .
j i
SLIDE 13 The modularity measure The modularity measure Q
Q = bi − ai
2
( )
i
∑
Q: correspondence of a network partitioning to the network
modular structure (Newman MEJ, Girvan M (2004) Phys
ical Review E 69: 26113-26127)
ai = fraction of edges with at least one end in the i-th component, bi = fraction of edges with both ends in the i-th component.
PHN-Families PHN-Families: connected components for θ = 0,5.
SLIDE 14 Comparison to PFAM Comparison to PFAM (~ 75% testable)
10-10 23.6% single domain, shared 10-87 68.3%
multi-domains 10-10 8.1% do not share a domain Fraction Fraction Protein Protein Classification lassification
Removed links Removed links
0.58 1.5% do not share a domain 0.68 98.5% share a domain Fraction Fraction Protein Classification Protein Classification
Added Links Added Links
〈θij〉 〉 〈
ij
ε
98.5% confirmed 98.5% confirmed 76.4% confirmed 76.4% confirmed
SLIDE 15
S ummary: the PHN-Families Algorithm S ummary: the PHN-Families Algorithm
SLIDE 16 Result: PHN-Families Result: PHN-Families
28,226 PHN-Families 28,226 PHN-Families(giant component disconnected into 14,443 PHN-Families + 26,000 isolated proteins)
Before Beforepartitioning After After partitioning
SLIDE 17 How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families?
1. 1. Enhanced Enhanced annotation annotation of new genomic
equences
- 2. Whole genome profiling and comparison
- 3. Identification and study of bacterial organelles
SLIDE 18 How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families?
1. Enhanced annotation of new genomic sequences
e genome p e profiling and c comparison
- 3. Identification and study of bacterial organelles
SLIDE 19
Microorganisms Microorganisms Microorganisms Microorganisms Protein Protein Protein Protein Families Families Families Families ( (functions functions functions functions) )
Protein Families as discrete characters: Protein Families as discrete characters: the genomic matrix the genomic matrix
SLIDE 20
PHN-Family profiles: genomic genomic signatures signatures Bacillales Archea
SLIDE 21 How can we use Protein Families? How can we use Protein Families? How can we use Protein Families? How can we use Protein Families?
1. Enhanced annotation of new genomic sequences
- 2. Whole genome profiling and comparison
3.
- 3. Identification and stud
Identification and study of bacterial organelles y of bacterial organelles
SLIDE 22
From PHN-Families to bacterial organelles bacterial organelles
A classification of proteins into families allows to recognize the similarities between complex structures, even if some individual components are missing, different, or placed in an unexpected position.
SLIDE 23 Can we group all the building blocks of Can we group all the building blocks of Type IV S ecretion S ystems? Type IV S ecretion S ystems?
S elected 12 major structural components from 6 reference T4S S belonging to A.tumefaciens, IncN R46, B.suis, B.pertussis, and H.pylori, which provide a good sampling of the diversity of known TTS S s.
Covacci et al., S cience (1999) 284, 1328-33.
174 1 VirD4 724 1 VirB11 119 1 VirB10 2 127 2 VirB9 2 69 2 VirB8 1 1 3 5 7 7 6 VirB7 3 117 2 VirB6 7 46 2 VirB5 228 1 VirB4 1 13 19 3 VirB3 1 7 9 18 4 VirB2 4 42 2 VirB1 Proteins PHN- Families Functional Class
SLIDE 24 Evolutionary diversification of Type IV S S Evolutionary diversification of Type IV S S
Conserved core Variable set
Groups of probably co-evolved Type IV S S
A B C D
SLIDE 25
PHN-Families are coherent with molecular molecular philogenesys hilogenesys
230 230 point accepted mutations 180 180 point accepted mutations
SLIDE 26 Conclusions Conclusions
The complex system: The Protein Homology Network is formed by clusters (families of homologous proteins) interconnected.
The simple tool: We have developed a computational method to identify these groups of proteins, the PHN-Families, an unsupervised classification of quality comparable to collections cured by human experts.
- The huge amount of genomic da
The huge amount of genomic data produced can be classified ta produced can be classified before expert curation, to study:
Whole genomes / Organelles / S
pecific families.
Integration with Pfam and other databases will connect PHN-Fams to experimental data.
SLIDE 27 Aknowledgements Aknowledgements
Claudio Donati Claudio Donati Antonello Antonello Covacci Covacci The BioInformatic The BioInformatic Unit (NV&D) nit (NV&D) Alessandro Muzzi Alessandro Muzzi Nicola Pacchiani Nicola Pacchiani Roberto Palmas Roberto Palmas Riccardo Riccardo Beltrami Beltrami
- D. Medini D, Covacci A, Donati C,
Protein Homology Network Families Reveal S tep-Wise Diversification of Type III and Type IV S ecretion S ystems, PLoSComputational Biology Vol. 2, No. 12, e173
The Pfam The Pfam group group (The WT S anger Institute) (The WT S anger Institute) Robert Finn Robert Finn