genomes Ekaterina Shelest 09.03.2018 Gttingen Part 1. Gene - - PowerPoint PPT Presentation

genomes
SMART_READER_LITE
LIVE PREVIEW

genomes Ekaterina Shelest 09.03.2018 Gttingen Part 1. Gene - - PowerPoint PPT Presentation

Promoter-based prediction of gene clusters in eukaryotic genomes Ekaterina Shelest 09.03.2018 Gttingen Part 1. Gene clusters and their discovery From promoter models to secondary metabolites TF TF Binding Sites From promoter models to


slide-1
SLIDE 1

Promoter-based prediction of gene clusters in eukaryotic genomes

09.03.2018 Göttingen

Ekaterina Shelest

slide-2
SLIDE 2

Part 1. Gene clusters and their discovery

slide-3
SLIDE 3

From promoter models to secondary metabolites

TF

TF Binding Sites

slide-4
SLIDE 4

TF

Co-regulated genes TF Binding Sites

From promoter models to gene clusters

slide-5
SLIDE 5

and co-localized genes: Gene cluster

TF

Co-regulated genes TF Binding Site

From promoter models to gene clusters

slide-6
SLIDE 6

Secondary metabolite gene clusters

Aflatoxin, one of the most potent carcenogens penicillin

Actinomycin D

Non-ribosomal peptide Polyketide

Secondary metabolite gene clusters

slide-7
SLIDE 7

Secondary metabolite gene clusters

Aflatoxin, one of the most potent carcenogens penicillin Actinomycin D

Secondary metabolite gene clusters

Fungi:

  • Hundreds of substances described but molecular (genetic) basis is unknown
  • Filamentous fungi: on average ~ 40 clusters per genome, most with unknown products
slide-8
SLIDE 8

There are many classes of compounds that are classified as SMs:

  • Polyketides
  • Non-ribosomal peptides
  • Ribosomally synthesized and post-translationally modified

peptides

  • Terpenoids
  • Alcaloids,
  • Etc.

Secondary metabolites

slide-9
SLIDE 9

There are many classes of compounds that are classified as SMs:

  • Polyketides
  • Non-ribosomal peptides
  • Ribosomally synthesized and post-translationally modified

peptides

  • Terpenoids
  • Alcaloids,
  • Etc.

OF INTEREST

Actinomycin D Non-ribosomal peptide Polyketide

Secondary metabolites

slide-10
SLIDE 10

KS, Ketosynthase domain; AT, acetyltransferase domain; ACP (PP), acyl carrier protein; KR, ketoacyl reductase domain; ER, enoyl reductase domain; DH, dehydratase domain; ME, methyltransferase domain; TE, thiolesterase.

KS AT ACP KR ER DH ME TE

Polyketide synthase (PKS)

Multi-domain megasynthases A C E PP A C A C PP PP

A, adenylation domain; T (PP), thiolation or peptidyl carrier domain (with a swinging phosphopantetheine group); C, condensation domain; E, epimerization domain; T, thioesterase domain. module 1 module 2 module 3 Non-ribosomal peptide synthetase (NRPS)

Large size and typical set of domains => easy detection in genomes!

Domain structure of PKSs and NRPSs

slide-11
SLIDE 11

Problems with detection and prediction of (SM) clusters 1. No unambigious definition 2. Pathways (and products) are mostly unknown, so it is hard to predict the set of genes involved in a cluster. 3. Most of clusters are silent under laboratory conditions. 4. Clusters are not necessarily conserved. 5. There are no marker genes except for synthases (PKSs, NRPSs, etc.). Some genes (P450, transporters, transcription factors) are often but not always found in clusters. What to rely on?

  • either genes/proteins or regulation

TF

TF Binding Site

SM gene clusters: problems of detection

slide-12
SLIDE 12

Methods developed so far are based on:

  • Gene / protein annotation
  • Protein similarity (antiSMASH, SMURF, etc.)
  • Expression data (Andersen et al, PNAS 2013)

SM gene clusters: Methods

slide-13
SLIDE 13

Protein similarity-based methods (antiSMASH, SMURF, etc.)

Protein domains

  • f these gene’s

products Library (database) Comparison of genes in candidate region to this set Known clusters: SM cluster prediction

BUT:

  • there are no marker genes except the anchors;
  • many products and pathways (hence genes) are unknown
slide-14
SLIDE 14

Issues with protein-based tools:

  • Over-estimation of cluster lengths
  • Prediction of “alien” genes as cluster genes
  • No way to differentiate closely located clusters

Protein domain-based prediction (SMURF):

Violaceol cluster Orsellinic acid cluster dbaI (PKS) OrsA (PKS)

No methods based on regulation information!

SM gene clusters: Methods

slide-15
SLIDE 15

Definition: Cluster definition: Co-regulated and co-localized genes

TF

Cluster Approach to modeling and prediction: The role of regulator

slide-16
SLIDE 16

Basic idea: To detect co-localized shared motifs (TFBSs) in the vicinity of the main biosynthetic enzymes (PKSs and NRPSs)

TF

Cluster Approach to modeling and prediction: The role of regulator

slide-17
SLIDE 17

Promoter-based method for gene cluster prediction: CASSIS – Cluster ASSociation by Islands of Sites

Approach to modeling and prediction: Role of regulator

slide-18
SLIDE 18

Anchor gene

MEME motif finder*

  • 15/0

0/+15

// //

0/+15

  • 15/0

interim sets of promoters Over-represented motifs (the best-scoring motif for each frame)

CASSIS method

Step 1: Motif search.

slide-19
SLIDE 19

Step 2: Genome-wide motif search

MEME motif finder* // // // //

CASSIS method

Step 1: Motif search.

Anchor gene

  • 15/0

0/+15 0/+15

  • 15/0

interim sets of promoters

slide-20
SLIDE 20

MEME motif finder* // //

Pr1 Pr2 Pr3 Pr4

1 1

Registering found motifs

1 0 0 1 0 0 0 0 1 0 0 0 … 0 0 1 1 1 1 0 1 1 1 0 0 …

„Island“ of numbers = Cluster

Step 3: Transforming genomic sequence into a number string Step 4: Searching for “islands” of sites

// //

CASSIS method

Step 2: Genome-wide motif search Step 1: Motif search.

Anchor gene

  • 15/0

0/+15 0/+15

  • 15/0

interim sets of promoters

slide-21
SLIDE 21

Step 4: Defining the cluster borders: set of rules

0 0 0 0 3 1 2 1 1 1 0 0 1 0 1 1 2 0 0 1 1 1 0 0 0 0

  • 1. „Gap rule“

CASSIS scans the number string immediately upstream and downstream of the anchor promoter until it hits the first “zero” value (promoter without binding site). Gap rule: 2 zero-promoters Is based on observations of real-life clusters (>30 known eukaryotic SM clusters). CASSIS method

slide-22
SLIDE 22

Adjustable parameters and their estimation What can influence the search:

  • 1. MEME and FIMO searches. Refining the latter by adjusting the e-value and

p-value cut-offs can be crucial for the whole cluster prediction.

  • 2. Intrinsic CASSIS parameters:

(i) the proportion of promoters with the motif in the genome (reflecting the genome- wide motif frequency); (ii) the maximal allowed number of “zero” promoters (“gaps”) within the cluster (Gap rule) All these parameters are estimated using a training set of experimentally verified SM clusters. For the Ascomycete training set, the parameter values were:

  • frequency 14%;
  • gap of 2 zero-promoters.

CASSIS method

slide-23
SLIDE 23

CASSIS is applicable to detection of any clusters as long as their genes are co-regulated and co-localized. The type of a cluster is defined by its anchor gene.

CASSIS method

slide-24
SLIDE 24

How to find a Gene Cluster in a genome?

  • 1. Find an anchor gene
  • 2. Find other genes
slide-25
SLIDE 25

How to find a Secondary Metabolite Gene Cluster in a genome?

  • 1. Find an anchor gene -> SMIPS
  • 2. Find other genes (define the borders) -> CASSIS
slide-26
SLIDE 26

SMIPS tool Based on the prediction of the protein domains (InterProScan) SMIPS

slide-27
SLIDE 27

KS AT ACP KR ER DH ME Genome-wide protein domain predictions (InterProScan) List of typical anchor gene domains Predictions of anchor genes SMIPS SMIPS tool Based on the prediction of the protein domains (InterProScan)

slide-28
SLIDE 28

SMIPS Input: Protein sequences or InterProScan tables Output: Genome-wide predictions of anchor genes (PKSs, NRPSs, DMATs (dimethylallyl tryptophan synthases)) CASSIS Input: Genome sequence; feature tables (.gff and alike); anchor gene(s) Output: Cluster borders predictions. Additional information: Shared motifs for each cluster. SMIPS and CASSIS overview

slide-29
SLIDE 29
slide-30
SLIDE 30

Results

  • Cross-validation
  • Comparison with other tools

Assessment of performance

slide-31
SLIDE 31

Assessment of performance Results Cross-validation (LOO)

slide-32
SLIDE 32

Comparison with other tools Results Cross-validation (LOO) Comparison with other tools

slide-33
SLIDE 33

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 CASSIS antiSMASH SMURF

Comparison of CASSIS with the similarity-based antiSMASH and SMURF tools: Re-identification of the 12 test clusters not used for the tools’ training. Results Cross-validation (LOO) Comparison with other tools Comparison with other tools  CASSIS integration into the antiSMASH (made in 2017)  Users can have 2 types of prediction (protein-based and promoter-based)

slide-34
SLIDE 34
  • Examples. Stories of application
slide-35
SLIDE 35

AN7884 was not characterized until recently We analysed the genomic region with CASSIS: + Synteny prediction:

AN7884 AN7884 AN7875 AN7873 AN7872 Aspercryptin, the story of AN7884

slide-36
SLIDE 36

Aspercryptin, the story of AN7884

slide-37
SLIDE 37

2016: We analysed the genomic region with CASSIS: + Synteny prediction:

AN7884 AN7884 AN7875 AN7873 AN7872 Aspercryptin, the story of AN7884

slide-38
SLIDE 38

CASSIS: + Synteny prediction:

AN7884 AN7884 AN7875 AN7873 AN7872

Synteny is a powerful tool!

Aspercryptin, the story of AN7884

slide-39
SLIDE 39

Page 39

Systems Biology/ Bioinformatics group, Hans Knöll Institute, Jena: Vladimir Shelest Thomas Wolf Alina Burmistrova Experimental work: Applied Molecular Microbiology lab, Hans Knöll Institute, Jena

slide-40
SLIDE 40

Page 40

Thank you for your attention!

slide-41
SLIDE 41

Inter-cluster cross-regulation

slide-42
SLIDE 42

induced expression

scpR

NRPS TF induction of asperfuranone

HKI Jena, 2010: Activation of the silent NRPSs AN3495/3496

NRPS

inpA inpB

Activation of silent clusters

  • S. Bergmann et al., 2010
  • Chr. II

AN3496 AN3495 AN3492

slide-43
SLIDE 43

induced expression

scpR

NRPS TF

induction of asperfuranone

NRPS

inpA inpB

But asperfuranone is a polyketide! Activation of silent clusters HKI Jena, 2010: Activation of the silent NRPSs AN3495/3496

  • S. Bergmann et al., submitted
  • Chr. II

AN3496 AN3495 AN3492

slide-44
SLIDE 44

induced expression

scpR

NRPS TF

  • Chr. II

induction of asperfuranone

NRPS

inpA inpB

But asperfuranone is a polyketide! induction of a PKS cluster! Activation of silent clusters HKI Jena, 2010: Activation of the silent NRPSs AN3495/3496

  • S. Bergmann et al., 2010

AN3496 AN3495 AN3492

slide-45
SLIDE 45

induced expression PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

asperfuranone biosynthetic cluster:

?? HOW? Regulatory cross-talk between the clusters

slide-46
SLIDE 46

induced expression PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

?? PRODUCT ?? ?? Regulatory cross-talk between the clusters

slide-47
SLIDE 47

induced expression PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

?? PRODUCT ?? ?? Regulatory cross-talk between the clusters

slide-48
SLIDE 48

PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

C2H2 TF Zn(2)-Cys(6) TF

Regulatory cross-talk between wet-lab and bioinformatics

slide-49
SLIDE 49

PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

C2H2 TF Zn(2)-Cys(6) TF Suggestion from the bioinformatics side: deletion of the afoA TF (AN1029)

Regulatory cross-talk between wet-lab and bioinformatics

slide-50
SLIDE 50

PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

C2H2 TF Zn(2)-Cys(6) TF Suggestion from the bioinformatics side: deletion of the afoA TF (AN1029) induced expression

asperfuranone

Regulatory cross-talk between wet-lab and bioinformatics

slide-51
SLIDE 51

PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

C2H2 TF Zn(2)-Cys(6) TF induced expression

asperfuranone

does not work

Regulatory cross-talk between wet-lab and bioinformatics

slide-52
SLIDE 52

PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

C2H2 TF Common binding sites for a C2H2 TF??

Regulatory cross-talk between wet-lab and bioinformatics

slide-53
SLIDE 53

PKS PKS

AN1034 AN1036

  • Chr. V

afoE AN1029 afoG afoA afoF

NRPS NRPS

  • Chr. II

inpA inpB scpR

C2H2 TF Common binding sites for a C2H2 TF?? induced expression

Regulatory cross-talk between wet-lab and bioinformatics

slide-54
SLIDE 54

The MEME search in the upstream sequences:

  • Sequence name Strand Start P-value Site
  • ----- ----- ---------
  • AN1029-30_interg1371 + 915 1.12e-07 AGAACGTGGT CTAAAGGATTGA GCTGACGATG

AN3496.4/AN3495.4 - 630 3.39e-07 TAACGATTAG CAAAAGGATTGA CTAAATCAAG AN1029-30_interg1371 - 1002 5.65e-07 AGCCACTAGC CTAAAGGAATCA GACCTTTAAT AN3491.4/AN3490.4 + 476 1.03e-06 CATCACCCGT CCAAAGGATGCA CCAAGGAACA

  • This motif is very similar to the one of RME1, a

yeast zinc-finger transcription factor with Cys2His2 domain:

Regulatory cross-talk between wet-lab and bioinformatics

slide-55
SLIDE 55