Eff fficient processing of Hi Hi-C data an and ap application to - - PowerPoint PPT Presentation

eff fficient processing of hi hi c data an and ap
SMART_READER_LITE
LIVE PREVIEW

Eff fficient processing of Hi Hi-C data an and ap application to - - PowerPoint PPT Presentation

Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv rvant, PhD Institut Curie, INSERM U900, Mines ParisTech, PSL-Research University 04th of Decembre 2019 Hi-C days, Toulouse 2 Spa patial


slide-1
SLIDE 1

Eff fficient processing of Hi Hi-C data an and ap application to can ancer

Nicola las Serv rvant, PhD

Institut Curie, INSERM U900, Mines ParisTech, PSL-Research University 04th of Decembre 2019 Hi-C days, Toulouse

slide-2
SLIDE 2

Spa patial organization

  • nof the

he geno nome me

How w are 2 metersof DNA packed into ntoa 10µm dia iameternuc ucle leus us ?

Adapted from Annunziato et al. 2008

Nucleosome (histone/DNA) H1 Histone Chromatin fibre

2

slide-3
SLIDE 3

Different nt levels of spa patial organiza zation

3

Nucleus

slide-4
SLIDE 4

Hi Hi-C capt ptures the he chr hrom

  • matinconf

nfor

  • rmation
  • n within

the he nucleus

Lieberman-Aiden et al. 2009, Rao et al. 2014

A B A B C A B C +1 +1

Whole genome map Intrachromosomalcontact maps at differentresolutions

chr6

… chr1 chr2 chr3 chr4 chr5 chr6

4

Cells population

Bin size

Contact Frequencies High Low

slide-5
SLIDE 5

Genom nome organization

  • nand

nd Hi Hi-C

chr2 chr1 chr2 chr3 chr4

Lieberman-Aiden et al. 2009, Dixon et al. 2012, Rao et al. 2014

5

slide-6
SLIDE 6

Genom nome organization

  • nand

nd Hi Hi-C

Lieberman-Aiden et al. 2009, Dixon et al. 2012, Rao et al. 2014

6

~1 Mb ~100 Kb

slide-7
SLIDE 7

Topol

  • polog
  • gical Associated doma
  • mains

ns (TAD ADs)

The topological domains (TADs) have been described as the functional units of the genome

  • rganization, able to promote enhancer/promoter interactions.

Contact Probability 7

slide-8
SLIDE 8

'Hi-C'-basedexpe periments

8

Method Main features References

Hi-C

For mapping whole-genomechromatin interaction in a cell population; proximity ligation is carried out in a large volume Lieberman-Aiden et al. (2009)

TCC

Similar to Hi-C, except that proximity ligation is carried out

  • n a solid phase-immobilized proteins

Kalhor et al. (2011)

Single-cell Hi-C

For mapping chromatin interactions at the single-cell level Nagano et al. (2013)

In situ Hi-C

Proximity ligation is carried out in the intact nucleus Rao et al. (2014)

Capture-C

Combines 3C with a DNA capture technology ; equivalent to high-throughput4C Hughes et al. (2014)

Dnase Hi-C

Chromatin is fragmented with DnaseI; proximity ligation is carried out on a solid gel Ma et al. (2015)

Targeted Dnase Hi-C

Combine Dnaseor in situ DnaseHi-C with a capture technology Ma et al. (2015)

Micro-C

Chromatin is fragmented with micrococcalnuclease Hsiech et al. (2015)

In situ DNAse Hi-C

Chromatin is fragmented with DnaseI; proximity logation is carried out in the intact nucleus Deng et al. (2015)

Capture-Hi-C

Combines 3C with a DNA capture technology ; equivalent to high-throughput5C Mifsud et al. (2015)

HiChIP

Detecting genome-widechromatin interaction mediated by a particular protein ; equivalent to ChAI-PET Mumbach et al. (2016)

slide-9
SLIDE 9

Ready-to to-us use Hi Hi-C Kits

9

https://www.phasegenomics.com https://arimagenomics.com/kit https://www.qiagen.com

slide-10
SLIDE 10

Whi hichappr proachfor

  • r whi

hichpur urpo pose ?

slide-11
SLIDE 11

Whi hichappr proachfor

  • r whi

hichpur urpo pose ?

Captu pture Hi Hi-Cprotocol (Franck et al. 2016) i.e. Hi-C library combined with capture of a dedicated genomic region

slide-12
SLIDE 12

Que uestion

  • ns ?
  • 1. How to efficiently process Hi-C data?
  • 2. Are there any specific computational challenges in

analyzing Hi-C data from cancer samples ?

12

slide-13
SLIDE 13

Wha hat doe

  • es Hi-C data look
  • ok like ?

Illumina paired-end sequencing

PE Sequencing Hi-C Fragments

slide-14
SLIDE 14

Cha hallenges in Hi Hi-Cdata proc

  • cessing

PE Sequencing Hi-C Fragments

IMR90 chr6 IMR90 chr6 Bin size = 500 Kb

240 - 0-

How to process Hi-C data in an easy and effic icie ient way taking into account ;

  • The huge amount of data
  • The evolutionof protocols
  • The computationalressources

14

slide-15
SLIDE 15

End-to-end genome Alignment PE Sequencing Aligned Reads Unmapped Reads End-to-end genome Alignment Aligned Reads Unmapped Reads Trim 3’ RS site Hi-C Fragments

Reads ds mapp pping ng strategy

15

slide-16
SLIDE 16

Detectionof validinteraction

  • n prod
  • duc

ucts

Hi-C Fragment

Valid Pairs Invalid Pairs

Singleton Dangling End Self Circle Dumped Pairs

+ Filtering on :

  • Insert size
  • Restriction fragment size
  • MAPQ
  • etc.

16

FR RF FF RR

R=Reverse / F=Forward

slide-17
SLIDE 17

Bui uildi ding ng cont ntact maps ps

bins bins

Dense (MB) Sparse Complete (MB) Sparse Symmetric (MB) 1M 25 98 49 500Kb 77 363 182 150Kb 818 1 900 934 40Kb 12 000 3 800 1 900 20Kb 45 000 5 300 2 700 5Kb >100 000 ?? 8 600 4 300

There is currently no consensus about how to (efficiently) store the contact maps A Hi-C contact map is :

  • Usually very sparse
  • Symmetric

We therefore propose to use a standard triplet sparse format to store only half of the non-zero contact values. 17

slide-18
SLIDE 18

Hi Hi-C formats

18

.hic ic file iles (Juic uicer, Juic icebo box)

  • Contact matrices in multiple resolutions and

summary statistics stored in one file

  • Java and C bindings
  • Command line tools
  • Extant suite of analysis tools
  • Extant visualization tool.

.cool file les (coole ler, hig igla lass)

  • Flexibility to store one or multiple matrices with

varying bin sizes

  • python library
  • Command line tools
  • HDF5, which has native bindings in practically all

languages

  • out of memory iterative matrix balancing, that

can work on very large matrices.

slide-19
SLIDE 19

Hi Hi-C data nor

  • rmalization

All high-througthut techniques are subject to technical and experimentalbiases The iterativecorrection (ICE) method is a widely used approach for Hi-C data normalization. This method is based on the assumption that each locus should have the same probability of interaction genome-wide, and is in theory able to correct for any bias in the contact maps.

1 1 1 1

19

slide-20
SLIDE 20
  • Easy-to-use
  • Optimized and scalable
  • Flexible
  • Support most protocols
  • Open to contribution
  • Compatible with many

downstream analysis software Highlyusedin the last years Available at https://github.com/nservant/HiC-Pro Forum and discussion at https://groups.google.com/forum/#!forum/hic-pro Dedicated Reads Mapping Strategy Detection of Valid Interaction Products Generates raw and normalized contact maps Quality Controls

HiC-Pro – proc

  • cessing

ng of Hi Hi-C/HiChiPdata

slide-21
SLIDE 21

Bui uildi ding ng Efficient and ndRepr prod

  • duc

ucibl ble Wor

  • rkflows

For facilities Highly optimised pipelines with excellent reporting. Validated releasesensure reproducibility.​ For users Portable, documentedand easy to use workflows. Pipelines that you can trust. For developers Companiontemplates and tools help to validateyour code and simplify common tasks.​

slide-22
SLIDE 22

Analys ysis ispipe ipelin lines:

  • Nextflow-based pipelines
  • High level of reproducibily
  • Strict Guidelines
  • 17 released pipelines
  • 19 under development

Commun unit ity: 29 organisations over the world More than 90 contributors

slide-23
SLIDE 23

First version of nf-core Hi-C pipeline released ! V1.1.0 = Nextflow HiC-Pro version

  • Automatic installation
  • Natively support most schedulers
  • Natively compatible with conda, docker, singularity
  • Efficient tasks management
  • Reads can be automaticaly splitted by chuncks to speed the processing
slide-24
SLIDE 24

Plans for the next xt versions :

  • TADs calling​ (which methods ?)
  • Compartment Calling​
  • Detection of significant contacts
  • ​Specific pipelines for Hi-C based assembly ? Cancer Hi-C ?

Contribution is welcome !​

slide-25
SLIDE 25

Questions ?

  • 1. How to efficiently process Hi-C data?
  • 2. Are there any specific computational challenges in

analyzing Hi-C data from cancer samples ?

25

slide-26
SLIDE 26

Hi Hi-C on canc ncer data

So far, most of the studies were dedicated to normal cell … and a few ones started to investigate chromatin structure of Breast and Prostate cancer using Hi-C

slide-27
SLIDE 27

Alterations ns in canc ncer (epi pi)geno nomics

27

slide-28
SLIDE 28

TADs are biol

  • log
  • gicallyrelevant

Luipanez et al. 2015, Franke et al. 2016

  • TADs disruption leads to new

enhancer/promoter contacts

  • Abnormal enhancer/promoter

contacts can have strong phenotypic impacts

  • Structural variants can disrupt

TADs structure 28

slide-29
SLIDE 29

Organization

  • n of canc

ncer genom

  • mes?

Valton & Dekker, 2016

Non-coding DNA mutation Structural Variants 29

slide-30
SLIDE 30

Hi Hi-C, a good

  • od tool
  • l to stud

udy CNVs ?

30

slide-31
SLIDE 31

Cha hallenges in Hi-C canc ncer data?

31

Hi-C Copy Number Variants Normalization Simulation? Impacts? Impacts?

slide-32
SLIDE 32

Hi Hi-C C – Wha hat do we coun

  • unt?

Whole genome map Intrachromosomal contact maps Topological domains chr6 HOX genes cluster

i j In the context of a diploid genome If i and j belong to the same chromosome Cij= 2 cis + 2 transH If i and j belong to different chromosomes Cij = 4 trans

32

slide-33
SLIDE 33

Gene neralization n to pol

  • lyploid genom
  • mes

Ni = Nj = 2 If chri = chrj, Cij = 2 cis + 2 transH If chri ≠ chrj, Cij = 4 trans Ni = Nj = 1 If chri = chrj, Cij = 1 cis If chri ≠ chrj, Cij = 1 trans Ni = Nj = 3 If chri = chrj, Cij = 3 cis + 6 transH If chri ≠ chrj, Cij = 9 trans Ni = Nj If chri = chrj, Cij = Ni cis + Ni (Nj -1) transH If chri ≠ chrj, Cij = Ni Nj trans

33

slide-34
SLIDE 34

Extens nsion

  • n to
  • Canc

ncer geno nome

N = 4 N = 2

If i and j belong to the same chromosomal segment Cij = Ni cis + Ni (Nj -1) transH

34

slide-35
SLIDE 35

Extens nsion

  • n to
  • Canc

ncer geno nome

If i and j belong to different chromosomal segments Cij = p cis + (Ni * Nj – p) * transH where p is the number of complete chromosomes

N = 5 N = 2 N = 3

Cij = 2 cis + (2x4 + 5) transH

35

slide-36
SLIDE 36
  • 1. Estima

mate the cisij

ijand transHterms

ms from a real diploid Hi-C dataset. Estimate transH under the assumption that the contact probability between homologuous chromosomes can be estimated using the observed trans contact between different chromosomes. For each interaction Cij, between the loci i and j, estimate the cis value using Cij= 2 cisij + 2 transH

  • 2. Simu

mulatethe effect of CNVs on the contact matrix Given the cis and transH values for two loci i and j, calculate Eij, the expected counts in the presence of CNVs Calculate the expected factor of enrichment/depletion

  • f interactions for the loci i and j matrix: pij = Eij / Cij

Estimate the simulated data using a binomial downsampling of parameter CTij ∼ B(Cij, pij)

Simu mulation

  • n of canc

ncer Hi-C data

36

slide-37
SLIDE 37

Simu mulation

  • n - Results

chr1 chr2 chr3 chr4 2 4 6 10 Dixon et al. IMR90 1

5

Simulated data chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4

37

slide-38
SLIDE 38

How

  • w to
  • va

validate the he simu mulation

  • n mode
  • del ?

In order to validate our simulation model, we used Hi-C from MCF10 normal-like data, from which we simulated the MCF7 CNV profile

+

MCF7 Affymetrixdata MCF10A Hi-C data chr1 chr2 chr3 chr4

slide-39
SLIDE 39

Simu mulation

  • n - Valida

dation

  • n
slide-40
SLIDE 40

Effect of ICE nor

  • rmalization
  • n

The iterativecorrection (ICE) does not correct for CNV bias. RAW ICE

40

slide-41
SLIDE 41

Effect of ICE nor

  • rmalization
  • n

The iterative correction (ICE) does no not correct for CNV bias. More importantly, it leads to an inversion of the signal in cis.

MCF7 Hi-C data

ICE RAW

1 5 8 Copy number 41

slide-42
SLIDE 42

How

  • w to
  • nor
  • rmalize canc

ncer Hi-C data?

How to take into account the CNV signal into the normalization ? 1.

  • 1. Correct for systematic bias but not for the CNVs signal, which can be useful for

biological interpretation of cancer, for 3D modeling, genome reconstruction, contribution to CNVs to disease, etc. 2.

  • 2. Correct for all bias including the CNVs because it migth introduce a bias in my

downstream analysis (differential contacts, detection of chromosome compartments, etc.)

42

slide-43
SLIDE 43

Estimation

  • n of DNA

A breakpo points from

  • m Hi-C data

The segmentation of 1D Hi-C profile is performed as follow :

  • 1. Generate the 1D Hi-C profile as the sum of contact per locus genome-wide
  • 2. Removesystematic biases using a Poisson regression model
  • 3. Segment the profile

Deletion Loss Normal Gain Amplification

43

Validation on 100 simulated data-sets : 91% recall / 62.4% precision

slide-44
SLIDE 44

CNV-based nor

  • rmalization
  • n of

Hi Hi-C canc ncer data

The Loc

  • cal Iterative co

correct ction

  • n (LOIC) normalization method extends the ICE model, making the assumption of

local equal visibility per genomic segment

ICE LOIC

44

slide-45
SLIDE 45

CNV-based nor

  • rmalization
  • n of

Hi Hi-C canc ncer data

The Loc

  • cal Iterative co

correct ction

  • n (LOIC) normalization method extends the ICE model, making the assumption of

local equal visibility per genomic segment

ICE LOIC RAW

GC content Fragment Length Mappability

45

slide-46
SLIDE 46

Remo moving CNVs from

  • m canc

ncer Hi-C data

We assume that the copy number bias is constant per block and that the contact counts at a given genomic distance should be the same regardless the copy number status. 1- Run the ICE normalization 2- Estimate the average counts ~ distance signal on the genome-wide matrix 3- Based on the segmentation profile, rescale the counts ~ distance fit for each segmentation block

From Barutcu et al. 2015 46

slide-47
SLIDE 47

Remo moving CNVs from

  • m canc

ncer Hi-C data

Simulation 1 Raw data CAIC data chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4

47

slide-48
SLIDE 48

Canc ncer Hi-C data nor

  • rmalization
  • CNVs estimation from Hi-C data
  • Cancer Hi-C data simulation
  • Normalization of Hi-C cancer data

Availableat https://github.com/nservant/cancer-hic-norm/ Normalization methodsare includedinto the iced python module and available at https://github.com/hiclib/iced

48

slide-49
SLIDE 49

How

  • w useful

ul is the LO LOIC method

  • d ?

(…) How ever, in general duplicated regions yield more signal compared to non-duplicated regions w hen mapped to the w ildtype reference genome. The signal is flattened disproportionately in duplicated regions by a normalization procedure that balances the w hole interaction matrix, such as KR normalization. Therefore, w e used only raw count maps to calculate the differences betw een

  • samples. (…)
slide-50
SLIDE 50

How

  • w useful

ul is the LO LOIC method

  • d ?

Application of LW-IC on Franke et al. data

slide-51
SLIDE 51

Goi

  • ing

ng further with h dow

  • wnstream

m ana nalysis

The detection of A/B chromosome compartments is usually based on PCA analysis

  • f the intra-chromosomal maps correlation.

The methods is surprisingly robust to CNV variations But for some chromosomes, the PC1 signal is biased toward the CNV profile

slide-52
SLIDE 52

Remo moving CNVs from

  • m canc

ncer Hi-C data

MCF10A - IC MCF7 – LOIC MCF7 – CAIC

Detection of A/B chromosome compartments

slide-53
SLIDE 53

Take Home Messages

In a copy number context, we demonstrate that the ICE normalization does not allow to correct for these effects and that it results in a shift in contact probabilities between altered regions in cis We proposed a first simulation model to investigate the CNVs impact on Hi-C map We then proposed two new methods for Cancer Hi-C data and applied it to different case studies

  • LOIC to keep the CNVs information
  • CAIC to remove the CNVs

HiC-Pro available at https://github.com/nservant/HiC-Pro nf-core-hic is available at https://github.com/nf-core/hic Both are collaborative projects, so do not hesitate to propose improvments or to report errors

slide-54
SLIDE 54

Many Thanks

Nelle Varoquaux, PhD Agathe Neviere Jean-Philippe Vert, PhD Emmanuel Barillot, PhD Edith Heard, PhD Joke van Bemmel , PhD Rafael Galupa, PhD Agnese Loda , PhD Elphege Nora , PhD