Detection of copy number alterations and loss of heterozygosity - - PowerPoint PPT Presentation

detection of copy number alterations and
SMART_READER_LITE
LIVE PREVIEW

Detection of copy number alterations and loss of heterozygosity - - PowerPoint PPT Presentation

Complete Genome analysis: Detection of copy number alterations and loss of heterozygosity Control-FREEC tutorial Valentina BOEVA Institut Curie, INSERM, Mines ParisTech Workshop outlines Motivation for copy number detection in cancer


slide-1
SLIDE 1

Complete Genome analysis: Detection of copy number alterations and loss of heterozygosity Control-FREEC tutorial

Valentina BOEVA – Institut Curie, INSERM, Mines ParisTech

slide-2
SLIDE 2

Workshop outlines

  • Motivation for copy number detection in cancer samples
  • ControlFREEC tool presentation
  • Methodology & functionalities
  • ControlFREEC tutorial on Galaxy
  • Hands on workshop
slide-3
SLIDE 3

Cancer genomes are often significantly rearranged A 24 color karyotype of a neuroblastoma cell line

3

slide-4
SLIDE 4

In cancer genome, it is important to detect CNAs and LOH CNAs – copy number alterations:

  • Large-scale genomic deletions
  • Large-scale genomic duplications
  • Amplicons (duplications >10 times)

LOH – loss of heterozygosity regions

4

slide-5
SLIDE 5

Amplification of an important gene can favor cancer development

  • MYCN amplification, which occurs in approximately 22% of primary neuroblastomas, is one
  • f the most powerful prognostic factors identified to date. It is significantly associated with

advanced-stage disease, rapid tumor progression, and poor prognosis.

5

MYCN

part of chr2

DDX1

more than 100 copies

slide-6
SLIDE 6

Amplification of an important gene can favor cancer development

  • MYCN amplification, which occurs in approximately 22% of primary neuroblastomas, is one
  • f the most powerful prognostic factors identified to date. It is significantly associated with

advanced-stage disease, rapid tumor progression, and poor prognosis.

6

From Kawa K et al. JCO 1999

Overall survival curve for MYCN-amplified neuroblastoma patients relative to treatment after induction chemotherapy. A, patients who underwent autologous bone marrow transplantation (ABMT)/peripheral-blood stem-cell transplantation (PBSCT) ; B, patients who did not undergo ABMT/PBSCT.

From Schneiderman, J. et al. 2008

Kaplan-Meier survival curves for 600 stage A, B, and Ds patients by MYCN status. Event-free survival.

Probability of event- free survival (%)

slide-7
SLIDE 7

Deletion in an important gene can favor cancer development

  • Patient was treated again

breast and ovarian cancer

  • She

developed therapy- related acute myeloid leukemia (t-AML)

  • Whole-genome

sequencing revealed a novel, heterozygous 3-kilobase deletion removing exons 7-9

  • f TP53 in the patient’s

normal skin DNA, which was homozygous in the leukemia DNA as a result of acquired uniparental disomy .

7

Adopted from C. Link et al., 2011

slide-8
SLIDE 8

Copy neutral loss of heterozygosity (LOH) or acquired uniparental disomy (UPD) often happens in cancer

In UPD, a person receives two copies of a chromosome, or part of a chromosome, from

  • ne parent and no copies from the other parent.

This acquired homozygosity could lead to development of cancer if the individual inherited a non-functional allele of a tumor suppressor gene.

8

From Wikipedia

slide-9
SLIDE 9

Identification of regions of gain and loss helps to predict the aggressiveness of cancer Copy number profile (chr 11) of a metastatic neuroblastoma sample:

9

slide-10
SLIDE 10

Identification of regions of gain and loss helps to predict the aggressiveness of cancer

10 From Carén H et al. PNAS 2010;107:4323-4328

Kaplan-Meier overall survival for patients with tumors with different genomic profiles.

slide-11
SLIDE 11

Detection of SNVs, indels, structural variants, copy number changes and LOH has become possible with Next Generation Sequencing (NGS)

  • Next Generation sequencing =

Fast, Accurate Reading of DNA

  • Whole genome
  • Exome sequencing
  • Targeted sequencing

11

slide-12
SLIDE 12

Detection of SNVs, indels, structural variants, copy number changes and LOH has become possible with Next Generation Sequencing (NGS)

  • Next Generation sequencing =

Fast, Accurate Reading of DNA

  • Whole genome
  • Sequencing of the whole cancer genome including intragenic regions

and introns

  • Complete information about the genome
  • Exome sequencing
  • Targeted sequencing

12

slide-13
SLIDE 13

Detection of SNVs, indels, structural variants, copy number changes and LOH has become possible with Next Generation Sequencing (NGS)

  • Next Generation sequencing =

Fast, Accurate Reading of DNA

  • Whole genome
  • Exome sequencing
  • Sequencing of exons of ~20000 well characterized genes
  • Complete information about SNVs, indels and copy number changes
  • f the coding part of the genome
  • Targeted sequencing

13

slide-14
SLIDE 14

Detection of SNVs, indels, structural variants, copy number changes and LOH has become possible with Next Generation Sequencing (NGS)

  • Next Generation sequencing =

Fast, Accurate Reading of DNA

  • Whole genome
  • Exome sequencing
  • Targeted sequencing
  • Complete information about SNVs, indels, copy numbers of a small

panel of genes (10-500) actionable in cancer

14

slide-15
SLIDE 15

Today we will speak only about detection of CNAs and LOH,

  • nly in WGS and WES data
  • Screenshot.

15

slide-16
SLIDE 16

Read count (RC) is calculated in sliding windows Gain Loss Normal

– read count in each window chromosome position

16

slide-17
SLIDE 17

We need to normalize read count per window to get meaningful profiles

17

1000 2000 3000 200 400 600 800 50kb-window, chr 5 1000 2000 3000 50kb-window, chr 5

Control Sample Loss

Read count per 50kb-window

200 600 1000

Position, chr5

?

slide-18
SLIDE 18

If control is available, the problem is easily solved

18

Loss

1000 2000 3000 0.0 1.0 2.0 3.0 50kb-window, chr 5 Normalized Read Count

Normalized read count per 50kb- window Position, chr5

slide-19
SLIDE 19

If there is no control dataset, normalization can be done using the GC-content

19

Control GC-content

Position, chr5

slide-20
SLIDE 20

RC can be modeled as a polynomial on GC-content A scatter plot shows the dependency RC ~ GC-content

20

GC-content

Read count per 100kb-window

?

slide-21
SLIDE 21

RC can be modeled as a polynomial on GC-content

– main component – components corresponding to losses and gains

Control, COLO-829BL mate pairs NCI-H2171 paired ends COLO-829 mate pairs

Here RC was modeled as a polynomial of order three on GC-content

21

GC-content GC-content GC-content Read count per 50kb-window

slide-22
SLIDE 22

The resulting profiles are segmented to detect gains and losses

gi = GC-content in window i RCi = is read count in window i, NRCi = resulting normalized read count

Transformation:

ploidy g f RC NRC

i i i

  ) (

– normal copy number – loss – gain

Normalized copy number Genomic position (3-kb window), chr5

slide-23
SLIDE 23

In summary

  • Control-FREEC detects Copy Number Alterations (CNAs) in

whole genome sequencing data

  • Control-FREEC uses a sliding window approach
  • It also allows visualizing CNAs and LOH at the genome

scale

slide-24
SLIDE 24

Visualization of copy number profiles calculated by software FREEC

24

– normal copy number – loss – gain

slide-25
SLIDE 25

There are 3 problems of genomic profiling

  • 1. Reference point for copy number variation (diploid,

triploid, tetraploid genomes)

25

100 200 300 400 0.0 1.0 2.0 window along the genome normalized ratio 100 200 300 400 0.0 1.0 2.0 window along the genome normalized ratio

One copy gain in a diploid genome One copy gain in a tetraploid genome

slide-26
SLIDE 26

There are 3 problems of genomic profiling

  • 2. Contamination of tumor samples by normal stroma cells

26

slide-27
SLIDE 27

We can evaluate contamination of a tumor sample by normal cells

.

Normalized copy number Genomic position (3-kb window) 27 Normalized copy number Genomic position (3-kb window)

slide-28
SLIDE 28

We can evaluate contamination of a tumor sample by normal cells

.

Normalized copy number Genomic position (3-kb window) 28 Normalized copy number Genomic position (3-kb window)

slide-29
SLIDE 29

There are 3 problems of genomic profiling

  • 3. Intra-tumoral heterogeneity

29

from Kost-Alimova et al, BMC Cancer 2007

slide-30
SLIDE 30

There are 3 problems of genomic profiling

  • 3. Intra-tumoral heterogeneity

30

One solution: Tumor Heterogeneity Analysis (THetA)

http://compbio.cs.brown.edu/projects/theta/

  • L. Oesper, A. Mahmoody, and B.J. Raphael. (2013) THetA: Inferring intra-tumor heterogeneity

from high-throughput DNA sequencing data. Genome Biology. 14:R80.

slide-31
SLIDE 31

Now we want to detect genotype status (including LOH)

31

  • r Loss Of Heterozygosity (LOH)
slide-32
SLIDE 32

We characterize the allelic content via the B allele frequency (BAF)

  • B allele = alternative variant in dbSNP

32

acGatgacgtcaAatgctagcgagGcacacaaTac acCatgacgtcaTatgctagcgagCcacacaaAac Reference genome (A allele) dbSNP (B allele) Observed nucleotide frequencies B allele frequency (BAF) 0.44 0.5 0.57 0.45

slide-33
SLIDE 33

There is a correspondence between copy number and possible BAF

33

slide-34
SLIDE 34

We infer the genotype status of a region from B allele frequency profiles

34

AA or BB AB ?

slide-35
SLIDE 35

To infer the genotype status of a region from B allele frequency profiles we use Gaussian mixture model (GMM) fit

  • We try different fits and choose a fit with the best

likelihood

35

The fit indicates that the genotype = AB The fit indicates that the genotype = AA/BB with 40% contamination by normal (“AB”) cells

Fit with 3 modes:

  • AA
  • AB
  • BB

Fit with 4 modes:

  • AA
  • BB
  • AA*0.6+AB*0.4
  • BB*0.6+AB*0.4
slide-36
SLIDE 36

Visualization of BAF

36

slide-37
SLIDE 37

Extending Control-FREEC to the exome sequencing data

  • Exome data:
  • Capture bias
  • GC-content and mappability correction is not enough
  • Mandatory use of a control sample to normalize read

counts

37

uneven coverage of exons

slide-38
SLIDE 38

Exome sequencing data may be much more noisy than whole genome sequencing data Additional bias (capture) => additional noise

38

slide-39
SLIDE 39

Exome sequencing data may be much more noisy than whole genome sequencing data Additional bias (capture) => additional noise

39

slide-40
SLIDE 40

Idea: use BAF profiles to infer correct copy numbers

GAP (Popova et al., Genome Biology 2009)

40

Allelic status Copy number BAF*

  • NA

A/B 1 0 1 AB 2 0 1 0.5 AA/BB 2 0 1 AAB/ABB 3 0 1 0.33 0.66 AAA/BBB 3 0 1 AABB 4 0 1 0.5 AAAB/ABB B 4 0 1 0.25 0.75 AAAA/BBB B 4 0 1 … … …

*No contamination by normal cel

slide-41
SLIDE 41

Idea: use BAF profiles to infer correct copy numbers

GAP (Popova et al., Genome Biology 2009)

41

Allelic status Copy numbe r BAF**

  • NA

A/B 1 0 1 p/(1+p) 1/(1+p) AB 2 0 1 0.5 AA/BB 2 0 1 p/2 1-p/2 AAB/ABB 3 0 1 1/(1-p) 1-1/(1-p) AAA/BBB 3 0 1 p/(3-p) 1-p(3-p) AABB 4 0 1 0.5 AAAB/ABB B 4 0 1 1/(4-2p) 1-1/(4- 2p) AAAA/BBB B 4 0 1 p/(4-2p) 1-p/(4- 2p) … … …

**Contamination p by normal cel

slide-42
SLIDE 42

Realization: in case of doubt, get max(logLikelihood) for

  • bserved BAF

1 or 2 copies?

42

Allelic ratio about ½: A/B: logLikelihood

  • 2340

AA/BB: logLikelihood -3270 AB: logLikeluhood 120

2 copies!

slide-43
SLIDE 43

Selection of window size

  • If a window size is not provided by the user, it can be calculated using the

following information:

  • total read count (T)
  • desired value of coefficient of variation (CV) for read count distribution per

window. Coefficient of variation = (standard deviation) / mean

  • Let all reads be uniformly distributed along the genome

L = genome length W = window size

  • Then, there are L/W windows and each of them contains on average

T/(L/W) = T·W/L reads.

  • Read count per window follows a Poisson distribution with parameter λ =

T·W/L.

  • We suggest a CV from 0.05 to 0.1.

43

slide-44
SLIDE 44

Segmentation

  • Segmentation is done by a LASSO-based algorithm suggested by

(Harchaoui and Lévy-Leduc, 2008).

  • Each chromosome is segmented independently.
  • The maximum number of breakpoints for each chromosome cannot

exceed 10% of its length.

  • Maximal of the slope coefficient (residual sum of squares RSS vs

number of breakpoints) tells where to stop.

RSS number of breakpoints

44

slide-45
SLIDE 45

45 Distribution of read count (RC) in tumor cells vs normal cells. Each point represents the number of sequence reads aligned to a 50-kb window (also called RC) for a control cell line vs tumor cell line. A,B - COLO-829 cell line; C,D - HCC1143 cell line. Red and yellow dots represent a fit for tumor genome ploidy. Red – near-triploid genome (A and C), yellow – near-tetraploid genome (B and D). Red points provide better fit to RC density than yellow dots, suggesting near-triploidy of COLO-829 and HCC1143.

slide-46
SLIDE 46

Control-FREEC

  • Control-FREEC detects and LOH in whole genome or

exome sequencing data

  • It also allows visualizing CNAs and LOH at the genome

scale

  • Normalized for contamination of tumor samples with

normal cells

  • Works with tumor genomes of different ploidy
  • Works for non-human genomes as well!
  • Control is optional (if data is whole genome)
slide-47
SLIDE 47

Testing data

  • 1. WGS dataset
  • Illumina GAII mate-paired data
  • whole genome re-sequencing
  • Neuroblastoma cell line
  • The non-tumoral cell line = control dataset
  • 2. WES dataset
  • Illumina HiSeq2500 paired-end data
  • whole exome re-sequencing
  • primary tumor sample
  • The non-tumoral DNA = control
slide-48
SLIDE 48

http://genome-euro.ucsc.edu/cgi- bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=valentina.boeva& hgS_otherUserSessionName=hg19_roscoff

slide-49
SLIDE 49

Hands on ControlFREEC tutorial!

slide-50
SLIDE 50

FREEC webpage and online manual

  • http://bioinfo-out.curie.fr/projects/freec/
  • http://bioinfo-out.curie.fr/projects/freec/tutorial.html
slide-51
SLIDE 51

tumor.genome.bam

In Galaxy version use only « Any » Mappabity for 35-100bp kmers is very similar; read length does not really matter Better to know in advance Better to run « without contamination » at first « yes » only for high depth of coverage data

slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62
  • There is an option to run normalization of tumor sample read counts using a control sample instead of GC-content
  • In this case the control sample preparation should be done by the same person who did it for the tumor, ideally the same moment
  • This will guarantee that the GC-content bias will be the same

Normal

Read count Read count GC-content GC-content

Tumor

Normal Read count Tumor Read count

slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65

~mappability threshold Run the first time without contamination adjustment Set to “Yes” to get allelic profiles and LOH Run the first time without this option

slide-66
SLIDE 66

Set to >=2 if you are not interested in one-exon outliers

slide-67
SLIDE 67
slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71
slide-72
SLIDE 72
slide-73
SLIDE 73
slide-74
SLIDE 74

Signature of normal contamination: shift toward “normal”

slide-75
SLIDE 75

Run with contamination adjustment Will use BAF profiles for copy number prediction in ambiguous cases In this case, the contamination seems to be about 25%; Leave the field black if you don’t have expertise to evaluate contamination by eye

slide-76
SLIDE 76

Increase the threshold if you want less breakpoints

slide-77
SLIDE 77
slide-78
SLIDE 78
slide-79
SLIDE 79
slide-80
SLIDE 80
slide-81
SLIDE 81
slide-82
SLIDE 82