Cancer gene discovery via network analysis of somatic mutation data - - PowerPoint PPT Presentation

cancer gene discovery via network analysis of somatic
SMART_READER_LITE
LIVE PREVIEW

Cancer gene discovery via network analysis of somatic mutation data - - PowerPoint PPT Presentation

Cancer gene discovery via network analysis of somatic mutation data Insuk Lee Cancer is a progressive genetic disorder. Accumulation of somatic mutations cause cancer. For example, in colorectal cancer, the first gatekeeping mutation


slide-1
SLIDE 1

Cancer gene discovery via network analysis of somatic mutation data

Insuk Lee

slide-2
SLIDE 2

Cancer is a progressive genetic disorder.

  • Accumulation of somatic mutations cause cancer.
  • For example, in colorectal cancer, the first gatekeeping mutation

(often occur in APC) is followed by series of activation of

  • ncogene and loss-of-function of tumor suppressor genes, which

eventually generates a malignant tumor.

slide-3
SLIDE 3
  • Tumor samples and adjacent healthy tissue (or blood) samples (i.e.,

matched normal) samples are sequenced (WES) and aligned to identify cancer-associated somatic mutations (and cancer genes). Sequencing approach to the comprehensive catalog of cancer genes

  • Nat. Rev. Genet 15:556 (2014)
slide-4
SLIDE 4

Driver vs. Passenger mutations

  • Driver mutation: A mutation that directly or indirectly confers a

selective growth advantage to the cell in which it occurs (opposite to passenger mutation)

  • Not all mutations are driver mutations. Therefore, not all genes

contain somatic mutations are cancer driver genes.

Nature 458:719 (2009)

slide-5
SLIDE 5

Distinguishing Drivers from Passengers § Based on recurrent mutations

  • Use deleteriousness of

the mutations

slide-6
SLIDE 6

Using additional information to reduce false positives

  • Mutation frequency is normalized by gene-specific background mutation

rate (BMR), expression level, and replication timing in Mutsig CV.

Nature reviews genetics 15:556 (2014)

slide-7
SLIDE 7

What about cancer genes with low mutation rate?

Many hills but only few mountains

Of the genomic landscapes of human colorectal cancers (Wood et al. Science 2007)

  • Map of mutations in 11 breast and 11

colorectal cancers.

  • In the landscape, the heights of the

peaks reflect the mutation frequency of each gene. A few gene “mountains” are mutated in a large proportion of tumors: most genes are mutated in <5% of tumors and are represented as “hills” in the figure.

  • We observed similar distribution of

mutation frequency from TCGA data.

slide-8
SLIDE 8

Long-tail distribution of mutation frequency

Mutation count

200 400 600 800 1000 1200 1400 1600 1800 2000

TP53 PIK3CA PTEN BRAF KMT2C KMT2D APC ATRX IDH1 ARID1A

Mutation count

200 400 600 800 1000 1200 1400 1600 1800 2000

Mutation distribution across 422 CGC (Cancer Genome Census) genes in 6764 Pan-cancer samples (April 2014 TCGA). 410 mutated genes

  • The majority of the cancer genes are infrequently mutated and have somatic mutations

in only few patients, which result in long-tail distribution of mutation frequency.

  • Therefore, methods based on recurrent mutations have intrinsic limitation in cancer

gene identification.

Among 422 known cancer genes by CGC 7 genes: mut in >5% tumors 128 genes: mut in >1% tumors 12 genes: no mut in tumors

slide-9
SLIDE 9

Cancer is a disease by pathway disorders

  • However, mutations concentrated in known cancer-related pathways, which suggest that

pathway-centric approach will be useful in analysis of cancer genomics data.

  • Nat. Rev. Cancer Poster (2002)
slide-10
SLIDE 10

MUFFINN: mutations for functional impact on network neighbors

Genome Biology (2016)

  • Predict driver genes based on

pathway-level mutational information

slide-11
SLIDE 11

3 ways to take account neighbors’ mutational burden

  • On the following two functional gene networks

Genome Res. (2011) Nucleic Acids Res. (2015)

slide-12
SLIDE 12

Cancer gene sets for benchmarking prediction

  • No comprehensive gold standard cancer gene set
  • We compiled multiple cancer gene sets from various sources of

annotations.

  • Each cancer gene set has a different trade-off between accuracy,

coverage, and bias.

CGC CGC PointMut 20/20 Rules HCD MouseMut

  • 422 genes
  • From CGC

(Cancer Genome Census) Futreal et al. 2004

  • 118 genes
  • CGC genes which

act to cancer via point mutations

  • 124 genes
  • based on the

mutational patterns V

  • gelstein et al. 2013
  • 288 genes
  • High-confidence

driver genes by rule-based approach Tamborero et al. 2013

  • 797 genes

Ortholog-mapped genes which are identified by mutagenesis experiment in mice March et al. 2011 Mann et al. 2012

slide-13
SLIDE 13

Result 1: MUFFINN performs better than gene-based methods

18 cancer types ~6700 TCGA samples

slide-14
SLIDE 14

Evaluation based on the all candidates Evaluation based on the top candidates, which go into the follow-up studies

Result 1: MUFFINN performs better than gene-based methods

slide-15
SLIDE 15

Testing significance of using mutational information among indirect network neighbors for MUFFINN

Use mutation information

  • f direct neighbors only

Use mutation information

  • f all genes
slide-16
SLIDE 16

Result 2: MUFFINN can predict cancer drivers better with taking only direct neighbors’ mutational information.

GS: Gaussian smoothing IR: Iterative Rank RWR: Random walk with restart

slide-17
SLIDE 17

Result 3: The larger size of Pan-cancer data makes

  • nly marginal improvement in predictions.
slide-18
SLIDE 18

Result 4: MUFFINN effectively predict cancer genes with only 10% of tumor samples.

slide-19
SLIDE 19

Manual examination of the novel candidate drivers

  • Selected 199 novel candidate drivers that pass all the following criteria.

1. Predicted in top 1000 by MUFFINN (Prob > 0.5) 2. Predicted in top 1000 by neither Mutsig nor MutationAccessor 3. Annotated by neither CGC nor 20/20 cancer gene sets (to exclude all knowns)

  • Among 199 candidate cancer genes, 128 (64%) genes have direct or

indirect supportive evidences in the literatures.

  • Class 1 (11 genes): already reported as cancer genes but not annotated yet by

CGC or 20/20 database.

  • Class 2 (14 genes): known to increase cancer susceptibility through germline

variants.

  • Class 3 (14 genes): known to be involved in cancer by copy number variation

(CNV) or structural variation (SV).

  • Class 4 (89 genes): associated with cancer via expression dysregulation with

non-genetic alterations (e.g., epigenetic regulation, miRNA target).

  • Class 5 (71 genes): with no evidence (novel candidates to be investigated in the

future)

slide-20
SLIDE 20

Novel candidate drivers with low mutation occurrence have neighboring genes known to be involved in cancer pathways

slide-21
SLIDE 21

Performing prediction using a companion web server www.inetbio.org/muffinn

slide-22
SLIDE 22

Summary

§ Cancer genome sequencing can facilitate discovery of cancer driver genes. § We can distinguish drivers from passengers based on recurrent mutations. § Conventional methods based on recurrent mutations are intrinsically limited to the cancer genes with low mutation occurrence. § Since cancer is pathway disease, incorporating pathway information will enhance cancer genomics data analysis. § We developed a network-based method, MUFFINN, and a companion web server, and demonstrated its superiority in cancer gene prediction. § Network-based analysis of cancer genomics data will provide a promising route to the comprehensive catalog of cancer gene.

slide-23
SLIDE 23

Acknowledgements

Ben Lehner, Fran Supek

Genome Biology 17:129 (June 2016)

EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (Spain) Ara Cho, Jung Eun Shim, Eiru Kim Yonsei Univeristy, Department of Biotechnology (Korea)

MUFFINN: cancer gene discovery via network analysis of somatic mutation data

slide-24
SLIDE 24

Network Biology Lab (www.netbiolab.org)

Current members

  • PhD. Jung Eun Shim
  • PhD. Eiru Kim

Tak Lee Sunmo Yang Kyungsoo Kim Heonjong Han Dasom Bae Sangyoung Lee Chan Yeong Kim Muyoung Lee Jaewon Cho Eunbeen Kim

Former members

  • PhD. Sohyun Hwang
  • PhD. Taeyun Oh
  • PhD. Jawon Song
  • PhD. Samuel Beck
  • PhD. Jonghoon Lee
  • PhD. Yoonhee Ko
  • PhD. Junha Shin
  • PhD. Hanhae Kim
  • PhD. Ara Cho
  • PhD. Sungou Ji

Hongseok Shim Hyojin Kim

slide-25
SLIDE 25
slide-26
SLIDE 26

Result : Accounting for mutational heterogeneity is not important for MUFFINN.

slide-27
SLIDE 27

HotNet2 vs. MUFFINN

HotNet2 (Nat.Genet. 2015) 1. Assign heat (mutation) to each gene 2. Diffuse heat from hot (highly mutated) to cold genes in the network 3. Extract significantly hot subnetwork (cancer pathway) MUFFINN (this study) 1. Assign heat (mutation) to each gene 2. For each gene, measure mutational burden over network neighbors 3. Rank genes (cancer genes) by the mutational burden

slide-28
SLIDE 28

Result : HotNet2 and MUFFINN are complementary

Retrieval rate for known cancer genes in 144 candidates by HotNet2 and top 144 canddiates by MUFFINN Venn diagram among 422 CGC genes, 144 candidates by HotNet2, and top 144 candidates by MUFFINN