Cancer gene discovery via network analysis of somatic mutation data - - PowerPoint PPT Presentation
Cancer gene discovery via network analysis of somatic mutation data - - PowerPoint PPT Presentation
Cancer gene discovery via network analysis of somatic mutation data Insuk Lee Cancer is a progressive genetic disorder. Accumulation of somatic mutations cause cancer. For example, in colorectal cancer, the first gatekeeping mutation
Cancer is a progressive genetic disorder.
- Accumulation of somatic mutations cause cancer.
- For example, in colorectal cancer, the first gatekeeping mutation
(often occur in APC) is followed by series of activation of
- ncogene and loss-of-function of tumor suppressor genes, which
eventually generates a malignant tumor.
- Tumor samples and adjacent healthy tissue (or blood) samples (i.e.,
matched normal) samples are sequenced (WES) and aligned to identify cancer-associated somatic mutations (and cancer genes). Sequencing approach to the comprehensive catalog of cancer genes
- Nat. Rev. Genet 15:556 (2014)
Driver vs. Passenger mutations
- Driver mutation: A mutation that directly or indirectly confers a
selective growth advantage to the cell in which it occurs (opposite to passenger mutation)
- Not all mutations are driver mutations. Therefore, not all genes
contain somatic mutations are cancer driver genes.
Nature 458:719 (2009)
Distinguishing Drivers from Passengers § Based on recurrent mutations
- Use deleteriousness of
the mutations
Using additional information to reduce false positives
- Mutation frequency is normalized by gene-specific background mutation
rate (BMR), expression level, and replication timing in Mutsig CV.
Nature reviews genetics 15:556 (2014)
What about cancer genes with low mutation rate?
Many hills but only few mountains
Of the genomic landscapes of human colorectal cancers (Wood et al. Science 2007)
- Map of mutations in 11 breast and 11
colorectal cancers.
- In the landscape, the heights of the
peaks reflect the mutation frequency of each gene. A few gene “mountains” are mutated in a large proportion of tumors: most genes are mutated in <5% of tumors and are represented as “hills” in the figure.
- We observed similar distribution of
mutation frequency from TCGA data.
Long-tail distribution of mutation frequency
Mutation count
200 400 600 800 1000 1200 1400 1600 1800 2000
TP53 PIK3CA PTEN BRAF KMT2C KMT2D APC ATRX IDH1 ARID1A
Mutation count
200 400 600 800 1000 1200 1400 1600 1800 2000
Mutation distribution across 422 CGC (Cancer Genome Census) genes in 6764 Pan-cancer samples (April 2014 TCGA). 410 mutated genes
- The majority of the cancer genes are infrequently mutated and have somatic mutations
in only few patients, which result in long-tail distribution of mutation frequency.
- Therefore, methods based on recurrent mutations have intrinsic limitation in cancer
gene identification.
Among 422 known cancer genes by CGC 7 genes: mut in >5% tumors 128 genes: mut in >1% tumors 12 genes: no mut in tumors
Cancer is a disease by pathway disorders
- However, mutations concentrated in known cancer-related pathways, which suggest that
pathway-centric approach will be useful in analysis of cancer genomics data.
- Nat. Rev. Cancer Poster (2002)
MUFFINN: mutations for functional impact on network neighbors
Genome Biology (2016)
- Predict driver genes based on
pathway-level mutational information
3 ways to take account neighbors’ mutational burden
- On the following two functional gene networks
Genome Res. (2011) Nucleic Acids Res. (2015)
Cancer gene sets for benchmarking prediction
- No comprehensive gold standard cancer gene set
- We compiled multiple cancer gene sets from various sources of
annotations.
- Each cancer gene set has a different trade-off between accuracy,
coverage, and bias.
CGC CGC PointMut 20/20 Rules HCD MouseMut
- 422 genes
- From CGC
(Cancer Genome Census) Futreal et al. 2004
- 118 genes
- CGC genes which
act to cancer via point mutations
- 124 genes
- based on the
mutational patterns V
- gelstein et al. 2013
- 288 genes
- High-confidence
driver genes by rule-based approach Tamborero et al. 2013
- 797 genes
Ortholog-mapped genes which are identified by mutagenesis experiment in mice March et al. 2011 Mann et al. 2012
Result 1: MUFFINN performs better than gene-based methods
18 cancer types ~6700 TCGA samples
Evaluation based on the all candidates Evaluation based on the top candidates, which go into the follow-up studies
Result 1: MUFFINN performs better than gene-based methods
Testing significance of using mutational information among indirect network neighbors for MUFFINN
Use mutation information
- f direct neighbors only
Use mutation information
- f all genes
Result 2: MUFFINN can predict cancer drivers better with taking only direct neighbors’ mutational information.
GS: Gaussian smoothing IR: Iterative Rank RWR: Random walk with restart
Result 3: The larger size of Pan-cancer data makes
- nly marginal improvement in predictions.
Result 4: MUFFINN effectively predict cancer genes with only 10% of tumor samples.
Manual examination of the novel candidate drivers
- Selected 199 novel candidate drivers that pass all the following criteria.
1. Predicted in top 1000 by MUFFINN (Prob > 0.5) 2. Predicted in top 1000 by neither Mutsig nor MutationAccessor 3. Annotated by neither CGC nor 20/20 cancer gene sets (to exclude all knowns)
- Among 199 candidate cancer genes, 128 (64%) genes have direct or
indirect supportive evidences in the literatures.
- Class 1 (11 genes): already reported as cancer genes but not annotated yet by
CGC or 20/20 database.
- Class 2 (14 genes): known to increase cancer susceptibility through germline
variants.
- Class 3 (14 genes): known to be involved in cancer by copy number variation
(CNV) or structural variation (SV).
- Class 4 (89 genes): associated with cancer via expression dysregulation with
non-genetic alterations (e.g., epigenetic regulation, miRNA target).
- Class 5 (71 genes): with no evidence (novel candidates to be investigated in the
future)
Novel candidate drivers with low mutation occurrence have neighboring genes known to be involved in cancer pathways
Performing prediction using a companion web server www.inetbio.org/muffinn
Summary
§ Cancer genome sequencing can facilitate discovery of cancer driver genes. § We can distinguish drivers from passengers based on recurrent mutations. § Conventional methods based on recurrent mutations are intrinsically limited to the cancer genes with low mutation occurrence. § Since cancer is pathway disease, incorporating pathway information will enhance cancer genomics data analysis. § We developed a network-based method, MUFFINN, and a companion web server, and demonstrated its superiority in cancer gene prediction. § Network-based analysis of cancer genomics data will provide a promising route to the comprehensive catalog of cancer gene.
Acknowledgements
Ben Lehner, Fran Supek
Genome Biology 17:129 (June 2016)
EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (Spain) Ara Cho, Jung Eun Shim, Eiru Kim Yonsei Univeristy, Department of Biotechnology (Korea)
MUFFINN: cancer gene discovery via network analysis of somatic mutation data
Network Biology Lab (www.netbiolab.org)
Current members
- PhD. Jung Eun Shim
- PhD. Eiru Kim
Tak Lee Sunmo Yang Kyungsoo Kim Heonjong Han Dasom Bae Sangyoung Lee Chan Yeong Kim Muyoung Lee Jaewon Cho Eunbeen Kim
Former members
- PhD. Sohyun Hwang
- PhD. Taeyun Oh
- PhD. Jawon Song
- PhD. Samuel Beck
- PhD. Jonghoon Lee
- PhD. Yoonhee Ko
- PhD. Junha Shin
- PhD. Hanhae Kim
- PhD. Ara Cho
- PhD. Sungou Ji
Hongseok Shim Hyojin Kim