Ontology, Network, and Pathway Analysis of Large Datasets Willard - - PowerPoint PPT Presentation

ontology network and pathway analysis of large datasets
SMART_READER_LITE
LIVE PREVIEW

Ontology, Network, and Pathway Analysis of Large Datasets Willard - - PowerPoint PPT Presentation

Ontology, Network, and Pathway Analysis of Large Datasets Willard Freeman wfreeman@psu.edu Flow of genetic information A buffet of omes Discovery approaches Knowledge versus Data Discovery approaches But we cant see everything


slide-1
SLIDE 1

Ontology, Network, and Pathway Analysis of Large Datasets

Willard Freeman wfreeman@psu.edu

slide-2
SLIDE 2

Flow of genetic information A buffet of ‘omes

slide-3
SLIDE 3

Discovery approaches

slide-4
SLIDE 4

Discovery approaches

Knowledge versus Data

slide-5
SLIDE 5

But we can’t see everything

  • “Measure what is

measurable, and make measurable what is not so.”

  • Galileo
slide-6
SLIDE 6

Technologies Available for Gene Expression Studies The choice of appropriate technology is balance between the # of genes to be analyzed and the # of samples to be analyzed (and costs).

slide-7
SLIDE 7

Analytical Flow of Discovery Studies

  • I can generate lots of data.
  • How do I make use of it and what are the

steps?

slide-8
SLIDE 8

GeneSpring, R, etc

Data analysis (stats, Ontology, classification statistics, PCA, heatmaps)

Ingenuity

Pathways, networks, effect on function, localization

Genome Studio

Illumina data analysis, QC

DeCyder w/ EDA

2DIGE analysis, stats, some clustering

Progenesis

2DIGE analysis, stats, some clustering

Protein Pilot

iTRAQ analysis

SDS

qPCR analysis

Application Specific Software Proteomics Transcriptomics GeneOntology

Biological processes

Other visualization, process, and pathway programs Statistical Analyses Biological Interpretation

slide-9
SLIDE 9

Gene Ontologies

Are certain categories of genes/proteins over-represented in your population of changes as compared to the entire genome/proteome?

The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences.

slide-10
SLIDE 10

Gene Ontologies

  • Prime categories

– Molecular function – what specific biochemical action(s)

  • Kinase, iron binding, etc.

– Biological process – what process is this part of

  • Proteolytic degradation, neurotransmitter release, etc.

– Cellular component – where is it

  • Nucleus, ER, etc.
  • Can assess over-representation of categories in your

gene/protein list by Fisher’s Exact Test

slide-11
SLIDE 11

Example output from GO analysis of the example dataset GO category

Multiple testing corrected p-value

Number of genes in your list that are in a category Number of genes in the genome that are in a category

slide-12
SLIDE 12

Gene Ontologies

  • Important facts

– Genes can belong to many categories – Classifications are artificial, human generated – Unknown functions not included – Just a list function – does not take direction of change into account

  • Strengths

– Fairly comprehensive – Easy analysis

  • Weaknesses

– Sometimes only vaguely informative – Redundant classifications

slide-13
SLIDE 13

Other Ontology Tools

  • Kyoto Encyclopedia of Gene and Genomes (KEGG)

– http://www.genome.jp/kegg/pathway.html

  • Database for Annotation Visualization and Integrated

Discovery (DAVID)

– http://david.abcc.ncifcrf.gov/

  • PANTHER (Protein ANalysis THrough Evolutionary

Relationships)

– http://www.pantherdb.org/

slide-14
SLIDE 14

Panther

  • Are specific processes
  • ver/under represented?
  • Very similar to GO
  • Uncertainly as to

continued curration

slide-15
SLIDE 15

Panther

slide-16
SLIDE 16

Pathways and Networks

  • Pathway

– A well characterized chain of molecular events leading to some functional outcome – Human created

  • Network

– A set of inter-related genes/proteins

  • Relationships can be human generated or as

computer predicted (e.g. protein-protein binding)

– May or may not have a known ‘functional’

  • utcome
slide-17
SLIDE 17

Pathways and Network Programs

  • Ingenuity

– www.ingenuity.com – Web accessed – Institutional license (will be re-activated soon) – May be small charge

  • Ariedne

– http://www.ariadnegenomics.com/ – Individual license at UP

  • KEGG, PathCase, Gephi, GenMAPP
slide-18
SLIDE 18

Ingenuity Example

  • Database is a combination of hand entered literature,

natural language processing, and retrieval of public databases ranging from gene expression atlases to Clinicaltrials.gov.

  • Very good for a wide variety of input types and identifiers

– SNPs (dbSNP) – mRNA (Unigene, RefSeq, specific array IDS, Entrez Gene) – miRNA (miRBase) – Proteins (UniProt, GI, HUGO gene symbol) – Metabolites (HMDB) – Small molecule chemicals (PubChem, CAS registry)

slide-19
SLIDE 19

Single Gene Analysis

  • I have some gene and I don’t know what it is and PubMed isn’t helping
slide-20
SLIDE 20

Importing a dataset

  • I have a set (10s to 1000s) of genes/proteins/etc

and what to see how changes may work additively or synergistically

  • Importing data

– Format

  • A recognizable identifier
  • A ratio or fold change value
  • Can also include p values, and multiple different

comparisons

slide-21
SLIDE 21

Most initial challenges with network/pathway analysis are formatting the primary date so that the appropriate software can parse the data. Remember, that software is not smart and you have to be explicit in organization

slide-22
SLIDE 22

Importing data

  • Identifiers
  • Quantitative information
  • Did the parsing work?
  • Any additional filtering?
  • Starting an analysis
slide-23
SLIDE 23

Summary of analysis

slide-24
SLIDE 24

Functions – analogous to ontologies Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics

slide-25
SLIDE 25

Threshold – Fisher’s exact test p value in –log. Can apply a BHMTC to statistics Ratio – ‘%’ of genes in the pathway altered in dataset

slide-26
SLIDE 26

Quantitative data – by shading, can mouse over for numeric data Genes – each gene (or group of genes) can be clicked for additional data Relationships – each line can be clicked for specific information Modifying – the pathway can be modified in Ingenuity, to add/subtract information and create presentation figures

slide-27
SLIDE 27

Pathways

  • Are a conceptual artifice.
  • Significance is dependent on the size (# of

molecules in pathway).

  • Can create your own pathways which are

focused on the topic of interest.

  • Pathways do not always pop out as blatantly
  • bvious. Heterogeneous and subtle treatments.
  • If you are an expert in the XYZ pathway you will

know more than a database.

slide-28
SLIDE 28

Networks Are there sets of inter-related genes that are regulated in your condition? Associative relationship and not necessarily deterministic.

Calvano SE, et al Nature 2005 PMID: 16136080.

slide-29
SLIDE 29

Can add function and process callouts Toggle between pair-wise comparisons to examine commonalities and differences

slide-30
SLIDE 30
slide-31
SLIDE 31

Other Ingenuity Functions

  • Toxicology, figure generations, data set

combination, list comparisons, biomarker development.

  • Data sharing

– Analyses can be shared across users – Excellent for sharing in collaborative projects

  • r just between lab members in a manner

more intuitive than an excel sheet.

slide-32
SLIDE 32

Ingenuity

  • Strengths

– Multiple sources of data compressed into one place – Extensive use of synonyms and multiple identifiers (e.g. same gene known as Lilrb3 and PirB, one field uses one name and another the other name – Finding unexpected results – Providing statistical and visual representations of findings to prioritize 2nd generation experiments – Sharing across investigators – Combining multiple types of data

slide-33
SLIDE 33

Ingenuity

  • Weaknesses

– Not omniscient – For niche pathways, poor representation

  • Can create your own

– Will not write your paper for you – Networks are sometimes not informative of biological outcomes due to limitations in the existing knowledge

slide-34
SLIDE 34

Combining Proteomics and Functional Genomics

  • How well do message and transcript

match up?

Gene mRNA mRNA mRNA Protein Protein-P Protein-Gly Protein-P-Gly

slide-35
SLIDE 35

Appropriate use of nomenclature is critical for comparing caterpillars and butterflies

slide-36
SLIDE 36

Datasets from multiple sources can be merged for subsequent analyses with accurate translation to a common identifier.

slide-37
SLIDE 37

Good hunting