PCA and clustering STEPHANIE J. SPIELMAN, PHD BIO5312, FALL 2017 - - PowerPoint PPT Presentation

pca and clustering
SMART_READER_LITE
LIVE PREVIEW

PCA and clustering STEPHANIE J. SPIELMAN, PHD BIO5312, FALL 2017 - - PowerPoint PPT Presentation

PCA and clustering STEPHANIE J. SPIELMAN, PHD BIO5312, FALL 2017 Exploratory methods for high- dimensional data Principal components analysis (PCA) Note there are many similar methods, e.g. linear discriminant analysis Clustering


slide-1
SLIDE 1

PCA and clustering

STEPHANIE J. SPIELMAN, PHD BIO5312, FALL 2017

slide-2
SLIDE 2

Exploratory methods for high- dimensional data

Principal components analysis (PCA)

  • Note there are many similar methods, e.g. linear discriminant analysis

Clustering

  • K-means
  • Hierarchical
  • Again, many more
slide-3
SLIDE 3

Principal components analysis

Linear algebra technique to emphasize axes of variation in the data

PCA offers new coordinate system to emphasize variation in the data

slide-4
SLIDE 4

Do it yourself!

There are as many PCs are there are variables

  • NUMERIC ONLY

http://setosa.io/ev/principal-component-analysis/

slide-5
SLIDE 5

Example: Iris

2 4 6 5 6 7 8

Sepal.Length Petal.Length

0.0 0.5 1.0 1.5 2.0 2.5 5 6 7 8

Sepal.Length Petal.Width

2 4 6 2.0 2.5 3.0 3.5 4.0 4.5

Sepal.Width Petal.Length

0.0 0.5 1.0 1.5 2.0 2.5 2.0 2.5 3.0 3.5 4.0 4.5

Sepal.Width Petal.Width

Species

setosa versicolor virginica

How well we can tell species apart depends on plotting strategy

slide-6
SLIDE 6

PCA on iris

> iris %>% select(-Species) %>% ### Remove any non-numeric columns scale() %>% ### Scale the data (columns in same units) prcomp() -> iris.pca ### Run the PCA with prcomp()

slide-7
SLIDE 7

PCA output

## Rotation matrix: Loadings are the percent of variance explained by the variable > iris.pca$rotation PC1 PC2 PC3 PC4 Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 Sepal.Width

  • 0.2693474 -0.92329566 -0.2443818 -0.1235096

Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492 Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971 Sepal.Length, Petal.Length, and Petal.Width load positively on PC1. Sepal.Width shows a weaker negative loading on PC1. PC2 is dominated by Sepal.Width, which loads strongly and negatively.

slide-8
SLIDE 8

PCA output

#### The actual principal components > head(iris.pca$x) PC1 PC2 PC3 PC4 [1,] -2.257141 -0.4784238 0.12727962 0.024087508 [2,] -2.074013 0.6718827 0.23382552 0.102662845 [3,] -2.356335 0.3407664 -0.04405390 0.028282305 [4,] -2.291707 0.5953999 -0.09098530 -0.065735340 [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870 [6,] -2.068701 -1.4842053 -0.02687825 0.006586116

slide-9
SLIDE 9

PCA output

#### Standard deviation of components is represents the percent of variation each component explains, ish > iris.pca$sdev [1] 1.7083611 0.9560494 0.3830886 0.1439265 ### Compute variance explained: > (iris.pca$sdev)^2 / (sum(iris.pca$sdev^2)) [1] 0.729624454 0.228507618 0.036689219 0.005178709 PC1 explains ~73% of variance in the data. By definition, PC1 explains the most variation (and so on) PC2 explains ~ 23% of variance in the data etc.

slide-10
SLIDE 10

Visualizing the PCA: PC vs PC

#### Bring back the original data for plotting > plot.pca <- cbind(iris, iris.pca$x) > ggplot(plot.pca, aes(x = PC1, y = PC2, color = Species)) + geom_point()

−2 −1 1 2 −2 2

PC1 PC2 Species

setosa versicolor virginica

slide-11
SLIDE 11

Visualizing the PCA: PC vs PC

#### Bring back the original data for plotting > plot.pca <- cbind(iris, iris.pca$x) > ggplot(plot.pca, aes(x = PC1, y = PC2, color = Species)) + geom_point() + stat_ellipse()

−2 −1 1 2 −2 2

PC1 PC2 Species

setosa versicolor virginica

Species separate along PC1 PC1 discriminates species. Species spread evenly across PC2.

slide-12
SLIDE 12

PC1 vs PC3?

−1.0 −0.5 0.0 0.5 1.0 −2 2

PC1 PC3 Species

setosa versicolor virginica

Species separate along PC1 PC1 discriminates species. Setosa is more compact along PC3, whereas there is more spread for versicolor/virginica along PC3.

slide-13
SLIDE 13

Visualizing the PCA: Loadings

> as.data.frame(iris.pca$rotation) %>% rownames_to_column() -> loadings rowname PC1 PC2 PC3 PC4 1 Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 2 Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096 3 Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492 4 Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971

Sepal.Length Sepal.Width Petal.Length Petal.Width −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

PC1 PC2

> ggplot(loadings) + geom_segment(x=0, y=0, aes(xend=PC1, yend=PC2)) + geom_text(aes(x=PC1, y=PC2, label=rowname), size=3, color='red') + xlim(-1.,1) + ylim(-1.,1.) + coord_fixed()

slide-14
SLIDE 14

Loadings with arrows

> arrow_style <- arrow(length = unit(0.05, "inches"), type = "closed") > ggplot(loadings) + geom_segment(x=0, y=0, aes(xend=PC1, yend=PC2), arrow = arrow = arrow_style arrow_style) + geom_text(aes(x=PC1, y=PC2, label=rowname), size=3, color='red') + xlim(-1.,1) + ylim(-1.,1.) + coord_fixed()

Sepal.Length Sepal.Width Petal.Length Petal.Width −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

PC1 PC2

Petal.Length and Petal.Width load positively on PC1, but not at all on PC2. Sepal.Width is orthogonal to petals, meaning it captures uncorrelated variation

slide-15
SLIDE 15

Variation explained

> as.tibble((iris.pca$sdev)^2 / (sum(iris.pca$sdev^2))) -> variance # A tibble: 4 x 1 value <dbl> 1 0.729624454 2 0.228507618 3 0.036689219 4 0.005178709 > variance %>% mutate(PC = colnames(iris.pca$x)) %>% ggplot(aes(x = PC, y = value)) + geom_bar(stat = "identity")

0.0 0.2 0.4 0.6 PC1 PC2 PC3 PC4

PC value

slide-16
SLIDE 16

Variation explained

> variance %>% mutate(PC = colnames(iris.pca$x)) %>% ggplot(aes(x = PC, y = value)) + geom_bar(stat = "identity") + geom_text geom_text(aes aes(x = PC, y = value+0.01, label=100*round(value,3 (x = PC, y = value+0.01, label=100*round(value,3))) ))

73 22.9 3.7 0.5

0.0 0.2 0.4 0.6 PC1 PC2 PC3 PC4

PC value

slide-17
SLIDE 17

Breathe break

slide-18
SLIDE 18

Clustering

A family of approaches to identify previously unknown or undetected groupings in data Requires:

  • A measure of distance and/or similarity among data points
  • A clustering algorithm to create the groupings
slide-19
SLIDE 19

There are too many algorithms and no real answers

slide-20
SLIDE 20

K-means clustering

Clusters data into k groups of equal variance by minimizing the within-cluster sum of squares Divide n data points into k disjoint clusters, each described by its mean (ish) K is specified in advance

slide-21
SLIDE 21

K-means algorithm

  • 1. Place k "centroids" in the data
  • 2. Assign point to cluster k based on Euclidian distance
  • 3. Re-compute each k centroid based on means of associated

points

  • 4. Re-assign centroids
  • 5. Repeat until convergence

https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif

slide-22
SLIDE 22

Do it yourself here:

https://www.naftaliharris.com/blog/visualizing-k-means- clustering/

slide-23
SLIDE 23

K-means caveats

Clustering depends on initial conditions Algorithm guaranteed to converge, but possibly on local optima No real way to know if clusters have meaning beyond the math

  • This is true for all clustering!
slide-24
SLIDE 24

Example: iris with K=5

> iris %>% select(-Species) %>% ### We can only cluster numbers! kmeans(5)

K-means clustering with 5 clusters of sizes 50, 12, 25, 24, 39 Cluster means: Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.006000 3.428000 1.462000 0.246000 2 7.475000 3.125000 6.300000 2.050000 3 5.508000 2.600000 3.908000 1.204000 4 6.529167 3.058333 5.508333 2.162500 5 6.207692 2.853846 4.746154 1.564103 Clustering vector: [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 3 5 5 5 3 5 3 3 5 3 5 3 5 5 3 5 3 5 3 5 5 [75] 5 5 5 5 5 3 3 3 3 5 3 5 5 5 3 3 3 5 3 3 3 3 3 5 3 3 4 5 2 4 4 2 3 2 4 2 4 [112] 4 4 5 4 4 4 2 2 5 4 5 2 5 4 2 5 5 4 2 2 2 4 5 5 2 4 4 5 4 4 4 5 4 4 4 5 4 [149] 4 5 Within cluster sum of squares by cluster: [1] 15.15100 4.65500 8.36640 5.46250 12.81128 (between_SS / total_SS = 93.2 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"

slide-25
SLIDE 25

Example: iris with K=5… and broom!

> iris %>% select(-Species) %>% ### We can only cluster numbers! kmeans(5) %>% augment(iris) %>% augment(iris) %>% ### Add clusters back into to original data frame head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species .cluster 1 5.1 3.5 1.4 0.2 setosa 4 2 4.9 3.0 1.4 0.2 setosa 4 3 4.7 3.2 1.3 0.2 setosa 4 4 4.6 3.1 1.5 0.2 setosa 4 5 5.0 3.6 1.4 0.2 setosa 4 6 5.4 3.9 1.7 0.4 setosa 4

slide-26
SLIDE 26

tidy() shows per-cluster information

> iris %>% select(-Species) %>% ### We can only cluster numbers! kmeans(5) %>% tidy() tidy() x1 x2 x3 x4 size withinss cluster 1 5.532143 2.635714 3.960714 1.2285714 28 9.749286 1 2 6.264444 2.884444 4.886667 1.6666667 45 17.014222 2 3 4.704545 3.122727 1.413636 0.2000000 22 3.114091 3 4 7.014815 3.096296 5.918519 2.1555556 27 15.351111 4 5 5.242857 3.667857 1.500000 0.2821429 28 4.630714 5

slide-27
SLIDE 27

Visualize the clustering

2.0 2.5 3.0 3.5 4.0 4.5 2 4 6

Petal.Length Sepal.Width .cluster

1 2 3 4 5

> iris %>% select(-Species) %>% kmeans(5) %>% augment(iris) %>% ggplot(aes(x = Petal.Length, y=Sepal.Width)) + geom_point(aes(color = .cluster))

No clear way to know "best" X and Y axes besides exhaustive plotting

slide-28
SLIDE 28

Was K=5 reasonable?

One (of many) approaches to choosing the best K is the "elbow method"

  • Plot within-sum-of-squares across different K choices
  • "Best" k is where you see an elbow/kink in the plot
  • Highly subjective
slide-29
SLIDE 29

Choosing K with broom

>iris %>% select(-Species) %>% kmeans(5) %>% glance() totss tot.withinss tot.withinss betweenss iter 1 681.3706 51.08942 51.08942 630.2812 2 > numeric.iris <- iris %>% select(-Species) > tibble(k = 2:15) %>% group_by(k) %>% do(kclust=kmeans(numeric.iris, .$k)) %>% glance(kclust)

k totss tot.withinss betweenss iter 1 2 681.3706 152.34795 529.0226 1 2 3 681.3706 78.85144 602.5192 2 3 4 681.3706 57.26562 624.1050 2 4 5 681.3706 49.82228 631.5483 2 5 6 681.3706 42.42155 638.9491 4 6 7 681.3706 36.83714 644.5335 3 7 8 681.3706 40.84578 640.5248 3 ...

slide-30
SLIDE 30

Choosing K with broom

> numeric.iris <- iris %>% select(-Species) > tibble(k = 2:15) %>% group_by(k) %>% do(kclust=kmeans(numeric.iris, .$k)) %>% glance(kclust) %>% mutate(g = 1) %>% ### mutate(g = 1) %>% ### ggplot ggplot gets gets angsty angsty with with geom_line geom_line without this specification without this specification ggplot ggplot(aes aes(x = factor(k), y = (x = factor(k), y = tot.withinss tot.withinss, group=g)) + , group=g)) + geom_point geom_point() + () + geom_line geom_line() ()

40 80 120 2 3 4 5 6 7 8 9 10 11 12 13 14 15

factor(k) tot.withinss

"Elbow" = shift in slope happens around K=4

slide-31
SLIDE 31

Hierarchical clustering

Extremely common in gene expression and/or systems biology studies Useful when data have a hierarchical structure:

slide-32
SLIDE 32

Approach

Divisive (top down) Agglomerative (bottom up)

slide-33
SLIDE 33

Example output

https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html

2 3 4 6

slide-34
SLIDE 34

You will see this figure in every –omics paper you read

Alizadeh, A. et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 403, p. 503-511, 2000 Alizadeh et al. Nature 2000
slide-35
SLIDE 35

What does the real world have to say?

slide-36
SLIDE 36 BMC Bioinformatics 9 (page number not for citation purposes) PCA plot for Alizadeh-V2 Figure 6 PCA plot for Alizadeh-V2. We display a scatter plot with the two first largest components of a PCA for Alizadeh-V2. Colors indicate the three classes in the data: diffuse large B-cell lymphoma in red (DLBCL), follicular lymphoma in green (FL) and chronic lymphocytic leukemia in blue(CLL).

−40 −20 20 40 −20 −15 −10 −5 5 10 15 20

1st component 2nd component PCA Analysis

DLCL FL CLL

BioMed Central

(page number not for citation purposes)

BMC Bioinformatics

Open Access

Research article

Clustering cancer gene expression data: a comparative study

Marcilio CP de Souto*1,2, Ivan G Costa1,3, Daniel SA de Araujo1,2, Teresa B Ludermir3 and Alexander Schliep1 BMC Bioinformatics

9

(page number not for citation purposes) CLL CLL CLL CLL CLL CLL CLL CLL CLL CLL CLL FL FL DLBCL FL FL FL FL FL FL FL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL

cluster 1 cluster 2 cluster 3

DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL CLL CLL CLL CLL CLL CLL CLL CLL CLL CLL CLL FL FL FL FL FL FL FL DLBCL FL FL DLBCL

hierarchical clustering k-means

a) b)

slide-37
SLIDE 37

ENCODE battles

Open Peer Review Discuss this article

(3) Comments

RESEARCH ARTICLE

A reanalysis of mouse ENCODE comparative gene expression data [version 1; referees: 3 approved, 1 approved with reservations]

Yoav Gilad, Orna Mizrahi-Man

Department of Human Genetics, University of Chicago, Chicago, IL, 60637, USA

Abstract Recently, the Mouse ENCODE Consortium reported that comparative gene expression data from human and mouse tend to cluster more by species rather than by tissue. This observation was surprising, as it contradicted much of the comparative gene regulatory data collected previously, as well as the common notion that major developmental pathways are highly conserved across a wide range of species, in particular across mammals. Here we show that the Mouse ENCODE gene expression data were collected using a flawed study design, which confounded sequencing batch (namely, the assignment of samples to sequencing flowcells and lanes) with species. When we account for the batch effect, the corrected comparative gene expression data from human and mouse tend to cluster by tissue, not by species.

This article is included in the Preclinical gateway. Reproducibility and Robustness

Yoav Gilad ( ) Corresponding author: gilad@uchicago.edu The authors have no conflicts of interest or competing interests to disclose. Competing interests: Gilad Y and Mizrahi-Man O. How to cite this article: A reanalysis of mouse ENCODE comparative gene expression data [version 1; 2015, :121 (doi: ) referees: 3 approved, 1 approved with reservations] F1000Research 4 10.12688/f1000research.6536.1 © 2015 Gilad Y and Mizrahi-Man O. This is an open access article distributed under the terms of the Copyright: Creative Commons Attribution , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated Licence with the article are available under the terms of the (CC0 1.0 Public domain dedication). Creative Commons Zero "No rights reserved" data waiver This work was supported by NIH grant MH084703. Grant information: The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 19 May 2015, :121 (doi: ) First published: 4 10.12688/f1000research.6536.1

Referee Status:

Invited Referees published 19 May 2015

1 2 3 4

report report report report , Harvard School of Public Rafael Irizarry Health, USA 1 , University of California, Michael Eisen Berkeley, USA 2 , University of Edinburgh, UK Mick Watson 3 , University of California, Lior Pachter Berkeley, USA 4 19 May 2015, :121 (doi: ) First published: 4 10.12688/f1000research.6536.1 19 May 2015, :121 (doi: ) Latest published: 4 10.12688/f1000research.6536.1

v1

Page 1 of 37 F1000Research 2015, 4:121 Last updated: 08 NOV 2017

Comparison of the transcriptional landscapes between human and mouse tissues

Shin Lina,b,1, Yiing Linc,1, Joseph R. Neryd, Mark A. Urichd, Alessandra Breschie,f, Carrie A. Davisg, Alexander Dobing, Christopher Zaleskig, Michael A. Beerh, William C. Chapmanc, Thomas R. Gingerasg,i, Joseph R. Eckerd,j,2, and Michael P. Snydera,2

aDepartment of Genetics, Stanford University, Stanford, CA 94305; bDivision of Cardiovascular Medicine, Stanford University, Stanford, CA 94305; cDepartment of Surgery, Washington University School of Medicine, St. Louis, MO 63110; dGenomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037; eCentre for Genomic Regulation and UPF, Catalonia, 08003 Barcelona, Spain; fDepartament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Spain; gFunctional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11742; hMcKusick-Nathans Institute of Genetic Medicine and the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21205; iAffymetrix, Inc., Santa Clara, CA 95051; and jHoward Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037 Contributed by Joseph R. Ecker, July 23, 2014 (sent for review May 23, 2014)
slide-38
SLIDE 38

Original findings: Clusters by species

Page 4 of 37 F1000Research 2015, 4:121 Last updated: 08 NOV 2017

slide-39
SLIDE 39

Confounding study design means results dominated by batch effects

Page 3 of 37 F1000Research 2015, 4:121 Last updated: 08 NOV 2017
slide-40
SLIDE 40

Accounting for batch effects changes the story

Page 5 of 37 F1000Research 2015, 4:121 Last updated: 08 NOV 2017

slide-41
SLIDE 41

Today's very believable GWAS

Genome-wide analyses for personality traits identify six genomic loci and show correlations with psychiatric disorders

Min-Tzu Lo1, David A Hinds2, Joyce Y Tung2, Carol Franz3, Chun-Chieh Fan1,4, Yunpeng Wang5–7, Olav B Smeland6,7, Andrew Schork1,4, Dominic Holland5, Karolina Kauppi1,8, Nilotpal Sanyal1, Valentina Escott-Price9, Daniel J Smith10, Michael O’Donovan9, Hreinn Stefansson11, Gyda Bjornsdottir11, Thorgeir E Thorgeirsson11, Kari Stefansson11, Linda K McEvoy1, Anders M Dale1,3,5, Ole A Andreassen6,7 & Chi-Hua Chen1

L E T T E R S

Conscientiousness 9 8 7 6 5 3 4 2 Openness –log10(P) Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 21 19 17 16 15 14
  • 8
7 6 5 4 3 2 Chromosome Extraversion –log10(P) Agreeableness rs57590327 rs2164273 rs6481128 rs1426371 rs7498702 12 10 8 6 4 2 8 –log10(P) 1 2 3 4 5 6 7 8 9 10 11 12 13 21 19 17 16 15 14 Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 21 19 17 16 15 14
  • Conscientiousness
rs3814424 9 8 7 6 5 4 3 2 –log10(P) Neuroticism rs6981523 rs9611519 12 10 8 6 4 2 –log10(P) Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 21 19 17 16 15 14 Chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 21 19 17 16 15 14
slide-42
SLIDE 42

PCA loadings from GWAS

D m e n d t d c e s y l l e

10 11 9 8 7 6 5 4 3 2 1

0.8 0.6 0.4 0.2 –0.2 –0.4 –0.6 –0.8 Principal component 2 (19% of total genetic variance) –0.9 –0.7 –0.5 –0.3 –0.1 0.1 0.3 0.5 0.7 0.9 Principal component 1 (25% of total genetic variance)

ADHD Bipolar disorder Schizophrenia Anorexia nervosa Autism Major depression Extraversion Agreeableness Openness Conscientiousness Neuroticism

II I III IV

b

  • d

e f h h

  • n

e d m e c r D m

Agreeableness Conscientiousness Extraversion Neuroticism Openness to experience Schizophrenia Bipolar disorder Major depression ADHD Autism spectrum disorder Anorexia nervosa

Genetic correlation 1.0 0.5 –0.5 –1.0

1.00 0.23** 0.23** 0.22** –0.40** 0.11 –0.03 0.07 –0.29* 0.08 –0.24* –0.03 1.00 0.15* –0.18* –0.19* –0.13* –0.18* –0.28* –0.10 –0.21* 0.01 –0.05 –0.04 0.30* 0.02 0.18* –0.01 0.34** –0.35** 1.00 0.15* 0.22** –0.40** –0.18* –0.35** 1.00 –0.15* 0.14 –0.01 0.56** 0.06 0.10 0.15** 0.09 0.12 0.11 0.21** 0.19 –0.03 0.28* 0.47** 0.52** 1.00 –0.12 0.12 0.15 0.07 0.16* 0.17 0.12 –0.12 1.00 –0.10 –0.10 –0.06 0.06 1.00 0.06 1.00

10 11

0.34** 0.36** 1.00 –0.15* 0.34** –0.19* 0.11 –0.03 –0.13* –0.01 0.14 0.36** 1.00 0.65** 0.07 –0.18* 0.18* –0.01 0.34** 0.65** 1.00 0.52** 0.47** 0.28* 0.56** 0.02 –0.28* –0.29* 0.08 –0.10 0.30* 0.06 0.19 –0.03 0.15 –0.24* –0.21* –0.04 0.10 0.12 0.11 0.07 –0.06 0.17 0.16*

0.21**

0.09 0.15** –0.05 0.01 –0.03

9 8 7 6 5 4 3 2 1

a

slide-43
SLIDE 43

One of the coolest papers

LETTERS

Genes mirror geography within Europe

John Novembre1,2, Toby Johnson4,5,6, Katarzyna Bryc7, Zolta ´n Kutalik4,6, Adam R. Boyko7, Adam Auton7, Amit Indap7, Karen S. King8, Sven Bergmann4,6, Matthew R. Nelson8, Matthew Stephens2,3 & Carlos D. Bustamante7 Vol 456 |6 November 2008 |doi:10.1038/nature07331

PC1

a

PC2

a

400 km

b

1.0

PC1 accounts of 0.3% of the variation

slide-44
SLIDE 44

Ancient DNA

LETTER

doi:10.1038/nature19844

Genomic insights into the peopling of the Southwest Pacific

Pontus Skoglund1,2,3, Cosimo Posth4,5, Kendra Sirak6,7, Matthew Spriggs8,9, Frederique Valentin10, Stuart Bedford9,11, Geoffrey R. Clark11, Christian Reepmeyer12, Fiona Petchey13, Daniel Fernandes6,14, Qiaomei Fu1,15,16, Eadaoin Harney1,2, Mark Lipson1, Swapan Mallick1,2, Mario Novak6,17, Nadin Rohland1, Kristin Stewardson1,2,18, Syafiq Abdullah19, Murray P . Cox20, Françoise R. Friedlaender21, Jonathan S. Friedlaender22, Toomas Kivisild23,24, George Koki25, Pradiptajati Kusuma26, D. Andrew Merriwether27, Francois-X. Ricaut28, Joseph T. S. Wee29, Nick Patterson2, Johannes Krause5, Ron Pinhasi6§ & David Reich1,2,18§ was 2b), We nia na-
  • te
the uals rac- ing ria- .86, and for ian s in ple ults
  • te
ain biq- me ing ents tely ing ata the ed a , all that d of was
  • ds
lder ient a b c d e Early Lapita 96 60 277 214 66% 74 Yoruba 46 16 Altai 34% 98% 2% 138 Denisovan 64 3% 77 41 4 6 Dai 5 55 Onge 93 1 Atayal 26 97% 29 Kankanaey 22 6 7 75% 59 1 5 25% Tongan 11 Tonga Vanuatu Taiwan Phillippines Bismarck archipelago New Guinea Australia First people in Remote Oceania ~3,000 BP Mixture near New Guinea Later gene fmow brings Papuan ancestry to Remote Oceania First people in Near Oceania >40,000 BP (Taiwan) (Philippines) New Guinea Highlanders Lapita Vanuatu Lapita Tonga Figure 3 | A model of population history. a, A model of population relationships that fits allele frequency patterns (all empirical f-statistics

Lapita Remote Oceania N e a r O c e a n i a East Asia Vanuatu Remote Oceania Near Oceania Tonga Ancient Lapita 100 120 140 160 180 –20 –10 10 20 30 40 Longitude (º) Latitude (º)

a

–0.04 –0.02 0.00 0.02 0.04 −0.05 0.00 0.05 PC1 PC2

b

slide-45
SLIDE 45

Honeybee gene expression

Honeybees show division of labor (common in social insects)

  • Young worker bees care for broods ("nursing") and transition to foraging at 2-

3 weeks

  • Change is hormonally determined

Studied 72 bees with 108 microarrays = high dimensional data

Genomic dissection of behavioral maturation in the honey bee

Charles W. Whitfield*†‡, Yehuda Ben-Shahar§¶, Charles Brillet, Isabelle Leoncini, Didier Crauser, Yves LeConte, Sandra Rodriguez-Zas*†‡**, and Gene E. Robinson*†‡††

Departments of *Entomology and **Animal Science, †Neuroscience Program, and ‡Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, IL 61801; §Howard Hughes Medical Institute, ¶University of Iowa College of Medicine, Iowa City, IA 52242; and Laboratoire Biologie et Protection de l’Abeille, Ecologie des Inverte ´bre ´s, Unite ´ Mixte de Recherche, Institut National de la Recherche AgronomiqueUniversite ´ d’Avignon et des Pays de Vaucluse, Site Agroparc, Domaine Saint-Paul, 84914 Avignon Cedex 9, France This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on May 3, 2005.
slide-46
SLIDE 46

PCA on gene expression

early forager late forager

slide-47
SLIDE 47

People love social insects

Wasp Gene Expression Supports an Evolutionary Link Between Maternal Behavior and Eusociality

Amy L. Toth,1* Kranthi Varala,2 Thomas C. Newman,1 Fernando E. Miguez,2 Stephen K. Hutchison,3 David A. Willoughby,3 Jan Fredrik Simons,3 Michael Egholm,3 James H. Hunt,4 Matthew E. Hudson,2 Gene E. Robinson1,5 The presence of workers that forgo reproduction and care for their siblings is a defining feature of

a- ney putative similarity melanogas-

  • f

anal- ess s er