CS-5630 / CS-6630 Visualization for Data Science Set Visualization - - PowerPoint PPT Presentation

cs 5630 cs 6630 visualization for data science set
SMART_READER_LITE
LIVE PREVIEW

CS-5630 / CS-6630 Visualization for Data Science Set Visualization - - PowerPoint PPT Presentation

CS-5630 / CS-6630 Visualization for Data Science Set Visualization Alexander Lex alex@sci.utah.edu [xkcd] Design Workshop item1 : A item2 : A A item3 : A, B item4 : A, C item5 : A, B, C B item6 : B item7 : B, C C item8 : C


slide-1
SLIDE 1

CS-5630 / CS-6630 Visualization for Data Science Set Visualization

Alexander Lex alex@sci.utah.edu

[xkcd]

slide-2
SLIDE 2

Design Workshop

slide-3
SLIDE 3

item1 : A item2 : A item3 : A, B item4 : A, C item5 : A, B, C item6 : B item7 : B, C item8 : C … A B C Venn diagram

slide-4
SLIDE 4

LETTER

doi:10.1038/nature11241

The banana (Musa acuminata) genome and the evolution of monocotyledonous plants

Ange ´lique D’Hont1*, France Denoeud2,3,4*, Jean-Marc Aury2, Franc-Christophe Baurens1, Françoise Carreel1,5, Olivier Garsmeur1, Benjamin Noel2, Ste ´phanie Bocs1, Gae ¨tan Droc1, Mathieu Rouard6, Corinne Da Silva2, Kamel Jabbari2,3,4, Ce ´line Cardi1, Julie Poulain2, Marle `ne Souquet1, Karine Labadie2, Cyril Jourda1, Juliette Lengelle ´1, Marguerite Rodier-Goud1, Adriana Alberti2, Maria Bernard2, Margot Correa2, Saravanaraj Ayyampalayam7, Michael R. Mckain7, Jim Leebens-Mack7, Diane Burgess8, Mike Freeling8, Didier Mbe ´guie ´-A-Mbe ´guie ´9, Matthieu Chabannes5, Thomas Wicker10, Olivier Panaud11, Jose Barbosa11, Eva Hribova12, Pat Heslop-Harrison13, Re ´my Habas5, Ronan Rivallan1, Philippe Francois1, Claire Poiron1, Andrzej Kilian14, Dheema Burthia1, Christophe Jenny1, Fre ´de ´ric Bakry1, Spencer Brown15, Valentin Guignon1,6, Gert Kema16, Miguel Dita19, Cees Waalwijk16, Steeve Joseph1, Anne Dievart1, Olivier Jaillon2,3,4, Julie Leclercq1, Xavier Argout1, Eric Lyons17, Ana Almeida8, Mouna Jeridi1, Jaroslav Dolezel12, Nicolas Roux6, Ange-Marie Risterucci1, Jean Weissenbach2,3,4, Manuel Ruiz1, Jean-Christophe Glaszmann1, Francis Que ´tier18, Nabila Yahiaoui1 & Patrick Wincker2,3,4

Bananas (Musa spp.), including dessert and cooking types, are giant perennial monocotyledonous herbs of the order Zingiberales, a sister group to the well-studied Poales, which include cereals. Bananas are vital for food security in many tropical and subtropical countries and the most popular fruit in industrialized countries1. The Musa domestication process started some 7,000 years ago in Southeast Asia. It involved hybridizations between diverse species and subspecies, fostered by human migrations2, and selection of diploid and triploid seedless, parthenocarpic hybrids thereafter widely dispersed by vegetative propagation. Half of the current production relies on somaclones derived from a single triploid genotype (Cavendish)1. Pests and diseases have gradually become adapted, representing an imminent danger for global banana pro- duction3,4. Here we describe the draft sequence of the 523-megabase genome of a Musa acuminata doubled-haploid genotype, providing a crucial stepping-stone for genetic improvement of banana. We detected three rounds of whole-genome duplications in the Musa lineage, independently of those previously described in the Poales lineage and the one we detected in the Arecales lineage. This first monocotyledon high-continuity whole-genome sequence reported

  • utside Poales represents an essential bridge for comparative

genome analysis in plants. As such, it clarifies commelinid- sequence errors. The assembly consisted of 24,425 contigs and 7,513 scaffolds with a total length of 472.2 Mb, which represented 90% of the estimated DH-Pahang genome size. Ninety per cent of the assembly was in 647 scaffolds, and the N50 (the scaffold size above which 50% of the total length of the sequence assembly can be found) was 1.3 Mb (Supplementary Text and Supplementary Tables 1–3). We anchored 70% of the assembly (332 Mb) along the 11 Musa linkage groups of the Pahang genetic map. This corresponded to 258 scaffolds and included 98.0% of the scaffolds larger than 1 Mb and 92% of the annotated genes (Supplementary Text, Supplementary Table 4 and Supplementary Fig. 1). We identified 36,542 protein-coding gene models in the Musa genome (Supplementary Tables 1 and 5). A total of 235 microRNAs from 37 families were identified, including only one of the eight microRNA gene (MIR) families found so far solely in Poaceae8 (Supplementary Tables 6 and 7). Viral sequences related to the banana streak virus (BSV) dsDNA plant pararetrovirus were found to be integrated in the Pahang genome, with 24 loci spanning 10 chromosomes (Supplementary Text and Supplementary Fig. 2). They belonged to a badnavirus phylogenetic group that differed from the endogenous BSV species (eBSV) found in M. balbisiana9 and most of them formed a new

Nature 2012

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

[D’Hont et al., Nature, 2012] [Wiles et al., BMC Systems Biology] [Neale et al., BMC Genome Biology, 2014] [Gibbs et al., Nature, 2004]

slide-8
SLIDE 8

What are some questions we’d like to ask?

slide-9
SLIDE 9

Design Workshop

work in groups get to know the data (5 mins) create two (rapid!) prototypes (2x5 mins) Write up your two favorites (5 mins) Upload to “Bonus” Canvas Dropbox by EOD

slide-10
SLIDE 10
  • 1. What is the biggest intersection?
  • 2. Which sets make up an intersection?
  • 3. How big is an intersection?
  • 4. Does it work for more than four sets?
  • 5. Does attribute value correlate with intersection

Tip: Don’t always try to show all individuals

slide-11
SLIDE 11
slide-12
SLIDE 12

Venn and Euler Diagrams

slide-13
SLIDE 13

Venn vs Euler

Venn Diagram Shows all possible logical relations between sets (even if empty) Euler Diagram Shows logical relations May omit empty intersections

slide-14
SLIDE 14

Venn Diagrams

Venn diagrams for many sets are hard # of intersections is 2n

https://en.wikipedia.org/wiki/Venn_diagram

slide-15
SLIDE 15

Area-Proportional Euler Diagrams

Problem with Venn: size doesn’t correspond to the data. Creating area-proportional Euler diagrams is hard. Layout criteria:

area proportional simple curves (circles are best) makes it easy to identify which sets are participating in intersection Gestalt-principle: good continuation

[Alsallakh 2015]

slide-16
SLIDE 16

Compare Simple vs Complex Shape

Complex Simple

slide-17
SLIDE 17

[created with EulerAPE]

slide-18
SLIDE 18

?

> <

22 19 44 43 41 19 9 22 5 [created with EulerAPE]

slide-19
SLIDE 19

Venn-Euler Pros/Cons

Pros Familiar Intuitive Work well for 2-4 sets Cons Doesn’t work well for more than 4 sets Area proportionality hard to do Not well suited to show attributes

slide-20
SLIDE 20

Relationships for specific Items

[Riche 2010]

No Duplicate Nodes Complex Shapes Notice the Nesting Duplicate Nodes Simple Shapes

slide-21
SLIDE 21
slide-22
SLIDE 22

Sets on top of a fixed layout

https://www.youtube.com/watch?v=Ju2hSThmPWA

slide-23
SLIDE 23

Sets on top of a fixed layout

LineSets Kelp Diagrams

[Alper 2011] [Dinkla 2012]

slide-24
SLIDE 24

Node-Link Techniques

Treat sets as nodes Connect to elements that are in set

http://mariandoerk.de/pivotpaths/demo/#/1:0_497686

slide-25
SLIDE 25
slide-26
SLIDE 26

Showing Pairwise Overlap

Doesn’t show higher-order

  • verlaps

Very scalable Can’t show attributes Co-Mutations of genes

slide-27
SLIDE 27

Set Matrices: OnSet

Set membership for each item shown in matrix Comparisons can be made using AND or OR

  • perations

Good for many sets and few items

https://vimeo.com/213029678#at=0

[Sadana 14]

slide-28
SLIDE 28

Linear Diagrams

[RODGERS 2015]

slide-29
SLIDE 29
slide-30
SLIDE 30

Radial Sets

Sets are segments on a “circle” Relationships are encoded as ribbons Size of segments encodes size

  • f sets

Histograms in segments show degrees

https://www.youtube.com/watch?v=UcYRrPqC5A8

[Alsallakh 2013]

slide-31
SLIDE 31
slide-32
SLIDE 32

UpSet Visualizing Intersecting Sets

[InfoVis’14]

slide-33
SLIDE 33

Set Vis Goals

  • 1. Efficient visual encoding
  • 3. Visualize attributes
  • 2. Creating complex

slices of a dataset

vs.

slide-34
SLIDE 34

Visualizing Intersections Visualizing Properties Attribute Details Element List & Queries

[Movie Lens Dataset]

slide-35
SLIDE 35

Visualizing Intersections

slide-36
SLIDE 36

A B C Universal Set A B C

slide-37
SLIDE 37

A B C Universal Set Must Must Not A B C

slide-38
SLIDE 38

A B C

Cardinality

5 17 7 10 14 20 7 5 5 17 7 10 14 20 7 5

slide-39
SLIDE 39

Plotting Attributes

slide-40
SLIDE 40

A B C Additional Plots

Deviation Attributes

How surprising is the size of an intersection? What’s the distribution of an attribute in an intersection?

slide-41
SLIDE 41

Action- Comedy Drama- Comedy

slide-42
SLIDE 42

Sorting

slide-43
SLIDE 43

A B C Which is the biggest intersection? Sort By: Cardinality

slide-44
SLIDE 44
slide-45
SLIDE 45

Elements & Attributes

slide-46
SLIDE 46

How do documentaries compare to adventure movies?

slide-47
SLIDE 47

How do documentaries compare to adventure movies?

slide-48
SLIDE 48

Applications

slide-49
SLIDE 49

R-Version: UpSetR

Developed at HMS Some design adaptions

slide-50
SLIDE 50

The Banana Chart Redesigned

slide-51
SLIDE 51
slide-52
SLIDE 52

Other Options

http://setviz.net