Data visualization strategies and tools for microbial genomic - - PowerPoint PPT Presentation

data visualization strategies and tools for microbial
SMART_READER_LITE
LIVE PREVIEW

Data visualization strategies and tools for microbial genomic - - PowerPoint PPT Presentation

Data visualization strategies and tools for microbial genomic epidemiology Anamaria Crisan Vanier Canada Scholar & UBC Public Scholar PhD Candidate, Computer Science University of British Columbia @amcrisan acrisan@cs.ubc.ca


slide-1
SLIDE 1

Data visualization strategies and tools for microbial genomic epidemiology

@amcrisan http://cs.ubc.ca/~acrisan acrisan@cs.ubc.ca

Anamaria Crisan

Vanier Canada Scholar & UBC Public Scholar

PhD Candidate, Computer Science University of British Columbia

slide-2
SLIDE 2

What we’ll talk about

slide-3
SLIDE 3

Pa Part I: Data Visualization Strategies & Tools Pa Part II: A brief (5 min) activity Pa Part III: Data Visualization Research in Practice

slide-4
SLIDE 4

Pa Part I: Data Visualization Strategies & Tools Pa Part II: A brief (5 min) activity Pa Part III: Data Visualization Research in Practice

slide-5
SLIDE 5

Data visualization strategies and tools for microbial genomic epidemiology

@amcrisan http://cs.ubc.ca/~acrisan acrisan@cs.ubc.ca

Anamaria Crisan

Vanier Canada Scholar & UBC Public Scholar

PhD Candidate, Computer Science University of British Columbia

Part I

slide-6
SLIDE 6

Ma Master of

  • f Science

( ( Bioinformatics )

Ph PhD

(Co (Computer Science) Ge GenomeDX Bi Biosciences Br British Columbi bia Centre for Diseas ease e Cont ntrol

2010 2010 2013 2013 2015 2015 2008 2008

PhD hD Cand andidat ate, e, Comput uter er Sc Scienc ence Un University of Br British Columbi bia

slide-7
SLIDE 7

What we’ll talk about

slide-8
SLIDE 8

Why should we visualize data? How should we visualize data? What datavis tools are available?

slide-9
SLIDE 9

Why should we visualize data?

slide-10
SLIDE 10

Translating Numbers to Words

http://bit.ly/1FxtT2z

It is not always easy to reason consistently with numbers

slide-11
SLIDE 11

60%

Probability Frequency Visualization

6 in 10

< <

Whiting (2015) “How well do health professionals interpret diagnostic information? A systematic review”

Least Understandable Most Understandable

Data Visualization is a Powerful Medium

slide-12
SLIDE 12

Role of data visualization in the current paradigm

  • f scientific research

= Communication

slide-13
SLIDE 13

Do you have a research

Problem? Yes. No.

Do all the

Science!

But eventually you’ll have a problem right?

Duh. Inform

the public!

https://www.ratbotcomics.com/comics/pgrc_2014/1/1.html

slide-14
SLIDE 14

Yes. No.

Do all the

Science! Duh. Inform

Maybe data

Visualization? Infographics are pretty

the public!

Problem?

right? Do you have a research But eventually you’ll have a problem

slide-15
SLIDE 15

Yes. No.

Do all the

Science! Duh. Inform

Did it work? Maybe data

Visualization?

the public!

Infographics are pretty Problem?

right? Do you have a research But eventually you’ll have a problem

slide-16
SLIDE 16

Yes. No.

Do all the

Science! Duh. Inform

Did it work? Maybe data

Visualization? No : (

the public!

Different Infographics? Problem?

right? Do you have a research But eventually you’ll have a problem

slide-17
SLIDE 17

Yes. No.

Do all the

Science! Duh.

the public!

Inform

Did it work? Maybe data

Visualization? No : ( Different Infographics? Declare Victory Yes! (maybe?) Problem?

right? Do you have a research But eventually you’ll have a problem

slide-18
SLIDE 18

Limitation #1 : Missed Opportunity in Exploration

Do all the

Science! Data Visualization!

the public!

Inform

Missed Opportunity for Exploration

§ Exploration is looking at your data, trying different analysis methods, assessing if there are outliers or missing data etc.

slide-19
SLIDE 19

Autodesk Research (2017). Same Stats, Different Graphs: https://www.autodeskresearch.com/publications/samestats

Same stats, different graphs

Limitation #1 : Missed Opportunity in Exploration

slide-20
SLIDE 20

Autodesk Research (2017). Same Stats, Different Graphs: https://www.autodeskresearch.com/publications/samestats

Same stats, different graphs (Datasaurus)

Limitation #1 : Missed Opportunity in Exploration

slide-21
SLIDE 21

Opening up the machine learning black box

Limitation #1 : Missed Opportunity in Exploration

slide-22
SLIDE 22

Limitation #1 : Missed Opportunity in Exploration

Ch Chihuahua or

  • r muffin?

Mo Mop or sheep dog?

slide-23
SLIDE 23

Limitation #1 : Missed Opportunity in Exploration

Goodfellow (2014). “Explaining and Harnessing Adversarial Examples”

slide-24
SLIDE 24

Olah (2018). “Building blocks of interpretability” (https://distill.pub/2018/building-blocks/)

Ma Made wit ith : JavaScrip ipt

Example : Trying to understand the black box

slide-25
SLIDE 25

Health data are complex to analyze and visualization

slide-26
SLIDE 26

Limitations #2 : Identifying the Appropriate Vis

Selecting the appropriate data visualization is challenging

Data Visualization!

§ True for exploration & communication applications

slide-27
SLIDE 27

Visualization Design ALSO matters

slide-28
SLIDE 28

Baseline Visualization Alternative 1 Alternative 2

Zikmund-Fisher (2013). A demonstration of ''less can be more'' in risk graphics.

Example: Communicating Survival Benefit of Cancer Therapy

slide-29
SLIDE 29

Example: Visualizing Arteries of the Heart for Surgery Planning

Borkin (2011). “Evaluation of Artery Visualizations for Heart Disease Diagnosis”

Ma Made wit ith : Processin ing

slide-30
SLIDE 30

EX EXISTI TING STANDARD RD Ac Accuracy : 39% 39% REVISED VISUALI LIZATION Ac Accuracy: 91% 91%

Borkin (2011). “Evaluation of Artery Visualizations for Heart Disease Diagnosis”

Ma Made wit ith : Processin ing

Example: Visualizing Arteries of the Heart for Surgery Planning

slide-31
SLIDE 31

There are two aspects of visualizations to think about:

Ho How w do

  • you
  • u ma

make a v a visual alizat ation? What datavis tools are available? Is Is it the ap appropriat ate vi visualization?

How should we visualize data?

slide-32
SLIDE 32

How should we visualize data ?

slide-33
SLIDE 33

Human Perception & Cognition Computer Graphics Data Analysis

Cross Cutting Disciplines in Information Visualization

Visualization Design & Analysis

slide-34
SLIDE 34
  • R. Kosara (EagerEyes) – https://eagereyes.org/basics/encoding-vs-decoding

Encoding and Decoding Information

slide-35
SLIDE 35

Putting it all Together for Visualization Design & Analysis

§ Non-trivial to condense knowledge across all these areas § Still an ongoing area of research § I will try convey a simpler intuition about design & analysis

slide-36
SLIDE 36

Guiding Principles for Visualizing your Data

Image Source: Valentin Antonucci via Pexels

slide-37
SLIDE 37

Why? (Motivation)

Why do you need to visualize data? How will you, or others, use the visualization?

Breaking Down a Visualization in Three Questions

34 34

slide-38
SLIDE 38

Breaking Down a Visualization in Three Questions

Why? (Motivation)

Why do you need to visualize data? How will you, or others, use the visualization?

What? (Data & Tasks)

What kind of data is being visualized? What tasks are performed with the data?

35 35

slide-39
SLIDE 39

People tend to jump to this level and ignore why and what

What? (Data & Tasks)

What kind of data is being visualized? What tasks are performed with the data?

How? (Visual & Interactive Design)

How do you make the visualization? Is it the right visualization?

Why? (Motivation)

Why do you need to visualize data? How will you, or others, use the visualization?

Breaking Down a Visualization in Three Questions

36 36

slide-40
SLIDE 40

Design & Evaluation with Three Questions

Wh Why? Wh What? How? How?

Design Evaluation

Does the visualization address the the intended need? Are you using the right data, or deriving the right data? Are the visual & interactive choices appropriate for the data and tasks? Does the visualization support the tasks using that data? If interactive / computer based, is the visualization easy to use and reliable (i.e doesn’t crash all the time)

37 37

slide-41
SLIDE 41

Ideas from the research literature : the nested-model

Wh Why? Wh What? How? How?

Design Evaluation

  • T. Munzner (2014) – Visualization Design and Analysis
slide-42
SLIDE 42

Steps to Systematic Thinking in Data Visualization

Image Source: Valentin Antonucci via Pexels

slide-43
SLIDE 43

Do Domain Pr Problem* Da Data + + Task sk Vi Visual + Interaction De Design Ch Choices Al Algori rithm thm

Infovis (Information Visualization) research advocates an it iterativ ive process

  • T. Munzner (2014) – Visualization Design and Analysis

Design Evaluation

Thinking Systematically about Data Visualization

*Domain Problem = Motivation

slide-44
SLIDE 44

An iterative approach to development allows us to get feedback before committing to ineffective design choices

An Iterative Process

slide-45
SLIDE 45
  • 1. Identify a relevant pr

probl blem that effects you or a group

  • f stakeholders

Do Domain Pr Problem Da Data + + Task sk Vi Visual + Interaction De Design Ch Choices Al Algori rithm thm

  • T. Munzner (2014) – Visualization Design and Analysis

Thinking Systematically about Data Visualization

slide-46
SLIDE 46

Nu Nurses Cl Clinicians

Me Medical He Health Of Officer ers

Re Researchers Co Community Le Leaders

§ Mu Multi tidisc sciplinary ry decisi sion making te teams

§ More data & diverse data types = more informed decision making § BUT – different stakeholder abilities to interpret data & different needs

Public Health Stakeholders

Policy Mak aker ers Pat atient ents

slide-47
SLIDE 47
  • 2. Ask what data stakeholders use (is it available)?
  • 3. Ask what stakeholders do with the data [tasks]

Do Domain Pr Problem Da Data + + Task sk Vi Visual + Interaction De Design Ch Choices Al Algori rithm thm

  • T. Munzner (2014) – Visualization Design and Analysis

Thinking Systematically about Data Visualization

slide-48
SLIDE 48

Data - Many Different Types of Data!

  • T. Munzner (2014) – Visualization Design and Analysis
slide-49
SLIDE 49

Data - Don’t Just Visualize the Raw Data!

Original (Raw) Data Derived Data Example Example when this advice is ignored

  • T. Munzner (2014) – Visualization Design and Analysis

XKCD

slide-50
SLIDE 50

Tasks - How People Use the Data

Source : Atlanta CDC

Geographic Overview of Prostate Cancer

§ Useful for epidemiologists and policy makers § Supports surveillance tasks

Individual Prostate Cancer Risk

§ Good for patients and doctors § Supports treatment decision making tasks

Source : http://riskcalc.org/PCPTRC/ (UT San Antonio)

slide-51
SLIDE 51

Tasks - How People Use the Data

  • Tasks can also change how the same data should be visualized
  • Example: representing US electoral collage results

Standard Map Cartogram

slide-52
SLIDE 52

Tasks - How People Use the Data

  • Tasks can also change how the same data should be visualized
  • Example: representing US electoral collage results

Standard Map Snakey Diagram

slide-53
SLIDE 53

Tasks - How People Use the Data

  • Tasks can also change how the same data should be visualized
  • Example: representing US electoral collage results
slide-54
SLIDE 54

Examples from my own research

How can we identify tasks and data?

slide-55
SLIDE 55

My research : making a clinical report for tuberculosis

  • Mixed methods approach to gathering data and tasks

Di Disco cove very De Design Im Implement

Information Gathering Design & Evaluation Finalize Design

Expert Consults Task & Data Questionnaire Design Sprint Design Choice Questionnaire TB Workflow Map

Data Gathered Qualitative Quantitative Study Design

Exploratory Sequential Model Embedded Model

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM Summary The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.
Organism The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing). Drug Suscepbility Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace
slide-56
SLIDE 56

My research : making a clinical report for tuberculosis

WGS equivalent DIAGNOSIS TASKS TREATMENT TASKS SURVEILLANCE TASKS TOTAL SCORE Diagnose Latent TB Diagnose Active TB Reactive vs New Infection Characterize Transmission Risk Choose Meds Choose Tx Duration Assess Response to Tx Guide Contact Tracing Report to Public Health Define a Cluster Connect Case to Existing Cluster Guide Public Health Response Patient Identifier Same 3 3 3 3 3 3 3 2 1 1 1 1 26 Sample Collection Date Same 3 3 2 3 3 3 3 1 1 1 1 1 24 Patient Prior TB Results Same 3 2 3 3 3 3 3 1 1 1 1 23 Speciation Speciation 1 3 2 3 3 3 3 2 1 1 1 1 23 Sample Type (sputum, fine needle aspirate etc.) Same 2 3 2 3 3 3 3 1 1 1 1 22 Culture results NA 1 3 2 3 3 3 3 2 1 1 1 22 Sample Collection Site (lymph node, lung etc..) Same 2 3 2 3 3 3 3 1 1 1 21 Acid Fast Bacilli Smear Speciation 2 3 2 3 2 3 3 1 1 1 1 21 Resistotype Predicted DST 2 3 1 3 3 2 2 1 1 1 1 19 Phenotypic DST Predicted DST 2 3 2 3 3 2 1 1 1 1 18 Chest x-ray NA 3 3 2 3 2 3 1 17 Report Release Date Same 2 2 1 2 2 2 2 1 1 1 15 Requester IDs Same 2 2 2 2 2 2 2 1 15 Interpretation or comments from reviewer Same 2 2 1 2 2 2 3 1 15 Predicted DST Predicted DST 2 2 1 3 3 2 1 1 15 MIRU-VNTR SNPs 2 3 1 1 1 1 1 1 1 1 1 13 Cluster Assignment Same 2 2 1 1 1 1 1 1 1 1 11 SNP/variant distance SNPs 1 2 1 1 1 1 1 1 1 1 10 Phylogenetic Tree Same 2 1 1 1 1 1 1 1 1 9 Reviewer ID Same 1 1 1 1 1 1 1 1 8 TST results Speciation* 3 1 1 1 1 7 IGRA results Speciation* 3 1 1 1 1 7 Lab QC WGS Specific 1 2 1 1 1 1 7 Spoligotype SNPs 1 1 1 3 RFLP SNPs 1 1 1 3

Data

3 (>75%) 2 (50% - 25%) 1 (25% -50%) 0 (<25%)

Consensus among participants

% agree cat.

slide-57
SLIDE 57

My research : making a clinical report for tuberculosis

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT

NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM Summary The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.
Organism The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing). Drug Suscepbility Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace
slide-58
SLIDE 58
  • 4. Explore if other visualizations have addressed this

problem and set of tasks & data

  • 5. Implement your own solution (remember this include

interaction!)

  • T. Munzner (2014) – Visualization Design and Analysis

Do Domain Pr Problem Da Data + + Task sk Vi Visual + Interaction De Design Ch Choices Al Algori rithm thm

Thinking Systematically about Data Visualization

slide-59
SLIDE 59

Ma Mark rk:

Basic Graphical Element (basic building block)

Ch Channel:

Controls the appearance of marks

Marks & Channels : Basic Building Blocks

  • T. Munzner (2014) – Visualization Design and Analysis

49 49

slide-60
SLIDE 60

Example

Marks Vary in their Effectiveness

Bar Bar Char art

Position Common Scale

Pi Pie Chart

Angle & Area

  • J. Heer (2010) – Crowdsourcing Graphical Perception: Using Mechanical Turk ……

50 50

slide-61
SLIDE 61

Perception and Cognition Matter Too!

Colour Blind Simulator: http://www.color-blindness.com/coblis-color-blindness-simulator/

Original Visualization Visualization as seen by color blind person

(color blindness (deuteranopia) impacts men more often))

slide-62
SLIDE 62

Perception and Cognition Here too!

Colour scales also impact interpretation!

Perceptual research from Liu et al (2018)

Liu et al. (2018) - Somewhere Over the Rainbow: An Empirical Assessment of Quantitative Colormaps

slide-63
SLIDE 63

ggplot (data = mpg, ae aes( x= x= display, y y = ct cty, co colour = cl class)) + geom_p _point( )

Channel: Position Channel: Colour Mark: Point

Marks & Channels : ggplot2 example

No Note te: : Generally in ggplot2 aesthetics refer to channels and geoms refer to marks, but there are complex geoms that aren’t simple marks but chart types (i.e. geom_density) and there are aesthetics that have little to do with the visual channels directly (i.e. group)

https://rpubs.com/hadley/ggplot-intro 51 51

slide-64
SLIDE 64

Marks & Channels : Tableau example

51 51

Marks Channels

slide-65
SLIDE 65

Linking Data to Mark and Channels to Make Visualizations

Dat Data Ma Marks & Channels Vi Visualization

slide-66
SLIDE 66

Linking Data to Mark and Channels to Make Visualizations

Chart Chooser

https://bit.ly/2P9zLEW

Data to viz

https://www.data-to-viz.com/

slide-67
SLIDE 67

Examples from my own research

How do people visualize data?

slide-68
SLIDE 68

My research: surveying visualizations in genomic epidemiology

http://gevit.net

Crisan et. al (2018) “A systematic method for surveying data visualizations and a resulting genomic epidemiology visualization typology: GEViT”

OXFORD BIOINFORMATICS

slide-69
SLIDE 69

Examples from my own research

How can we help people visualize data?

slide-70
SLIDE 70

My research: simplifying the creation of data visualizations

#specify individual charts phyloTree_chart<-specify_base(chart_type = "phylogenetic tree",data="tree_dat") epicurve<-specify_base(chart_type = "histogram",data="tab_dat",x = "month") map_chart<-specify_base("geographic map",data="tab_dat",lat = "latitude",long = "longitude") #specify a combination colour_ combo<-specify_combination(combo_type = "color_linked", base_charts = c("phyloTree_chart","map_chart","epicurve"),link_by="country") #plot the result plot(color_combo)

slide-71
SLIDE 71

My research: automatic data visualization

# Analyze different # data types automatically harmon_obj<-data_harmonization(tab_dat, tree_dat,genomic_dat,all_spatial) # Create specifications # that compile to minCombinr component_specs<-get_spec_list(harmon_obj) #plot the result one view at a time plot_view(component_specs,view_num=1)

Preliminary Result

GIN LBR SLE country combo_axis_var GIN LBR SLE country combo_axis_var −14 −12 −10 −8 longitude 4°N 6°N 8°N 10°N 12°N 14°W 12°W 10°W 8°W case_count 30 60 90 minID GIN LBR SLE

A

250 500 750 1000 GIN LBR SLE country count

B

slide-72
SLIDE 72
  • 4. Explore if other visualizations have addressed this

problem and set of tasks

  • 5. Implement your own solution (part or all of that

solution could be a new algorithm)

Do Domain Pr Problem Da Data + + Task sk Vi Visual + Interaction De Design Ch Choices Al Algori rithm thm

Thinking Systematically about Data Visualization

slide-73
SLIDE 73
  • 6. Test multiple alternatives (including new ones you

develop) with stakeholders

  • 7. Gather qualitative & quantitative evaluation data

Do Domain Pr Problem* Da Data + + Task sk Vi Visual + Interaction De Design Ch Choices Al Algori rithm thm

Thinking Systematically about Data Visualization

slide-74
SLIDE 74
  • 1. Identify a relevant pr

probl blem that effects you or a group

  • f stakeholders
  • 2. Ask wh

what da data stakeholders use (is it available)?

  • 3. Ask wh

what stake keholde ders do do with the data [ta tasks]

  • 4. Explore if other visualizations have addressed this

pr probl blem and set of ta tasks & da data ta

  • 5. Implement yo

your own wn solution (vis and/or algorithm)

  • 6. Test multiple

ltiple alte alternativ atives (including new ones you develop) with stakeholders

  • 7. Gather qu

qualita tati tive & qu quanti tita tati tive evaluation data

Design Evaluation

Thinking Systematically about Data Visualization

slide-75
SLIDE 75

What datavis tools are available?

slide-76
SLIDE 76

Data Visualization Tools to Get You Started

slide-77
SLIDE 77

Tools & Libraries for data visualization

Lisa Charlotte Rost has an excellent blog post about this: http://bit.ly/2gRGx1J I am presenting her figures here

slide-78
SLIDE 78

Tools & Libraries for data visualization

Lisa Charlotte Rost has an excellent blog post about this: http://bit.ly/2gRGx1J

Analysis vs Presentation

slide-79
SLIDE 79

Tools & Libraries for data visualization

Lisa Charlotte Rost has an excellent blog post about this: http://bit.ly/2gRGx1J

Extent of Flexibility

How easy/hard it is to make data visualizations (including custom/novel visualizations)

slide-80
SLIDE 80

Tools & Libraries for data visualization

Lisa Charlotte Rost has an excellent blog post about this: http://bit.ly/2gRGx1J

Static vs Interactive

slide-81
SLIDE 81

Tools & Libraries for data visualization

Lisa Charlotte Rost has an excellent blog post about this: http://bit.ly/2gRGx1J “There are no perfect tools, just good tools for people with certain goals”

See a detailed table here: http://bit.ly/2DeWPwV

slide-82
SLIDE 82

Tools & Libraries for data visualization

Another take with commonly used tools : https://bit.ly/2SgrOzS

slide-83
SLIDE 83

Don’t forget that pen and paper is an option too!

Dear Data Project (Lupi & Posavec)

slide-84
SLIDE 84

Datavis tools for (Microbial) Genomics

slide-85
SLIDE 85

IGV Browser for all your genomic needs

https://software.broadinstitute.org/software/igv/

slide-86
SLIDE 86

The classic UCSC genome browser

https://genome.ucsc.edu

slide-87
SLIDE 87

GenVisR: Human Genomes in R

https://academic.oup.com/bioinformatics/article/32/19/3012/2196360

slide-88
SLIDE 88

Variant Viewer: Human Genomes

http://www.cs.ubc.ca/labs/imager/tr/2013/VariantView/

slide-89
SLIDE 89

Island Viewer: Microbial Genomics

https://www.pathogenomics.sfu.ca/islandviewer/accession/NZ_CP012358.1/

slide-90
SLIDE 90

Microreact: Microbial Genomics

https://microreact.org/

slide-91
SLIDE 91

GenGIS: Microbial Genomics (Made in Canada!)

http://kiwi.cs.dal.ca/GenGIS/Main_Page

slide-92
SLIDE 92

Nextstrain: Microbial Genomics

https://nextstrain.org/ebola

slide-93
SLIDE 93

Wrapping up

slide-94
SLIDE 94

DATA VISUALIZATION IS NOT JUST AN ART PROJECT

slide-95
SLIDE 95

Key take-aways from this talk

§ Vi Visualizations of data are useful

§ Helpful in instance of low numeracy § Can used in communication an and exploration

§ Bu But. t.. visual alizati ation de design gn al also matte atters rs

§ Many different alternatives, important to test

§ It It’s p possib ible le t to t thin ink s systemat atic ically ally ab about v vis isualiz alizat atio ions

§ Many disciplines cross cut information visualization research § At the minimum think “Why”, “What”, “How”

§ En Encod

  • de data well so
  • that ot
  • thers can decod
  • de it later

§ Da Data ta visualizati tion is a re researc rch pro rocess wi with ope pen and d interesting g pr probl blems ms

slide-96
SLIDE 96

Additional Resources

§ Bo Books to consider:

§ Interpretable Machine Learning: https://christophm.github.io/interpretable-ml-book/ § Making Data Visual: A Practical Guide to Using Visualization for Insight by Danyel Fisher and Miriah Meyer § Visualization Design and Analysis by Tamara Munzner (more technical )

§ On Onlin line resou

  • urces:

§ Distill Publication : https://distill.pub/ § UBC Infovis Resource Page : http://www.cs.ubc.ca/group/infovis/resources.shtml § UW Interactive Data Lab : https://medium.com/@uwdata § Data stories podcast : http://datastori.es/

§ In Inspiration :

§ Information is Beautiful : https://informationisbeautiful.net/ § Visualization WTF (examples of what not to do) : http://viz.wtf/

slide-97
SLIDE 97

Data visualization strategies and tools for microbial genomic epidemiology

@amcrisan http://cs.ubc.ca/~acrisan acrisan@cs.ubc.ca

Anamaria Crisan

Vanier Canada Scholar & UBC Public Scholar

PhD Candidate, Computer Science University of British Columbia

slide-98
SLIDE 98

Pa Part I: Data Visualization Strategies & Tools Pa Part II: A brief (5 min) activity Pa Part III: Data Visualization Research in Practice

slide-99
SLIDE 99

How many ways can we visualize these numbers?

  • In your head, on paper, or computer, sketch out as many

examples as you can to visualize the following to numbers:

75 37

slide-100
SLIDE 100

How many ways can we visualize these numbers?

  • In your head, on paper, or computer, sketch out as many

examples as you can to visualize the following to numbers:

75 37

example:

slide-101
SLIDE 101

How many ways can we visualize these numbers?

  • In your head, on paper, or computer, sketch out as many

examples as you can to visualize the following to numbers:

75 37

some solutions:

http://www.scribblelive.com/blog/2012/07/27/45-ways-to-communicate-two-quantities/

slide-102
SLIDE 102

Pa Part I: Data Visualization Strategies & Tools Pa Part II: A brief (5 min) activity Pa Part III: Data Visualization Research in Practice

slide-103
SLIDE 103

Data visualization strategies and tools for microbial genomic epidemiology

@amcrisan http://cs.ubc.ca/~acrisan acrisan@cs.ubc.ca

Anamaria Crisan

Vanier Canada Scholar & UBC Public Scholar

PhD Candidate, Computer Science University of British Columbia

Part III

slide-104
SLIDE 104

The characteristics of data

Volume

Amount of data

slide-105
SLIDE 105

The characteristics of data

Volume Variety

Amount of data Kinds of data

slide-106
SLIDE 106

The characteristics of data

Volume Variety Veracity

Amount of data Kinds of data Reliability of data

slide-107
SLIDE 107

The characteristics of data

Volume Variety Veracity Velocity

Amount of data Kinds of data Reliability of data Speed of acquisition

slide-108
SLIDE 108

How do we bridge the gap from data to insights and actions ?

Data Insights Action

slide-109
SLIDE 109

Data Insights Action

How do we bridge the gap from data to insights and actions ?

Transform Explore Visualize Model

Data Science & Data Visualization

slide-110
SLIDE 110

Data Insights Action

Doctoral research: visualizing complex & heterogenous data

Transform Explore Visualize Model

Data Science & Data Visualization

slide-111
SLIDE 111
  • Understand : what data do stakeholders need to

perform their tasks?

  • The challenge : limited stakeholder time,

complex & restricted data access

  • My strategy: partner with stakeholders on a

high-value project, gather necessary evidence for data & tasks

  • Collaboration context: evidence-based redesign
  • f a clinical report with Public Health England

Understand : stakeholder needs, data, and tasks

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT

NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM Summary The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.
Organism The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing). Drug Suscepbility Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace

Crisan A, McKee G, Munzner T, Gardy JL “Evidence-based design and evaluation of a whole genome sequencing clinical report for the reference microbiology laboratory” PEERJ (2018) Crisan A, Gardy JL, Munzner T “On regulatory and organizational constraints in visualization design and evaluation” BELIV ‘16 – an IEEE VIS affiliated methods workshop

slide-112
SLIDE 112

Understand : gathering evidence through mixed-methods

  • Integrating mixed-methods (MM) research into design study methodologies

Di Disco cove ver De Design De Deploy

Information Gathering Design & Evaluation Finalize Design

Expert Consults Task & Data Questionnaire Design Sprint Design Choice Questionnaire TB Workflow Map

Data Gathered Qualitative Quantitative MM Study Design

Exploratory Sequential Model Embedded Model

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM Summary The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.
Organism The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing). Drug Suscepbility Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace
slide-113
SLIDE 113

Understand : gathering evidence through mixed-methods

Di Disco cove ver De Design Im Implement

Information Gathering Design & Evaluation Finalize Design

Expert Consults Task & Data Questionnaire Design Sprint Design Choice Questionnaire TB Workflow Map

Data Gathered Qualitative Quantitative

Exploratory Sequential Model Embedded Model

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM Summary The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.
Organism The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing). Drug Suscepbility Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace
  • Integrating mixed-methods (MM) research into design study methodologies

MM Study Design

slide-114
SLIDE 114

Discover : what data is used for different tasks

18

Pu Public Heal alth th Role To Total

Clinician 7 Nurse 3 Laboratorian 3 Researcher 1 Surveillance 3 Other To Total 17 17

Pu Public Heal alth th Role To Total

Clinician 2 Nurse 1 Laboratorian 2 Researcher Surveillance 1 Other 1 To Total 7

Expert Consults Participants Online Task & Data Survey Participants

Semi-structure interviews Qualitative Data Multiple choice questionnaire Quantitative data

slide-115
SLIDE 115

Discover : quantified consensus for data used for some tasks

WGS equivalent DIAGNOSIS TASKS TREATMENT TASKS SURVEILLANCE TASKS TOTAL SCORE Diagnose Latent TB Diagnose Active TB Reactive vs New Infection Characterize Transmission Risk Choose Meds Choose Tx Duration Assess Response to Tx Guide Contact Tracing Report to Public Health Define a Cluster Connect Case to Existing Cluster Guide Public Health Response Patient Identifier Same 3 3 3 3 3 3 3 2 1 1 1 1 26 Sample Collection Date Same 3 3 2 3 3 3 3 1 1 1 1 1 24 Patient Prior TB Results Same 3 2 3 3 3 3 3 1 1 1 1 23 Speciation Speciation 1 3 2 3 3 3 3 2 1 1 1 1 23 Sample Type (sputum, fine needle aspirate etc.) Same 2 3 2 3 3 3 3 1 1 1 1 22 Culture results NA 1 3 2 3 3 3 3 2 1 1 1 22 Sample Collection Site (lymph node, lung etc..) Same 2 3 2 3 3 3 3 1 1 1 21 Acid Fast Bacilli Smear Speciation 2 3 2 3 2 3 3 1 1 1 1 21 Resistotype Predicted DST 2 3 1 3 3 2 2 1 1 1 1 19 Phenotypic DST Predicted DST 2 3 2 3 3 2 1 1 1 1 18 Chest x-ray NA 3 3 2 3 2 3 1 17 Report Release Date Same 2 2 1 2 2 2 2 1 1 1 15 Requester IDs Same 2 2 2 2 2 2 2 1 15 Interpretation or comments from reviewer Same 2 2 1 2 2 2 3 1 15 Predicted DST Predicted DST 2 2 1 3 3 2 1 1 15 MIRU-VNTR SNPs 2 3 1 1 1 1 1 1 1 1 1 13 Cluster Assignment Same 2 2 1 1 1 1 1 1 1 1 11 SNP/variant distance SNPs 1 2 1 1 1 1 1 1 1 1 10 Phylogenetic Tree Same 2 1 1 1 1 1 1 1 1 9 Reviewer ID Same 1 1 1 1 1 1 1 1 8 TST results Speciation* 3 1 1 1 1 7 IGRA results Speciation* 3 1 1 1 1 7 Lab QC WGS Specific 1 2 1 1 1 1 7 Spoligotype SNPs 1 1 1 3 RFLP SNPs 1 1 1 3

Data

3 (>75%) 2 (50% - 25%) 1 (25% -50%) 0 (<25%)

Consensus among participants

% agree cat.

slide-116
SLIDE 116

Discover : for some tasks, stakeholders don’t know what data to use

  • A surprising finding : limited consensus for data used in surveillance tasks
WGS equivalent DIAGNOSIS TASKS TREATMENT TASKS SURVEILLANCE TASKS TOTAL SCORE Diagnose Latent TB Diagnose Active TB Reactive vs New Infection Characterize Transmission Risk Choose Meds Choose Tx Duration Assess Response to Tx Guide Contact Tracing Report to Public Health Define a Cluster Connect Case to Existing Cluster Guide Public Health Response Patient Identifier Same 3 3 3 3 3 3 3 2 1 1 1 1 26 Sample Collection Date Same 3 3 2 3 3 3 3 1 1 1 1 1 24 Patient Prior TB Results Same 3 2 3 3 3 3 3 1 1 1 1 23 Speciation Speciation 1 3 2 3 3 3 3 2 1 1 1 1 23 Sample Type (sputum, fine needle aspirate etc.) Same 2 3 2 3 3 3 3 1 1 1 1 22 Culture results NA 1 3 2 3 3 3 3 2 1 1 1 22 Sample Collection Site (lymph node, lung etc..) Same 2 3 2 3 3 3 3 1 1 1 21 Acid Fast Bacilli Smear Speciation 2 3 2 3 2 3 3 1 1 1 1 21 Resistotype Predicted DST 2 3 1 3 3 2 2 1 1 1 1 19 Phenotypic DST Predicted DST 2 3 2 3 3 2 1 1 1 1 18 Chest x-ray NA 3 3 2 3 2 3 1 17 Report Release Date Same 2 2 1 2 2 2 2 1 1 1 15 Requester IDs Same 2 2 2 2 2 2 2 1 15 Interpretation or comments from reviewer Same 2 2 1 2 2 2 3 1 15 Predicted DST Predicted DST 2 2 1 3 3 2 1 1 15 MIRU-VNTR SNPs 2 3 1 1 1 1 1 1 1 1 1 13 Cluster Assignment Same 2 2 1 1 1 1 1 1 1 1 11 SNP/variant distance SNPs 1 2 1 1 1 1 1 1 1 1 10 Phylogenetic Tree Same 2 1 1 1 1 1 1 1 1 9 Reviewer ID Same 1 1 1 1 1 1 1 1 8 TST results Speciation* 3 1 1 1 1 7 IGRA results Speciation* 3 1 1 1 1 7 Lab QC WGS Specific 1 2 1 1 1 1 7 Spoligotype SNPs 1 1 1 3 RFLP SNPs 1 1 1 3

Data

3 (>75%) 2 (50% - 25%) 1 (25% -50%) 0 (<25%)

Consensus among participants

% agree cat.

slide-117
SLIDE 117

Understand : gathering evidence through mixed-methods

  • Integrating mixed-methods (MM) research into design study methodologies

Di Disco cove ver De Design Im Implement

Information Gathering Design & Evaluation Finalize Design

Expert Consults Task & Data Questionnaire Design Sprint Design Choice Questionnaire TB Workflow Map

Data Gathered Qualitative Quantitative

Exploratory Sequential Model Embedded Model

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM Summary The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.
Organism The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing). Drug Suscepbility Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace

MM Study Design

slide-118
SLIDE 118

Design : creating and testing alternative report designs

Design Sprint Design Choice Questionnaire

Creative session Making prototypes Multiple Choice questionnaire Quantitative & Qualitative data

Pu Public Heal alth th Role To Total

Clinician 13 Nurse 5 Laboratorian 3 Researcher 8 Surveillance 8 Other* 12 To Total 54 54

slide-119
SLIDE 119

Mycobacterium Whole Genome Sequencing Report from MGIT Positive Samples

Not for diagnostic use 01/02/1915

Sample Details

Sequencing Location Oxford Date received in Lab Local Lims Specimen ID 123456789 Run date 01/01/19150115 Guuid 123456-79aab-910abr-15243hg

Organism Identification

Predicted/closest match TBCOMP/microti 100% TBCOMP 100% TBCOMP/TB 96.77% TBCOMP/tuberculosis-canettii 35.71% MACCOMP 21.21%

Sample/Sequencing Quality

Total reads (~millions) Mapped % No reads mapped (~millions) Coverage %

4.73 99.47 4.7 91.99

Resistance Summary

INH RIF EMB PZA QUI SM

AG U S S S S S S

Resistotype

Drug Mutation Nucleotides Support (ACGT) Source – (R/Total) Prediction INH katG_A727T GCC->ACC (160/0/1/0) (0/164/0/0) (0/167/0/0) Unclassified UNK

Original

slide-120
SLIDE 120

Mycobacterium Whole Genome Sequencing Report from MGIT Positive Samples

Not for diagnostic use 01/02/1915

Sample Details

Sequencing Location Oxford Date received in Lab Local Lims Specimen ID 123456789 Run date 01/01/19150115 Guuid 123456-79aab-910abr-15243hg

Organism Identification

Predicted/closest match TBCOMP/microti 100% TBCOMP 100% TBCOMP/TB 96.77% TBCOMP/tuberculosis-canettii 35.71% MACCOMP 21.21%

Sample/Sequencing Quality

Total reads (~millions) Mapped % No reads mapped (~millions) Coverage %

4.73 99.47 4.7 91.99

Resistance Summary

INH RIF EMB PZA QUI SM

AG U S S S S S S

Resistotype

Drug Mutation Nucleotides Support (ACGT) Source – (R/Total) Prediction INH katG_A727T GCC->ACC (160/0/1/0) (0/164/0/0) (0/167/0/0) Unclassified UNK

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT

NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM

Summary

The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.

Organism

The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing).

Drug Suscepbility

Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace
slide-121
SLIDE 121

Design results : using evidence to inform the design

MYCOBACTERIUM TUBERCULOSIS GENOME SEQUENCING REPORT

NOT FOR DIAGNOSTIC USE Paent Name JOHN DOE Barcode Birth Date 2000-01-01 Paent ID 12345678910 Locaon SOMEPLACE Sample Type SPUTUM Sample Source PULMONARY Sample Date 2016-12-25 Sample ID A12345678 Sequenced From MGIT CULTURED ISOLATE Reporng Lab LAB NAME Report Date/Time 2017-01-01, 15:36 Requested By REQUESTER NAME Requester Contact REQUESTER@EMAIL.COM

Summary

The specimen was posive for Mycobacterium tuberculosis. It is resistant to isoniaizd and ri-
  • fampin. It belongs to a cluster, suggesng recent transmission.

Organism

The specimen was posive for Mycobacterium tuberculosis, lineage 2.2.1 (East-Asian Beijing).

Drug Suscepbility

Resistance is reported when a high-confidence resistance-conferring mutaon is detected. “No mutaon detected” does not exclude the possi- bility of resistance. No drug resistance predicted Mono-resistance predicted
  • Mul-drug resistance predicted
Extensive drug resistance predicted Drug class Interpretaon Drug Resistance Gene (Amino Acid Mutaon) Ethambutol No mutaon detected Suscepble Pyrazinimide No mutaon detected Isoniazid katG (S315T) First Line Resistant Rifampin rpoB (S531L) Streptomycin No mutaon detected Ciprofloxacin No mutaon detected Ofloxacin No mutaon detected Moxifloxacin No mutaon detected Amikacin No mutaon detected Kanamycin No mutaon detected Second Line Suscepble Capreomycin No mutaon detected Page 1 of 2 Paent ID: 12345678910 | Date: 2017-01-01 | Locaon: Someplace
  • Visual hierarchy that follows a clinical

narrative

  • Gr

Grouping of common data elements (gestalt)

  • Judicious use of emphasis for “at-a-glance”

read

  • Prioritize reading flow for clinical tasks
  • LaTeX report that is programmatically

generated

slide-122
SLIDE 122

Design results : using evidence to inform the design

  • We tested alternative designs to come up with the final report

Drug Prediction Isonazid Resistant Rifampin Resistant Ethambutol Resistant Pyrazinimde Resistant

Drug Susceptibility

Drug Prediction Isonazid Resistant Rifampin Resistant Ethambutol Resistant Pyrazinimde Resistant

Drug Susceptibility

Based on predicted antibiotic mutations, the individual has multidrug resistant TB

Drug Prediction Isonazid Resistant Rifampin Resistant Ethambutol Resistant Pyrazinimde Resistant

Drug Susceptibility

Mono-resistant Multidrug-resistant (MDR) Extremely Drug Resistant (XDR)

Control Design (Original report) Alternative 1 Alternative 2 Alternatives Generated in ‘Design Sprint’ Step

x

slide-123
SLIDE 123

Design results : example of a typical result from our study

“the check boxes provide an at-a-glance result” “tick boxes may cause confusion when clinicians read XDR without realizing that option is no not selected.”

Comments from respondents:

0.00 0.25 0.50 0.75 1.00

B C A

0.00 0.25 0.50 0.75 1.00

C B A

0.00 0.25 0.50 0.75 1.00 0.50 0.00 1.00 0.25 0.75 Rescaled rank score

Clinician Non- Clinician

Least Preferred Most Preferred Random

Drug Prediction Isonazid Resistant Rifampin Resistant Ethambutol Resistant Pyrazinimde Resistant

Drug Susceptibility

Drug Prediction Isonazid Resistant Rifampin Resistant Ethambutol Resistant Pyrazinimde Resistant

Drug Susceptibility

Based on predicted antibiotic mutations, the individual has multidrug resistant TB Drug Prediction Isonazid Resistant Rifampin Resistant Ethambutol Resistant Pyrazinimde Resistant

Drug Susceptibility

Mono-resistant Multidrug-resistant (MDR) Extremely Drug Resistant (XDR) x

A - Control B - Alternative C - Alternative

slide-124
SLIDE 124

Design results : whole reports were actually confusing

  • Asked participants to evaluate isolated components (previous) and whole reports

“N “None a are e espe pecially g y good ( (see pr previ vious c comments o

  • n i

indivi vidual pa parts)”

  • Pa

Participant Comment

slide-125
SLIDE 125

Design results : we did this for many design choices

§ Generally, alternative designs preferred

  • in 12 out of 14 comparisons to control

§ Designs should promote patient safety & precise interpretability

  • Abbreviations should be avoided
  • Debate about prioritizing susceptible vs.

resistant drugs

§ Clinically actionable data to be given priority

  • Surveillance tasks aren’t clinically actionable

§ Sometimes we didn’t provide good alternatives

0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B C D 0.00 0.25 0.50 0.75 1.00 A B C D 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 B D C A 0.00 0.25 0.50 0.75 1.00 B D C A 0.00 0.25 0.50 0.75 1.00 C A B D 0.00 0.25 0.50 0.75 1.00 C B A D 0.00 0.25 0.50 0.75 1.00 C B A 0.00 0.25 0.50 0.75 1.00 C A B 0.00 0.25 0.50 0.75 1.00 Random Most Preferred Least Preferred Rank Questions Multiple Choice Questions 0.50 0.00 1.00 0.25 0.75 Normalized Rank Score 0.50 0.00 1.00 0.25 0.75 Percentage in Favour [Q6] Wording - Speciation Preferred: B (Organism) [Q8] Wording - Resistance Preferred: C (Drug Susceptibility) [Q14] Wording - Relatedness Preferred: C (Cluster Detection) [Q7] Wording - Speciation Results Preferred: A (Full Sentence) [Q9] Abbreviation - Drug Names Preferred: B (Full Name) [Q10] Abbreviation - Resistance Preferred: B (Full Name) A Random permutation reference
+1σ +2σ +3σ >+3σ
<-3σ Le A 0.00 0.25 0.50 0.75 1.00 C B D A 0.00 0.25 0.50 0.75 1.00 C B D A 0.00 0.25 0.50 0.75 1.00 B C A 0.00 0.25 0.50 0.75 1.00 C B A 0.00 0.25 0.50 0.75 1.00 A B D E C 0.00 0.25 0.50 0.75 1.00 B A E D C 0.00 0.25 0.50 0.75 1.00 D F C B A E 0.00 0.25 0.50 0.75 1.00 D A B F C E 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B C 0.00 0.25 0.50 0.75 1.00 A B 0.00 0.25 0.50 0.75 1.00 A B 0.00 0.25 0.50 0.75 1.00 A B 0.00 0.25 0.50 0.75 1.00 A B 0.00 0.25 0.50 0.75 1.00 A B 0.00 0.25 0.50 0.75 1.00 A B 0.00 0.25 0.50 0.75 1.00 Random Most Preferred Least Preferred Rank Questions Multiple Choice Questions 0.50 0.00 1.00 0.25 0.75 Normalized Rank Score [Q12] Emphasis – Drug Resistance Preferred: C (Shading) [Q16] Layout – Drug Resistance Preferred: B (Prediction by drug) A (Drug listed by category) [Q17] Visualization - Clusters Preferred: D (Phylogenetic tree + table) 0.50 0.00 1.00 0.25 0.75 Percentage in Favour [Q5] Emphasis - Bolding Preferred: A (With bolding, for relevant content) [Q11] Data – Mutation Data Preferred: C (Include, but on second report page) [Q18] Design – Summary Statement Preferred: B (Include Summary) [Q13] Emphasis – Resistance Overview Preferred: C (Tick Boxes) [Q15] Design - Speciation Preferred: A (Organism name only) [Q19] Layout – Columns Preferred: B (Two Columns) *no control A) Isolated Wording Choices B) Isolated Design Choices 0.00 0.25 0.50 0.75 1.00 A B C D 0.00 0.25 0.50 0.75 1.00 B A C D 0.00 0.25 0.50 0.75 1.00 Random Most Preferred Least Preferred Rank Question 0.50 0.00 1.00 0.25 0.75 Normalized Rank Score C C) Full Reports Rescaled Rank Score Proportion in Favor Rescaled Rank Score Proportion in Favor Rescaled Rank Score Control Alternative LEGEND Clinician Non-clinician Public Health Role Design Option A, B,.. Option Indicator
slide-126
SLIDE 126

Deploy : report delivers insights that lead to actions

  • Our report is used by Public Health England and several other global public

health organizations, and public health software systems

  • Community consensuses: we’re tackling a difficult and underappreciated part of

biomedical research

slide-127
SLIDE 127

Do YO YOU have to go through all this effort for every report?

slide-128
SLIDE 128

It depends one what you want to achieve

§ Broad data collection can be used for other projects

  • We were also collecting data for future software projects
  • Stayed tuned for more details!

§ At the very least test alternative designs

  • If you can’t do a Discovery stage (time, people, budget) at least to the Design stage
  • Check in with stakeholders to avoid ad hoc design issues

§ Bioinformaticians : you should use human-centered design for your tools!

  • Not command line ≠user friendly
  • If you didn’t test it with even one user it’s not “user friendly” or “intuitive”
  • Report design is a very simple example of how to use these methods
slide-129
SLIDE 129
  • Understand : the good, the bad, and the

common datavis solutions already in use

  • Strategy: systematically mine and describe

visualization solutions

  • Output: a domain prevalence visualization design

space

Understand : community strategies for data visualization

Common Statistical Charts Relational Charts Temporal Charts Spatial Charts Tree Charts Special Cases
  • Epidemic Curve
  • Diversity Chart
  • LefSe Plot
Special Cases
  • eBurst
  • Social network
  • Molecular network
  • Minimum Spanning Tree
Bar Bar Char art He Heatm tmap De Density Plot* No Node de-lin link Fl Flow Di Diagram St Strea eamgraph* Ti Timeline Geo Geogr graphi phic Map Ch Choropleth Map In Interior M Map Ab Absolute Re Relative Ch Chord Diagram Sa Sank nkey ey Diagram Ph Phyl ylogenetic Tr Tree Ro Rooted (Ra Radial & Linear) Un Unrooted d (Radi dial & Linear) De Dendrogram Cl Clonal Tree* Genomic Charts Geno Genomic Map Al Alignme ment Seq Sequenc uence e Logo Plot Li Linear Ra Radial Sc Scatter er Plot Pi Pie Chart Ve Venn Di Diagram Special Cases
  • Root-to-tip
  • Ordination Plot
  • Q-Q plot
Li Line Chart rt Special Cases
  • Bootscan
  • Kaplan-Meier
  • Skyline Plot
St Stand ndard St Stacked ed Di Divergent Di Distribution Plot Bo Boxplot PD PDF Sw Swarm Plot Hi Histogram Other Charts Ta Table Im Image Ge Gel Image Ge General Image Ca Category St Stripe Mi Miscellany Co Composition Plot Colour Charts

Crisan A, Gardy JL, Munzner T “A systematic method for surveying data visualizations and a resulting genomic epidemiology visualization typology: GEViT” OXFORD BIOINFORMATICS (2018)

slide-130
SLIDE 130

Understand : community strategies for data visualization

  • Different visualizations for hospital outbreaks – which choice is right?

Gorrie (2017) Willman (2015) Davis (2015)

slide-131
SLIDE 131

Understand : community strategies for data visualization

  • Is it a good idea to visualize data like this?

Gorrie (2017)

slide-132
SLIDE 132

Can I come up with a method to systematically review data visualizations? What will I find by trying this method out on microbial genomic epidemiology research papers?

slide-133
SLIDE 133

Understand : an overview of our systematic method

  • Our method has two analysis phases:

Analysis Phase

Literature Analysis Qualitative Analysis Visualization Analysis Quantitative Analysis

slide-134
SLIDE 134

Understand : an overview of our systematic method

  • Some analyses are automated ( ) and others are manual ( )

Analysis Phase

Literature Analysis Qualitative Analysis Visualization Analysis Quantitative Analysis

slide-135
SLIDE 135

Understand : an overview of our systematic method

  • Analysis phases answer different research questions

WHY WHY are researchers visualizing data? HO HOW are researchers visualizing data? HO HOW W MANY Y examples of specific visualizations?

Analysis Phase Research Question

Literature Analysis Quantitative Analysis Qualitative Analysis Visualization Analysis

slide-136
SLIDE 136

Explore : apply our method to genomic epidemiology

Article Acquisition & Unsupervised Clustering Limit to clusters of human pathogens Random Stratified Sampling (clusters as strata) 17,974 6,350 221 801 figures

An Analysis Step # a # articl cles es re results

Article topic clusters

WH WHY are researchers visualizing data?

slide-137
SLIDE 137

Explore : apply our method to genomic epidemiology

Article Acquisition & Unsupervised Clustering Limit to clusters of human pathogens 17,974 6,350 221 801 figures

An Analysis Step # a # articl cles es re results

Article topic clusters

WH WHY are researchers visualizing data?

Iterative & axial coding A genomic epidemiology visualization typology (GEViT) 221

Ho How are researchers visualizing data?

Random Stratified Sampling (clusters as strata)

slide-138
SLIDE 138

Explore : apply our method to genomic epidemiology

Article Acquisition & Unsupervised Clustering Limit to clusters of human pathogens 17,974 6,350 221 801 figures

An Analysis Step # a # articl cles es re results

Article topic clusters

WH WHY are researchers visualizing data?

Iterative & axial coding A genomic epidemiology visualization typology (GEViT) Descriptive Statistics Current common visualization practices 221 221

Ho How are researchers visualizing data?

Random Stratified Sampling (clusters as strata)

slide-139
SLIDE 139

Understand : an overview of our systematic method

  • Analysis phases answer different research questions

WHY WHY are researchers visualizing data? HO HOW are researchers visualizing data? HO HOW W MANY Y examples of specific visualizations?

Analysis Phase Research Question

Literature Analysis Quantitative Analysis Qualitative Analysis Visualization Analysis

slide-140
SLIDE 140

18K articles on genomic epidemiology

Used for unsupervised topic clustering

slide-141
SLIDE 141

Articles clustered primarily by pathogen

  • Strategic subset: sample fixed # of articles

from each cluster

  • Ran

Random strat atified s sam ampling

  • Could capture variability in visualizations

across pathogens – if it existed

  • Sampling resulted in 221 articles that

yielded 801 figures Final Topics Clustering Results

t-SNE & hdbscan

slide-142
SLIDE 142

Adjutant : unsupervised topic discovery for literature reviews

  • People liked the literature review method, so I made it into an R package!

Crisan A, Munzner T , Gardy JL “Adjutant: an R-based tool to support topic discovery for systematic and literature reviews” OXFORD BIOINFORMATICS (APPLICATION NOTE) - 2018

slide-143
SLIDE 143

Understand : an overview of our systematic method

  • Analysis phases answer different research questions

WHY WHY are researchers visualizing data? HO HOW are researchers visualizing data? HO HOW W MANY Y examples of specific visualizations?

Analysis Phase Research Question

Literature Analysis Quantitative Analysis Qualitative Analysis Visualization Analysis

slide-144
SLIDE 144

Qualitative analysis : how are figures constructed

  • Extract figures from sample articles
  • Figures are not tied to any one specific tool
  • Includes post-processing made on figures
  • Used iterative and axial coding from qualitative methods
  • All figures analyzed separately
  • Multipart figures (i.e. Fig 3A, 3B) analyzed together
  • Result: GEViT, a genomic epidemiology visualization typology
  • Taxonomic hierarchy describing ch

chart types es, ch chart co combinations, and ch chart en enhancem cemen ents

slide-145
SLIDE 145

GEViT chart types: the basic building blocks

slide-146
SLIDE 146

GEViT chart types: current common practices

  • A tree is the most common chart type
  • Not a surprise
  • A lot of data is in text and not visualized
slide-147
SLIDE 147

GEViT chart combinations : showing data together

  • Baseline is just a single chart

40.1% 20.3% 17.3% 13.5% 8.8% 11.9%

Current Practice

  • f all figures

Example of a simple chart

slide-148
SLIDE 148

GEViT chart combinations : showing data together

  • Users also combined individual charts to tell show more aspects of the data

40.1% 20.3% 17.3% 13.5% 8.8%

Current Practice

  • f all figures
slide-149
SLIDE 149

GEViT chart combinations : showing data together

  • Users also combined individual charts to tell show more aspects of the data

Example of a spatially aligned combination

Combination Type Spatially Aligned # of charts 1 # of chart types Many Linkage Horizontal and vertical alignment

slide-150
SLIDE 150

GEViT chart combinations : showing data together

  • Users also combined individual charts to tell show more aspects of the data

Example of a small multiples combination

Combination Type Small Multiples # of charts Many # of chart types 1 Linkage Chart type & data

slide-151
SLIDE 151

GEViT chart combinations : showing data together

  • Users also combined individual charts to tell show more aspects of the data

Example of a color aligned combination

Combination Type Color Aligned # of charts Many # of chart types Many Linkage Visual linkage (color)

slide-152
SLIDE 152

GEViT chart combinations : showing data together

  • Users also combined individual charts to tell show more aspects of the data

40.1% 20.3% 17.3% 13.5% 8.8%

Current Practice

  • f all figures
slide-153
SLIDE 153

GEViT chart enhancements : adding metadata to chart types

>80% of all figures have some enhancement

Current Practice

slide-154
SLIDE 154

GEViT chart enhancements : adding metadata to chart types

slide-155
SLIDE 155

GEViT chart enhancements : adding metadata to chart types

Chart Enhancement Examples

slide-156
SLIDE 156

Understand : now we could systematically describe visualizations

WHY WHY are researchers visualizing data? HO HOW are researchers visualizing data? HO HOW W MANY Y examples of specific visualizations?

Analysis Phase Research Question

Literature Analysis Quantitative Analysis Qualitative Analysis Visualization Analysis

slide-157
SLIDE 157

http://gevit.net GEViT in action : knowledge translation via the GEViT Gallery

Crisan et. al (2018) “A systematic method for surveying data visualizations and a resulting genomic epidemiology visualization typology: GEViT”

OXFORD BIOINFORMATICS

slide-158
SLIDE 158

What does GEViT do and not do?

GEViT provides a base

  • A Visualization Typology for visual design
  • Chart Types
  • Chart Combinations
  • Chart Enhancements
  • An Interactive Gallery

GEViT does not evaluate

  • Massive undertaking that would

take many years

  • GEViT provides platform for future

evaluation research

slide-159
SLIDE 159

minCombinR: simplified visual generation in R

#specify individual charts phyloTree_chart<-specify_base(chart_type = "phylogenetic tree",data="tree_dat") epicurve<-specify_base(chart_type = "histogram",data="tab_dat",x = "month") map_chart<-specify_base("geographic map",data="tab_dat",lat = "latitude",long = "longitude") #specify a combination colour_ combo<-specify_combination(combo_type = "color_linked", base_charts = c("phyloTree_chart","map_chart","epicurve"),link_by="country") #plot the result plot(color_combo)

slide-160
SLIDE 160

GEViTRec: automated visualization recommendation

# Analyze different # data types automatically harmon_obj<-data_harmonization(tab_dat, tree_dat,genomic_dat,all_spatial) # Create specifications # that compile to minCombinr component_specs<-get_spec_list(harmon_obj) #plot the result one view at a time plot_view(component_specs,view_num=1)

Preliminary Result

GIN LBR SLE country combo_axis_var GIN LBR SLE country combo_axis_var −14 −12 −10 −8 longitude 4°N 6°N 8°N 10°N 12°N 14°W 12°W 10°W 8°W case_count 30 60 90 minID GIN LBR SLE

A

250 500 750 1000 GIN LBR SLE country count

B

slide-161
SLIDE 161

Data visualization strategies and tools for microbial genomic epidemiology

@amcrisan http://cs.ubc.ca/~acrisan acrisan@cs.ubc.ca

Anamaria Crisan

Vanier Canada Scholar & UBC Public Scholar

PhD Candidate, Computer Science University of British Columbia

Part III