UW eScience Institute Initiatives Cecilia Aragon University of - - PowerPoint PPT Presentation

uw escience institute initiatives
SMART_READER_LITE
LIVE PREVIEW

UW eScience Institute Initiatives Cecilia Aragon University of - - PowerPoint PPT Presentation

UW eScience Institute Initiatives Cecilia Aragon University of Washington Seattle, WA, USA aragon@uw.edu (slides courtesy Bill Howe, Anissa Tanweer, Carole Goble) Dagstuhl EAS plenary talk, Jun 23, 2016 2005-2008 All across our campus,


slide-1
SLIDE 1

UW eScience Institute Initiatives

Cecilia Aragon University of Washington Seattle, WA, USA aragon@uw.edu

(slides courtesy Bill Howe, Anissa Tanweer, Carole Goble)

Dagstuhl EAS plenary talk, Jun 23, 2016

slide-2
SLIDE 2

“All across our campus, the process of discovery will increasingly rely on researchers’ ability to extract knowledge from vast amounts of data… In order to remain at the forefront, UW must be a leader in advancing these techniques and technologies, and in making [them] accessible to researchers in the broadest imaginable range of fields.”

2005-2008 In other words:

  • Data-intensive science will be ubiquitous
  • It’s about people and software, not only hardware
slide-3
SLIDE 3

PDB GenBank UniProt Pfam Spreadsheets, Notebooks Local, Lost High throughput experimental methods Industrial scale Commons based production Publicly data sets Cherry picked results Preserved CATH, SCOP (Protein Structure Classification) ChemSpider

Long Tail of Research Data

[src: Carole Goble]

slide-4
SLIDE 4

How much data do you work with?

Wright 2013

slide-5
SLIDE 5

7/8/2016

Bill Howe, UW

5

Data Science Kickoff Session: 137 posters from 30+ departments and units

slide-6
SLIDE 6

6

PIs on Moore/Sloan effort + eScience Institute Steering Committee + UW participants in February 7 Data Science poster session

Broad collaborations

slide-7
SLIDE 7

Moore/Sloan Data Science Environment: $37.8M Initiative at UCB, NYU, UW

7 Impact Graphic by Ray Hong and eScience Institute, UW

slide-8
SLIDE 8

Career paths and alternative metrics

Recruited / recruiting data scientists

  • Typically Ph.D.-educated; fully supported by DSE; research position with

emphasis on taking responsibility for core activities (e.g., incubator projects)

Recruited / recruiting research scientists

  • Typically Ph.D.-educated; partially supported by DSE; research position

with emphasis on specific science goals

Designated 33 faculty and staff as Data Science Fellows

  • We cribbed Berkeley’s excellent idea

Recruited 6 “Provost’s Initiative” faculty members

  • Provost provided 6 faculty “half-positions”
  • Individuals with strength and commitment both to advancing data

science methodology and to applying it at the forefront of a specific field

  • Astronomy, Biology, Mechanical Engineering, Sociology, Applied

Mathematics, Statistics + Computer Science & Engineering

Recruited 2 cohorts of 6 Data Science Postdoctoral Fellows

  • Each is co-mentored by “methodology” and “applications” faculty

UW flagship activity: Establish two new roles on campus: “Data Science Fellows” and “Data Scientists”

slide-9
SLIDE 9

Education and training

  • IGERT Ph.D. program in Big Data / Data Science
  • Seven departments have put in place Big Data Tracks
  • Data science classes count toward Ph.D. (no extra work)
  • Departments: Astronomy, Biology, Chemical Engineering, Computer

Science & Engineering, Genome Sciences, Oceanography, and Statistics

UW flagship activity: Establish new graduate program tracks in data science

  • Undergraduate

“transcriptable option”

  • Workshops and Bootcamps

– Multiple Software Carpentry Bootcamps (Python, R, etc.) – AstroData Hack Week – Many others

  • Seminar series
slide-10
SLIDE 10

New MS in data science at UW

Interdisciplinary

  • Participation from six departments (HCDE, CSE, Stats,

Biostats, iSchool, Applied Math)

Innovative

  • Rigorous technical program in statistics and computer

science

  • Human-centered data science curriculum – ethics, data

science and society, ‘big data’ user experience, visualization

Designed for working professionals

  • Evening courses, full-time or part time attendance
slide-11
SLIDE 11

Software tools, environments, and support

“Incubator” program

  • Our experiment at achieving scalability
  • A lightweight 2-page proposal process several times each year
  • I have an interesting science problem
  • I’m stumped by the data science aspects
  • If you cracked it, others would benefit
  • I’m going to send you the following person half-time for 3 months

to provide the labor; you provide the guidance

  • Preceded by an information session to clarify expectations and

commitments

  • Activities take place in the Data Science Studio, staffed by our

Data Scientists

  • We coach software hygiene as well as methodology
  • Running two cohorts annually

UW flagship activity: Establish an “incubator” seed grant program

slide-12
SLIDE 12

Drop-in “Office Hours”

  • eScience Institute Data Scientists
  • UW-IT Academic & Collaborative Applications Team,

Research Computing Team, Network Design & Architecture Team

  • AWS Scientific Computing Team
  • Center for Statistics and the Social Sciences Statistical

Consulting Service

  • UW Libraries Research Data Management Team
  • Google Cloud Platform Team
slide-13
SLIDE 13

UW campus-wide monthly meetings May 2014 national workshop

  • More than 80 participants
  • Attendees from NYU, Berkeley, Fred Hutchinson Cancer

Research Center, Allen Institute for Brain Science, Sage Bionetworks, Google, …

Draft guidelines for reproducible research Weekly tutorials on “research hygiene” topics

  • E.g. GitHub, KnitR, iPython Notebook

Reproducibility and open science

UW flagship activity: Establish a campus-wide community around reproducible research

slide-14
SLIDE 14

Working spaces and culture

Washington Research Foundation Data Science Studio

UW flagship activity: Establish a “Data Science Studio”

slide-15
SLIDE 15

Ethnography and evaluation: data science studies

Ethnography and evaluation integrated into a wide range of Data Science Environment activities

  • Project overall (beginning with in-depth baseline interviews

with participants from grad students through faculty)

  • IGERT
  • AstroData Hack Week
  • Incubator projects

Developed ethnography research questions

  • E.g., who does data science, how are they networked,

forms of social interaction and organization, intellectual groupings, career reward structures, collaborative tool use in scientific workflows, data science values and ethics, etc.

Established baseline for evaluation, and determined evaluation questions

UW flagship activity: Establish a research program in “the data science of data science”

slide-16
SLIDE 16

Data Science Ethnography

Qualitative field-based technique originally from anthropology

  • Enables the study of underlying patterns and themes in a

sociotechnical system

  • Trends can be analyzed in context without compromising

ecological validity

Ethnographers immerse themselves in a community to discern

  • subtle patterns that may not be self-evident to members
  • f the community
  • how members make sense of the world
  • what motivates them
  • how they work together
slide-17
SLIDE 17

Data Science Ethnography

Ethnography involves

  • Hundreds of pages of field notes, interview transcripts, and

artifacts from the field

  • Collected and recorded over a long period of time

Ethnographic insights emerge as patterns and themes are detected Ethnographers work with members of community to interpret observations Analysis

  • Co-occurs with data collection
  • Interactively shapes research strategy

“Applied ethnography”

  • Goal is to provide ongoing feedback on what works and

what doesn’t

slide-18
SLIDE 18

Duration of Engagement # of engagements go to them come to us

Office Hours

2015-present 50+ annually

Door-to-Door; Lab Visits

2011-present 25-30 annually

Incubator; DSSG

2014-present 1-2 annually

Embedded

2010-present 0-2 annually

Joint Research

2010-present 0-2 annually per FTE:

slide-19
SLIDE 19

ß

Data Science for Social Good @ UW

slide-20
SLIDE 20

Precursor: Data Science Incubator

Goal: Identify high-impact data- intensive science projects that will benefit from quarter-long sprints of expertise Protocol: ~ 1-2-page proposals, in- studio collaboration two days per week Best projects: “I have the questions, I have the data, I need help getting the answers”

  • 4-6 concurrent teams: Network effects

among cohort beyond 1: 1 interactions

  • Each team is ~ 50% project lead + ~ 50%

eScience FTE

  • Structured, time-bounded engagement

ensures progress (and an exit strategy)

  • Feels like a course: “I have incubator

today, so I can’t go do XXXX”

Spring 2014, Fall 2014, Winter 2016

http://data.uw.edu/incubator/

slide-21
SLIDE 21
slide-22
SLIDE 22

“I talked with Alicia a bit yesterday, and she showed me that her earthquake-repeater- searching implementation is more general, and more powerful than I had thought, and closer to trial by others (and I have a particular use in mind in the ongoing iMUSH experiment on Mount St Helens)<snip> “So I'm encouraging her to continue to work on it a day per week or so for the foreseeable future, assuming you have the facilities to continue the incubation.”

The project outlives the incubator…… Publications in the works on both the software and the science – from three months of half-time work

slide-23
SLIDE 23
slide-24
SLIDE 24

24

Assessing Community Well-Being

Third-Place Technologies

Optimization of King County Metro Paratransit

Computer Science & Engineering

Predictors of Permanent Housing for Homeless Families

Bill and Melinda Gates Foundation

Open Sidewalk Graph for Accessible Trip Planning

Computer Science & Engineering

Inaugural 2015 program: 16 spots 140 applicants …from 20+ departments

slide-25
SLIDE 25

Predictors of Permanent Housing for Homeless Families

Project Leads: Neil Roche & Anjana Sundaram, Gates Foundation DSSG Fellows: Joan Wang, Jason Portenoy, Fabliha Ibnat, Chris Suberlak ALVA High School Students: Cameron Holt, Xilalit Sanchez eScience Data Scientist Mentors: Ariel Rokem, Bryna Hazelton

When homeless families engage in services and programs, what factors are most likely to lead to a successful exit?

The DSSG team

  • developed algorithms to identify

‘families’ and to identify ‘episodes’ of homelessness including back-to-back,

  • r overlapping enrollments in

individual programs

  • devised innovative ways to visualize

and analyze the ways families transition between programs

The Gates Foundation, together with Building Changes have partnered with King, Pierce and Snohomish counties to make homelessness in these counties rare, brief and one-time.

slide-26
SLIDE 26

Develop visualizations to show how homeless families move through programs

slide-27
SLIDE 27

Conduct analysis to identify predictors of permanent housing

Correlation with successful outcome, by family characteristics Correlation with successful outcome, by homelessness program

Emergency Shelter use tends to be associated with unsuccessful

  • utcomes (unsurprising!)

Homelessness Prevention programs more strongly associated with positive

  • utcomes than

transitional housing Substance abuse strongly associated with unsuccessful outcomes Parent employment strongest predictor of successful outcomes

slide-28
SLIDE 28

Open Sidewalks – Sidewalk maps for low-mobility citizens

Project Leads: Nick Bolten, Anat Caspi – Taskar Center, CSE DSSG Fellows: Amir Amini, Yun Hao, Vaishnavi Ravichandran, Andre Stephens ALVA High School Students: Nick Krasnoselsky, Doris Layman eScience Data Scientist Mentors: Anthony Arendt, Jake Vanderplas

“ 30 million Americans over 15

years old experience limited mobility, including difficulty walking, climbing stairs, using wheelchairs, crutches, walkers” while 24

million more persons experience difficulty walking a quarter mile”

|Picture: US Federal Highway administration http://www.fhwa.dot.gov/environment/bicycle_pedestrian/publications/sidewalk2/sidewalks204.cfm

slide-29
SLIDE 29

Automated cleaning of sidewalk data through computational geometry

powered by data from: SDOT/Socrata Google API

Step Runtime Solved (All) Percent Connecting T-Gaps ~3.9s 3,837 (4,352) 88.2 Intersection Cleaning ~23.6s 38,844 (44,700) 86.9 Polygon Cleaning ~10min 7,283 (8,035) 90.6 Subgraphs ~23.2s 39,913 (45,265) 88.1

slide-30
SLIDE 30

"Impediment to insight to innovation: understanding data assemblages through the breakdown-repair process." Anissa Tanweer, Brittany Fiore-Gartland, Cecilia Aragon. Information, Communication & Society, March 2016

slide-31
SLIDE 31
slide-32
SLIDE 32

Lessons Learned

Staff

  • Science background is important
  • Autonomy is important

Program

  • Physical shared studio space is important
  • Continuous program evaluation is important
  • Diversity in the “menu” is important
  • Pipelines are important
  • People learn from each other
  • The focus must be the research

32

slide-33
SLIDE 33

Thanks!

Cecilia Aragon

  • Dept. of Human Centered Design & Engineering

eScience Institute University of Washington Seattle, WA, USA aragon@uw.edu escience.uw.edu data.uw.edu faculty.uw.edu/ aragon depts.washington.edu/ hdsl/