How Do We Solve the Worlds Spreadsheet Problem? Alex Rasmussen - - PowerPoint PPT Presentation

how do we solve the world s spreadsheet problem
SMART_READER_LITE
LIVE PREVIEW

How Do We Solve the Worlds Spreadsheet Problem? Alex Rasmussen - - PowerPoint PPT Presentation

How Do We Solve the Worlds Spreadsheet Problem? Alex Rasmussen @alexras Hi, Im Alex! @alexras alexras.info freenome.com bitsondisk.com My Background 2009-2013 : really fast sorting 2013-2016 : data wrangling 2017-2018 :


slide-1
SLIDE 1

How Do We Solve 
 the World’s Spreadsheet Problem?

Alex Rasmussen
 @alexras

slide-2
SLIDE 2

Hi, I’m Alex!

@alexras

freenome.com bitsondisk.com alexras.info

slide-3
SLIDE 3

My Background

2009-2013: really fast sorting 2013-2016: data wrangling 2017-2018: cancer-fighting robots

slide-4
SLIDE 4

I think/worry a lot 
 about spreadsheets.

slide-5
SLIDE 5

Today’s focus: spreadsheet data

(for compute, felienne.com)

slide-6
SLIDE 6
  • 1. Spreadsheets are great
  • 2. Spreadsheets are a problem
  • 3. How we can fix it

This talk:

slide-7
SLIDE 7
  • 1. Spreadsheets are great
  • 2. Spreadsheets are a problem
  • 3. How we can fix it

This talk:

slide-8
SLIDE 8

What’s so great about spreadsheets?

slide-9
SLIDE 9

Spreadsheets are 
 Ubiquitous

slide-10
SLIDE 10

1.2 billion Office users (~16% of humans)


slide-11
SLIDE 11

1.2 billion Office users (~16% of humans)
 60 million Office 365 customers


slide-12
SLIDE 12

1.2 billion Office users (~16% of humans)
 60 million Office 365 customers
 >5 million businesses use Google Apps

slide-13
SLIDE 13

Spreadsheets are 
 Approachable

slide-14
SLIDE 14
slide-15
SLIDE 15

Spreadsheets are 
 Flexible

slide-16
SLIDE 16

Data grids

slide-17
SLIDE 17

Data grids Graphs

slide-18
SLIDE 18

Data grids Graphs Anything tabular

slide-19
SLIDE 19

Data grids Graphs Anything tabular Full-scale “apps”

slide-20
SLIDE 20

Tatsuo Horiuchi (b. 1940) Kegon Falls, 2007 AutoShape on canvas

slide-21
SLIDE 21

https://pasokonga.com/

slide-22
SLIDE 22

So if spreadsheets are ubiquitous, approachable, and flexible, 
 what’s the problem?

slide-23
SLIDE 23
  • 1. Spreadsheets are great
  • 2. Spreadsheets are a problem
  • 3. How we can fix it

This talk:

slide-24
SLIDE 24

Problem #1: Data Types

slide-25
SLIDE 25
slide-26
SLIDE 26

Automatic type conversion can cause serious problems.

slide-27
SLIDE 27

DEC1

slide-28
SLIDE 28

DEC1 12/1

slide-29
SLIDE 29

RIKEN Identifier 2310009E13

slide-30
SLIDE 30

RIKEN Identifier 2310009E13 2.31E+13

slide-31
SLIDE 31

–Johnny Appleseed

“We confirmed gene name errors in 987 supplementary files from 704 published articles (19.6% of all articles).” 
 https://genomebiology.biomedcentral.com/ articles/10.1186/s13059-016-1044-7

slide-32
SLIDE 32

False Equivalence

000123
 = 00123 = 123

True if they’re integers, 
 but what if they’re strings?

slide-33
SLIDE 33

Enumerated Types

slide-34
SLIDE 34

Enumerated Types

“Prostate Cancer”

slide-35
SLIDE 35

Enumerated Types

“Prostate Cancer” “prostate cancer”

slide-36
SLIDE 36

Enumerated Types

“Prostate Cancer” “prostate cancer” “prostatecancer”

slide-37
SLIDE 37

Enumerated Types

“Prostate Cancer” “prostate cancer” “prostatecancer” “PC”

slide-38
SLIDE 38

Enumerated Types

“Prostate Cancer” “prostate cancer” “prostatecancer” “PC” “prostate”

slide-39
SLIDE 39

Enumerated Types

“Prostate Cancer” “prostate cancer” “prostatecancer” “PC” “prostate” “prostrate”

slide-40
SLIDE 40

List Validations? Sheet Protection?

Easy to add Easy to remove by accident Hard to enforce

slide-41
SLIDE 41

Data loss! False equivalence! Ontological chaos! Mass hysteria!

slide-42
SLIDE 42

Problem #2: Queryability

slide-43
SLIDE 43

Inside a spreadsheet, things are pretty good!

slide-44
SLIDE 44

Formulas! Pivot Tables! Filters!

slide-45
SLIDE 45

What about querying across spreadsheets?

slide-46
SLIDE 46

Get and Transform

slide-47
SLIDE 47

No Mac support.

slide-48
SLIDE 48

Structure changes? Type changes? Column Renames? Have fun re-loading.

slide-49
SLIDE 49

And what about joins?

slide-50
SLIDE 50

There’s VLOOKUP

=VLOOKUP(“Product 1”, Prices!$A$2:$B$9,2,FALSE)

… but, like, eww.

slide-51
SLIDE 51

Data inside a spreadsheet is hard to connect to data outside that spreadsheet.

slide-52
SLIDE 52

Summary:
 Spreadsheets are 
 bad at types and
 hard to query

slide-53
SLIDE 53
  • 1. Spreadsheets are great
  • 2. Spreadsheets are a problem
  • 3. How we can fix it

This talk:

slide-54
SLIDE 54

What about databases?

slide-55
SLIDE 55

Databases are great in ways that spreadsheets aren’t.

slide-56
SLIDE 56

Databases are great at data type definition and enforcement.

slide-57
SLIDE 57

Numeric Monetary Character Binary Date/Time Boolean Enumerated Geometric Network Address Bit String Text Search UUID XML JSON Arrays Composite Range Pseudo-Types

So Many Types of Types!

slide-58
SLIDE 58

Databases are purpose-built for queries and joins.

slide-59
SLIDE 59

BUT

slide-60
SLIDE 60

Databases 
 aren’t as approachable 
 as spreadsheets.

slide-61
SLIDE 61

$ psql -d postgres psql (10.4, server 9.6.9) Type "help" for help. postgres=#

slide-62
SLIDE 62

Databases 
 aren’t as flexible 
 as spreadsheets.

slide-63
SLIDE 63

Databases are good at storing and querying data. 
 
 But that’s it.

slide-64
SLIDE 64

Spreadsheets and databases have complementary skillsets.

slide-65
SLIDE 65

So, what do we do about it?

slide-66
SLIDE 66

How to Solve Your Spreadsheet Problem

slide-67
SLIDE 67
  • 1. Identify the use case.
  • 2. Stop the spread.
  • 3. Backfill.
slide-68
SLIDE 68
  • 1. Identify the use case.
  • 2. Stop the spread.
  • 3. Backfill.
slide-69
SLIDE 69

Every spreadsheet solves a problem. What is that problem?

slide-70
SLIDE 70

What’s the business need? How much data is there? How fast does it change? How frequent are additions?

slide-71
SLIDE 71
  • 1. Identify the use case.
  • 2. Stop the spread.
  • 3. Backfill.
slide-72
SLIDE 72

Give new data a structured home.

slide-73
SLIDE 73
slide-74
SLIDE 74

No custom apps. At least at first.

slide-75
SLIDE 75

Optimize for Speed

slide-76
SLIDE 76
  • 1. Identify the use case.
  • 2. Stop the spread.
  • 3. Backfill.
slide-77
SLIDE 77

It’s time for some Data Wrangling.

(yee-haw )

slide-78
SLIDE 78

Writing one-off scripts is sometimes the best option.

slide-79
SLIDE 79
slide-80
SLIDE 80

https://www.trifacta.com/start-wrangling/

slide-81
SLIDE 81

https://www.trifacta.com/start-wrangling/

Infer wrangle “recipe” from high-level actions.

slide-82
SLIDE 82

Another Option: Programming 
 By Example

slide-83
SLIDE 83

FlashRelate

A B C D E . . . R 1 value year value year Comments 2 Albania 1,000 1950 930 1981 FRA 1 3 Austria 3,139 1951 3,177 1955 FRA 3 4 Belgium 541 1947 601 1950 5 Bulgaria 2,964 1947 3,259 1958 FRA 1 6 Czech . . . 2,416 1950 2,503 1960 NC . . .

(a)

A B C D 1 Albania 1,000 1950 FRA 1 2 Albania 930 1981 FRA 1 . . . 5 Austria 3,139 1951 FRA 3 6 Austria 3,177 1955 FRA 3 . . . 9 Belgium 541 1947 10 Belgium 601 1950 . . .

(b)

Provide example rows, 
 synthesize layout transformations.

https://github.com/microsoft/prose

slide-84
SLIDE 84

Foofah

Provide input/output sample, synthesize 
 layout and syntactic transformations.

https://github.com/umich-dbgroup/foofah

slide-85
SLIDE 85
  • 1. Identify the use case.
  • 2. Stop the spread.
  • 3. Backfill.

How to Solve Your Spreadsheet Problem

slide-86
SLIDE 86

What about 
 the future?

slide-87
SLIDE 87

Spreadsheets aren’t going anywhere, 
 for good reason.

slide-88
SLIDE 88

Learn from the spreadsheet.

slide-89
SLIDE 89

Meet the users where they are.

slide-90
SLIDE 90

Thank you.

@alexras Consulting Inquiries: contact@bitsondisk.com