Data Management Research @ UW Seattle uwdb.io http://uwdb.io/ - - PowerPoint PPT Presentation

data management research uw seattle
SMART_READER_LITE
LIVE PREVIEW

Data Management Research @ UW Seattle uwdb.io http://uwdb.io/ - - PowerPoint PPT Presentation

Data Management Research @ UW Seattle uwdb.io http://uwdb.io/ Magdalena Balazinska Research in database systems, theory, and programming languages Alvin Cheung ~15 students + postdocs Dan Suciu Research Areas Big data processing in the


slide-1
SLIDE 1

Data Management Research @ UW Seattle

uwdb.io

slide-2
SLIDE 2

Magdalena Balazinska Dan Suciu Alvin Cheung

http://uwdb.io/

Research in database systems, theory, and programming languages

~15 students + postdocs

slide-3
SLIDE 3

Research Areas

Big data processing in the cloud

  • Theory: optimal query processing
  • Systems: Myria, efficient & complex processing at scale,

image analytics, DBMS+NN, data summarization

  • Usability: Cloud SLAs, performance tuning, viz analytics

New Types of DBMSs

  • Open World DBMS
  • Image & video DBMS
  • LightDB: VR/AR/MR DBMS

Scientific data management

  • Collaborations with scientists & deep involvement with

eScience Institute

Databases and programming languages

  • DBMS & app co-optimization

Probabilistic Databases Causality

Walter Cai Jenny Ortiz Brandon Haynes Laurel Orr Leilani Battle

slide-4
SLIDE 4

Towards Application-Specific Databases

uwdb.io uwplse.org

slide-5
SLIDE 5
  • ptimizer

executor DB

slide-6
SLIDE 6

Specialization

Sc Scidb Ma Main Me Memory DB DB

OLTP Scientific Workloads

Co Column St Stores

OLAP

St Storm

Streams

Sp SparkSQ SQL

Analytics

Can we generate customized data stores from application code?

slide-7
SLIDE 7

Application Inefficiencies

  • Code translated to inefficient queries
  • Misplaced computation
  • Redundant data loads
  • Issuing queries with known results
  • Loading unused data
  • Missing indexes

78% of fixes took fewer than 5 lines Max app speedup: 39x

Application # issues Discourse (forum) 85 Lobster (forum) 45 Gitlab (collaboration) 23 Redmine (collaboration) 59 Spree (E-commerce) 20 ROR Ecommerce 11 Fulcrum (task mgmt) 2 Tracks (task mgmt) 30 Diaspora (social network) 57 Onebody (social network) 76 Openstreetmap (map) 4 Fallingfruit (map) 16 Total 428 # stars 22k 1k 49k 13k 17k 1.7k 697 3.5k 18k 1.2k 8k 1.1k

Cong Yan

slide-8
SLIDE 8

Image Rotate Blur Join Hash Partitioning

slide-9
SLIDE 9

SEARCH

Proof of translation

Target code

slide-10
SLIDE 10

Proof of translation

PROGRAM SYNTHESIS

Target code

slide-11
SLIDE 11

Verified Lifting: Casper

// sequential implementation void regress(Point [] points) { int SumXY = 0; for(Point p : points){ SumXY += p.x * p.y; } return SumXY; }

void map(Object key, Point [] value) { for(Point p : points) emit("sumxy", SumXY); } void reduce(Text key, int [] vs) { int SumXY = 0; for (Integer val : vs) SumXY = SumXY + val; emit(key, SumXY); }

Lifted code can be

  • ptimized by Hadoop

max 32x speedup

codegen

  • 1. Define semantics of map and reduce
  • 2. Synthesizer infers

spec from source

  • 3. Retarget spec to Hadoop

SumXY = reduce(map(points, fm), fr) fm(x,y) = x * y fr(v1,v2) = v1 + v2

Maaz Ahmad

slide-12
SLIDE 12

SELECT ... FROM ... WHERE ... SELECT ... FROM ... WHERE ...

Q1 Q2

Query Optimizers Autograders Application Caches ∀ D . Q1(D) = Q2(D) ∃ D . Q1(D) ≠ Q2(D) ?

slide-13
SLIDE 13

Full decision procedure exists for conjunctive queries

Deciding the equality of two arbitrary relational queries is undecidable.

Boris Trakhtenbrot

Simple heuristics can already prove many common cases

slide-14
SLIDE 14

Constraint Solver Proof Assistant Finding counterexamples Check validity of proofs

Rosette Coq

Cosette

Q1 =?= Q2

Q1 == Q2 Q1 ≠ Q2

ShumoChu Daniel Li Nick Anderson

slide-15
SLIDE 15

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

Many regex and join algorithms to choose from! Likewise for convolution Many regex and join algorithms to choose from!

slide-16
SLIDE 16

Cuttlefish: A Lightweight Primitive for Online Tuning

Tomer Kaftan

def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start

  • utput

result, elapsedTime

slide-17
SLIDE 17

Cuttlefish: A Lightweight Primitive for Online Tuning

Tomer Kaftan

def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start

  • utput

result, elapsedTime

slide-18
SLIDE 18

Cuttlefish: A Lightweight Primitive for Online Tuning

Tomer Kaftan

def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start

  • utput

result, elapsedTime

slide-19
SLIDE 19

Cuttlefish: A Lightweight Primitive for Online Tuning

Tomer Kaftan

def loopConvolve(image, filters): ... def fftConvolve(image, filters): ... def mmConvolve(image, filters): ... tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start tuner.observe(token, elapsedTime)

  • utput

result, elapsedTime

slide-20
SLIDE 20

Note: Y-axis is Log-scale

slide-21
SLIDE 21
slide-22
SLIDE 22

id date 1 12/25 2 11/21 4 12/24 … …

Input tables Output tables

id date max 1 12/25 30 2 11/21 10 4 12/24 20 … … …

Search for abstract queries Instantiate abstract queries

Prune query skeletons Rank results based on simplicity

Scythe

Chenglong Wang

Stored using specialized data structures

slide-23
SLIDE 23

Supported features

  • SPJ
  • Grouping
  • Aggregation
  • Subqueries
  • Outer join
  • Exists
  • Union

Scythe

Chenglong Wang

slide-24
SLIDE 24

Titles summarize post 80% of the time

Stackoverflow dataset

  • Posts tagged with #sql, #oracle, #database (430k)
  • Posts containing an accepted answer in SQL
  • Results: 41k (title, query) pairs

Filtered away titles

  • My query doesn't work!
  • Why is my query slow?
  • I hate SQL!
slide-25
SLIDE 25

Srini Iyer

Model Naturalness Informativeness Code-NN (Ours) 2.6 1.55 Nearest neighbor 1.9 1.55 MOSES 1.76 1.36 ATTEN 2.82 0.93

slide-26
SLIDE 26

UWDB Collaborators

UW

  • Bill Howe (iSchool)
  • Andrew Connolly (Astronomy)
  • Aaron Lee (Ophtalmology)
  • Ariel Rokem (eScience)
  • Emilio Zagheni (Sociology)
  • Prog Lang & SW Eng group

Industry

  • Adobe
  • Huawei
  • Intel
  • Microsoft
  • Teradata