Getting Rid of Data Tova Milo Tel Aviv University The Big Data Era - - PowerPoint PPT Presentation

getting rid of data
SMART_READER_LITE
LIVE PREVIEW

Getting Rid of Data Tova Milo Tel Aviv University The Big Data Era - - PowerPoint PPT Presentation

Getting Rid of Data Tova Milo Tel Aviv University The Big Data Era From sports, to health care, to the way we drive our cars, or choose how to invest our money, Big Data is changing every aspect of our lives. 2 Tova Milo GETTING RID OF


slide-1
SLIDE 1

Getting Rid of Data

Tova Milo Tel Aviv University

slide-2
SLIDE 2

The Big Data Era

From sports, to health care, to the way we drive our cars,

  • r choose how to invest our money,…

Big Data is changing every aspect of our lives.

Tova Milo GETTING RID OF DATA - VLDB’19

2

slide-3
SLIDE 3

The Big Data Era

The data-centered revolution is fueled by the masses of data, but at the same time is at a great risk due to the very same information flood.

Tova Milo GETTING RID OF DATA - VLDB’19

3

slide-4
SLIDE 4

Time to stop and rethink the “More Data!” philosophy. The 3 P’s to worry about:

Tova Milo GETTING RID OF DATA - VLDB’19

4

Production Privacy Performance

The Big Data Era

slide-5
SLIDE 5

Production of Data & Storage

Tova Milo GETTING RID OF DATA - VLDB’19

5

The size of our digital universe grows exponentially Forecast [IDC’17]: “By 2025 the global datasphere will grow to 163 zettabytes (trillion giga), ten times the 16.1 ZB of data generated in 2016.” Updated forecast [IDC’18]: “By 2025 the global datasphere will grow to 175 zettabytes, from the 33 ZB in 2018”

Storage demand is estimated to outstrip production by more than double!

Production

Privacy Performance

slide-6
SLIDE 6

Data Size

Tova Milo GETTING RID OF DATA - VLDB’19

6

Production

Privacy Performance

slide-7
SLIDE 7

How Much is175 ZB?

Tova Milo GETTING RID OF DATA - VLDB’19

7

“If one were able to store 175ZB onto BluRay discs, then you’d have a stack of discs that can get you to the moon 23 times…” “Even if you could download 175ZB on today’s largest hard drive it would take 12.5 billion drives (and as an industry, we ship a fraction of that today.)”

Production

Privacy Performance

slide-8
SLIDE 8

Storage Production

Tova Milo GETTING RID OF DATA - VLDB’19

8

Production

Privacy Performance

slide-9
SLIDE 9

Data vs. Storage

Tova Milo GETTING RID OF DATA - VLDB’19

9

5 ZB

Production

Privacy Performance

slide-10
SLIDE 10

Performance

Handling exponentially growing data incurs a substantial maintenance and processing overhead

  • data cleaning,
  • validation,
  • enhancement,
  • analysis,…

Selective data management is key to performance !

Tova Milo GETTING RID OF DATA - VLDB’19

10

Production Privacy

Performance

slide-11
SLIDE 11

Let’s Think Energy…

Tova Milo GETTING RID OF DATA - VLDB’19

11

Production Privacy

Performance

slide-12
SLIDE 12

Let’s Think Energy…

Tova Milo GETTING RID OF DATA - VLDB’19

12

Production Privacy

Performance

slide-13
SLIDE 13

Energy Optimization ?

Over the last few years:

  • Development of better ways to cool data centers
  • Recycling the waste heat
  • Streamlining computing processes
  • Switching to renewable energy

Still, even in the best-scenario predictions, if we don’t learn how to dispense of data we’ll stay at the same consumption level (which is already high)

Tova Milo GETTING RID OF DATA - VLDB’19

13

Production Privacy

Performance

slide-14
SLIDE 14

Privacy and Security

Even if we disregard storage and performance constraints, uncontrolled data retention dangers privacy & security

  • EU Data Protection Regulation (GDPR).
  • Sarbanes-Oxley, Graham-Leach-Bliley, the Fair and Accurate

Credit Transactions Act, HIPAA,… Data disposal/retention policies must be systematically developed and enforced to benefit and protect organizations and individuals.

Tova Milo GETTING RID OF DATA - VLDB’19

14

Production Privacy

Performance

slide-15
SLIDE 15

1) Not all data is important! 2) People fear of loosing potentially important data 3) Already now, sometimes there is really no choice 4) Like most good ideas, we are not the first to think about this …

Tova Milo GETTING RID OF DATA - VLDB’19

15

Before we continue, 4 important notes

Production

Privacy Performance

slide-16
SLIDE 16

1) Not all data is important! 2) People fear of loosing potentially important data 3) Already now, sometimes there is really no choice 4) Like most good ideas, we are not the first to think about this … Martin Kersten, "The Wildest Idea" Award, CIDR’15 Gong Show, for "Big Data Space Fungus"

Tova Milo GETTING RID OF DATA - VLDB’19

16

Before we continue, 4 important notes

Production

Privacy Performance

slide-17
SLIDE 17

Big Data Space Fungus

Tova Milo GETTING RID OF DATA - VLDB’19

17

Production

Privacy Performance

[CIDR’15]

slide-18
SLIDE 18

Big Data Space Fungus

Tova Milo GETTING RID OF DATA - VLDB’19

18

Production

Privacy Performance

[CIDR’15]

slide-19
SLIDE 19

Big Data Space Fungus

Tova Milo GETTING RID OF DATA - VLDB’19

19

Production

Privacy Performance

[CIDR’15]

slide-20
SLIDE 20

Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints

  • Determine an optimal disposal policy (which data to retain,

summarize, dispose off) and execute it efficiently

  • Support full-cycle information processing over the partial data
  • Incrementally maintain the partial data as new info comes in

Tova Milo GETTING RID OF DATA - VLDB’19

20

Production

Privacy Performance

The Data Disposal Challenge

slide-21
SLIDE 21

The 7 Criteria for Disposing Data

  • What makes a piece of data important?
  • How importance changes over time?
  • Which of the data is important?
  • Which data can (or must) be retained/disposed off? When?
  • What is the cost of retaining / disposing off the data ?
  • How can data be summarized / disposed off?
  • How to process the partial data?

Tova Milo GETTING RID OF DATA - VLDB’19

21

Production

Privacy Performance

slide-22
SLIDE 22
  • 1. Existing tools

(and why they are not enough)

  • 2. Understanding the past

(provenance)

  • 3. Predicting the future

(Deep Reinforcement Learning)

22

The Rest of This Talk

slide-23
SLIDE 23

(Very) Incomplete List

Deduplication

  • Entity resolution

(Semantic) compression & summarization

  • Relations
  • Semi-structured (XML, RDF, graph)
  • Unstructured (text)

Sampling

  • Approximate Query Processing

Sketching

  • Streams

Machine Learning

  • Dimensionality reduction
  • Clustering
  • Features selection

Tova Milo GETTING RID OF DATA - VLDB’19

23

slide-24
SLIDE 24

Example 1: Relations

Tova Milo GETTING RID OF DATA - VLDB’19

24

[Jagadish, Ng, Ooi, Tung, ICDE'04]

Back to the late 90’s…

slide-25
SLIDE 25

Example 2: Graphs

Tova Milo GETTING RID OF DATA - VLDB’19

25

[Song, Wu, Lin, Dong, Sun, TKDE‘18]

slide-26
SLIDE 26

Example 3: Sampling for AQP

Approximate query answers, at a fraction of full execution cost

  • In query-time sampling, the query is evaluated over samples

taken from the database at run time.

  • For a sharper reduction on response time, draw samples from the

data in a pre-processing step

Question 1: Sample also from the data summaries? Question 2: Use the precomputed samples as data summaries,

thereby allowing to discard some (or all) of the remaining items?

Tova Milo GETTING RID OF DATA - VLDB’19

26

[Chaudhuri, Ding, Kandula, SIGMOD‘17]

slide-27
SLIDE 27

Common Objectives

Summary properties

  • Conciseness
  • Diversification
  • Coverage

Accuracy w.r.t query results

  • Concrete queries
  • Queries class/workload
  • Information loss

Tova Milo GETTING RID OF DATA - VLDB’19

27

[Orr, Suciu, Balazinska, VLDB‘17]

slide-28
SLIDE 28

But in Practice…

Workloads are far more complex (cleaning, transformation, integration, ML,…)

Tova Milo GETTING RID OF DATA - VLDB’19

28

slide-29
SLIDE 29

But in Practice…

Workloads are far more complex (cleaning, transformation, integration, ML,…) Need to understand how data is manipulated, summarized, disposed off throughout the entire workload !

Tova Milo GETTING RID OF DATA - VLDB’19

29

slide-30
SLIDE 30
  • 1. Existing tools

(and why they are not enough)

  • 2. Understanding the past

(provenance)

  • 3. Predicting the future

(Deep Reinforcement Learning)

30

The Rest of This Talk

slide-31
SLIDE 31

Data Provenance

  • Tracks computation and reveals the “origin” of results
  • Many different models with different granularities
  • Can be a key for performing & understanding data reduction

Tova Milo GETTING RID OF DATA - VLDB’19

31

slide-32
SLIDE 32

Provenance by Example

Tova Milo GETTING RID OF DATA - VLDB’19

32

slide-33
SLIDE 33

Lineage

Tova Milo GETTING RID OF DATA - VLDB’19

33

slide-34
SLIDE 34

Provenance Polynomials

Tova Milo GETTING RID OF DATA - VLDB’19

34

slide-35
SLIDE 35

Provenance Polynomials

Tova Milo GETTING RID OF DATA - VLDB’19

35

slide-36
SLIDE 36

Workflow Provenance

Tova Milo GETTING RID OF DATA - VLDB’19

36

slide-37
SLIDE 37

Many Applications

  • Results Explanation
  • Hypothetical reasoning
  • Trust level assessment
  • Computation in presence of incomplete/probabilistic info.
  • Data reduction [Gershtein, M, Novgorodov, CIKM’19]

Tova Milo GETTING RID OF DATA - VLDB’19

37

slide-38
SLIDE 38

But…

Provenance is HUGE

Tova Milo GETTING RID OF DATA - VLDB’19

38

slide-39
SLIDE 39

Provenance Reduction

Lossless

  • Size reduction via expression simplification/factorization

(e.g. using Boolean circuits)

Lossy

  • Selective provenance
  • Compression via abstraction

Tova Milo GETTING RID OF DATA - VLDB’19

39

slide-40
SLIDE 40

Example: Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19

40

[Deutch, Moskovitch, Rinetzky SIGMOD’19]

slide-41
SLIDE 41

Example: Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19

41

slide-42
SLIDE 42

Example: Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19

42

slide-43
SLIDE 43

Example: Compression by Abstraction

Tova Milo GETTING RID OF DATA - VLDB’19

43

slide-44
SLIDE 44

Optimization Problem

  • Choose a cut in the ontology that maximizes expressiveness for a target

compression ratio

  • NP-hard in general
  • Polynomial time complexity for a single ontology
  • Practically appealing heuristics for the general case

Tova Milo GETTING RID OF DATA - VLDB’19

44

Expressiveness Size

slide-45
SLIDE 45
  • 1. Existing tools

(and why they are not enough)

  • 2. Understanding the past

(provenance)

  • 3. Predicting the future

(Deep Reinforcement Learning)

45

The Rest of This Talk

slide-46
SLIDE 46

Learn what may be interesting in a new dataset

Tova Milo GETTING RID OF DATA - VLDB’19

46

Exploratory data analysis (EDA): The process of examining & investigating a given dataset

slide-47
SLIDE 47

Exploratory Data Analysis

EEDA is an iterative process:

  • A user u loads a dataset D to an analysis interface.
  • Performs a sequence of: Su(D)= q1, q2,…qn of actions (e.g. queries)
  • After executing qi - the user examines the results, and decides if and

which action to perform next. The goal:

  • Understand the nature of the dataset
  • Discover its properties
  • Estimate its quality
  • Figure our what may be interesting in it

Modern analysis platforms (e.g. Splunk, Kibana-ELK, Tableau, …)

Tova Milo GETTING RID OF DATA - VLDB’19

47

slide-48
SLIDE 48

EDA agent

Can we teach a machine to generate a coherent, meaningful sequence of exploratory queries?

Tova Milo GETTING RID OF DATA - VLDB’19

48

slide-49
SLIDE 49

Deep Reinforcement Learning

DRL works surprisingly well for very difficult tasks:

  • Play Go
  • Drive a car
  • Conduct natural language dialogs
  • ……

Tova Milo GETTING RID OF DATA - VLDB’19

49

slide-50
SLIDE 50

Can/Should we use DRL?

PROS:

  • It requires NO training data OR traces of user activity
  • Once trained - results can be obtained rather FAST.

CONS:

  • It is a heavy-weight tool, requires lots of computing power.
  • Currently works mostly on game-like environments
  • Even when working - it may just overfit to some odd

patterns in the data

Tova Milo GETTING RID OF DATA - VLDB’19

50

slide-51
SLIDE 51
  • 1. Quick recap of standard RL settings
  • 2. Requirements for RL-EDA environment
  • 3. Our framework (ongoing work)

Tova Milo GETTING RID OF DATA - VLDB’19

51

The Rest of This Talk

slide-52
SLIDE 52

RL Standard Settings

In the (not so simple) Atari environment:

Tova Milo GETTING RID OF DATA - VLDB’19

52

  • 1. Agent observes a “State”

from an “environment”

  • 2. Agent selects an “action”
  • 3. Agent receives “reward”
  • 4. Agent learns (unsupervised)

a “policy” that maximizes the mean reward

slide-53
SLIDE 53

RL-EDA Settings

Tova Milo GETTING RID OF DATA - VLDB’19

53

Utilizing the RL paradigm for EDA:

  • 1. Agent observes a dataset/results set
  • 2. Agent formulates a query
  • 3. Agent receives reward
  • 4. Agent learns to maximize the reward
slide-54
SLIDE 54

1.RL-EDA environment

  • 2. State and action representation
  • 3. Reward Signal
  • 4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19

54

Outline for an RL-EDA Framework

slide-55
SLIDE 55

1.RL-EDA environment

  • 2. State and action representation
  • 3. Reward Signal
  • 4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19

55

Outline for an RL-EDA Framework

slide-56
SLIDE 56

RL-EDA Environment

RL-EDA environment comprises: (1) A collection of datasets (2) Query interface RL-EDA Episode: The agent is “given” an arbitrary dataset The agent performs a “session” (sequence) of N queries.

Tova Milo GETTING RID OF DATA - VLDB’19

56

slide-57
SLIDE 57

1.RL-EDA environment

  • 2. State and action representation
  • 3. Reward Signal
  • 4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19

57

Outline for an RL-EDA Framework

slide-58
SLIDE 58

State Representation

Tova Milo GETTING RID OF DATA - VLDB’19

58

Result displays are often large and complex… → Summarize the results display into a numeric vector

  • Structural features of the data:

Value entropy, # of distinct values, # of Null values

  • Grouping/Aggregation features:

# of groups, groups size variance, aggr. values, entropy,…

  • Context:

N previous displays

slide-59
SLIDE 59

1.RL-EDA environment

  • 2. State and action representation
  • 3. Reward Signal
  • 4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19

59

Outline for an RL-EDA Framework

slide-60
SLIDE 60

Action Representation

Parameterized Actions (action type + parameters)

  • FILTER(attr, op, term) - used to select data tuples that matches a criteria
  • GROUP(attr, agg func, agg attr) - groups and aggregates the data
  • BACK() - allows the agent to backtrack to a previous display

Our Representation

  • [action_type, attr, op, term, agg_func, agg_attr]
  • Handle filter terms using the frequency of appearances in the display

Issue: large actions domain

Tova Milo GETTING RID OF DATA - VLDB’19

60

slide-61
SLIDE 61

1.RL-EDA environment

  • 2. State and action representation
  • 3. Reward Signal
  • 4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19

61

Outline for an RL-EDA Framework

slide-62
SLIDE 62

Reward Signal

Given a sequence SD= q1, q2,…qn of queries performed by the agent on dataset D. How to determine the reward R(SD)? We suggest three major components.

  • 1. Interestingness: Actions inducing interesting results set

should be encouraged

  • 2. Diversity: Actions in the same session should yield diverse

results describing different aspects of the dataset

  • 3. Coherency: The session is understandable to human analysts

Tova Milo GETTING RID OF DATA - VLDB’19

62

slide-63
SLIDE 63

Interestingness

Tova Milo GETTING RID OF DATA - VLDB’19

63

Multitude of interestingness measures are suggested in previous work. Each captures a different nt aspect t of interesti stingn ngness: ss:

Diversity

Measures how much the elements of a data pattern are different from on another

Pecularity

Measures how anomalous is a pattern comparing to the rest of the data patterns

Conciseness Measures the size of the pattern compared

to its coverage

Novelty

Measures how unexpected a data pattern is w.r.t. known prior knowledge

slide-64
SLIDE 64

Diversity

Goal: encourage the agent to choose actions inducing new

  • bservations of different parts of the data than those examined

so far Solution: calculate the Euclidean distances between the

  • bservation vector of the current results display and the vectors
  • f all previous displays

64

Tova Milo GETTING RID OF DATA - VLDB’19

slide-65
SLIDE 65

Coherency

Performed using an external classifier:

  • 1. Given the dataset schema & application domain we use a set
  • f heuristic classification-rules composed by domain experts

(e.g. “a group-by that is employed on more than 4 attributes is non-coherent”)

  • 2. Then employ Snorkel to build a weak-supervision based

classifier

65

Tova Milo GETTING RID OF DATA - VLDB’19

slide-66
SLIDE 66

1.RL-EDA environment

  • 2. State and action representation
  • 3. Reward Signal
  • 4. Agent NN-Architecture

Tova Milo GETTING RID OF DATA - VLDB’19

66

Outline for an RL-EDA Framework

slide-67
SLIDE 67

Challenges

Large # of actions (in particular due to the Filter parameter) Exploration challenges: imbalanced action types (BACK, GROUP, FILTER) Our solution: parameterized softmax with pre-output layer

67

Tova Milo GETTING RID OF DATA - VLDB’19

slide-68
SLIDE 68

A few words about experimental evaluation

  • 1. Learning curves and reward
  • 2. Competitors: Greedy, Recommender systems, Human…
  • 3. Measures: BLEU, sessions similarity

“Turing test”

Tova Milo GETTING RID OF DATA - VLDB’19

68

slide-69
SLIDE 69

Time to Conclude…

Tova Milo GETTING RID OF DATA - VLDB’19

69

slide-70
SLIDE 70

Time to Conclude…

The Data Disposal Challenge

  • Determine an optimal disposal policy (which data to retain,

summarize, dispose off) and execute it efficiently

  • Support full-cycle information processing over the partial data
  • Incrementally maintain the partial data as new info comes in

Define formally what makes a disposal policy good…

Tova Milo GETTING RID OF DATA - VLDB’19

70

slide-71
SLIDE 71

Time to Conclude…

  • 1. Plenty of relevant tools
  • 2. But still very far from a comprehensive solution
  • 3. ML agents: Still a lot to do here!
  • Support more data analysis actions
  • Adaptive disposal policies based on user interaction
  • Consider potential data exploration goals

Tova Milo GETTING RID OF DATA - VLDB’19

71

slide-72
SLIDE 72

Thank You

72

Ori Bar-El, Naama Boer, Daniel Deutch, Shay Gershtein, Amir Gilad, Gefen Keinan, Nave Frost, Yuval Moskovitch, Slava Novgorodov, Kathy Razmadze, Noam Rinetzky, Amit Somech, Brit Youngmann, …