Getting Rid of Data
Tova Milo Tel Aviv University
Getting Rid of Data Tova Milo Tel Aviv University The Big Data Era - - PowerPoint PPT Presentation
Getting Rid of Data Tova Milo Tel Aviv University The Big Data Era From sports, to health care, to the way we drive our cars, or choose how to invest our money, Big Data is changing every aspect of our lives. 2 Tova Milo GETTING RID OF
Tova Milo Tel Aviv University
From sports, to health care, to the way we drive our cars,
Big Data is changing every aspect of our lives.
Tova Milo GETTING RID OF DATA - VLDB’19
2
The data-centered revolution is fueled by the masses of data, but at the same time is at a great risk due to the very same information flood.
Tova Milo GETTING RID OF DATA - VLDB’19
3
Time to stop and rethink the “More Data!” philosophy. The 3 P’s to worry about:
Tova Milo GETTING RID OF DATA - VLDB’19
4
Production Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
5
The size of our digital universe grows exponentially Forecast [IDC’17]: “By 2025 the global datasphere will grow to 163 zettabytes (trillion giga), ten times the 16.1 ZB of data generated in 2016.” Updated forecast [IDC’18]: “By 2025 the global datasphere will grow to 175 zettabytes, from the 33 ZB in 2018”
Storage demand is estimated to outstrip production by more than double!
Production
Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
6
Production
Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
7
“If one were able to store 175ZB onto BluRay discs, then you’d have a stack of discs that can get you to the moon 23 times…” “Even if you could download 175ZB on today’s largest hard drive it would take 12.5 billion drives (and as an industry, we ship a fraction of that today.)”
Production
Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
8
Production
Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
9
5 ZB
Production
Privacy Performance
Handling exponentially growing data incurs a substantial maintenance and processing overhead
Selective data management is key to performance !
Tova Milo GETTING RID OF DATA - VLDB’19
10
Production Privacy
Performance
Tova Milo GETTING RID OF DATA - VLDB’19
11
Production Privacy
Performance
Tova Milo GETTING RID OF DATA - VLDB’19
12
Production Privacy
Performance
Over the last few years:
Still, even in the best-scenario predictions, if we don’t learn how to dispense of data we’ll stay at the same consumption level (which is already high)
Tova Milo GETTING RID OF DATA - VLDB’19
13
Production Privacy
Performance
Even if we disregard storage and performance constraints, uncontrolled data retention dangers privacy & security
Credit Transactions Act, HIPAA,… Data disposal/retention policies must be systematically developed and enforced to benefit and protect organizations and individuals.
Tova Milo GETTING RID OF DATA - VLDB’19
14
Production Privacy
Performance
1) Not all data is important! 2) People fear of loosing potentially important data 3) Already now, sometimes there is really no choice 4) Like most good ideas, we are not the first to think about this …
Tova Milo GETTING RID OF DATA - VLDB’19
15
Production
Privacy Performance
1) Not all data is important! 2) People fear of loosing potentially important data 3) Already now, sometimes there is really no choice 4) Like most good ideas, we are not the first to think about this … Martin Kersten, "The Wildest Idea" Award, CIDR’15 Gong Show, for "Big Data Space Fungus"
Tova Milo GETTING RID OF DATA - VLDB’19
16
Production
Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
17
Production
Privacy Performance
[CIDR’15]
Tova Milo GETTING RID OF DATA - VLDB’19
18
Production
Privacy Performance
[CIDR’15]
Tova Milo GETTING RID OF DATA - VLDB’19
19
Production
Privacy Performance
[CIDR’15]
Retaining the knowledge hidden in the data while respecting storage, processing and regulatory constraints
summarize, dispose off) and execute it efficiently
Tova Milo GETTING RID OF DATA - VLDB’19
20
Production
Privacy Performance
Tova Milo GETTING RID OF DATA - VLDB’19
21
Production
Privacy Performance
(and why they are not enough)
(provenance)
(Deep Reinforcement Learning)
22
Deduplication
(Semantic) compression & summarization
Sampling
Sketching
Machine Learning
Tova Milo GETTING RID OF DATA - VLDB’19
23
Tova Milo GETTING RID OF DATA - VLDB’19
24
[Jagadish, Ng, Ooi, Tung, ICDE'04]
Back to the late 90’s…
Tova Milo GETTING RID OF DATA - VLDB’19
25
[Song, Wu, Lin, Dong, Sun, TKDE‘18]
Approximate query answers, at a fraction of full execution cost
taken from the database at run time.
data in a pre-processing step
Question 1: Sample also from the data summaries? Question 2: Use the precomputed samples as data summaries,
thereby allowing to discard some (or all) of the remaining items?
Tova Milo GETTING RID OF DATA - VLDB’19
26
[Chaudhuri, Ding, Kandula, SIGMOD‘17]
Summary properties
Accuracy w.r.t query results
Tova Milo GETTING RID OF DATA - VLDB’19
27
[Orr, Suciu, Balazinska, VLDB‘17]
Workloads are far more complex (cleaning, transformation, integration, ML,…)
Tova Milo GETTING RID OF DATA - VLDB’19
28
Workloads are far more complex (cleaning, transformation, integration, ML,…) Need to understand how data is manipulated, summarized, disposed off throughout the entire workload !
Tova Milo GETTING RID OF DATA - VLDB’19
29
(and why they are not enough)
(provenance)
(Deep Reinforcement Learning)
30
Tova Milo GETTING RID OF DATA - VLDB’19
31
Tova Milo GETTING RID OF DATA - VLDB’19
32
Tova Milo GETTING RID OF DATA - VLDB’19
33
Tova Milo GETTING RID OF DATA - VLDB’19
34
Tova Milo GETTING RID OF DATA - VLDB’19
35
Tova Milo GETTING RID OF DATA - VLDB’19
36
Tova Milo GETTING RID OF DATA - VLDB’19
37
Provenance is HUGE
Tova Milo GETTING RID OF DATA - VLDB’19
38
Lossless
(e.g. using Boolean circuits)
Lossy
Tova Milo GETTING RID OF DATA - VLDB’19
39
Tova Milo GETTING RID OF DATA - VLDB’19
40
[Deutch, Moskovitch, Rinetzky SIGMOD’19]
Tova Milo GETTING RID OF DATA - VLDB’19
41
Tova Milo GETTING RID OF DATA - VLDB’19
42
Tova Milo GETTING RID OF DATA - VLDB’19
43
compression ratio
Tova Milo GETTING RID OF DATA - VLDB’19
44
Expressiveness Size
(and why they are not enough)
(provenance)
(Deep Reinforcement Learning)
45
Tova Milo GETTING RID OF DATA - VLDB’19
46
Exploratory data analysis (EDA): The process of examining & investigating a given dataset
EEDA is an iterative process:
which action to perform next. The goal:
Modern analysis platforms (e.g. Splunk, Kibana-ELK, Tableau, …)
Tova Milo GETTING RID OF DATA - VLDB’19
47
Can we teach a machine to generate a coherent, meaningful sequence of exploratory queries?
Tova Milo GETTING RID OF DATA - VLDB’19
48
DRL works surprisingly well for very difficult tasks:
Tova Milo GETTING RID OF DATA - VLDB’19
49
PROS:
CONS:
patterns in the data
Tova Milo GETTING RID OF DATA - VLDB’19
50
Tova Milo GETTING RID OF DATA - VLDB’19
51
In the (not so simple) Atari environment:
Tova Milo GETTING RID OF DATA - VLDB’19
52
from an “environment”
a “policy” that maximizes the mean reward
Tova Milo GETTING RID OF DATA - VLDB’19
53
Utilizing the RL paradigm for EDA:
1.RL-EDA environment
Tova Milo GETTING RID OF DATA - VLDB’19
54
1.RL-EDA environment
Tova Milo GETTING RID OF DATA - VLDB’19
55
RL-EDA environment comprises: (1) A collection of datasets (2) Query interface RL-EDA Episode: The agent is “given” an arbitrary dataset The agent performs a “session” (sequence) of N queries.
Tova Milo GETTING RID OF DATA - VLDB’19
56
1.RL-EDA environment
Tova Milo GETTING RID OF DATA - VLDB’19
57
Tova Milo GETTING RID OF DATA - VLDB’19
58
Result displays are often large and complex… → Summarize the results display into a numeric vector
Value entropy, # of distinct values, # of Null values
# of groups, groups size variance, aggr. values, entropy,…
N previous displays
1.RL-EDA environment
Tova Milo GETTING RID OF DATA - VLDB’19
59
Parameterized Actions (action type + parameters)
Our Representation
Issue: large actions domain
Tova Milo GETTING RID OF DATA - VLDB’19
60
1.RL-EDA environment
Tova Milo GETTING RID OF DATA - VLDB’19
61
Given a sequence SD= q1, q2,…qn of queries performed by the agent on dataset D. How to determine the reward R(SD)? We suggest three major components.
should be encouraged
results describing different aspects of the dataset
Tova Milo GETTING RID OF DATA - VLDB’19
62
Tova Milo GETTING RID OF DATA - VLDB’19
63
Multitude of interestingness measures are suggested in previous work. Each captures a different nt aspect t of interesti stingn ngness: ss:
Diversity
Measures how much the elements of a data pattern are different from on another
Pecularity
Measures how anomalous is a pattern comparing to the rest of the data patterns
Conciseness Measures the size of the pattern compared
to its coverage
Novelty
Measures how unexpected a data pattern is w.r.t. known prior knowledge
Goal: encourage the agent to choose actions inducing new
so far Solution: calculate the Euclidean distances between the
64
Tova Milo GETTING RID OF DATA - VLDB’19
Performed using an external classifier:
(e.g. “a group-by that is employed on more than 4 attributes is non-coherent”)
classifier
65
Tova Milo GETTING RID OF DATA - VLDB’19
1.RL-EDA environment
Tova Milo GETTING RID OF DATA - VLDB’19
66
Large # of actions (in particular due to the Filter parameter) Exploration challenges: imbalanced action types (BACK, GROUP, FILTER) Our solution: parameterized softmax with pre-output layer
67
Tova Milo GETTING RID OF DATA - VLDB’19
“Turing test”
Tova Milo GETTING RID OF DATA - VLDB’19
68
Tova Milo GETTING RID OF DATA - VLDB’19
69
The Data Disposal Challenge
summarize, dispose off) and execute it efficiently
Define formally what makes a disposal policy good…
Tova Milo GETTING RID OF DATA - VLDB’19
70
Tova Milo GETTING RID OF DATA - VLDB’19
71
72
Ori Bar-El, Naama Boer, Daniel Deutch, Shay Gershtein, Amir Gilad, Gefen Keinan, Nave Frost, Yuval Moskovitch, Slava Novgorodov, Kathy Razmadze, Noam Rinetzky, Amit Somech, Brit Youngmann, …