Data Collection and Aggregation 1 Challenges: data Data type: - - PowerPoint PPT Presentation

data collection and aggregation
SMART_READER_LITE
LIVE PREVIEW

Data Collection and Aggregation 1 Challenges: data Data type: - - PowerPoint PPT Presentation

Data Collection and Aggregation 1 Challenges: data Data type: numerical sensor readings. Rich and massive data, spatially distributed and correlated. Data dynamics: data streaming and aging. Uncertainty, noise, erroneous


slide-1
SLIDE 1

1

Data Collection and Aggregation

slide-2
SLIDE 2

2

Challenges: data

  • Data type: numerical sensor readings.
  • Rich and massive data, spatially

distributed and correlated.

  • Data dynamics: data streaming and

aging.

  • Uncertainty, noise, erroneous data,
  • utliers.
  • Semantics. Raw data knowledge.
slide-3
SLIDE 3

3

Challenges: query variability

  • Data-centric query: search for “car detection”,

instead of sensor node ID.

  • Geographical query: report values near the lake.
  • Real-time detection & control: intruder detection.
  • Multi-dimensional query: spatial, temporal and

attribute range.

  • Query interface: fixed base station or mobile

hand held devices.

slide-4
SLIDE 4

4

Data processing

  • In-network aggregation
  • In-network storage
  • Distributed data management
  • Statistical modeling
  • Intelligent reasoning
slide-5
SLIDE 5

5

In-network data aggregation

  • Communication is expensive, bandwidth is

precious.

– “In-network processing”: process raw data before transmit.

  • Single sensor reading may not hold much

value.

– Inherently unreliable, outlier readings. – Users are often interested in the hidden patterns

  • r the global picture.
  • Data compression and knowledge discovery.

– Save storage; generate semantic report.

slide-6
SLIDE 6

6

Distributed In-network Storage

  • Flash drive, etc. enables distributed in-network

storage

  • Challenges

– Distributed indexing for fast query dissemination – Explore storage locality to benefit data retrieval. – Resilience to node or link failures. – Graceful adaptation to data skews. – Alleviate the hot spot problem created by popular data.

slide-7
SLIDE 7

7

Sound statistical models

  • Raw data may

misrepresent the physical world.

– Sensors sample at discrete times. Sensors may be faulty. Packets may be lost. – Most sensor data may not improve the answer quality to the query. Data can be compressed. – Correlation between nearby sensors or different attributes of the same sensor.

slide-8
SLIDE 8

8

Model-based query

  • Build statistical models
  • n the sensor readings.

– Generates observation plan to improve model accuracy. – Answers query results.

  • Pros:

– Improve data robustness. – Explore correlation – Decrease communication cost. – Provide prediction of the future. – Easier to extract data abstraction.

slide-9
SLIDE 9

9

Reasoning and control

  • Reason from raw sensor readings for high-level

semantic events.

– Fire detection.

  • Events triggered reaction, sensor tasking and control.

– Turn on fire alarm. Direct people to closest exits.

slide-10
SLIDE 10

10

Data privacy, fault tolerance and security

  • Under what format should data be stored?
  • What if a sensor die? Can we recover its data?
  • What information is revealed if a sensor is

compromised?

  • Adversary injects false reports and false alarms.
slide-11
SLIDE 11

11

Approximation and randomization

  • Connection to streaming data model:

– No way to store the raw data. – Scan the data sequentially. – Maintain sketches of massive amount of data. – One more challenge in sensor network: the streaming data is spatially distributed and communication is expensive.

  • Approximations, sampling, randomization.
slide-12
SLIDE 12

12

Papers

  • [Madden02] Samuel R. Madden, Michael J. Franklin, Joseph
  • M. Hellerstein, and Wei Hong. TAG: a Tiny AGgregation

Service for Ad-Hoc Sensor Networks. OSDI, December

  • 2002. Aggregation with a tree.
  • [Shrivastava04] Nisheeth Shrivastava, Chiranjeeb

Buragohain, Divy Agrawal, Subhash Suri, Medians and Beyond: New Aggregation Techniques for Sensor Networks, ACM SenSys '04, Nov. 3-5, Baltimore, MD. Approximate answer to medians, reduce storage and message size.

  • [Nath04] Suman Nath, Phillip B. Gibbons, Zachary Anderson,

and Srinivasan Seshan, Synopsis Diffusion for Robust Aggregation in Sensor Networks". In proceedings of ACM SenSys'04. Use multipath routing to improve routing

  • robustness. Order and duplicate insensitive synopsis needs to

be used to prevent one data value to be aggregated multiple times.

slide-13
SLIDE 13

13

TinyDB

  • Philosophy:

– Sensor network = distributed database. – Data are stored locally. – Networking structure: tree-based routing. – Top-down SQL query. – Results aggregated back to the query node. – Most intelligence outside the network.

slide-14
SLIDE 14

14

  • TinyDB Architecture
  • 4

1 5 2 6 3 7

  • 8

The next few slides from Sam Madden, Wei Hong

slide-15
SLIDE 15

15

Query Language (TinySQL)

SELECT <aggregates>, <attributes> [FROM {sensors | <buffer>}] [WHERE <predicates>] [GROUP BY <exprs>] [SAMPLE PERIOD <const> | ONCE] [INTO <buffer>] [TRIGGER ACTION <command>]

slide-16
SLIDE 16

16

TinySQL Examples

SELECT nodeid, nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s

1

Epoch Nodeid nestNo Light 1 17 455 2 25 389 1 1 17 422 1 2 25 405

Sensors Sensors Sensors Sensors “ ! "#

slide-17
SLIDE 17

17

TinySQL Examples (cont.)

Epoch region CNT(…) AVG(…) North 3 360 South 3 520 1 North 3 370 1 South 3 520

“$ % !&%"#

'(' !)*+,

  • +,

./ ./0 ! 1-*-+,2344 '/1./* 54

3

.!6-+,2344 '(' -+, ./ '/1./* 54

2

slide-18
SLIDE 18

18

Data Model

  • Entire sensor network as one single, infinitely-

long logical table: sensors

  • Columns consist of all the attributes defined in

the network

  • Typical attributes:

– Sensor readings – Meta-data: node id, location, etc. – Internal states: routing tree parent, timestamp, queue length, etc.

  • Nodes return NULL for unknown attributes
slide-19
SLIDE 19

19

Query over Stored Data

  • Named buffers in Flash memory
  • Store query results in buffers
  • Query over named buffers
  • Analogous to materialized views
  • Example:

– CREATE BUFFER name SIZE x (field1 type1, field2 type2, …) – SELECT a1, a2 FROM sensors SAMPLE PERIOD d INTO name – SELECT field1, field2, … FROM name SAMPLE PERIOD d

slide-20
SLIDE 20

20

Event-based Queries

  • ON event SELECT …
  • Run query only when interesting events

happens

  • Event examples

– Button pushed – Message arrival – Bird enters nest

  • Analogous to triggers but events are user-

defined

slide-21
SLIDE 21

21

TAG: Tiny Aggregation

  • Query Distribution: aggregate queries are pushed

down the network to construct a spanning tree.

– Root broadcasts the query, each node hearing the query broadcasts. – Each node selects a parent. The routing structure is a spanning tree rooted at the query node.

  • Data Collection: aggregate values are routed up

the tree.

– Internal node aggregates the partial data received from its subtree.

slide-22
SLIDE 22

22

TAG example

Query distribution Query collection 1 2 3 4 5 6 1 2 3 4 5 6

slide-23
SLIDE 23

23

TAG example

MAX AVERAGE 1 2 3 4 5 6 1 2 3 4 5 6 m4 = max{m6, m5} Count: c4 = c6+c5 Sum: s4 = s6+s5

slide-24
SLIDE 24

24

Considerations about aggregations

  • Packet loss?

– Acknowledgement and re-transmit? – Robust routing?

  • Packets arriving out of order or in

duplicates?

– Double count?

  • Size of the aggregates?

– Message size growth?

slide-25
SLIDE 25

25

Classes of aggregations

  • Exemplary aggregates return one or more

representative values from the set of all values; summary aggregates compute some properties over all values.

– MAX, MIN: exemplary; SUM, AVERAGE: summary. – Exemplary aggregates are prone to packet loss and not amendable to sampling. – Summary aggregates of random samples can be treated as a robust estimation.

slide-26
SLIDE 26

26

Classes of aggregations

  • Duplicate insensitive aggregates are

unaffected by duplicate readings.

– Examples: MAX, MIN. – Independent of routing topology. – Combine with robust routing (multi-path).

slide-27
SLIDE 27

27

Classes of aggregations

  • Monotonic aggregates: when two partial

records s1 and s2 are combined to s, either e(s) ≥ max{e(s1), e(s2)} or e(s) ≤ min{e(s1), e(s2)}.

– Examples: MAX, MIN. – Certain predicates (such as HAVING) can be applied early in the network to reduce the communication cost.

slide-28
SLIDE 28

28

Classes of aggregations

  • Partial state of the aggregates:

– Distributive: the partial state is simply the aggregate for the partial data. The size is the same with the size of the final

  • aggregate. Example: MAX, MIN, SUM

– Algebraic: partial records are of constant size. Example: AVERAGE. – Holistic: the partial state records are proportional in size to the partial data. Example: MEDIAN. – Unique: partial state is proportional to the number of distinct values. Example: COUNT DISTINCT. – Content-sensitive: partial state is proportional to some (statistical) properties of the data. Example: fixed-size bucket histogram, wavelet, etc.

Good bad worst

slide-29
SLIDE 29

29

Classes of aggregates

Duplicate sensitive Exemplary, Summary Monotonic Partial State MAX, MIN No E Yes Distributive COUNT, SUM Yes S Yes Distributive AVERAGE Yes S No Algebraic MEDIAN Yes E No Holistic COUNT DISTINCT No S Yes Unique HISTOGRAM Yes S No Content- sensitive

slide-30
SLIDE 30

30

Communication cost

Send all data to the sink Partial states too large!

slide-31
SLIDE 31

31

Problem with median

  • Computing average is simple on an aggregation

tree.

– Each node x stores the average a(x) and the number of nodes in its subtree n(x). – The average of a node x can be computed from its children u, v. n(x)=n(u)+n(v). a(x)=(a(u)n(u)+a(v)n(v))/n(x).

  • Computing the median with a fixed amount of

message is hard.

– We do not know the rank of u’s median in v’s dataset. – We resort to approximations. x u v

slide-32
SLIDE 32

32

Deal with computing median

  • Resort to approximation.

– Random sampling approach. – A deterministic approach.

slide-33
SLIDE 33

33

Approach I: Random sampling

  • Problem: compute the median a of n

unsorted elements {ai}.

  • Solution: Take a random sample of k

elements K. Compute the median x of K.

  • Claim: x has rank within (½+ε)n and (½-ε)n

with probability at least 1-2/exp{2kε2}. (Proof left as an exercise.)

  • Choose k=ln(2/δ)/(2ε2), then x is an

approximate median with probability 1-δ.

slide-34
SLIDE 34

34

Approach II: Quantile digest (q-digest)

  • A data structure that answers

– Approximate quantile query: median, the kth largest reading. – Range queries: the kth to lth largest readings. – Most frequent items. – Histograms.

  • Properties:

– Deterministic algorithm. – Error-memory trade-off. – Confidence factor. – Support multiple queries.

slide-35
SLIDE 35

35

Q-digest

  • Input data: frequency of data value {f1,

f2,…,fσ}.

  • Compress the data:

– detailed information concerning frequent data are preserved; – less frequently occurring values are lumped into larger buckets resulting in information loss.

  • Buckets: the nodes in a binary partition of

the range [1, σ]. Each bucket v has range [v.min, v.max].

  • Only store non-zero buckets.
slide-36
SLIDE 36

36

Example

Input data bucketed Q-digest Information loss

slide-37
SLIDE 37

37

Q-digest properties

  • Store values in buckets.
  • 1. Count(v) ≤ n/k. (except leaf)

– Control information loss.

  • 2. Count(v) + Count(p) + Count(s) > n/k.

(except root)

– Ensure sufficient compression. – K: compression parameter.

parent sibling

slide-38
SLIDE 38

38

Construct a q-digest

  • Each sensor

constructs a q- digest based on its value.

  • Check the digest

property bottom up: two “small” children’s count are added up and moved to the parent.

slide-39
SLIDE 39

39

Merging two q-digests

  • Merge q-digests

from two children

  • Add up the

values in buckets

  • Re-evaluate the

digest property bottom up. Information loss: t undercounts since some of its value appears on ancestors.

slide-40
SLIDE 40

40

Space complexity

Claim: A q-digest with compression parameter k has at most 3k buckets.

  • By property 2, for all buckets v in Q,

– Σv∈Q [Count(v) + Count(p) + Count(s)] > |Q| n/k. – Σv∈Q [Count(v) + Count(p) + Count(s)] ≤ 3 [Σv∈QCount(v)] = 3n. – |Q|<3k.

slide-41
SLIDE 41

41

Error bound

Claim: Any value that should be counted in v can be present in one of the ancestors. 1. Count(v) has max error logσ⋅n/k.

– Error(v) ≤ Σancestor p Count(p) ≤ Σancestor p n/k ≤ logσ⋅n/k.

2. MERGE maintains the same relative error.

– Error(v) ≤ Σi Error(vi) ≤ Σi logσ⋅ni/k ≤ logσ⋅n/k.

slide-42
SLIDE 42

42

Median and quantile query

  • Given q∈(0, 1), find the value

whose rank is qn.

  • Relative error ε=|r-qn|/n, where r is

the true rank.

  • Post-order traversal on Q, sum the

counts of all nodes visited before a node v, which is ≤ # of values less than v.max. Report it when it is first time larger than qn.

  • Error bound: ε=logσ /k.
slide-43
SLIDE 43

43

Other queries

  • Inverse quantile: given a value, determine its rank.

– Traverse the tree in post-order, report the sum of counts v for which x>v.max, which is within [rank(x), rank(x)+εn]

  • Range query: find # values in range [l,h].

– Perform two inverse quantile queries and take the

  • difference. Error bound is 2εn.
  • Frequent items: given s∈(0, 1), find all values

reported by more than sn sensors.

– Count the leaf buckets whose counts are more than (s-ε)n. – Small false positive: values with count between (s-ε)n and sn may also be reported as frequent.

slide-44
SLIDE 44

44

Simulation setup

  • A typical aggregation tree (BFS tree) on 40 nodes

in a 200 by 200 area. In the simulation they use 4000~8000 nodes.

slide-45
SLIDE 45

45

Simulation setup

  • Random data;
  • Correlated data: 3D elevation value from Death

Valley.

slide-46
SLIDE 46

46

Histogram v.s. q-digest

  • Comparison of histogram and q-digest.
slide-47
SLIDE 47

47

Tradeoff between error and msg size

slide-48
SLIDE 48

48

Saving on message size

Naïve solution

slide-49
SLIDE 49

Review of data aggregation

49

AVERAGE 1 2 3 4 5 6 Count: c4 = c6+c5 Sum: s4 = s6+s5 Query distribution 1 2 3 4 5 6

slide-50
SLIDE 50

50

1nd problem: how to compute median?

  • In a naïve way, the size of the message is

in the same order as # nodes in the subtree.

  • Last lecture: approximate median.
slide-51
SLIDE 51

51

2nd problem: Aggregation tree in practice

  • Tree is a fragile structure.

– If a link fails, the data from the entire subtree is lost.

  • Fix #1: use multipath, a DAG instead of a

tree.

– Send 1/k data to each of the k upstream nodes (parents). – A link failure lost 1/k data

1 2 3 4 5 6 1 2 3 4 5 6

slide-52
SLIDE 52

52

Aggregation tree in practice

tree DAG True value

slide-53
SLIDE 53

53

Fundamental problem

  • Aggregation and routing are coupled
  • Improve routing robustness by multi-

path routing?

– Same data might be delivered multiple times. – Problem: double-counting!

  • Decouple routing & aggregation

– Work on the robustness of each separately

1 2 3 4 5 6

slide-54
SLIDE 54

54

Order and duplicate insensitive (ODI) synopsis

  • Aggregated value is insensitive to the

sequence or duplication of input data.

  • Small-sizes digests such that any

particular sensor reading is accounted for only once.

– Example: MIN, MAX. – Challenge: how about COUNT, SUM?

slide-55
SLIDE 55

55

Aggregation framework

  • Solution for robustness aggregation:

– Robust routing (e.g., multi-hop) + ODI synopsis.

  • Leaf nodes: Synopsis generation: SG(⋅).
  • Internal nodes: Synopsis fusion: SF(⋅) takes

two synopsis and generate a new synopsis of the union of input data.

  • Root node: Synopsis evaluation: SE(⋅)

translates the synopsis to the final answer.

slide-56
SLIDE 56

56

An easy example: ODI synopsis for MAX/MIN

  • Synopsis generation: SG(⋅).

– Output the value itself.

  • Synopsis fusion: SF(⋅)

– Take the MAX/MIN of the two input values.

  • Synopsis evaluation: SE(⋅).

– Output the synopsis. 1 2 3 4 5 6

slide-57
SLIDE 57

57

Three questions

  • What do we mean by ODI, rigorously?
  • Robust routing + ODI
  • How to design ODI synopsis?

– COUNT – SUM – Sampling – Most popular k items – Set membership – Bloom filter

slide-58
SLIDE 58

58

Definition of ODI correctness

  • A synopsis diffusion algorithm is ODI-correct if

SF() and SG() are order and duplicate insensitive functions.

  • Or, if for any aggregation DAG, the resulting

synopsis is identical to the synopsis produced by the canonical left-deep tree.

  • The final result is independent of the underlying

routing topology.

– Any evaluation order. – Any data duplication.

slide-59
SLIDE 59

59

Definition of ODI correctness

Connection to streaming model: data item comes 1 by 1.

slide-60
SLIDE 60

60

Test for ODI correctness

1. SG() preserves duplicates: if two readings are duplicates (e.g., two nodes with same temperature readings), then the same synopsis is generated. 2. SF() is commutative. 3. SF() is associative. 4. SF() is same-synopsis idempotent, SF(s, s)=s. Theorem: The above properties are sufficient and necessary properties for ODI-correctness. Proof idea: transfer an aggregation DAG to a left-deep tree with the same output by using these properties.

slide-61
SLIDE 61

61

Proof of ODI correctness

1. Start from the DAG. Duplicate a node with out- degree k to k nodes, each with out degree 1.  duplicates preserving.

slide-62
SLIDE 62

62

Proof of ODI correctness

2. Re-order the leaf nodes by the increasing value of the synopsis.  Commutative.

slide-63
SLIDE 63

63

Proof of ODI correctness

3. Re-organize the tree s.t. adjacent leaves with the same value are input to a SF function.  Associative.

SF SF SG SF SG r1 SG SG r2 r2 r3

slide-64
SLIDE 64

64

Proof of ODI correctness

4. Replace SF(s, s) by s.  same-synopsis idempotent.

SF SF SG SF SG r1 SG SG r2 r2 r3 SF SF SG SG r1 SG r2 r3

slide-65
SLIDE 65

65

Proof of ODI correctness

5. Re-order the leaf nodes by the increasing canonical

  • rder.  Commutative.

6. QED.

slide-66
SLIDE 66

66

Design ODI synopsis

  • Recall that MAX/MIN are ODI.
  • Translate all the other aggregates

(COUNT, SUM, etc.) by using MAX.

  • Let’s first do COUNT.
  • Idea: use probabilistic counting.
  • Counting distinct element in a multi-set.

(Flajolet and Martin 1985).

slide-67
SLIDE 67

67

Counting distinct elements

  • Each sensor generates a sensor reading. Count the

total number of different readings.

  • Counting distinct element in a multi-set. (Flajolet

and Martin 1985).

  • Each element chooses a random number i ∈ [1, k].
  • Pr{CT(x)=i} = 2-i, for 1≤i ≤ k-1. Pr{CT(x)=k}= 2-(k-1).
  • Use a pseudo-random generator so that CT(x) is a

hash function (deterministic). 1

½ ¼ 1/8 1/16 …..

slide-68
SLIDE 68

68

Counting distinct elements

  • Synopsis: a bit vector of

length k>logn.

  • SG(): output a bit vector s
  • f length k with CT(k)’s bit

set.

  • SF(): bit-wise boolean OR
  • f input s and s’.
  • SE(): if i is the lowest

index that is still 0, output 2i-1/0.77351.

  • Intuition: i-th position will

be 1 if there are 2i nodes, each trying to set it with probability 1/2i 1 1 OR 1 1 i=3

slide-69
SLIDE 69

69

Distinct value counter analysis

  • Lemma: For i<logn-2loglogn, FM[i]=1 with high

probability (asymptotically close to 1). For i ≥ 3/2logn+δ, with δ ≥ 0, FM[i]=0 with high probability.

  • The expected value of the first zero is

log(0.7753n)+P(logn)+o(1), where P(u) is a periodic function of u with period 1 and amplitude bounded by 10-5.

  • The error bound (depending on variance) can be

improved by using multiple trials. 1

slide-70
SLIDE 70

70

Counting distinct elements

  • Check the ODI-correctness:

– Duplication: by the hash

  • function. The same reading x

generates the same value CT(x). – Boolean OR is commutative, associative, same-synopsis idempotent.

  • Total storage: O(logn) bits.

1 1 OR 1 1 i=3

slide-71
SLIDE 71

71

Robust routing + ODI

  • Use Directed Acyclic Graph (DAG) to replace

tree.

  • Rings overlay:

– Query distribution: nodes in ring Rj are j hops from q. – Query aggregation: node in ring Rj wakes up in its allocated time slot and receives messages from nodes in Rj+1.

slide-72
SLIDE 72

72

Rings and adaptive rings

  • Adaptive rings: cope with network dynamics, node

deletions and insertions, etc.

  • Each node on ring j monitor the success rate of its

parents on ring j-1.

  • If the success rate is low, the node may change its

parent to other nodes with higher success rate.

  • Nodes at ring 1 may transmit multiple times to

ensure robustness.

slide-73
SLIDE 73

73

Implicit acknowledgement

  • Explicit acknowledgement:

– 3-way handshake. – Used for wired networks.

  • Implicit acknowledgement:

– Used on ad hoc wireless networks. – Node u sending to v snoops the subsequent broadcast from v to see if v indeed forwards the message for u. – Explores broadcast property, saves energy.

  • With aggregation this is problematic.

– Say u sends value x to v, and subsequently hears value z. – U does not know whether or not x is incorporated into z.

u v

slide-74
SLIDE 74

74

Implicit acknowledgement

  • ODI-synopsis enables efficient implicit

acknowledgement.

– u sends to v synopsis x. – Afterwards u hears that v transmitting synopsis z. – u verifies whether SF(x, z)=z u v

slide-75
SLIDE 75

75

Error of approximate answers

  • Two sources of errors:

– Algorithmic error: due to randomization and approximation. – Communication error: the fraction of sensor readings not accounted for in the final answer.

  • Algorithmic error depends on the choice of

algorithm and is under control.

  • Communication error depends on the network

dynamics and robustness of routing algorithms.

slide-76
SLIDE 76

76

Simulation results

Unaccounted node

slide-77
SLIDE 77

77

Simulation results

Relative root mean square error

slide-78
SLIDE 78

78

More ODI synopsis

  • Distinct values
  • SUM
  • Second moment
  • Uniform sample
  • Most popular items
  • Set membership --- Bloom Filter
slide-79
SLIDE 79

79

Sum

  • Naïve approach: for an item x with value c times,

make c distinct copies (x, j), j=1, …, c. Now use the distinct count algorithm.

  • When c is large, we set the bits as if we had

performed c successive insertions to the FM sketch.

  • First set the first δ = logc-loglogc bits to 1.
  • Those who reached δ follow a binomial distribution:

each item reaches δ with prob 2-δ.

  • Explicitly insert those that reached bit δ by coin

flipping.

  • Powerful building block.
slide-80
SLIDE 80

80

Second moment

  • Kth moment µk=Σ xi

k, xi is the number of

sensor readings (frequency) of value i.

– µ0 is the number of distinct elements. – µ1 is the sum. – µ2 is the square of L2 norm (variance, skewness

  • f the data).
  • The sketch algorithm for frequency

moments can be turned into an ODI easily by using ODI-sum.

The space complexity of approximating the frequency moments,

  • N. Alon, Y. Matias, and M. Szegedy. STOC 1996.
slide-81
SLIDE 81

81

Second moment

  • Random hash h( ): {0,1,…,N-1} {-1,1}
  • Define zi =h(i)
  • Maintain X = Σi xizi
  • E(X2) = E(Σi xizi)2 = E(Σi xi

2zi 2)+ E(Σi,jxixjzizj).

  • Choose the hash function to be pairwise

independent: Pr{h(i)=a,h(j)=b} = ¼.

  • E(zi

2)=1, E(zizj)= E(zi) E(zj)= 0.

  • Now E(X2) = Σi xi

2.

  • ODI: Each sensor of value i generates zi, then use

ODI-sum.

  • The final answer is X2
slide-82
SLIDE 82

82

More ODI synopsis

  • Distinct values
  • SUM
  • Second moment
  • Uniform sample
  • Most popular items
  • Set membership --- Bloom Filter
slide-83
SLIDE 83

83

Uniform sample

  • Each sensor has a reading. Compute a uniform

sample of a given size k.

  • Synopsis: a sample of k tuples.
  • SG(): output (value, r, id), where r is a uniform

random number in range [0, 1].

  • SF(): output the k tuples with the k largest r values.

If there are less than k tuples in total, out them all.

  • SE(): output the values in s.
  • ODI-correctness is implied by “MAX” and union
  • peration in SF().
  • Correctness: the largest k random numbers is a

uniform k sample.

slide-84
SLIDE 84

84

Most popular items

  • Return the k values that occur the most frequently

among all the sensor readings

  • Synopsis: a set of k most popular items.
  • SG(): output (value, weight) pair, with weight=CT(k),

k>logn.

  • SF(): for each distinct value v, discard all but the

pair with max weight. Then output the k pairs with max weight.

  • SE(): output the set of values.
  • Note: we attach a weight to estimate the frequency.
  • Many aggregates that can be approximated by

using random samples now have ODI-synopsis, e.g., median.

slide-85
SLIDE 85

85

Set membership: Bloom Filter

  • A compact data structure to encode set

containment.

  • Widely used in networking applications.
  • Given: n elements S={x1, x2, , xn}.
  • Answer query: whether x is in S?
  • Allow a small false positive (an element not in S

might be reported as “yes”).

slide-86
SLIDE 86

86

Bloom filter

  • An array of m bits.
  • Insert: for x 2 S, use k

random hash functions and set hj(x) to “1”.

  • Query: to check if y is in S,

search all buckets hj(y), if all “1”, answer “yes”.

  • No false negative. Small

false positive.

slide-87
SLIDE 87

87

Bloom filter tricks

  • Union of S1 and S2:

– Take “OR” of their bloom filters. – ODI aggregation.

  • Shrink the size to half:

– OR the first and second halves.

slide-88
SLIDE 88

88

Counting bloom filter

  • Handle element insertion and deletion
  • Each bucket is a counter.
  • Insert: increase by “1” on the hashed

locations.

  • Delete: decrease by “1”.
  • Be careful about buffer overflow.
slide-89
SLIDE 89

89

Spectral bloom filter

  • Record multi-set {x1, x2, , xn}, each item

xi has a frequency fi.

  • Insert: add fi to each bucket.
  • Retrieve: return the smallest bucket value

from the hashed locations.

  • Idea: the smallest bucket is unlikely to be

polluted.

slide-90
SLIDE 90

90

Bloom filter applications

  • Traditional applications:

– Dictionary, UNIX-spell checker.

  • Network applications:

– Cache summary in content delivery network. – Resource routing, etc. – Read the survey for more….

  • Good for sensor network setting:

– ODI, compact, many algebraic properties.

slide-91
SLIDE 91

91

Conclusion

  • Due to the high dynamics in sensor networks,

robust aggregates that are insensitive to order and duplication are very attractive – they provide the flexibility of using any multi-path routing algorithms and re-transmission.

  • Use ODI-synopsis as black box operators to

replace naïve operators in more complex data structures.

slide-92
SLIDE 92

92

Is the problem solved? NO

  • Best effort multi-path routing does not guarantee

all data have been incorporated.

– Blackbox setting.

  • ODI synopsis translates everything to MAX,

which is not robust to outliers!

– Sensor malfunction. – Malicious attacks.

  • For exemplary aggregations (MAX, MIN), the

final result is a single sensor value, but all nodes are examined. – Can we improve?

slide-93
SLIDE 93

93

CountTorrent

  • To improve routing robustness, deliver

each value multiple times to make sure at least one copy arrives

– Synopsis diffusion: aggregation of the same value for multiple times does not result in double counting. – CountTorrent: remember what value has been included in the aggregation in an implicit manner.

slide-94
SLIDE 94

94

How to record the members in the aggregate?

  • In the naive way, keep the members

explicitly.

– Storage cost /communication cost too high – It loses the point of aggregation.

  • In the implicit way

– Label the aggregates

slide-95
SLIDE 95

95

CountTorrent

  • Each node has a label: a 0,1 string
  • Two nodes can have their data aggregated if

their labels are the same except the last bit.

  • After aggregation, remove the last bit and assign

the label to the aggregated data.

  • Gossip-style communication: each node

exchanges its value with neighbors.

slide-96
SLIDE 96

96

CountTorrent example

  • For any 2 nodes, their labels are neither the same nor is

either one a substring of the other

  • All N labels can be merged pairwise and recursively to

yield ε, the empty string.

slide-97
SLIDE 97

97

Aggregation

  • Each node keeps a buffer of received

value/label pair

  • Consolidate: try to merge the data in the buffer
slide-98
SLIDE 98

98

How to assign labels?

  • Each node is given the label of a leaf

node.

slide-99
SLIDE 99

99

Conclusion

  • Aggregation sometimes requires careful design

to tradeoff accuracy & storage/message size.

  • Aggregation incurs information loss, making

robust estimation more difficult. E.g. a single

  • utlier reading can screw up MAX/MIN

aggregates.