1
Data Collection and Aggregation 1 Challenges: data Data type: - - PowerPoint PPT Presentation
Data Collection and Aggregation 1 Challenges: data Data type: - - PowerPoint PPT Presentation
Data Collection and Aggregation 1 Challenges: data Data type: numerical sensor readings. Rich and massive data, spatially distributed and correlated. Data dynamics: data streaming and aging. Uncertainty, noise, erroneous
2
Challenges: data
- Data type: numerical sensor readings.
- Rich and massive data, spatially
distributed and correlated.
- Data dynamics: data streaming and
aging.
- Uncertainty, noise, erroneous data,
- utliers.
- Semantics. Raw data knowledge.
3
Challenges: query variability
- Data-centric query: search for “car detection”,
instead of sensor node ID.
- Geographical query: report values near the lake.
- Real-time detection & control: intruder detection.
- Multi-dimensional query: spatial, temporal and
attribute range.
- Query interface: fixed base station or mobile
hand held devices.
4
Data processing
- In-network aggregation
- In-network storage
- Distributed data management
- Statistical modeling
- Intelligent reasoning
5
In-network data aggregation
- Communication is expensive, bandwidth is
precious.
– “In-network processing”: process raw data before transmit.
- Single sensor reading may not hold much
value.
– Inherently unreliable, outlier readings. – Users are often interested in the hidden patterns
- r the global picture.
- Data compression and knowledge discovery.
– Save storage; generate semantic report.
6
Distributed In-network Storage
- Flash drive, etc. enables distributed in-network
storage
- Challenges
– Distributed indexing for fast query dissemination – Explore storage locality to benefit data retrieval. – Resilience to node or link failures. – Graceful adaptation to data skews. – Alleviate the hot spot problem created by popular data.
7
Sound statistical models
- Raw data may
misrepresent the physical world.
– Sensors sample at discrete times. Sensors may be faulty. Packets may be lost. – Most sensor data may not improve the answer quality to the query. Data can be compressed. – Correlation between nearby sensors or different attributes of the same sensor.
8
Model-based query
- Build statistical models
- n the sensor readings.
– Generates observation plan to improve model accuracy. – Answers query results.
- Pros:
– Improve data robustness. – Explore correlation – Decrease communication cost. – Provide prediction of the future. – Easier to extract data abstraction.
9
Reasoning and control
- Reason from raw sensor readings for high-level
semantic events.
– Fire detection.
- Events triggered reaction, sensor tasking and control.
– Turn on fire alarm. Direct people to closest exits.
10
Data privacy, fault tolerance and security
- Under what format should data be stored?
- What if a sensor die? Can we recover its data?
- What information is revealed if a sensor is
compromised?
- Adversary injects false reports and false alarms.
11
Approximation and randomization
- Connection to streaming data model:
– No way to store the raw data. – Scan the data sequentially. – Maintain sketches of massive amount of data. – One more challenge in sensor network: the streaming data is spatially distributed and communication is expensive.
- Approximations, sampling, randomization.
12
Papers
- [Madden02] Samuel R. Madden, Michael J. Franklin, Joseph
- M. Hellerstein, and Wei Hong. TAG: a Tiny AGgregation
Service for Ad-Hoc Sensor Networks. OSDI, December
- 2002. Aggregation with a tree.
- [Shrivastava04] Nisheeth Shrivastava, Chiranjeeb
Buragohain, Divy Agrawal, Subhash Suri, Medians and Beyond: New Aggregation Techniques for Sensor Networks, ACM SenSys '04, Nov. 3-5, Baltimore, MD. Approximate answer to medians, reduce storage and message size.
- [Nath04] Suman Nath, Phillip B. Gibbons, Zachary Anderson,
and Srinivasan Seshan, Synopsis Diffusion for Robust Aggregation in Sensor Networks". In proceedings of ACM SenSys'04. Use multipath routing to improve routing
- robustness. Order and duplicate insensitive synopsis needs to
be used to prevent one data value to be aggregated multiple times.
13
TinyDB
- Philosophy:
– Sensor network = distributed database. – Data are stored locally. – Networking structure: tree-based routing. – Top-down SQL query. – Results aggregated back to the query node. – Most intelligence outside the network.
14
- TinyDB Architecture
- 4
1 5 2 6 3 7
- 8
The next few slides from Sam Madden, Wei Hong
15
Query Language (TinySQL)
SELECT <aggregates>, <attributes> [FROM {sensors | <buffer>}] [WHERE <predicates>] [GROUP BY <exprs>] [SAMPLE PERIOD <const> | ONCE] [INTO <buffer>] [TRIGGER ACTION <command>]
16
TinySQL Examples
SELECT nodeid, nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s
1
Epoch Nodeid nestNo Light 1 17 455 2 25 389 1 1 17 422 1 2 25 405
Sensors Sensors Sensors Sensors “ ! "#
17
TinySQL Examples (cont.)
Epoch region CNT(…) AVG(…) North 3 360 South 3 520 1 North 3 370 1 South 3 520
“$ % !&%"#
'(' !)*+,
- +,
./ ./0 ! 1-*-+,2344 '/1./* 54
3
.!6-+,2344 '(' -+, ./ '/1./* 54
2
18
Data Model
- Entire sensor network as one single, infinitely-
long logical table: sensors
- Columns consist of all the attributes defined in
the network
- Typical attributes:
– Sensor readings – Meta-data: node id, location, etc. – Internal states: routing tree parent, timestamp, queue length, etc.
- Nodes return NULL for unknown attributes
19
Query over Stored Data
- Named buffers in Flash memory
- Store query results in buffers
- Query over named buffers
- Analogous to materialized views
- Example:
– CREATE BUFFER name SIZE x (field1 type1, field2 type2, …) – SELECT a1, a2 FROM sensors SAMPLE PERIOD d INTO name – SELECT field1, field2, … FROM name SAMPLE PERIOD d
20
Event-based Queries
- ON event SELECT …
- Run query only when interesting events
happens
- Event examples
– Button pushed – Message arrival – Bird enters nest
- Analogous to triggers but events are user-
defined
21
TAG: Tiny Aggregation
- Query Distribution: aggregate queries are pushed
down the network to construct a spanning tree.
– Root broadcasts the query, each node hearing the query broadcasts. – Each node selects a parent. The routing structure is a spanning tree rooted at the query node.
- Data Collection: aggregate values are routed up
the tree.
– Internal node aggregates the partial data received from its subtree.
22
TAG example
Query distribution Query collection 1 2 3 4 5 6 1 2 3 4 5 6
23
TAG example
MAX AVERAGE 1 2 3 4 5 6 1 2 3 4 5 6 m4 = max{m6, m5} Count: c4 = c6+c5 Sum: s4 = s6+s5
24
Considerations about aggregations
- Packet loss?
– Acknowledgement and re-transmit? – Robust routing?
- Packets arriving out of order or in
duplicates?
– Double count?
- Size of the aggregates?
– Message size growth?
25
Classes of aggregations
- Exemplary aggregates return one or more
representative values from the set of all values; summary aggregates compute some properties over all values.
– MAX, MIN: exemplary; SUM, AVERAGE: summary. – Exemplary aggregates are prone to packet loss and not amendable to sampling. – Summary aggregates of random samples can be treated as a robust estimation.
26
Classes of aggregations
- Duplicate insensitive aggregates are
unaffected by duplicate readings.
– Examples: MAX, MIN. – Independent of routing topology. – Combine with robust routing (multi-path).
27
Classes of aggregations
- Monotonic aggregates: when two partial
records s1 and s2 are combined to s, either e(s) ≥ max{e(s1), e(s2)} or e(s) ≤ min{e(s1), e(s2)}.
– Examples: MAX, MIN. – Certain predicates (such as HAVING) can be applied early in the network to reduce the communication cost.
28
Classes of aggregations
- Partial state of the aggregates:
– Distributive: the partial state is simply the aggregate for the partial data. The size is the same with the size of the final
- aggregate. Example: MAX, MIN, SUM
– Algebraic: partial records are of constant size. Example: AVERAGE. – Holistic: the partial state records are proportional in size to the partial data. Example: MEDIAN. – Unique: partial state is proportional to the number of distinct values. Example: COUNT DISTINCT. – Content-sensitive: partial state is proportional to some (statistical) properties of the data. Example: fixed-size bucket histogram, wavelet, etc.
Good bad worst
29
Classes of aggregates
Duplicate sensitive Exemplary, Summary Monotonic Partial State MAX, MIN No E Yes Distributive COUNT, SUM Yes S Yes Distributive AVERAGE Yes S No Algebraic MEDIAN Yes E No Holistic COUNT DISTINCT No S Yes Unique HISTOGRAM Yes S No Content- sensitive
30
Communication cost
Send all data to the sink Partial states too large!
31
Problem with median
- Computing average is simple on an aggregation
tree.
– Each node x stores the average a(x) and the number of nodes in its subtree n(x). – The average of a node x can be computed from its children u, v. n(x)=n(u)+n(v). a(x)=(a(u)n(u)+a(v)n(v))/n(x).
- Computing the median with a fixed amount of
message is hard.
– We do not know the rank of u’s median in v’s dataset. – We resort to approximations. x u v
32
Deal with computing median
- Resort to approximation.
– Random sampling approach. – A deterministic approach.
33
Approach I: Random sampling
- Problem: compute the median a of n
unsorted elements {ai}.
- Solution: Take a random sample of k
elements K. Compute the median x of K.
- Claim: x has rank within (½+ε)n and (½-ε)n
with probability at least 1-2/exp{2kε2}. (Proof left as an exercise.)
- Choose k=ln(2/δ)/(2ε2), then x is an
approximate median with probability 1-δ.
34
Approach II: Quantile digest (q-digest)
- A data structure that answers
– Approximate quantile query: median, the kth largest reading. – Range queries: the kth to lth largest readings. – Most frequent items. – Histograms.
- Properties:
– Deterministic algorithm. – Error-memory trade-off. – Confidence factor. – Support multiple queries.
35
Q-digest
- Input data: frequency of data value {f1,
f2,…,fσ}.
- Compress the data:
– detailed information concerning frequent data are preserved; – less frequently occurring values are lumped into larger buckets resulting in information loss.
- Buckets: the nodes in a binary partition of
the range [1, σ]. Each bucket v has range [v.min, v.max].
- Only store non-zero buckets.
36
Example
Input data bucketed Q-digest Information loss
37
Q-digest properties
- Store values in buckets.
- 1. Count(v) ≤ n/k. (except leaf)
– Control information loss.
- 2. Count(v) + Count(p) + Count(s) > n/k.
(except root)
– Ensure sufficient compression. – K: compression parameter.
parent sibling
38
Construct a q-digest
- Each sensor
constructs a q- digest based on its value.
- Check the digest
property bottom up: two “small” children’s count are added up and moved to the parent.
39
Merging two q-digests
- Merge q-digests
from two children
- Add up the
values in buckets
- Re-evaluate the
digest property bottom up. Information loss: t undercounts since some of its value appears on ancestors.
40
Space complexity
Claim: A q-digest with compression parameter k has at most 3k buckets.
- By property 2, for all buckets v in Q,
– Σv∈Q [Count(v) + Count(p) + Count(s)] > |Q| n/k. – Σv∈Q [Count(v) + Count(p) + Count(s)] ≤ 3 [Σv∈QCount(v)] = 3n. – |Q|<3k.
41
Error bound
Claim: Any value that should be counted in v can be present in one of the ancestors. 1. Count(v) has max error logσ⋅n/k.
– Error(v) ≤ Σancestor p Count(p) ≤ Σancestor p n/k ≤ logσ⋅n/k.
2. MERGE maintains the same relative error.
– Error(v) ≤ Σi Error(vi) ≤ Σi logσ⋅ni/k ≤ logσ⋅n/k.
42
Median and quantile query
- Given q∈(0, 1), find the value
whose rank is qn.
- Relative error ε=|r-qn|/n, where r is
the true rank.
- Post-order traversal on Q, sum the
counts of all nodes visited before a node v, which is ≤ # of values less than v.max. Report it when it is first time larger than qn.
- Error bound: ε=logσ /k.
43
Other queries
- Inverse quantile: given a value, determine its rank.
– Traverse the tree in post-order, report the sum of counts v for which x>v.max, which is within [rank(x), rank(x)+εn]
- Range query: find # values in range [l,h].
– Perform two inverse quantile queries and take the
- difference. Error bound is 2εn.
- Frequent items: given s∈(0, 1), find all values
reported by more than sn sensors.
– Count the leaf buckets whose counts are more than (s-ε)n. – Small false positive: values with count between (s-ε)n and sn may also be reported as frequent.
44
Simulation setup
- A typical aggregation tree (BFS tree) on 40 nodes
in a 200 by 200 area. In the simulation they use 4000~8000 nodes.
45
Simulation setup
- Random data;
- Correlated data: 3D elevation value from Death
Valley.
46
Histogram v.s. q-digest
- Comparison of histogram and q-digest.
47
Tradeoff between error and msg size
48
Saving on message size
Naïve solution
Review of data aggregation
49
AVERAGE 1 2 3 4 5 6 Count: c4 = c6+c5 Sum: s4 = s6+s5 Query distribution 1 2 3 4 5 6
50
1nd problem: how to compute median?
- In a naïve way, the size of the message is
in the same order as # nodes in the subtree.
- Last lecture: approximate median.
51
2nd problem: Aggregation tree in practice
- Tree is a fragile structure.
– If a link fails, the data from the entire subtree is lost.
- Fix #1: use multipath, a DAG instead of a
tree.
– Send 1/k data to each of the k upstream nodes (parents). – A link failure lost 1/k data
1 2 3 4 5 6 1 2 3 4 5 6
52
Aggregation tree in practice
tree DAG True value
53
Fundamental problem
- Aggregation and routing are coupled
- Improve routing robustness by multi-
path routing?
– Same data might be delivered multiple times. – Problem: double-counting!
- Decouple routing & aggregation
– Work on the robustness of each separately
1 2 3 4 5 6
54
Order and duplicate insensitive (ODI) synopsis
- Aggregated value is insensitive to the
sequence or duplication of input data.
- Small-sizes digests such that any
particular sensor reading is accounted for only once.
– Example: MIN, MAX. – Challenge: how about COUNT, SUM?
55
Aggregation framework
- Solution for robustness aggregation:
– Robust routing (e.g., multi-hop) + ODI synopsis.
- Leaf nodes: Synopsis generation: SG(⋅).
- Internal nodes: Synopsis fusion: SF(⋅) takes
two synopsis and generate a new synopsis of the union of input data.
- Root node: Synopsis evaluation: SE(⋅)
translates the synopsis to the final answer.
56
An easy example: ODI synopsis for MAX/MIN
- Synopsis generation: SG(⋅).
– Output the value itself.
- Synopsis fusion: SF(⋅)
– Take the MAX/MIN of the two input values.
- Synopsis evaluation: SE(⋅).
– Output the synopsis. 1 2 3 4 5 6
57
Three questions
- What do we mean by ODI, rigorously?
- Robust routing + ODI
- How to design ODI synopsis?
– COUNT – SUM – Sampling – Most popular k items – Set membership – Bloom filter
58
Definition of ODI correctness
- A synopsis diffusion algorithm is ODI-correct if
SF() and SG() are order and duplicate insensitive functions.
- Or, if for any aggregation DAG, the resulting
synopsis is identical to the synopsis produced by the canonical left-deep tree.
- The final result is independent of the underlying
routing topology.
– Any evaluation order. – Any data duplication.
59
Definition of ODI correctness
Connection to streaming model: data item comes 1 by 1.
60
Test for ODI correctness
1. SG() preserves duplicates: if two readings are duplicates (e.g., two nodes with same temperature readings), then the same synopsis is generated. 2. SF() is commutative. 3. SF() is associative. 4. SF() is same-synopsis idempotent, SF(s, s)=s. Theorem: The above properties are sufficient and necessary properties for ODI-correctness. Proof idea: transfer an aggregation DAG to a left-deep tree with the same output by using these properties.
61
Proof of ODI correctness
1. Start from the DAG. Duplicate a node with out- degree k to k nodes, each with out degree 1. duplicates preserving.
62
Proof of ODI correctness
2. Re-order the leaf nodes by the increasing value of the synopsis. Commutative.
63
Proof of ODI correctness
3. Re-organize the tree s.t. adjacent leaves with the same value are input to a SF function. Associative.
SF SF SG SF SG r1 SG SG r2 r2 r3
64
Proof of ODI correctness
4. Replace SF(s, s) by s. same-synopsis idempotent.
SF SF SG SF SG r1 SG SG r2 r2 r3 SF SF SG SG r1 SG r2 r3
65
Proof of ODI correctness
5. Re-order the leaf nodes by the increasing canonical
- rder. Commutative.
6. QED.
66
Design ODI synopsis
- Recall that MAX/MIN are ODI.
- Translate all the other aggregates
(COUNT, SUM, etc.) by using MAX.
- Let’s first do COUNT.
- Idea: use probabilistic counting.
- Counting distinct element in a multi-set.
(Flajolet and Martin 1985).
67
Counting distinct elements
- Each sensor generates a sensor reading. Count the
total number of different readings.
- Counting distinct element in a multi-set. (Flajolet
and Martin 1985).
- Each element chooses a random number i ∈ [1, k].
- Pr{CT(x)=i} = 2-i, for 1≤i ≤ k-1. Pr{CT(x)=k}= 2-(k-1).
- Use a pseudo-random generator so that CT(x) is a
hash function (deterministic). 1
½ ¼ 1/8 1/16 …..
68
Counting distinct elements
- Synopsis: a bit vector of
length k>logn.
- SG(): output a bit vector s
- f length k with CT(k)’s bit
set.
- SF(): bit-wise boolean OR
- f input s and s’.
- SE(): if i is the lowest
index that is still 0, output 2i-1/0.77351.
- Intuition: i-th position will
be 1 if there are 2i nodes, each trying to set it with probability 1/2i 1 1 OR 1 1 i=3
69
Distinct value counter analysis
- Lemma: For i<logn-2loglogn, FM[i]=1 with high
probability (asymptotically close to 1). For i ≥ 3/2logn+δ, with δ ≥ 0, FM[i]=0 with high probability.
- The expected value of the first zero is
log(0.7753n)+P(logn)+o(1), where P(u) is a periodic function of u with period 1 and amplitude bounded by 10-5.
- The error bound (depending on variance) can be
improved by using multiple trials. 1
70
Counting distinct elements
- Check the ODI-correctness:
– Duplication: by the hash
- function. The same reading x
generates the same value CT(x). – Boolean OR is commutative, associative, same-synopsis idempotent.
- Total storage: O(logn) bits.
1 1 OR 1 1 i=3
71
Robust routing + ODI
- Use Directed Acyclic Graph (DAG) to replace
tree.
- Rings overlay:
– Query distribution: nodes in ring Rj are j hops from q. – Query aggregation: node in ring Rj wakes up in its allocated time slot and receives messages from nodes in Rj+1.
72
Rings and adaptive rings
- Adaptive rings: cope with network dynamics, node
deletions and insertions, etc.
- Each node on ring j monitor the success rate of its
parents on ring j-1.
- If the success rate is low, the node may change its
parent to other nodes with higher success rate.
- Nodes at ring 1 may transmit multiple times to
ensure robustness.
73
Implicit acknowledgement
- Explicit acknowledgement:
– 3-way handshake. – Used for wired networks.
- Implicit acknowledgement:
– Used on ad hoc wireless networks. – Node u sending to v snoops the subsequent broadcast from v to see if v indeed forwards the message for u. – Explores broadcast property, saves energy.
- With aggregation this is problematic.
– Say u sends value x to v, and subsequently hears value z. – U does not know whether or not x is incorporated into z.
u v
74
Implicit acknowledgement
- ODI-synopsis enables efficient implicit
acknowledgement.
– u sends to v synopsis x. – Afterwards u hears that v transmitting synopsis z. – u verifies whether SF(x, z)=z u v
75
Error of approximate answers
- Two sources of errors:
– Algorithmic error: due to randomization and approximation. – Communication error: the fraction of sensor readings not accounted for in the final answer.
- Algorithmic error depends on the choice of
algorithm and is under control.
- Communication error depends on the network
dynamics and robustness of routing algorithms.
76
Simulation results
Unaccounted node
77
Simulation results
Relative root mean square error
78
More ODI synopsis
- Distinct values
- SUM
- Second moment
- Uniform sample
- Most popular items
- Set membership --- Bloom Filter
79
Sum
- Naïve approach: for an item x with value c times,
make c distinct copies (x, j), j=1, …, c. Now use the distinct count algorithm.
- When c is large, we set the bits as if we had
performed c successive insertions to the FM sketch.
- First set the first δ = logc-loglogc bits to 1.
- Those who reached δ follow a binomial distribution:
each item reaches δ with prob 2-δ.
- Explicitly insert those that reached bit δ by coin
flipping.
- Powerful building block.
80
Second moment
- Kth moment µk=Σ xi
k, xi is the number of
sensor readings (frequency) of value i.
– µ0 is the number of distinct elements. – µ1 is the sum. – µ2 is the square of L2 norm (variance, skewness
- f the data).
- The sketch algorithm for frequency
moments can be turned into an ODI easily by using ODI-sum.
The space complexity of approximating the frequency moments,
- N. Alon, Y. Matias, and M. Szegedy. STOC 1996.
81
Second moment
- Random hash h( ): {0,1,…,N-1} {-1,1}
- Define zi =h(i)
- Maintain X = Σi xizi
- E(X2) = E(Σi xizi)2 = E(Σi xi
2zi 2)+ E(Σi,jxixjzizj).
- Choose the hash function to be pairwise
independent: Pr{h(i)=a,h(j)=b} = ¼.
- E(zi
2)=1, E(zizj)= E(zi) E(zj)= 0.
- Now E(X2) = Σi xi
2.
- ODI: Each sensor of value i generates zi, then use
ODI-sum.
- The final answer is X2
82
More ODI synopsis
- Distinct values
- SUM
- Second moment
- Uniform sample
- Most popular items
- Set membership --- Bloom Filter
83
Uniform sample
- Each sensor has a reading. Compute a uniform
sample of a given size k.
- Synopsis: a sample of k tuples.
- SG(): output (value, r, id), where r is a uniform
random number in range [0, 1].
- SF(): output the k tuples with the k largest r values.
If there are less than k tuples in total, out them all.
- SE(): output the values in s.
- ODI-correctness is implied by “MAX” and union
- peration in SF().
- Correctness: the largest k random numbers is a
uniform k sample.
84
Most popular items
- Return the k values that occur the most frequently
among all the sensor readings
- Synopsis: a set of k most popular items.
- SG(): output (value, weight) pair, with weight=CT(k),
k>logn.
- SF(): for each distinct value v, discard all but the
pair with max weight. Then output the k pairs with max weight.
- SE(): output the set of values.
- Note: we attach a weight to estimate the frequency.
- Many aggregates that can be approximated by
using random samples now have ODI-synopsis, e.g., median.
85
Set membership: Bloom Filter
- A compact data structure to encode set
containment.
- Widely used in networking applications.
- Given: n elements S={x1, x2, , xn}.
- Answer query: whether x is in S?
- Allow a small false positive (an element not in S
might be reported as “yes”).
86
Bloom filter
- An array of m bits.
- Insert: for x 2 S, use k
random hash functions and set hj(x) to “1”.
- Query: to check if y is in S,
search all buckets hj(y), if all “1”, answer “yes”.
- No false negative. Small
false positive.
87
Bloom filter tricks
- Union of S1 and S2:
– Take “OR” of their bloom filters. – ODI aggregation.
- Shrink the size to half:
– OR the first and second halves.
88
Counting bloom filter
- Handle element insertion and deletion
- Each bucket is a counter.
- Insert: increase by “1” on the hashed
locations.
- Delete: decrease by “1”.
- Be careful about buffer overflow.
89
Spectral bloom filter
- Record multi-set {x1, x2, , xn}, each item
xi has a frequency fi.
- Insert: add fi to each bucket.
- Retrieve: return the smallest bucket value
from the hashed locations.
- Idea: the smallest bucket is unlikely to be
polluted.
90
Bloom filter applications
- Traditional applications:
– Dictionary, UNIX-spell checker.
- Network applications:
– Cache summary in content delivery network. – Resource routing, etc. – Read the survey for more….
- Good for sensor network setting:
– ODI, compact, many algebraic properties.
91
Conclusion
- Due to the high dynamics in sensor networks,
robust aggregates that are insensitive to order and duplication are very attractive – they provide the flexibility of using any multi-path routing algorithms and re-transmission.
- Use ODI-synopsis as black box operators to
replace naïve operators in more complex data structures.
92
Is the problem solved? NO
- Best effort multi-path routing does not guarantee
all data have been incorporated.
– Blackbox setting.
- ODI synopsis translates everything to MAX,
which is not robust to outliers!
– Sensor malfunction. – Malicious attacks.
- For exemplary aggregations (MAX, MIN), the
final result is a single sensor value, but all nodes are examined. – Can we improve?
93
CountTorrent
- To improve routing robustness, deliver
each value multiple times to make sure at least one copy arrives
– Synopsis diffusion: aggregation of the same value for multiple times does not result in double counting. – CountTorrent: remember what value has been included in the aggregation in an implicit manner.
94
How to record the members in the aggregate?
- In the naive way, keep the members
explicitly.
– Storage cost /communication cost too high – It loses the point of aggregation.
- In the implicit way
– Label the aggregates
95
CountTorrent
- Each node has a label: a 0,1 string
- Two nodes can have their data aggregated if
their labels are the same except the last bit.
- After aggregation, remove the last bit and assign
the label to the aggregated data.
- Gossip-style communication: each node
exchanges its value with neighbors.
96
CountTorrent example
- For any 2 nodes, their labels are neither the same nor is
either one a substring of the other
- All N labels can be merged pairwise and recursively to
yield ε, the empty string.
97
Aggregation
- Each node keeps a buffer of received
value/label pair
- Consolidate: try to merge the data in the buffer
98
How to assign labels?
- Each node is given the label of a leaf
node.
99
Conclusion
- Aggregation sometimes requires careful design
to tradeoff accuracy & storage/message size.
- Aggregation incurs information loss, making
robust estimation more difficult. E.g. a single
- utlier reading can screw up MAX/MIN