Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed - - PDF document

roadmap roadmap
SMART_READER_LITE
LIVE PREVIEW

Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed - - PDF document

Distributed Data Mining: Current Distributed Data Mining: Current Pleasures and Emerging Applications Pleasures and Emerging Applications Hillol Kargupta Hillol Kargupta University of Maryland, Baltimore County and AGNIK University of


slide-1
SLIDE 1

1

Distributed Data Mining: Current Distributed Data Mining: Current Pleasures and Emerging Applications Pleasures and Emerging Applications

Hillol Kargupta Hillol Kargupta University of Maryland, Baltimore County and AGNIK University of Maryland, Baltimore County and AGNIK www.cs.umbc.edu/~hillol www.cs.umbc.edu/~hillol Acknowledgements: Wes Griffin, Acknowledgements: Wes Griffin, Souptik Souptik Datta Datta, , Kanishka Bhaduri, Kanishka Bhaduri, Kamalika Kamalika Das, Ran Wolff, Chris Das, Ran Wolff, Chris Giannella Giannella

Roadmap Roadmap

  • Distributed Data Mining: Why Bother?

Distributed Data Mining: Why Bother?

  • Some Emerging Applications

Some Emerging Applications

  • Local Algorithms

Local Algorithms

  • Exact Local Algorithms

Exact Local Algorithms

  • Approximate Local Algorithms

Approximate Local Algorithms

  • Resources

Resources

slide-2
SLIDE 2

2

Data Mining and Distributed Data Mining Data Mining and Distributed Data Mining

  • Data Mining: Scalable analysis of data by paying

Data Mining: Scalable analysis of data by paying careful attention to the resources: careful attention to the resources:

  • computing,

computing,

  • communication,

communication,

  • storage, and

storage, and

  • human

human-

  • computer interaction.

computer interaction.

  • Distributed data mining (DDM): Mining data

Distributed data mining (DDM): Mining data using distributed resources. using distributed resources. Data Mining for Distributed and Ubiquitous Data Mining for Distributed and Ubiquitous Environments: Applications Environments: Applications

  • Mining Large Databases from distributed sites

Mining Large Databases from distributed sites

  • Grid data mining in Earth Science, Astronomy, Counter

Grid data mining in Earth Science, Astronomy, Counter-

  • terrorism, Bioinformatics

terrorism, Bioinformatics

  • Monitoring Multiple time critical data streams

Monitoring Multiple time critical data streams

  • Monitoring vehicle data streams in real

Monitoring vehicle data streams in real-

  • time

time

  • Monitoring physiological data streams

Monitoring physiological data streams

  • Analyzing data in Lightweight Sensor Networks and Mobile devices

Analyzing data in Lightweight Sensor Networks and Mobile devices

  • Limited network bandwidth

Limited network bandwidth

  • Limited power supply

Limited power supply

  • Preserving privacy

Preserving privacy

  • Security/Safety related applications

Security/Safety related applications

  • Peer

Peer-

  • to

to-

  • peer data mining

peer data mining

  • Large decentralized asynchronous environments

Large decentralized asynchronous environments

slide-3
SLIDE 3

3

Vehicles: Source of High Volume Data Streams Vehicles: Source of High Volume Data Streams

  • Vehicles generate tons

Vehicles generate tons

  • f data
  • f data
  • Hundreds of different

Hundreds of different parameters from parameters from different subsystems different subsystems

  • High throughput data

High throughput data streams streams

  • So what?

So what?

Breakdowns cost Breakdowns cost thousands of thousands of dollars dollars Bad driving Bad driving costs money costs money---

  • fuel, brake shoe,

fuel, brake shoe, insurance, law insurance, law-

  • suits

suits

Why Mine Vehicle Data? Why Mine Vehicle Data?

  • Fuel consumption analysis

Fuel consumption analysis

  • Fleet analytics

Fleet analytics

  • Vehicle benchmarking

Vehicle benchmarking

  • Predictive health

Predictive health-

  • monitoring

monitoring

  • Driver behavior analytics

Driver behavior analytics

High gas prices High gas prices

slide-4
SLIDE 4

4

From Concept to Commercial Product From Concept to Commercial Product

  • First prototype

First prototype --

  • - PDA

PDA-

  • based platform

based platform

  • Other choices:

Other choices:

  • Cell phones and

Cell phones and

  • Low

Low-

  • cost, less powerful embedded devices

cost, less powerful embedded devices

  • Market Entry Point

Market Entry Point

  • Location management companies

Location management companies

  • M2M companies

M2M companies

  • Low Cost Embedded GPS Devices

Low Cost Embedded GPS Devices

  • Resource constrained

Resource constrained

  • 3

3-

  • 4K run time memory

4K run time memory

  • 250K footprint

250K footprint

  • Resource sharing with GPS program

Resource sharing with GPS program

Circa 2001 Circa 2001 Circa 2005 Circa 2005 Circa 2007 Circa 2007

Private & Secure Data Mining from Multi Private & Secure Data Mining from Multi-

  • Party

Party Distributed Data Distributed Data

  • Compute global patterns without direct access to the multi

Compute global patterns without direct access to the multi-

  • party

party raw distributed data raw distributed data

  • Minimize communication cost

Minimize communication cost

  • Must come with provably correct guarantees with respect to a

Must come with provably correct guarantees with respect to a given privacy model given privacy model

  • Must be scalable with respect to

Must be scalable with respect to

  • number of data sites

number of data sites

  • size of the data

size of the data

  • Privacy

Privacy-

  • preserving data mining

preserving data mining

  • Blends in ``pattern

Blends in ``pattern-

  • preserving’’ transformations with data analysis

preserving’’ transformations with data analysis

slide-5
SLIDE 5

5

How PURSUIT Works for the User How PURSUIT Works for the User

  • Need to have your own sensor such as SNORT, MINDS

Need to have your own sensor such as SNORT, MINDS

  • Download PURSUIT plug

Download PURSUIT plug-

  • in for the sensor and install

in for the sensor and install

  • PURSUIT plug

PURSUIT plug-

  • in offers

in offers

  • A stand

A stand-

  • alone interface for processing your alerts from the sensor

alone interface for processing your alerts from the sensor and cross and cross-

  • domain analysis

domain analysis

  • Web account for detailed cross

Web account for detailed cross-

  • domain statistics

domain statistics

  • Optional distributed collaboration management module for

Optional distributed collaboration management module for managing the threats and archiving forensics managing the threats and archiving forensics

PURSUIT Web Site PURSUIT Web Site

slide-6
SLIDE 6

6

Peer Peer-

  • to

to-

  • peer (P2P) Networks

peer (P2P) Networks

  • Relies primarily on the computing resources of the

Relies primarily on the computing resources of the participants in the network rather than a relatively low participants in the network rather than a relatively low number of servers. number of servers.

  • P2P networks are typically used for connecting nodes via

P2P networks are typically used for connecting nodes via largely ad hoc connections. largely ad hoc connections.

  • No central administrator/coordinator

No central administrator/coordinator

  • Peers simultaneously function as both "clients" and "servers"

Peers simultaneously function as both "clients" and "servers"

  • Privacy is an important issue in most P2P applications

Privacy is an important issue in most P2P applications

Where do we find P2P Networks? Where do we find P2P Networks?

  • Applications:

Applications:

  • File

File-

  • sharing networks:

sharing networks: KaZAa KaZAa, Napster, Gnutella , Napster, Gnutella

  • P2P network storage, web caching,

P2P network storage, web caching,

  • P2P bio

P2P bio-

  • informatics,

informatics,

  • P2P astronomy,

P2P astronomy,

  • P2P Information retrieval

P2P Information retrieval

  • P2P Sensor Networks?

P2P Sensor Networks?

  • P2P Mobile Ad

P2P Mobile Ad-

  • hoc

hoc NETwork NETwork (MANET)? (MANET)?

  • Next Generation:

Next Generation:

  • P2P Search Engines, Social Networking, Digital libraries, P2P

P2P Search Engines, Social Networking, Digital libraries, P2P “YouTube”? “YouTube”?

slide-7
SLIDE 7

7

P2P Web Mining P2P Web Mining

  • Web mining in a sever

Web mining in a sever-

  • less environment

less environment

Useful Browser Data Useful Browser Data

  • Web

Web-

  • browser history

browser history

  • Browser cache

Browser cache

  • Click

Click-

  • stream data stored at browser (browsing pattern)

stream data stored at browser (browsing pattern)

  • Search queries typed in the search engine

Search queries typed in the search engine

  • User profile

User profile

  • Bookmarks

Bookmarks

  • Challenges

Challenges

  • Indexing, clustering, data analysis in a decentralized

Indexing, clustering, data analysis in a decentralized asynchronous manner asynchronous manner

  • Scalability

Scalability

  • Privacy

Privacy

slide-8
SLIDE 8

8

P2P NASA Astronomy Data Mining P2P NASA Astronomy Data Mining

  • Virtual Observatories

Virtual Observatories

  • Client

Client-

  • server architecture

server architecture

  • Consider Sloan Digital Sky Survey:

Consider Sloan Digital Sky Survey:

  • 2M hits per month

2M hits per month

  • traffic is doubling every 15 months

traffic is doubling every 15 months

  • Need better scalability

Need better scalability

  • MyDB

MyDB: Download and locally manage your data : Download and locally manage your data

  • Network of such databases

Network of such databases

  • Searching, clustering, and outlier detection in P2P virtual

Searching, clustering, and outlier detection in P2P virtual

  • bservatory data network.
  • bservatory data network.
  • NASA AIST Project at UMBC

NASA AIST Project at UMBC

DDM Applications: Typical Characteristics DDM Applications: Typical Characteristics

  • Distributed computing environment

Distributed computing environment

  • Heterogeneous communication links with bandwidth

Heterogeneous communication links with bandwidth constraints constraints

  • Distributed data

Distributed data

  • Continuous data streams

Continuous data streams

  • Multi

Multi-

  • party data, sometimes privacy sensitive (difficult to

party data, sometimes privacy sensitive (difficult to centralize) centralize)

  • Server

Server-

  • free networks

free networks

  • Resource constraints (e.g. energy consumption)

Resource constraints (e.g. energy consumption)

slide-9
SLIDE 9

9

Data Communication Data Communication

  • Case I:

Case I: Participating nodes are connected by Participating nodes are connected by high speed networks and efficient redistribution high speed networks and efficient redistribution

  • f data is possible.
  • f data is possible.
  • Case II:

Case II: Nodes are connected by low speed Nodes are connected by low speed networks and data redistribution is difficult to networks and data redistribution is difficult to support. support.

  • Case III:

Case III: Combination of I and II. Combination of I and II.

Global Function Computation Global Function Computation

  • Each vertex v in a graph holds an input

Each vertex v in a graph holds an input

  • Compute some global function

Compute some global function

) ,..... , (

2 1 n

v v v

X X X f

i

v

X

slide-10
SLIDE 10

10

Locality Sensitive Computation Locality Sensitive Computation

  • Global vs. Local

Global vs. Local

  • Main problems of the global algorithms:

Main problems of the global algorithms:

  • Every node needs to maintain information about the

Every node needs to maintain information about the entire network entire network

  • Maintaining this information is resource intensive for

Maintaining this information is resource intensive for large networks large networks

Data Mining as Function Computation Data Mining as Function Computation

  • Most data mining problems can be viewed as

Most data mining problems can be viewed as function computations function computations

  • Examples

Examples

  • Classification

Classification

  • Predictive modeling

Predictive modeling

  • Clustering

Clustering

  • Outlier detection

Outlier detection

slide-11
SLIDE 11

11

DDM: Defining the Problem DDM: Defining the Problem

  • Let

Let G=(V, E) G=(V, E) be a graph be a graph

  • Let be the set of all neighboring nodes of the

Let be the set of all neighboring nodes of the k k-

  • th

th node node

  • Need a decomposable representation where

Need a decomposable representation where can be computed from “locally” computed functions can be computed from “locally” computed functions

  • Example

Example:

:

k

Ω

V vk ∈

) (V f ) (

k k Ω

Φ ) ( ) (

k k k k

w V f Ω Φ =∑

Homogeneous Data Sites Homogeneous Data Sites

Account Number Amount Location History Earning 11992346 99.84 Seattle Good High 12999333 29.33 Seattle Good High 45633341 34.89 Portland Okay Low 55567999 980 Spokane Good Low

Account Number Amount Location History Earning 87992364 20 Chicago Good Low 67845921 447 Urbana Good Low 85621341 19.78 Chicago Okay High 95345998 800 Peoria bad High

Different sites observe same features for different events Different sites observe same features for different events

slide-12
SLIDE 12

12

Heterogeneous Heterogeneous Sites Sites

State Movie Rating Revenue WA Hyper Space A+ 6M ID Once Upon a Time B- 2M BC The King and the Liar B+ 8M CA The Shepard A- 10M

City State Size

  • Avg. earning Teen pop.

Lewiston ID Small Low 5K Spokane WA Medium Medium 30K Seattle WA Large High 250K Portland OR Large High 200K Vancouver BC Medium Medium 199K

Different sites observing different feature sets Different sites observing different feature sets

Distributed Randomized Inner Product Distributed Randomized Inner Product Computation Computation

Node 1 Node 1 Node2 Node2

  • Node 1 computes Z

Node 1 computes Z1,k

1,k

  • Z

Z1k

1k=A1.J

=A1.J1

1+..+An.J

+..+An.Jn

n

  • J

Ji

i ∈

∈ {+1, {+1,-

  • 1} with uniform

1} with uniform probability probability

  • Node 2 calculates Z

Node 2 calculates Z2,k

2,k

  • Z

Z2k

2k=B1.J

=B1.J1

1+..+Bn.J

+..+Bn.Jn

n

  • Compute

Compute z z1,k

1,k.z

.z2,k

2,k for a few

for a few times and take the average times and take the average

A1 A2 . . An B1 B2 . . Bn

Random Seed generator

Z1,k Z2,k

slide-13
SLIDE 13

13

Locality Sensitive Distributed Algorithms Locality Sensitive Distributed Algorithms

  • Global algorithms: Know

Global algorithms: Know everything about the entire everything about the entire network network

  • Every node needs to maintain

Every node needs to maintain information about the entire network information about the entire network

  • Maintaining this information is

Maintaining this information is resource intensive for large resource intensive for large networks networks

  • Local algorithms: Communicate

Local algorithms: Communicate

  • nly with the local neighborhood.
  • nly with the local neighborhood.
  • Does locality imply efficiency?

Does locality imply efficiency?

Bounded Communication Local Algorithms Bounded Communication Local Algorithms

  • Every node communicates with its local neighborhood

Every node communicates with its local neighborhood bounded by path bounded by path-

  • length of

length of

  • In addition, the total amount of communication with its

In addition, the total amount of communication with its neighbors is also bounded neighbors is also bounded by some

by some

  • Local algorithms

Local algorithms

γ α

− ) , ( γ α

slide-14
SLIDE 14

14

Approaches Approaches

  • Functions computation through decomposable

Functions computation through decomposable representations representations

  • Approximations

Approximations

  • Randomized techniques

Randomized techniques

  • Sampling

Sampling-

  • based approximations

based approximations

  • Variational

Variational approximations approximations

  • Exact decompositions

Exact decompositions

  • Deterministic techniques

Deterministic techniques

Approximation Approximation

  • Estimate

Estimate

  • Cardinal sampling

Cardinal sampling

  • Ordinal relaxation

Ordinal relaxation

  • Interested in constructing an ordering

Interested in constructing an ordering

  • Find the ones that rank high

Find the ones that rank high

) (

k k Ω

Φ

slide-15
SLIDE 15

15

Sampling in Distributed Environments Sampling in Distributed Environments

p1 p2 p4 p7 p3 p5 p6

  • Uniform data sample often good representative of data

Uniform data sample often good representative of data

  • Collecting uniform sample in asynchronous networks is

Collecting uniform sample in asynchronous networks is challenging challenging

  • Varying degrees of nodes

Varying degrees of nodes

  • Skewed data distribution

Skewed data distribution

Challenges Challenges

  • How to collect random

How to collect random-

  • uniform sample of data from an

uniform sample of data from an asynchronous network? asynchronous network?

  • How to make sampling communication efficient and fast?

How to make sampling communication efficient and fast?

  • Asynchronous algorithms

Asynchronous algorithms

slide-16
SLIDE 16

16

Varying Degrees of Connectivity in Large Varying Degrees of Connectivity in Large Communication Networks Communication Networks

Source: Source: Stefan Stefan Saroiu Saroiu, Krishna P. , Krishna P. Gummadi Gummadi, and Steven D. Gribble , and Steven D. Gribble. . Measuring and Measuring and analyzing the characteristics of analyzing the characteristics of napster napster and and gnutella gnutella hosts hosts. . Multimedia Syst. Multimedia Syst., , 2003 2003

Problem Definition: Uniform Data Sampling Problem Definition: Uniform Data Sampling

  • Data is homogeneously distributed among peers

Data is homogeneously distributed among peers X = X X = X1

1U X

U X2

2U…U

U…U X Xn

n

  • Data distribution is non

Data distribution is non-

  • uniform

uniform |X |Xi

i| ≠ |

| ≠ |X Xj

j| for

| for i ≠ j i ≠ j

  • Uniform sampling of peers results in biased data

Uniform sampling of peers results in biased data sampling sampling

  • Problem: How to collect a uniform

Problem: How to collect a uniform-

  • random sample x

random sample x

  • f total data X from the network?
  • f total data X from the network?
slide-17
SLIDE 17

17

Random Walk and Markov Chain Random Walk and Markov Chain

  • Random walk on Graph: visits nodes in a sequence

Random walk on Graph: visits nodes in a sequence where at each step, the next destination node is where at each step, the next destination node is selected using transition probability of current node selected using transition probability of current node – – Markov process Markov process π π(t+1) (t+1)T

T =

= π π( (t) t)T

TP

P

  • P = Transition Probability Matrix

P = Transition Probability Matrix

  • π

π (t) = Probability Distribution of State at (t) = Probability Distribution of State at t t

  • i

ith

th element of stationary distribution

element of stationary distribution π πi

i = d

= di

i/2m

/2m

  • Mixing

Mixing-

  • time of Markov Chain: Length of walk to

time of Markov Chain: Length of walk to converge to stationary distribution [Sinclair, 1992] converge to stationary distribution [Sinclair, 1992] τ τ = O ( log (n)/(1 = O ( log (n)/(1-

  • |

|λ λ2

2|))

|))

  • |

|λ λ2

2|= SLEM (Second Largest

|= SLEM (Second Largest Eigenvalue Eigenvalue Modulus) of P Modulus) of P

  • Aperiodic

Aperiodic graphs graphs

Uniform Sampling of Peers in P2P Uniform Sampling of Peers in P2P

  • Random walk with degree correction helps uniform

Random walk with degree correction helps uniform sampling of peers sampling of peers

  • Maximum Degree

Maximum Degree

  • Metropolis

Metropolis-

  • Hasting

Hasting

  • Random

Random-

  • weight Distribution

weight Distribution

  • Metropolis

Metropolis-

  • Hasting:

Hasting:

  • Source node applies modified

Source node applies modified r.w r.w. of length . of length L Lwalk

walk

to pick to pick-

  • up one peer uniformly

up one peer uniformly

  • Walk

Walk-

  • length

length L Lwalk

walk =

= O O ( (log log (Total Network Size)) (Total Network Size))

slide-18
SLIDE 18

18

Sampling of Peers by Random Walk Sampling of Peers by Random Walk Data Sampling Concept: Virtual Network Data Sampling Concept: Virtual Network

  • Transform to virtual

Transform to virtual network with single data network with single data-

  • tuple

tuple per virtual node per virtual node

  • Data

Data-

  • tuples

tuples held by same held by same real node are ‘ real node are ‘fully fully-

  • connected’.

connected’.

  • Apply Metropolis

Apply Metropolis-

  • Hasting

Hasting

  • n virtual network.
  • n virtual network.
  • Communication saved on

Communication saved on ‘virtual ‘virtual-

  • walk’ on

walk’ on internal internal links. links.

V V N1

N1

V V N2

N2

slide-19
SLIDE 19

19

Metropolis Metropolis-

  • Hasting on Virtual Graph

Hasting on Virtual Graph

  • To achieve uniformity, P should meet the following

To achieve uniformity, P should meet the following conditions on virtual graph conditions on virtual graph

Sym m etric, Non Sym m etric, Non -

  • negative, Double

negative, Double -

  • stochastic

stochastic

  • Transition probability between data

Transition probability between data-

  • tuple

tuple K K and and L L

  • n virtual graph:
  • n virtual graph:

Algorithm Algorithm

  • Initialization: Each node N

Initialization: Each node Ni

i knows

knows

  • Immediate neighborhood :

Immediate neighborhood : Γ Γ(i)

(i)

  • Total data

Total data-

  • size of neighbors:

size of neighbors:

  • Transition Probability on real graph at

Transition Probability on real graph at N Ni

i

Γ ∈

) (i

j j

n

Case I Case II Case III

slide-20
SLIDE 20

20

Performance Analysis Performance Analysis

  • Arbitrarily selected ‘source

Arbitrarily selected ‘source-

  • node’ (N

node’ (NS

S) launches

) launches s s random walks random walks

  • Random

Random-

  • walk terminates after

walk terminates after L Lwalk

walk steps

steps L Lwalk

walk =

= O O(log(Datasize)/ (log(Datasize)/1

1-

  • |

|λ λ2

2|)

|)

  • The data

The data-

  • tuple

tuple t ti

i on which walk terminates marked

  • n which walk terminates marked

as a uniform random sample as a uniform random sample

  • t

ti

i sent back to N

sent back to NS

S

Estimating Random Estimating Random-

  • walk length

walk length

  • L

Lwalk

walk =

= O O ( (log(Datasize log(Datasize)/(spectral gap)) )/(spectral gap))

  • Spectral gap = ( 1

Spectral gap = ( 1-

  • |

|λ λ2

2|)

|)

  • Source

Source-

  • node can over

node can over-

  • estimate

estimate datasize datasize

  • Logarithmic effect on the walk

Logarithmic effect on the walk-

  • length

length

  • Computing ‘spectral

Computing ‘spectral-

  • gap’ exactly is communication

gap’ exactly is communication and computation intensive. and computation intensive.

  • A lower

A lower-

  • bound of spectral

bound of spectral-

  • gap gives upper

gap gives upper-

  • bound of

bound of walk walk-

  • length

length

slide-21
SLIDE 21

21

Bounding the Spectral Bounding the Spectral-

  • gap

gap

  • Neighbor data ratio is important

Neighbor data ratio is important

  • For a network of size

For a network of size n n, lower bound of spectral gap , lower bound of spectral gap

  • For each node, if

For each node, if ρ ρi

i ≥

≥ ρ ρT

T , a universal threshold

, a universal threshold value: value:

  • If

If ρ ρi

i =

= O O (n) for all nodes, then (n) for all nodes, then

  • Hence,

Hence, L Lwalk

walk =

= O O (log ( (log (Datasize Datasize) ) ) )

i j j i

n n

i

Γ ∈

=

) (

ρ

∑ =

+ − ≥ −

n i i 1 2

1 1 2 1 ρ λ

T

n ρ λ + − ≤ − 1 2 1 1 1

2

) 1 ( 1 1

2

O = − λ

Effect on Communication Topology Effect on Communication Topology

  • For all nodes N

For all nodes Ni

i in the network,

in the network, ρ ρi

i =

= O O (n) (n) implies : implies :

Total Data Contained by Neighbors ( Total Data Contained by Neighbors (n nj

j in

in Γ Γi

i) ≥

) ≥ O O (n) times local (n) times local data data

  • Real world network data distribution often follows

Real world network data distribution often follows power power-

  • law

law (

(“Measuring and Analyzing the Characteristics of “Measuring and Analyzing the Characteristics of Napster and Gnutella hosts” by Stefan Napster and Gnutella hosts” by Stefan Saroiu Saroiu et. al. , 2003)

  • et. al. , 2003)
  • Majority of the data content by few peers forming a ‘data

Majority of the data content by few peers forming a ‘data-

  • hub’

hub’

  • Peers

Peers with small amount of local data connecting to

with small amount of local data connecting to ‘data ‘data-

  • hub’ achieves

hub’ achieves O O (n) neighbor data ratio (n) neighbor data ratio

  • Communication topology: A central hub consisting of few

Communication topology: A central hub consisting of few peers sharing most of the data, and rest of the peers peers sharing most of the data, and rest of the peers sharing few data are directly connected to this hub. sharing few data are directly connected to this hub.

slide-22
SLIDE 22

22

Communication Topology Communication Topology Communication Complexity Communication Complexity

  • Communication cost

Communication cost

1. 1.

Discover a uniform sample Discover a uniform sample

2. 2.

Transport sampled data to N Transport sampled data to NS

S

  • Assumption: Network protocol takes care of the

Assumption: Network protocol takes care of the peer peer-

  • to

to-

  • peer communication between two nodes

peer communication between two nodes

  • P2P

P2P-

  • Sampling Initialization Cost: 2×|E| integer

Sampling Initialization Cost: 2×|E| integer bytes bytes

  • Communication to discover one sample

Communication to discover one sample ά ά× × L

Lwalk

walk ×(đ+2) integer bytes

×(đ+2) integer bytes

  • đ

đ = Average degree of connectivity = Constant = Average degree of connectivity = Constant

  • ά

ά = Average probability of going to a different node in one = Average probability of going to a different node in one step of random walk ( 1≥ step of random walk ( 1≥ ά ά ≥ 0 ) ≥ 0 )

slide-23
SLIDE 23

23

Experimental Setup Experimental Setup

  • Network topology generated by

Network topology generated by

  • BRITE (Boston University Representative Internet

BRITE (Boston University Representative Internet Topology Generator) Topology Generator)

  • Router level

Router level Barabasi Barabasi-

  • Albert

Albert model for power model for power-

  • law

law topology topology

  • P2P network with 1,000 and higher nodes

P2P network with 1,000 and higher nodes

  • Total data = 40×network size

Total data = 40×network size

  • Arbitrarily selected node conducts P2P

Arbitrarily selected node conducts P2P-

  • Sampling

Sampling

  • Data distribution: Non

Data distribution: Non-

  • uniformly distributed

uniformly distributed

Uniformity of Sampling Uniformity of Sampling

0.5 1 1.5 2 2.5 3 3.5 4 x 10

4

0.5 1 1.5 2 2.5 3 3.5 4 x 10

−5

−−−Datapoint Index−−> −−−Probability of Selection−−> Walklength = 5.log(105)

Theoretical Selection Probability

slide-24
SLIDE 24

24

Ordinal Relaxation Ordinal Relaxation

  • Let X be a continuous random variable

Let X be a continuous random variable

  • Let be the population percentile of order p, i.e.

Let be the population percentile of order p, i.e.

  • Let x

Let x1

1<x

<x2

2<

<… …<x <xN

N be N independent samples from X

be N independent samples from X

  • We have

We have

  • Example:

Example:

  • q=95% and p =80%

q=95% and p =80% N=14 N=14

  • If we took 14 independent samples from any distribution, we

If we took 14 independent samples from any distribution, we can be 95% confident that 80% of the population would can be 95% confident that 80% of the population would below x below x14.

14.

Pr{ }

p

x p ξ ≤ =

log(1 ) Pr{ } log

N p

q x q N p ξ ⎡ ⎤ − > > ⇒ ≥ ⎢ ⎥ ⎢ ⎥

p

ξ

Ordinal Inner Product Computation Ordinal Inner Product Computation

  • Each node has a vector

Each node has a vector X Xi

i

  • Compute the Inner Product Matrix

Compute the Inner Product Matrix

  • Every node needs

Every node needs X Xi

i from every node.

from every node.

  • How about finding just the top

How about finding just the top-

  • k entries of the inner

k entries of the inner product matrix? product matrix?

slide-25
SLIDE 25

25

Ordinal Identification of Significant Entries from the Inner Ordinal Identification of Significant Entries from the Inner Product Matrix Product Matrix

10 20 30 40 50 60 70 1 2 3 4 5 6 7 8x 10

10

Sample Size Value of An An − Distributed Algorithm An − Centralized Algorithm Actual top p percentile

Bhaduri, Das, Kargupta. (2006). An Ordinal Approach for Detectin Bhaduri, Das, Kargupta. (2006). An Ordinal Approach for Detecting Feature g Feature-

  • Interaction

Interaction in a Peer in a Peer-

  • to

to-

  • Peer Network

Peer Network

Majority Vote Computation Algorithm Majority Vote Computation Algorithm

  • Each node has a number

Each node has a number

  • Check if the summation of the numbers at all nodes

Check if the summation of the numbers at all nodes is greater than or equal to 0. is greater than or equal to 0.

  • Another variant: Check if the sum is greater than a

Another variant: Check if the sum is greater than a certain threshold. certain threshold.

slide-26
SLIDE 26

26

Notations Notations

  • P

P1

1,…,

,…, P Pn

n –

– set of peers set of peers

  • P

Pi

i’s local vectors

’s local vectors -

  • S

Si

i –

– data at time t

data at time t

  • X

Xij

ij –

– sent by P sent by Pi

i to

to P Pj

j

  • K

Ki

i –

– knowledge knowledge ( ( S Si

i+

+Σ ΣX Xji

ji )

)

  • A

Aij

ij –

– agreement agreement ( ( X Xij

ij+

+X Xji

ji )

)

  • W

Wij

ij –

– withheld withheld ( ( K Ki

i –

– A Aij

ij )

)

  • G

G – – average of all peers average of all peers

Wij Aij Ki All vectors computations are local to a peer

One One-

  • dimensional Example: Majority Vote

dimensional Example: Majority Vote

  • Input to P

Input to Pi

i : a real number (

: a real number (S Si

i)

)

  • Goal: Find if

Goal: Find if Σ ΣS Si

i > 0

> 0

  • Output: 1 if

Output: 1 if K Ki

i > 0, 0 otherwise

> 0, 0 otherwise

  • Simple stopping rule:

Simple stopping rule:

  • If (

If (A Aij

ij > 0 and

> 0 and A Aij

ij >

> K Ki

i )

) Communicate

Communicate

  • If (

If (A Aij

ij < 0 and

< 0 and A Aij

ij <

< K Ki

i )

) Communicate Communicate

  • If communicate

If communicate

  • Set

Set X Xij

ij =

= K Ki

i -

  • X

Xji

ji

slide-27
SLIDE 27

27

Applications: L2 Norm Monitoring Applications: L2 Norm Monitoring

  • Initial setup: each peer has

Initial setup: each peer has

  • A data vector

A data vector

  • Monitoring Problem:

Monitoring Problem:

  • is ||

is ||G G ||< ||<ε ε ? ?

Local L2 Norm Monitoring Algorithm Local L2 Norm Monitoring Algorithm

  • Initial setup: each peer has

Initial setup: each peer has

  • A data vector

A data vector

  • Some global pattern vector

Some global pattern vector

  • Monitoring Problem:

Monitoring Problem:

  • is the L2 norm of the distance between the average data vector

is the L2 norm of the distance between the average data vector and the pattern vector greater than a given constant and the pattern vector greater than a given constant ε ε

  • Α

Αpplications: pplications:

  • Centroid

Centroid monitoring monitoring

  • Eigenvector monitoring

Eigenvector monitoring

slide-28
SLIDE 28

28

Specifications: L2 Norm Monitoring Specifications: L2 Norm Monitoring

  • Region of

Region of F F true ( true (R Rin

in): inside circle,

): inside circle,

  • Region of

Region of F F false: outside circle, non false: outside circle, non-

  • convex

convex

  • Use tangent planes (

Use tangent planes (R Ri

i) to cover domain

) to cover domain

  • C = {

C = {R Rin

in, R

, R1

1, R

, R2

2, …}

, …}

Rin Ri

convex convex

Local Vectors Local Vectors

  • For peer P

For peer Pi

i

  • Own estimate of global average (X)

Own estimate of global average (X)

  • Agreement with neighbor

Agreement with neighbor P Pj

j (Y)

(Y)

  • Withheld knowledge

Withheld knowledge w.r.t w.r.t neighbor neighbor P Pj

j

(Z=X (Z=X-

  • Y)

Y)

slide-29
SLIDE 29

29

Theorem Theorem

  • If for

If for every every peer and peer and each each of its neighbours

  • f its neighbours both

both the agreement and the withheld knowledge are in a the agreement and the withheld knowledge are in a convex shape (here a circle) convex shape (here a circle) -

  • then so is the

then so is the global average global average

  • Wolff, Bhaduri, Kargupta, 2005

Wolff, Bhaduri, Kargupta, 2005

L2 Experimental Setup L2 Experimental Setup

  • Simulator

Simulator

  • Distributed Data Mining Toolkit (DDMT)

Distributed Data Mining Toolkit (DDMT)

  • Topology

Topology

  • BRITE Internet Topology generator

BRITE Internet Topology generator

  • Realistic edge delays

Realistic edge delays

  • Input data

Input data

  • Mixture of correlated Gaussians in R

Mixture of correlated Gaussians in Rd

d with 10% noise

with 10% noise

  • Epoch change

Epoch change: Change of the means of Gaussians : Change of the means of Gaussians at fixed time intervals at fixed time intervals

slide-30
SLIDE 30

30

L2 Experimental Setup L2 Experimental Setup

  • Quality: Percentage of peers correctly computing

Quality: Percentage of peers correctly computing alert alert – –

  • ||

||K K||< ||<ε ε when || when ||G G||< ||<ε ε

  • ||

||K K||> ||>ε ε when || when ||G G||> ||>ε ε

  • Cost: Messages per peer per unit of leaky bucket

Cost: Messages per peer per unit of leaky bucket

L2 Experiments: Results L2 Experiments: Results

Epoch change Epoch change Quality Cost For broadcast-based algorithms normalized messages = 2

slide-31
SLIDE 31

31

L2 Experiments: Scalability L2 Experiments: Scalability

Quality Cost

Resources Resources

  • DDMWiki

DDMWiki ( (http://www.umbc.edu/ddm/wiki/ http://www.umbc.edu/ddm/wiki/) )

  • DDMBib

DDMBib ( (http://www.cs.umbc.edu/~hillol/DDMBIB/ http://www.cs.umbc.edu/~hillol/DDMBIB/) )

  • Full DDM Course Web Site

Full DDM Course Web Site ( (http:// http://www.cs.umbc.edu www.cs.umbc.edu/~hillol/CLASSES/DDM /~hillol/CLASSES/DDM) )