roadmap roadmap
play

Roadmap Roadmap Distributed Data Mining: Why Bother? Distributed - PDF document

Distributed Data Mining: Current Distributed Data Mining: Current Pleasures and Emerging Applications Pleasures and Emerging Applications Hillol Kargupta Hillol Kargupta University of Maryland, Baltimore County and AGNIK University of


  1. Distributed Data Mining: Current Distributed Data Mining: Current Pleasures and Emerging Applications Pleasures and Emerging Applications Hillol Kargupta Hillol Kargupta University of Maryland, Baltimore County and AGNIK University of Maryland, Baltimore County and AGNIK www.cs.umbc.edu/~hillol www.cs.umbc.edu/~hillol Acknowledgements: Wes Griffin, Souptik Acknowledgements: Wes Griffin, Souptik Datta Datta, , Kanishka Bhaduri, Kamalika Kanishka Bhaduri, Kamalika Das, Ran Wolff, Chris Das, Ran Wolff, Chris Giannella Giannella Roadmap Roadmap � Distributed Data Mining: Why Bother? Distributed Data Mining: Why Bother? � � Some Emerging Applications Some Emerging Applications � � Local Algorithms Local Algorithms � � Exact Local Algorithms Exact Local Algorithms � � Approximate Local Algorithms Approximate Local Algorithms � � Resources Resources � 1

  2. Data Mining and Distributed Data Mining Data Mining and Distributed Data Mining � Data Mining: Scalable analysis of data by paying Data Mining: Scalable analysis of data by paying � careful attention to the resources: careful attention to the resources: � computing, computing, � � communication, communication, � � storage, and storage, and � � human human- -computer interaction. computer interaction. � � Distributed data mining (DDM): Mining data Distributed data mining (DDM): Mining data � using distributed resources. using distributed resources. Data Mining for Distributed and Ubiquitous Data Mining for Distributed and Ubiquitous Environments: Applications Environments: Applications � Mining Large Databases from distributed sites Mining Large Databases from distributed sites � � Grid data mining in Earth Science, Astronomy, Counter Grid data mining in Earth Science, Astronomy, Counter- -terrorism, Bioinformatics terrorism, Bioinformatics � � Monitoring Multiple time critical data streams Monitoring Multiple time critical data streams � � Monitoring vehicle data streams in real Monitoring vehicle data streams in real- -time time � � Monitoring physiological data streams Monitoring physiological data streams � � Analyzing data in Lightweight Sensor Networks and Mobile devices Analyzing data in Lightweight Sensor Networks and Mobile devices � � Limited network bandwidth Limited network bandwidth � � Limited power supply Limited power supply � � Preserving privacy Preserving privacy � � Security/Safety related applications Security/Safety related applications � � Peer Peer- -to to- -peer data mining peer data mining � � Large decentralized asynchronous environments Large decentralized asynchronous environments � 2

  3. Vehicles: Source of High Volume Data Streams Vehicles: Source of High Volume Data Streams � Vehicles generate tons Vehicles generate tons � of data of data � Hundreds of different Hundreds of different � parameters from parameters from different subsystems different subsystems � High throughput data High throughput data � streams streams � So what? So what? � Why Mine Vehicle Data? Why Mine Vehicle Data? � Fuel consumption analysis Fuel consumption analysis � � Fleet analytics Fleet analytics � � Vehicle benchmarking Vehicle benchmarking � � Predictive health Predictive health- -monitoring monitoring High gas prices High gas prices � � Driver behavior analytics Driver behavior analytics � Breakdowns cost Breakdowns cost Bad driving Bad driving thousands of thousands of costs money--- --- costs money dollars dollars fuel, brake shoe, fuel, brake shoe, insurance, law- insurance, law - suits suits 3

  4. From Concept to Commercial Product From Concept to Commercial Product First prototype First prototype -- -- PDA PDA- -based platform based platform � � Other choices: Other choices: � � Cell phones and Cell phones and � � Low- -cost, less powerful embedded devices cost, less powerful embedded devices Low � � Circa 2001 Circa 2001 Market Entry Point Market Entry Point Circa 2005 � � Circa 2005 � Location management companies Location management companies � � M2M companies M2M companies � Low Cost Embedded GPS Devices Low Cost Embedded GPS Devices � � Resource constrained Resource constrained � � 3 3- -4K run time memory 4K run time memory � � Circa 2007 Circa 2007 250K footprint 250K footprint � � Resource sharing with GPS program Resource sharing with GPS program � � Private & Secure Data Mining from Multi- -Party Party Private & Secure Data Mining from Multi Distributed Data Distributed Data � Compute global patterns without direct access to the multi Compute global patterns without direct access to the multi- -party party � raw distributed data raw distributed data � Minimize communication cost Minimize communication cost � � Must come with provably correct guarantees with respect to a Must come with provably correct guarantees with respect to a � given privacy model given privacy model � Must be scalable with respect to Must be scalable with respect to � � number of data sites number of data sites � � size of the data size of the data � � Privacy Privacy- -preserving data mining preserving data mining � � Blends in ``pattern Blends in ``pattern- -preserving’’ transformations with data analysis preserving’’ transformations with data analysis � 4

  5. How PURSUIT Works for the User How PURSUIT Works for the User � Need to have your own sensor such as SNORT, MINDS Need to have your own sensor such as SNORT, MINDS � � Download PURSUIT plug Download PURSUIT plug- -in for the sensor and install in for the sensor and install � � PURSUIT plug PURSUIT plug- -in offers in offers � � A stand A stand- -alone interface for processing your alerts from the sensor alone interface for processing your alerts from the sensor � and cross and cross- -domain analysis domain analysis � Web account for detailed cross Web account for detailed cross- -domain statistics domain statistics � � Optional distributed collaboration management module for Optional distributed collaboration management module for � managing the threats and archiving forensics managing the threats and archiving forensics PURSUIT Web Site PURSUIT Web Site 5

  6. Peer- -to to- -peer (P2P) Networks peer (P2P) Networks Peer � Relies primarily on the computing resources of the Relies primarily on the computing resources of the � participants in the network rather than a relatively low participants in the network rather than a relatively low number of servers. number of servers. � P2P networks are typically used for connecting nodes via P2P networks are typically used for connecting nodes via � largely ad hoc connections. largely ad hoc connections. � No central administrator/coordinator No central administrator/coordinator � � Peers simultaneously function as both "clients" and "servers" Peers simultaneously function as both "clients" and "servers" � � Privacy is an important issue in most P2P applications Privacy is an important issue in most P2P applications � Where do we find P2P Networks? Where do we find P2P Networks? � Applications: Applications: � � File File- -sharing networks: sharing networks: KaZAa KaZAa, Napster, Gnutella , Napster, Gnutella � � P2P network storage, web caching, P2P network storage, web caching, � � P2P bio P2P bio- -informatics, informatics, � � P2P astronomy, P2P astronomy, � � P2P Information retrieval P2P Information retrieval � � P2P Sensor Networks? P2P Sensor Networks? � � P2P Mobile Ad P2P Mobile Ad- -hoc hoc NETwork NETwork (MANET)? (MANET)? � � Next Generation: Next Generation: � � P2P Search Engines, Social Networking, Digital libraries, P2P P2P Search Engines, Social Networking, Digital libraries, P2P � “YouTube”? “YouTube”? 6

  7. P2P Web Mining P2P Web Mining � Web mining in a sever Web mining in a sever- -less environment less environment � Useful Browser Data Useful Browser Data � Web Web- -browser history browser history � � Browser cache Browser cache � � Click Click- -stream data stored at browser (browsing pattern) stream data stored at browser (browsing pattern) � � Search queries typed in the search engine Search queries typed in the search engine � � User profile User profile � � Bookmarks Bookmarks � � Challenges Challenges � � Indexing, clustering, data analysis in a decentralized Indexing, clustering, data analysis in a decentralized � asynchronous manner asynchronous manner � Scalability Scalability � � Privacy Privacy � 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend