Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion - PowerPoint PPT Presentation

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion Models Robert Grossman University of Illinois at Chicago & Open Data Group

Traditional Statistics  One small data set  A few attributes  Vector-valued data Data Mining  Few large data sets  Many attributes  Complex data

But Large Data Is Not Homogeneous Statistics Large Large, Highly Data Heterogeneous Today Data (Tomorrow) Data Small Large Large Attributes Few Many Many Structure Vector Complex Complex Populations One Several Many

Features vs. Model Our interest: one billion Parameters models … Feature Model Vector vs … Model Parameters Parameters … Feature Vectors Feature Vector … Model Today, we can Parameters manage one billion feature vectors.

Progress to Date Machine Highly Manually segmented Single segmented hetero- model, Ensembles of models models geneous models (homogeneous) models ? 1 10 100 1000 10E4 10E5 10E6 10E9

Example 1 - 42,000 Models  Is the traffic speed and volume today (Tuesday, May 15, 4:30 pm,, no rain) different than the baseline model?  Separate model for 7 days x 24 hours x 250 locations = 42,000 models • 833 road sensors Anomalies • weather data (images, xml) • text data about special events

GLR Change Detection Algorithms (Single Model) Baseline Observed Model Model β  Sequence of events x[1], x[2], x[3], …  Question: is the observed distribution different than the baseline distribution?  Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests  ... but use thousands of them

Build 10 4 + Models Geospatial region Build segmented 1. models using multidimensional data cubes For each distinct Day x 2. cube, estimate Hour Types of parameters for weather separate statistical model Modeling using Cubes Detect changes from 3. of Models (MCM)- separate baselines and send baselines for each cell alerts in real time

Greedy Meaningful/Manageable Balancing (GMMB) Algorithm Breakpoint • More alerts • Fewer alerts • Alerts more • Alerts more meaningful manageable • To increase alerts, •To decrease alerts, add breakpoint remove breakpoint, One model for each to split cubes, order by number cell in data cube order by number of decreased alerts, of new alerts, & & select one or more select one or more breakpoints to remove new breakpoints

Example 2: Data Quality for Payment Systems Account Issuing Bank Merchant Merchant Bank • 6000+ peak transactions per second.

Payments Data is Highly Heterogeneous Variation merchant to • merchant Variation bank to bank • Daily variation • Variation season to • season

Data Cubes of Models - Payments Systems Build separate model for each • 20,000+ separate bank (c. 1000) baselines Geospatial Build separate model for each • region geographical region (6 regions) Build separate model for each • different type of merchant (c. 800 types of merchants) Type of For each distinct cube, • Transaction establish separate baselines Entity (bank, for each metric of interest etc.) (declines, etc.) Modeling using Cubes Detect changes from baselines • of Models (MCM)

Example 3 - Emergent Behavior Network Packet Data  Data collected in real time from several different distributed sensors (Angle)  Still investigating best dimensions for cube  Build separate cluster model for each cell in cube

Angle Scoring Functions for Each Cube in Data Cube of Models • Hard scoring - use  Update features using new max / min packets and evolve features s ( x ) = max k � B s k ( x )  Divide clusters into good (B or Blue), neutral, and bad (R • Soft scoring use sum or Red) � s ( x ) = s k ( x )  Blue - score using good k � B � R clusters • Scoring function for single cluster  Red - score using bad � � 2 clusters exp � � x � µ k 1 � � s k ( x ) = � k � 2 � 2 � k � k  Purple - score using both � � good and bad clusters � = 1 � k k

The Challenge  This methodology can work quite well in practice.  Develop some of the theory to guide this methodology and improve the methodology.

Other Applications  George Church’s challenge individual predictive models for each human genome 6.5 Billion humans x 6 Billion Base Pairs  Consumer Marketing - large advertisers will see 1-3 Billion different consumers  Network defense / cyberdefense - 4 billion IPv4 addresses; billions of users; billions+ of IPv6 addresses

What About the Data?  Highway change detection data is available highway.ncdm.uic.edu  Angle network anomalies will be available What About the Software?  Augustus - Will be available from Source Forge

References Robert L. Grossman, Michal Sabala, Javid Alimohideen, Anushka  Aanand, John Chaves, John Dillenburg, Steve Eick, Jason Leigh, Peter Nelson, Mike Papka, Doug Rorem, Rick Stevens, Steve Vejcik, Leland Wilkinson, and Pei Zhang, Real Time Change Detection and Alerts from Highway Traffic Data, ACM/IEEE International Conference for High Performance Computing and Communications (SC '05). Joseph Bugajski, Robert L. Grossman, Eric Sumner and Steve Vejcik,  Monitoring Data Quality for Very High Volume Transaction Systems, Proceedings of the 11th International Conference on Information Quality, 2006. Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke, Steve  Vejcik, Detecting Changes in Large Data Sets of Payment Card Data: A Case Study, Proceedings of The Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2007.

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion - PowerPoint PPT Presentation

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion Models Robert Grossman University of Illinois at Chicago & Open Data Group Traditional Statistics One small data set A few attributes Vector-valued data Data

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Large Sets of q -Analogs of Designs Michael Braun, Michael Kiermaier, Axel Kohnert , Reinhard

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

gholzmann@acm.org ISO 26262: highly recommended EN 50128: highly recommended IEC 61508: highly

Towards Heterogeneous Automatic MT Error Analysis (6th LREC) Jes us Gim enez and Llu

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid

Language Technologies Or why we all need large data sets, automatic tools and sharing! Thesis

Disjoint Sets and Disjoint sets The UNION-FIND ADT for disjoint sets the UNION-FIND

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Ready for Prime Time! By Steve Klebe, VP of BD & Strategy Steve.klebe@billtomobile.com

Index 1) My Mobile Money Logo 2) What is My Mobile Money A) Youtube Video B) Downloadable App C) Is

FOR MY BUSINESS Arizona Marketing Association RAINSTAR CAPITAL Managing Director 1. Over 35

ASX Small and Mid-Cap Conference 2020. September 2020 1 Disclaimer This presentatjon

Cryptography Prof. Dr. Werner Schindler Adjunct Professor Federal civil servant at (au

Webinar 24 th June 2020 Agenda Introduction School Payments Systems and National Framework

Investor Session Focus on October 19, 2009 Investor Session Tim Thompson Tim Thompson

MicroCash: Practical Concurrent Processing of Micropayments Ghada Almashaqbeh 1 , Allison Bishop 2

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion - PowerPoint PPT Presentation

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion Models Robert Grossman University of Illinois at Chicago & Open Data Group Traditional Statistics One small data set A few attributes Vector-valued data Data

Learning by Fusing Heterogeneous Data Marinka Zitnik Thesis Defense, October 22 2015 Motivation

Modeling Heterogeneous Modeling Heterogeneous Real- -time Components in BIP time Components in

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Large Sets of q -Analogs of Designs Michael Braun, Michael Kiermaier, Axel Kohnert , Reinhard

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Unifying Heterogeneous Cray Unifying Heterogeneous Cray Resources and Systems into an

gholzmann@acm.org ISO 26262: highly recommended EN 50128: highly recommended IEC 61508: highly

Towards Heterogeneous Automatic MT Error Analysis (6th LREC) Jes us Gim enez and Llu

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid

Language Technologies Or why we all need large data sets, automatic tools and sharing! Thesis

Disjoint Sets and Disjoint sets The UNION-FIND ADT for disjoint sets the UNION-FIND

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Ready for Prime Time! By Steve Klebe, VP of BD &amp; Strategy Steve.klebe@billtomobile.com

Index 1) My Mobile Money Logo 2) What is My Mobile Money A) Youtube Video B) Downloadable App C) Is

FOR MY BUSINESS Arizona Marketing Association RAINSTAR CAPITAL Managing Director 1. Over 35

ASX Small and Mid-Cap Conference 2020. September 2020 1 Disclaimer This presentatjon

Cryptography Prof. Dr. Werner Schindler Adjunct Professor Federal civil servant at (au

Webinar 24 th June 2020 Agenda Introduction School Payments Systems and National Framework

Investor Session Focus on October 19, 2009 Investor Session Tim Thompson Tim Thompson

MicroCash: Practical Concurrent Processing of Micropayments Ghada Almashaqbeh 1 , Allison Bishop 2

Ready for Prime Time! By Steve Klebe, VP of BD & Strategy Steve.klebe@billtomobile.com