Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion - - PowerPoint PPT Presentation

modeling highly heterogeneous large data sets towards a
SMART_READER_LITE
LIVE PREVIEW

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion - - PowerPoint PPT Presentation

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion Models Robert Grossman University of Illinois at Chicago & Open Data Group Traditional Statistics One small data set A few attributes Vector-valued data Data


slide-1
SLIDE 1

Modeling Highly Heterogeneous (Large) Data Sets: Towards a Billion Models

Robert Grossman University of Illinois at Chicago & Open Data Group

slide-2
SLIDE 2

Traditional Statistics

 One small data set  A few attributes  Vector-valued data

Data Mining

 Few large data sets  Many attributes  Complex data

slide-3
SLIDE 3

But Large Data Is Not Homogeneous

Many Several One Populations Complex Complex Vector Structure Many Many Few Attributes Large Large Small Data Large, Highly Heterogeneous Data (Tomorrow) Large Data Today Statistics

slide-4
SLIDE 4

Features vs. Model Parameters

Model Parameters Feature Vector Feature Vector

vs

Model Parameters

Model Parameters

… …

Feature Vectors Our interest:

  • ne billion

models Today, we can manage one billion feature vectors.

slide-5
SLIDE 5

Progress to Date

Single models Manually segmented model, Ensembles of models 1 10 100 1000 10E4 10E5 10E6 Machine segmented models (homogeneous) 10E9

?

Highly hetero- geneous models

slide-6
SLIDE 6

Example 1 - 42,000 Models

 Is the traffic speed

and volume today (Tuesday, May 15, 4:30 pm,, no rain) different than the baseline model?

 Separate model for

7 days x 24 hours x 250 locations = 42,000 models

  • 833 road sensors
  • weather data (images, xml)
  • text data about special events

Anomalies

slide-7
SLIDE 7

GLR Change Detection Algorithms (Single Model)

 Sequence of events x[1], x[2], x[3], …  Question: is the observed distribution different than the baseline

distribution?

 Use simple CUSUM & Generalized Likelihood Ratio (GLR) tests  ... but use thousands of them

Observed Model Baseline Model β

slide-8
SLIDE 8

Build 104+ Models

1.

Build segmented models using multidimensional data cubes

2.

For each distinct cube, estimate parameters for separate statistical model

3.

Detect changes from baselines and send alerts in real time

Day x Hour Types of weather Geospatial region Modeling using Cubes

  • f Models (MCM)- separate

baselines for each cell

slide-9
SLIDE 9

Greedy Meaningful/Manageable Balancing (GMMB) Algorithm

  • Fewer alerts
  • Alerts more

manageable

  • To decrease alerts,

remove breakpoint,

  • rder by number
  • f decreased alerts,

& select one or more breakpoints to remove

  • More alerts
  • Alerts more

meaningful

  • To increase alerts,

add breakpoint to split cubes,

  • rder by number
  • f new alerts, &

select one or more new breakpoints One model for each cell in data cube Breakpoint

slide-10
SLIDE 10

Example 2: Data Quality for Payment Systems

Account Merchant Issuing Bank Merchant Bank

  • 6000+ peak transactions per second.
slide-11
SLIDE 11
  • Variation merchant to

merchant

  • Variation bank to bank
  • Daily variation
  • Variation season to

season

Payments Data is Highly Heterogeneous

slide-12
SLIDE 12

Data Cubes of Models - Payments Systems

  • Build separate model for each

bank (c. 1000)

  • Build separate model for each

geographical region (6 regions)

  • Build separate model for each

different type of merchant (c. 800 types of merchants)

  • For each distinct cube,

establish separate baselines for each metric of interest (declines, etc.)

  • Detect changes from baselines

Entity (bank, etc.) Geospatial region Type of Transaction 20,000+ separate baselines Modeling using Cubes

  • f Models (MCM)
slide-13
SLIDE 13

Example 3 - Emergent Behavior Network Packet Data

 Data collected in

real time from several different distributed sensors (Angle)

 Still investigating

best dimensions for cube

 Build separate

cluster model for each cell in cube

slide-14
SLIDE 14

Angle Scoring Functions for Each Cube in Data Cube of Models

 Update features using new

packets and evolve features

 Divide clusters into good (B

  • r Blue), neutral, and bad (R
  • r Red)

 Blue - score using good

clusters

 Red - score using bad

clusters

 Purple - score using both

good and bad clusters

s(x) = maxkB sk(x)

sk(x) = k 1 k exp x µk

2

2 k

2

  • k

k

  • =1

s(x) = sk(x)

k BR

  • Hard scoring - use

max / min

  • Soft scoring use sum
  • Scoring function for

single cluster

slide-15
SLIDE 15

The Challenge

This methodology can work quite well in

practice.

Develop some of the theory to guide

this methodology and improve the methodology.

slide-16
SLIDE 16

Other Applications

George Church’s challenge individual

predictive models for each human genome 6.5 Billion humans x 6 Billion Base Pairs

Consumer Marketing - large advertisers will

see 1-3 Billion different consumers

Network defense / cyberdefense - 4 billion

IPv4 addresses; billions of users; billions+ of IPv6 addresses

slide-17
SLIDE 17

What About the Data?

Highway change detection data is

available highway.ncdm.uic.edu

Angle network anomalies will be

available

What About the Software?

Augustus - Will be available from

Source Forge

slide-18
SLIDE 18

References

Robert L. Grossman, Michal Sabala, Javid Alimohideen, Anushka Aanand, John Chaves, John Dillenburg, Steve Eick, Jason Leigh, Peter Nelson, Mike Papka, Doug Rorem, Rick Stevens, Steve Vejcik, Leland Wilkinson, and Pei Zhang, Real Time Change Detection and Alerts from Highway Traffic Data, ACM/IEEE International Conference for High Performance Computing and Communications (SC '05).

Joseph Bugajski, Robert L. Grossman, Eric Sumner and Steve Vejcik, Monitoring Data Quality for Very High Volume Transaction Systems, Proceedings of the 11th International Conference on Information Quality, 2006.

Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke, Steve Vejcik, Detecting Changes in Large Data Sets of Payment Card Data: A Case Study, Proceedings of The Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2007.