Missing Data in Large Transaction Databases Allan R. Wilks - - PDF document

missing data in large transaction databases
SMART_READER_LITE
LIVE PREVIEW

Missing Data in Large Transaction Databases Allan R. Wilks - - PDF document

Missing Data in Large Transaction Databases Allan R. Wilks AT&T Labs - Research Setting call detail on AT&T long distance network 300 million transactions (50 GB) per day collected from 400 sources reporting frequency


slide-1
SLIDE 1

Missing Data in Large Transaction Databases

Allan R. Wilks AT&T Labs - Research

slide-2
SLIDE 2

Workshop on Data Quality 30 November 2000, Slide 1

Setting

  • call detail on AT&T long distance

network

  • 300 million transactions (50 GB) per

day

  • collected from 400 sources
  • reporting frequency ranging from

continuous to every few weeks

  • complicated variable-length record

format

slide-3
SLIDE 3

Workshop on Data Quality 30 November 2000, Slide 2

Use

  • fraud detection
  • streaming access
  • database access
slide-4
SLIDE 4

Workshop on Data Quality 30 November 2000, Slide 3

Problem

  • are we seeing all the data?
  • needle absence in haystack
  • niches for fraudsters
  • perception: database confidence
slide-5
SLIDE 5

Workshop on Data Quality 30 November 2000, Slide 4

Sources

  • are all sources reporting?
  • depends on having exhaustive source

list

  • each source reporting everything?

volume monitoring frequency monitoring serial number monitoring stratified -- all exchanges?

slide-6
SLIDE 6

Workshop on Data Quality 30 November 2000, Slide 5

Holes in database

  • users can detect quite small holes --

surprising

do users alert? -- depends on their expectations do users think about the data as they see it?

  • auto queries

transverse to reporting sources

  • traceback

can the source of a hole be traced? keep raw data

slide-7
SLIDE 7

Workshop on Data Quality 30 November 2000, Slide 6

Tools

  • streaming tools

sh, awk, C, ... everything small

  • database tools

Daytona integrates well with UNIX 8 TB and growing

  • alerting

via pager software failures system failures heartbeat

slide-8
SLIDE 8

Workshop on Data Quality 30 November 2000, Slide 7

Lessons

  • develop subject matter expertise
  • log everything
  • explain all anomalies
  • keep raw data
  • automate as much as possible