Master Data Management Nick Pizzi, PhD pizzi@imt.ca IMT IMT - - PowerPoint PPT Presentation

master data management
SMART_READER_LITE
LIVE PREVIEW

Master Data Management Nick Pizzi, PhD pizzi@imt.ca IMT IMT - - PowerPoint PPT Presentation

Master Data Management Nick Pizzi, PhD pizzi@imt.ca IMT IMT provides solutions in: e-Health strategy and architecture; integration; master data management; and big data analytics IMT specializes in large-scale Electronic Health Record


slide-1
SLIDE 1

Master Data Management

Nick Pizzi, PhD pizzi@imt.ca

slide-2
SLIDE 2

Page: Confidential

  • IMT provides solutions in: e-Health strategy and architecture;

integration; master data management; and big data analytics

  • IMT specializes in large-scale Electronic Health Record (EHR)

Registries that utilize a probabilistic matching algorithm

  • IMT has had a role in the architecture and deployment of

nearly all core Client and Provider Registries in Canada that now serve as foundational components for the Canada Health Infoway EHR Blueprint

  • IMT has been involved in >60 Health Information Exchange

EHR projects in Canada and USA

IMT

2

slide-3
SLIDE 3

Page: Confidential

  • Master Data
  • Entities & Attributes
  • Algorithm &

Linkages

  • Thresholds & Tasks
  • Algorithm Details
  • Generating Weights
  • Threshold Analysis
  • Implementation

Process

  • Matching

Challenges

Synopsis

3

slide-4
SLIDE 4

Master Data

slide-5
SLIDE 5

Page: Confidential

  • High-value, core information used to support

critical business processes across the enterprise

  • Demographic/descriptive data about customers,

patients, suppliers, partners, products, employees, accounts …

  • Master Data is at the heart of every business

transaction, application and decision

What is Master Data?

5

Master Data

slide-6
SLIDE 6

Page: Confidential

  • Master Data Management (MDM) delivers a

single, trusted and complete version of critical data assets to downstream applications, end users and processes

  • MDM provides a mastered list of records, often

called “Golden Records”

  • MDM creates a central location to access key

information

What is Master Data Management?

6

Master Data

slide-7
SLIDE 7

Page: Confidential

  • Registries collect data from multiple sources to

create an “index”

  • Records are linked into entities, one index may

point to multiple records

  • Ownership of data (updates and merges) remains

at the source level

  • Data is queried at runtime to create composite

views (virtual Golden Record)

Registry Style

7

Master Data

slide-8
SLIDE 8

Page: Confidential

  • Centralized Style: Gather data from sources, but MDM assumes

central ownership of data

  • Golden Records become reference data for other enterprise apps
  • Can synchronize downstream systems by publishing updates
  • Hybrid Style: Sources send data to be included in the Registry Index
  • Data across sources are consolidated to create a Golden Record
  • Golden Records are now more easily referenced across

enterprise

Centralized and Hybrid Styles

8

Master Data

slide-9
SLIDE 9

Page: Confidential

  • Gather data from multiple source systems
  • Locate records that have a high probability of match
  • Link matched records to form a unified view of information
  • If there is a question about the match, records are added

to a task

  • MDM hub supports key business goals like searching,

viewing relationships, or populating a data warehouse for business intelligence

Primary Function: Matching & Linking

9

Master Data

slide-10
SLIDE 10

Page: Confidential

  • Definitional sources: systems that contribute

records to the MDM via the Broker, API, or periodic batch updates

  • Eg, Hospital, Clinic, Labs sources
  • Informational sources: the “issuers” of

demographic identifiers (MEMIDENT)

  • Eg, Driver’s Licence (AB) or Passport (Canada)

Sources: Definitional vs Informational

10

Master Data

slide-11
SLIDE 11

Entities & Attributes

slide-12
SLIDE 12

Page: Confidential

  • A record that comes from a source that

contains current and/or historical information

Member

12

ID# Name Work Address Home Phone Email 319883 Debbie Smith 4150-50 Tecnology Way
 Carson City, NV 89722 775.789.1020 dbecker@mdmgroup.org 319884 Ronald Bucher 931 West Canyon Blvd.
 Gardnerville, NV 775.212.1891 rbucher@gmail.com 319885 Jennifer Long 882 N. Weldon Way
 Reno, NV 89502 775.279.5629 Jen1871@aol.com 319886 Mike Smith 775.302.2582 Msmith2@yahoo.com

Data Source

Entities & Attributes

slide-13
SLIDE 13

Page: Confidential

  • A distinct person, organization, or location

represented by assigning the same Enterprise Identifier (EID) to ≥1 member records

Entity

13

Deb Becker-Smith EID: 456 Debbie Beckersmith Deborah Becker Deb Becker-Smith

Source A Source B Source C

Entities & Attributes

slide-14
SLIDE 14

Page: Confidential

  • How do you define a person?
  • Multiple records may exist for the same person
  • Key demographics, like Name, DoB, Address, and

Phone help determine the individual people in data

  • Strong identifiers (eg, SIN, PHIN) help make better

matches

  • Linked records are aggregated to create views of the

Entity

What makes an Entity?

14

Entities & Attributes

slide-15
SLIDE 15

Page: Confidential

  • With multiple entity types, relationships can

be created

  • Entity Types:
  • Patients, Facilities, Organizations, Incidents,

Customers, Citizens, Locations, Vehicles, Providers, Households, Suspects, Products

Entity Types & Relationships

15

Entities & Attributes

slide-16
SLIDE 16

Page: Confidential

  • Attributes are stored in database tables,

Segments, based on the Type of data they are (eg, names, dates, phones, or addresses)

Attribute Types

16

MDM HUB

MEMPHONE

Home Phone Mobile Phone Work Phone Fax Number

MEMNAME

Patient Name Previous Name Guarantor Next of Kin

mpi_memphone

phCc phArea phNumber phExtension phComment

mpi_memphone

  • nmLast
  • nmFirst
  • nmMiddle
  • nmSuffix
  • nmPrefix
  • nmTitle
  • nmDegree

Entities & Attributes

slide-17
SLIDE 17

Page: Confidential

  • A demographic data element made up of ≥1

fields that defines traits of a member record

Attribute

17

Deb Becker-Smith EID: 456 Debbie Beckersmith Deborah Becker Deb Becker-Smith

Source A Source B Source C

Birth Date Name Home Address Home Phone

Entities & Attributes

slide-18
SLIDE 18

Algorithm & Linkages

slide-19
SLIDE 19

Page: Confidential

  • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score

Algorithm Overview

19

Standardize Convert data to simplest form for easier use during matching process Bucket Organize records that share common values for faster search retrieval Compare Compare pairs of records using a probabilistic method to calculate a score (512) 634-5144 6345144 1344456 6345144 vs. 6345414

Data

3.9 Algorithm & Linkages

slide-20
SLIDE 20

Page: Confidential

  • When member records have a high comparison score

they are linked together as part of the same Entity

  • We do this by giving them the same Enterprise ID (EID)

Linkages

20

MDM HUB

vs.

EID EID

Score ≥ AutoLink?

When new records are added to the hub, buckets are used to find candidates to compare against If the comparison score is greater than the auto-link threshold the records are given the same EID, linking them to the same entity

Algorithm & Linkages

slide-21
SLIDE 21

Page: Confidential

  • Composite Views define which values get displayed
  • Not all records contain a complete or up-to-date view of data
  • Normally follows “Most Recent Update” rule to show freshest

version

  • Can also take into account Source preference or custom logic
  • Composite Views are not stored in database, only the raw data
  • Views are rendered at runtime, picking the best value for each

attribute

Composite Views

21

Algorithm & Linkages

slide-22
SLIDE 22

Page: Confidential

Composite Views

22

Jones, Ron 2772 W. North Ave. Oak Park, IL 60302 SSN: 972-41-2318 Jones, R. Ph: (773) 826-2825 DOB: 07/26/1964 SSN: 972-41-2318 Jones, Ronald R. 5282 W. Chicago Ave.
 Chicago, IL 60610 Ph: (773) 826-2825 Jones, Ronald R. 5282 W. Chicago Ave.
 Chicago, IL 60610 Ph: (773) 826-2825 DOB: 07/26/1964 SSN: 972-41-2318

Algorithm & Linkages

slide-23
SLIDE 23

Page: Confidential

  • Coordinate record linking and task creation

Entity Management

23

Source + MemIdNum Source + MemIdNum MemRecNo Algorithm Entity Manager Create Task ? MemRecNo

MemIdNum used to store original primary key for each source record MemRecNo assigned as internal MDM key field for each unique record Assess probable matches to determine comparison score Measure score to decide whether to link, do nothing, or create a task Assign same EID to both records when they are linked If score is unclear then a task is created for Data Stewards to review

Algorithm & Linkages

slide-24
SLIDE 24

Thresholds & Tasks

slide-25
SLIDE 25

Page: Confidential

Ignore Link

  • Two thresholds: link, ignore, or create task

Thresholds

25

TASKS Clerical Review Threshold (CR) Auto-Link Threshold (AL)

Thresholds & Tasks

slide-26
SLIDE 26

Page: Confidential

  • FP (FN) occur when match probability is high (low) but

assessment is wrong

  • FN: caused by sparse/inaccurate data
  • FP: twins, siblings, or parent/child

False Positives & False Negatives

26

CR AL

FP FN

Thresholds & Tasks

slide-27
SLIDE 27

Page: Confidential

  • Goal: move FP & FN into task range for Data Steward
  • FN: reduce via data governance/quality initiatives
  • FP: reduce via false positive filter, a comparison function

that penalizes scores for small inaccuracies or missing data

Correcting FP & FN

27

CR AL

FP FN

Thresholds & Tasks

slide-28
SLIDE 28

Page: Confidential

  • Manual decision about joining two records

that fall between the CR & AL thresholds.

Tasks

28 Potential Overlay

Source A

record w′

Source A

record w Potential Duplicate

Source A

record w

Source A

record w Potential Linkage

Source A

record w

Source B

record w Review Identifier

Source A

record x

Source A

record w

Thresholds & Tasks

slide-29
SLIDE 29

Page: Confidential

  • Link between entities used to visualize the

interconnectedness between records or hierarchical structures within a dataset

Relationships

29

Smith Household 1122 Main Street Reno NV, 89522 Mike Smith EID: 4569872 Deb Becker-Smith EID: 1456984 Mikey Smith EID: 9734546

Thresholds & Tasks

slide-30
SLIDE 30

Page: Confidential

  • Anonymous Values are usually fake or over used values
  • Removing them enhances match quality and search speed
  • “BABYBOY” or (999) 999-9999 are considered NULL
  • Equivalencies (or nicknames) create lists of interchangeable

words

  • Greater flexibility in searching and comparison
  • Names (Bob = Robert) and abbreviations (HWY = Highway)

Validation Lists

30

Thresholds & Tasks

slide-31
SLIDE 31

Page: Confidential

  • Robert · Noonan · Jillian · Delacroix · Duk · Yussef

Rajeshi · Katsumoto · Jorge · Twin · Hernandez Katai · John · M’Tembe · Mikel · Hiroshi · Larssen Marija · Ozols · Unknown · Chi · D’Esopo · Fergus O’Shea · Maria · Obidos · Wei · James · Vladamir Jian · Baby · Pham · Oliver · Kensington · Kimball Paolo · Morgan · Boy · Silvia · McDonald · Woo Hiram · Fiona · Marta · Alexi · Zhang · Krishna NoLastName · Isaac · Gabrielle · Test · Vargas
 Tomas · Hubert · Paul · Nguyen · Hussein · Liam

Anonymous Values

31

Thresholds & Tasks

slide-32
SLIDE 32

Page: Confidential

MDM Data Flow

32

Adds/Updates are Consumed

Adds and Updates are called Puts. Puts can come via the API, Web Services, Message Brokers, or as a bulk process.

Parse Data to Segments

Each record is assigned an identifier in the MEMHEAD table. Attributes are sent to the appropriate table or UNL file, Names go to MEMNAME, etc.

Derive Data

Derived data takes raw information and creates standardized versions of the data for easier comparison and

  • rganizes records

into buckets for faster searching.

Compare Records

Pairs of records compared using a probabilistic method. Values are weighted by frequency, similarity, or their relationship to

  • ther data, then

the score is tallied.

Link or Create Tasks

The results of the comparisons are measured against a set of thresholds that determine whether to link, ignore, or create a tasks

Thresholds & Tasks

slide-33
SLIDE 33

Page: Confidential

  • Pairs of records are compared to each other
  • Key attributes are assessed and an overall

match score is issued

Probabilistic Matching Framework

33

Record X Name: Ron R. Jones Phone: (773) 826-2825 ID #: 972-41-2318 Record Y Name: Ronald Jones Phone: (773) 826-2852 ID #: 972-41-2318

13.8 Thresholds & Tasks

slide-34
SLIDE 34

Algorithm Details

slide-35
SLIDE 35

Page: Confidential

Standardization

35

Standardize Standardization takes the raw data from a record and cleans it to make analysis easier. The resulting string is stored in the mpi_memcmpd table. Bucket Bucketing organizes records with similar data to optimize the search process. Bucket hash indexes are stored in the mpi_membktd table. Compare Comparison measures the probability that candidate records (from buckets) match. Scores are based on pre-defined weights in lookup tables.

  • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score

Algorithm Details

slide-36
SLIDE 36

Page: Confidential

  • Standardization provides a wide variety of functions:
  • Case Conversion: Karen Jones ⟹ JONES:KAREN
  • Truncation of Values: (312) 832-1212 ⟹ 8321212
  • Anonymous Values: Remove (000) 000-0000
  • Validating Data Format: Rejects e-mail addresses without an @
  • Equivalency Translation: Apartment ⟹ APT
  • Standardization functions are designed to work with specific data

types, such as Names, Addresses, Dates, Phones, etc.

Standardization Functions

36

Algorithm Details

slide-37
SLIDE 37

Page: Confidential

Standardized Data: MEMCMPD Table

37

Original Data Standardized Data Maria R. Fontana FONTANA:MARIA:R::. 391-20-1923 391201923 (832) 812-1193 8121193 (832) 811-2915 8112915 mfont91@us.ibm.com MFONT91USIBM 1973-09-21 19730921 928 West Kingston Court Chicago, IL 60617 N-928:S-W:S-KINGSTON:S-CT:S- CHICAGO:S-IL:N-60617

FONTANA:MARIA:R::.^391201923^8121193~8112915^MFONT91USIBM^
 19730921^N-928:S-W:S-KINGSTON:S-CT:S-CHICAGO:S-IL:N-60617

OR condition Attribute delimiter

Algorithm Details

slide-38
SLIDE 38

Page: Confidential

Bucketing

38

Standardize Standardization takes the raw data from a record and cleans it to make analysis easier. The resulting string is stored in the mpi_memcmpd table. Bucket Bucketing organizes records with similar data to optimize the search process. Bucket hash indexes are stored in the mpi_membktd table. Compare Comparison measures the probability that candidate records (from buckets) match. Scores are based on pre-defined weights in lookup tables.

  • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score

Algorithm Details

slide-39
SLIDE 39

Page: Confidential

  • Buckets act like a table index, except that they

reference data in multiple tables

  • Stored as 64-bit “hash” integers
  • Used to select candidates for comparison
  • Speeds searching by enabling sub-set search
  • Cast a wide net by taking abstract view of data

Bucketing: mpi_membktd

39

Algorithm Details

slide-40
SLIDE 40

Page: Confidential

  • Token is single piece of data (eg, field or sub-string)
  • Normally an “or” condition exists, so more criteria broadens a search
  • To create “and” searching, modify min and max tokens
  • Tokens can be controlled at two levels: Bucket Function & Bucket Group

Bucket Tokens

40

Bucket Function Patient Name

(MAX: 2, MIN: 1)

Bucket Function Birth Date

(MAX: 1, MIN: 0)

Bucket Group Name + DOB

(MAX: 2, MIN: 2)

JOHN + M SMITH + M SMITH + DOB JOHN + SMITH JOHN + DOB M + DOB

JOHN M SMITH = 3 tokens DOB ⇾ 1 token

  • ptional

2 Name tokens 0 DOB tokens 1 Name token 1 DOB token

Algorithm Details

slide-41
SLIDE 41

Page: Confidential

Examples of Bucket Generation Types

41

Generation Type What Happens YYYMMDD Bucket records according to Full Date, so there is no transformation MMDD Use MMDD portion of date token to group dates by anniversary (19280912 ⟹ 0912) Phonetic Metaphone: Convert tokens to key sound markers. Industry standard phonetic methodology Normphone: Use Initiate’s proprietary phonetic method, works best with Western languages Arabic Name: Apply phonetics to English translation of Arabic names Equivalence Apply a String Equivalency (Nickname or Abbreviation) from a lookup table (Jim ⟹ James) Equivalence & Phonetics Blend phonetic and equivalence conversions against tokens. (ED ⟹ EDWARD ⟹ ETWRT) Sorted Sort the contents of numeric tokens (3014201324 ⟹ 0011223344, 52431 ⟹ 12345) nGRAM Sequences Bucket chunks of data with length N (N = 4 against “6345111” ⟹ 6345 3451 4511 5111)

Algorithm Details

slide-42
SLIDE 42

Page: Confidential

  • Generation types control how buckets

are conceived

  • Format data for optimized searching
  • Make buckets more accepting of

discrepancies

  • Leverage standardized version of

data

  • Why use bucket generation types?
  • Helps work around typos
  • Makes searches less reliant on exact

matches

  • Casts a wide net, includes nicknames

& key sounds

Bucket Generation Types

42

(723) 445-2983 PHONE Standardization 4452983 Attribute Bucket (Sorted) 2344589

Algorithm Details

slide-43
SLIDE 43

Page: Confidential

  • Remove all vowels (unless vowel is first letter)
  • S, Z, SH, CE → S (drop if final sound of word)
  • J, G → J if first letter; J, G → K otherwise
  • T, D, DT, TH, DTH → T; M → N; V → F
  • C → S if surrounded by vowels (eg, “ICE”); C → K otherwise
  • Truncate at 6 chars: KRISHNAMOORTHY → KRSNNR
  • X → S if at the beginning:to “S”, X → K otherwise
  • Drop H (unless it’s the first letter)

Phonetics Normphone

43

Algorithm Details

slide-44
SLIDE 44

Page: Confidential

Example

44

Name FONTANA:MARIA:R::. SSN 391201923 Phone 8121193 DOB 19830921

FNTN + MR FNTN + R MR + R 011223399 1112389 FNTN + 198309 MR + 198309 R + 198309

Algorithm Details

slide-45
SLIDE 45

Page: Confidential

Comparison

45

Standardize Standardization takes the raw data from a record and cleans it to make analysis easier. The resulting string is stored in the mpi_memcmpd table. Bucket Bucketing organizes records with similar data to optimize the search process. Bucket hash indexes are stored in the mpi_membktd table. Compare Comparison measures the probability that candidate records (from buckets) match. Scores are based on pre-defined weights in lookup tables.

  • Derive ⇒ Organize ⇒ Search ⇒ Compare ⇒ Score

Algorithm Details

slide-46
SLIDE 46

Page: Confidential

  • Exact Match: Do the two values

match, Yes or No?

  • Starts With: Do the two values

start with same initial (digit)?

  • Edit Distance: How many edits

does it take to make the values match?

  • Phonetics: Do the values have

the same key sound markers?

  • Equivalency: Could one of the

values be a nickname or alias?

Matching with Comparison Functions

46

Data Type Comparison Method(s) to Use Name All Birth Date Exact Match Edit Distance Passport # Exact Match Gender Exact Match Phone # Exact Match Edit Distance

Algorithm Details

slide-47
SLIDE 47

Page: Confidential

  • When two values are exactly the same, distance=0
  • A degree of distance is added for three reasons:
  • Transposition: Stein versus Stien
  • Add/Remove: Margret versus Margaret
  • Replace: 281990 versus 281980

Edit Distance: Measuring Similarity

47

3922019 versus 3292089 Phillip versus Philippe 1982-03-21 versus 1982-08-24 Hashida versus Hashida

Algorithm Details

slide-48
SLIDE 48

Page: Confidential

  • Probabilistic comparisons are optimistic when scoring (reward any

correlation)

  • May lead to False Positives: twins, Jr. & Sr., or siblings that scored highly
  • FPF increases accuracy of matching and linking by issuing penalties when

pre-defined conditions are met

  • Penalty forces False Positive to fall below AL threshold and into Task
  • FPFs are deployed as comparison functions within Algorithm
  • Run in parallel to normal probabilistic matching process, but unlike other

comparison functions, FPFs issue penalties instead of awarding points

  • Examine four key elements: Name; Gender; Birth Date; SSN

False Positive Filter (FPF)

48

Algorithm Details

slide-49
SLIDE 49

Page: Confidential

Record X Record Y Score Name John Michael Smith Johnny M. Smith 5.2 Gender Male Male 0.3 Address 828 W. High St. Camden, ME 04843 828 W. High St. Camden, ME 04843 5.8 Phone (207) 236-9132 (207) 236-9132 5.2 SSN 771-29-1821 0.0 Total: 16.5 Total (FPF Adjusted): 14.3

FPF Example: AL=15

49

FPF Penalty

Name: partial match Gender: exact DOB: missing SSN: missing

Algorithm Details

slide-50
SLIDE 50

Generating Weights

slide-51
SLIDE 51

Page: Confidential

  • Original source data is abstracted during the derivation process:
  • Data parsed to attribute-based segments (eg, Names to mpi_memname)
  • Core data is standardized and stored in the mpi_memcmpd table
  • Bucket hashes are generated and stored in the mpi_membktd table
  • Matching and linking data is converted to binary form in the bxm files
  • Derivation employs the use of several techniques when abstracting data
  • Standardized data will nullify any anonymous values that are defined
  • Buckets will apply equivalencies, anonymous values, and phonetics
  • Attributes are tokenized, focusing on the individual words and numbers

What is Derived Data?

51

Generating Weights

slide-52
SLIDE 52

Page: Confidential

  • Comparison strings are carat-delimited values stored in mpi_memcmpd:
  • Carat order is determined by Comparison Role algorithm components
  • Format of comparison strings is determined by the Standardization functions
  • Historical and alternate values are sub-delimited by tildes
  • Comparison strings act like de-normalized data (all critical information is kept)
  • Bucket hashes are 64-bit integers that are stored in mpi_membktd:
  • Bucket hashes are assigned according to bucketing groups in the algorithm
  • The bucket ‘Role’ number is prepended to the value before being hashed
  • Each member record could have several bucket hashes

Comparison Strings & Bucket Hashes

52

Generating Weights

slide-53
SLIDE 53

Page: Confidential

Derivation Process

53 Original Data Initiate receives data in a wide variety of formats and this data must be normalized Parse Data to Segments Attributes are sent to the appropriate table

  • r UNL file, Names go

to MEMNAME, etc. Create Comparison Strings Standardized data is carat-delimited and sent to MEMCMPD table or UNL file Assign Bucket Hashes Hashes are assigned to each record and sent to MEMBKTD table or UNL file Compile Binary Files MEMCMPD and MEMBKTD are converted to binary & stored in bxm files

Generating Weights

slide-54
SLIDE 54

Page: Confidential

  • Weights are scores that represent the confidence that a single

value can uniquely identify a record

  • Calculated for each client, based on their data
  • Stored in lookup tables: when data is compared, scores are

pulled

  • Weight generation is resource intensive
  • Weight generation should be run when server traffic is low
  • Weight generation for large datasets (>10 million) may take a

few days to process

Weight Generation

54

Generating Weights

slide-55
SLIDE 55

Page: Confidential

  • Frequency-based Weights: provide a score based on how often a

value appears within overall population. Common values (like John) have a low score, rare values (like Chitsumungo) have a high score.

  • Edit Distance Weights: measure the similarity between two values.

Eg, “Gordon” vs. “Gorton” has a distance of 1 edit. Exact match has highest score, but each edit lowers score by a certain degree.

  • Parameterized Weights: These weights control maximum caps on

scores, extra credit points, and penalties for variance. Eg, there is a maximum weight for Full Name that ensures that the name does not generate a disproportionate score.

Three Kinds of Weights

55

Generating Weights

slide-56
SLIDE 56

Page: Confidential

Weight Tables

56

Table Description mpi_wgthead

Holds core definitions of weights, including comparison specification string and weight type

mpi_wgt1dim

Holds weight values for comparison functions that have a single comparison attribute (eg, SSN, DOB)

mpi_wgt2dim

Holds weight values for appropriate comparison functions that use two attributes (eg, Eye + Hair Color)

mpi_wgt3dim

Holds weight values for appropriate comparison functions that use three attributes (eg, Zip Code + Address + phone)

mpi_wgt4dim

Holds weight values for the False Positive Filter, which uses four separate attributes to control situations like Twins or Jr’s & Sr’s who are mistakenly linked

mpi_wgtsval

Holds common string weight values based purely on frequency (weights for people's names and attributes that use a simple “match or do not match” like gender)

mpi_wgtnval

Holds common numeric weight values based purely on frequency (date information like birth year)

mpi_stranon

Holds anonymous values established by the Anonymous Value Utility (not a weight table per se, but commonly associated with weights because of role that anonymous values play in measuring frequency)

Generating Weights

slide-57
SLIDE 57

Page: Confidential

One-Dimensional Weights: wgt1dim

57

Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6

  • 91

CMPID-SSN-DIST 7

  • 205

CMPID-SSN-DIST 8

  • 291

CMPID-SSN-DIST 9

  • 299

Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3

  • 176

CMPID-DOB-DIST 4

  • 285

Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9

  • 299

Generating Weights

slide-58
SLIDE 58

Page: Confidential

wgt1dim: SSN

58

Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6

  • 91

CMPID-SSN-DIST 7

  • 205

CMPID-SSN-DIST 8

  • 291

CMPID-SSN-DIST 9

  • 299

Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3

  • 176

CMPID-DOB-DIST 4

  • 285

Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9

  • 299

Edit distance weights, as with SSN, use the index as a placeholder for the edits. 0=missing 1=exact match 2=one edit 3=two edits …

Generating Weights

slide-59
SLIDE 59

Page: Confidential

wgt1dim: Date of Birth

59

Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6

  • 91

CMPID-SSN-DIST 7

  • 205

CMPID-SSN-DIST 8

  • 291

CMPID-SSN-DIST 9

  • 299

Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3

  • 176

CMPID-DOB-DIST 4

  • 285

Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9

  • 299

Date comparison function stores weights across two tables. 0=missing 1=exact match 2=one edit 3=two edits … In the case of an exact match, the weight values are stored in wgtnval

Generating Weights

slide-60
SLIDE 60

Page: Confidential

wgtnval: Exact Match Weights for DOB

60

Comparison Type Index Weight CMPID-DOB-YEAR 1900 342 CMPID-DOB-YEAR 1899 386 CMPID-DOB-YEAR 1898 398 CMPID-DOB-YEAR 1897 419 CMPID-DOB-YEAR 1901 423 CMPID-DOB-YEAR 1896 425 CMPID-DOB-YEAR 1895 445 CMPID-DOB-YEAR 1894 463 CMPID-DOB-YEAR 2001 463 CMPID-DOB-YEAR 1954 473 CMPID-DOB-YEAR 1955 473 CMPID-DOB-YEAR 1977 475 CMPID-DOB-YEAR 2003 475 CMPID-DOB-YEAR

  • 1

477

Exact match scores for dates are based

  • n the frequency of dates from each year.

Here you clearly see fake data because the ‘Baby Boomers’ should be the most common (lowest score) years. If there is an exact match between dates, but the year is not listed, the default score (-1) is awarded. This is the highest score because the value was very rare.

Generating Weights

slide-61
SLIDE 61

Page: Confidential

One-Dimensional Weights: wgt1dim

61

Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6

  • 91

CMPID-SSN-DIST 7

  • 205

CMPID-SSN-DIST 8

  • 291

CMPID-SSN-DIST 9

  • 299

Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3

  • 176

CMPID-DOB-DIST 4

  • 285

Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9

  • 299

For AXP , the index is used to indicate the # of digits in numerical address elements. 0=missing 1=one digit 2=two digits 3=three digits … Scores are applied when two addresses contain the exact same

  • number. These scores increase

because one-digit address numbers are common, but eight- digit numbers are rare.

Generating Weights

slide-62
SLIDE 62

Page: Confidential

Two-Dimensional Weights: wgt2dim

62

Phone Comparison Type Index Missing Exact ED 1 ED2 ED 3 ED 4 ED 5 ED 6 CMPID-AXP-2DIM 299 217 60

  • 89 -127 -174 -285

CMPID-AXP-2DIM 1 300 615 560 420 300 250 220 200 CMPID-AXP-2DIM 2 263 584 530 395 283 205 159 132 CMPID-AXP-2DIM 3 226 553 500 370 266 160 98 64 CMPID-AXP-2DIM 4 189 522 470 345 249 115 37

  • 4

CMPID-AXP-2DIM 5 152 491 440 320 232 70

  • 24
  • 72

CMPID-AXP-2DIM 6 115 460 410 295 215 25

  • 85 -140

CMPID-AXP-2DIM 7 78 429 380 270 198

  • 20 -146 -208

CMPID-AXP-2DIM 8 41 398 350 245 181

  • 65 -207 -276

CMPID-AXP-2DIM 9 4 367 320 220 164 -110 -268 -344 CMPID-AXP-2DIM 10

  • 33

336 290 195 147 -155 -329 -412 CMPID-AXP-2DIM 11

  • 70

305 260 170 130 -200 -390 -480 CMPID-AXP-2DIM 12

  • 107

274 230 145 113 -245 -451 -548 CMPID-AXP-2DIM 13

  • 144

243 200 120 96 -290 -512 -616 CMPID-AXP-2DIM 14

  • 181

212 170 95 79 -335 -573 -684 CMPID-AXP-2DIM 15

  • 218

181 140 70 62 -380 -634 -752

2D weights for AXP use the columns to represent phone edit distance and the rows to represent address edit distance. Index column for address works like wgt1dim (0=missing, 1=exact, 2=one edit …). Score when both phone and address are totally different. Score when both phone and address are exact match.

Generating Weights

slide-63
SLIDE 63

Page: Confidential

String-Based Weights: wgtsval

63

Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-NAME-XACT L 133 CMPID-NAME-XACT N 133 CMPID-NAME-XACT X 267 CMPID-NAME-XACT CHRIS 316 CMPID-NAME-XACT JOHN 318 CMPID-NAME-XACT JENNIFER 333 CMPID-NAME-XACT BRITTANY 339 CMPID-NAME-XACT LUPE 352 CMPID-NAME-XACT DARIUS 354 CMPID-NAME-XACT a 396 CMPID-NAME-XACT d

  • 100

CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 Comparison Type Index Weight CMPID-AXP-XACT AZ 145 CMPID-AXP-XACT ST 181 CMPID-AXP-XACT AVE 360 CMPID-AXP-XACT RD 465 CMPID-AXP-XACT JUNCTION 471 CMPID-AXP-XACT CREEK 557 CMPID-AXP-XACT BLVD 566 CMPID-AXP-XACT DR 566 CMPID-AXP-XACT HILLS 597 CMPID-AXP-XACT WEST 598 CMPID-AXP-XACT AFB 617 CMPID-AXP-XACT PARK 633 CMPID-AXP-XACT a 883 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000

XACT weights are frequency-based, so the more common a value is, the lower the score. The rarer a value is, the higher the score.

Generating Weights

slide-64
SLIDE 64

Page: Confidential

String-Based Weights: wgtsval

64

Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-NAME-XACT L 133 CMPID-NAME-XACT N 133 CMPID-NAME-XACT X 267 CMPID-NAME-XACT CHRIS 316 CMPID-NAME-XACT JOHN 318 CMPID-NAME-XACT JENNIFER 333 CMPID-NAME-XACT BRITTANY 339 CMPID-NAME-XACT LUPE 352 CMPID-NAME-XACT DARIUS 354 CMPID-NAME-XACT a 396 CMPID-NAME-XACT d

  • 100

CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 Comparison Type Index Weight CMPID-AXP-XACT AZ 145 CMPID-AXP-XACT ST 181 CMPID-AXP-XACT AVE 360 CMPID-AXP-XACT RD 465 CMPID-AXP-XACT JUNCTION 471 CMPID-AXP-XACT CREEK 557 CMPID-AXP-XACT BLVD 566 CMPID-AXP-XACT DR 566 CMPID-AXP-XACT HILLS 597 CMPID-AXP-XACT WEST 598 CMPID-AXP-XACT AFB 617 CMPID-AXP-XACT PARK 633 CMPID-AXP-XACT a 883 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000

Default agree weights are used when a value is encountered that does not have a preset weight score. The ‘a’ weight is the highest score, because the value is very rare.

Generating Weights

slide-65
SLIDE 65

Page: Confidential

String-Based Weights: wgtsval

65

Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-NAME-XACT L 133 CMPID-NAME-XACT N 133 CMPID-NAME-XACT X 267 CMPID-NAME-XACT CHRIS 316 CMPID-NAME-XACT JOHN 318 CMPID-NAME-XACT JENNIFER 333 CMPID-NAME-XACT BRITTANY 339 CMPID-NAME-XACT LUPE 352 CMPID-NAME-XACT DARIUS 354 CMPID-NAME-XACT a 396 CMPID-NAME-XACT d

  • 100

CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 Comparison Type Index Weight CMPID-AXP-XACT AZ 145 CMPID-AXP-XACT ST 181 CMPID-AXP-XACT AVE 360 CMPID-AXP-XACT RD 465 CMPID-AXP-XACT JUNCTION 471 CMPID-AXP-XACT CREEK 557 CMPID-AXP-XACT BLVD 566 CMPID-AXP-XACT DR 566 CMPID-AXP-XACT HILLS 597 CMPID-AXP-XACT WEST 598 CMPID-AXP-XACT AFB 617 CMPID-AXP-XACT PARK 633 CMPID-AXP-XACT a 883 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000

Parameter weights issue bonus points, put caps on maximum scores, and indicate the penalties to subtract for edit distance, nicknames, and phonetics.

Generating Weights

slide-66
SLIDE 66

Page: Confidential 66

FPF – wgt4dim

Name & 
 Gender DOB 
 Edit Dist. DOB
 Year Diff. SSN 
 Missing SSN 
 Exact SSN 
 Edit Dist.=1

  • 0.5
  • 0.65

1

  • 0.5
  • 0.65

2

  • 0.5
  • 0.65

3

  • 0.5
  • 0.65

4

  • 0.5
  • 0.65

5

  • 0.5
  • 0.65

1 1 1 1 2 1 3 1 4 1 5 2

  • 0.65
  • 0.2
  • 0.25

2 1

  • 0.3

2 2

  • 0.7
  • 0.35
  • 0.4

2 3

  • 0.75
  • 0.4
  • 0.45

2 4

  • 0.8
  • 0.45
  • 0.5

2 5

  • 0.85
  • 0.5
  • 0.55

1

  • 0.2

1 1

  • 0.2

1 2

  • 0.2

1 3

  • 0.2

1 4

  • 0.2

1 5

  • 0.2

1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2

  • 0.35
  • 0.1
  • 0.15

1 2 1

  • 0.3

1 2 2

  • 0.4
  • 0.25
  • 0.3

1 2 3

  • 0.45
  • 0.3
  • 0.35

1 2 4

  • 0.7
  • 0.35
  • 0.4

1 2 5

  • 0.75
  • 0.4
  • 0.45

Name & Gender

Index Name Result Gender Result

Missing Missing 1 Exact Missing 2 Partial Missing 3 Disagree Missing 4 Missing Agree 5 Exact Agree 6 Partial Agree 7 Disagree Agree 8 Missing Disagree 9 Exact Disagree 10 Partial Disagree 11 Disagree Disagree

Birth Month/Day

Index Meaning

One or both Dates are Missing mm/dd 1 The mm/dd Dates are an Exact match 2 The mm/dd have an Edit Distance of 1

Birth Year

Index Meaning

One or both years are Missing 1 Dates are 0-4 Years different 2 Dates are 5-9 Years different 3 Dates are 10-14 Years different 4 Dates are 15-19 Years different

SSN

Index Meaning

One or both SSNs are Missing 1 SSNs are an Exact match 2 SSNs have an Edit Distance of 1 … Edit distance can continue to 9

slide-67
SLIDE 67

Page: Confidential 67

FPF – wgt4dim

Name & 
 Gender DOB 
 Edit Dist. DOB
 Year Diff. SSN 
 Missing SSN 
 Exact SSN 
 Edit Dist.=1

  • 0.5
  • 0.65

1

  • 0.5
  • 0.65

2

  • 0.5
  • 0.65

3

  • 0.5
  • 0.65

4

  • 0.5
  • 0.65

5

  • 0.5
  • 0.65

1 1 1 1 2 1 3 1 4 1 5 2

  • 0.65
  • 0.2
  • 0.25

2 1

  • 0.3

2 2

  • 0.7
  • 0.35
  • 0.4

2 3

  • 0.75
  • 0.4
  • 0.45

2 4

  • 0.8
  • 0.45
  • 0.5

2 5

  • 0.85
  • 0.5
  • 0.55

1

  • 0.2

1 1

  • 0.2

1 2

  • 0.2

1 3

  • 0.2

1 4

  • 0.2

1 5

  • 0.2

1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2

  • 0.35
  • 0.1
  • 0.15

1 2 1

  • 0.3

1 2 2

  • 0.4
  • 0.25
  • 0.3

1 2 3

  • 0.45
  • 0.3
  • 0.35

1 2 4

  • 0.7
  • 0.35
  • 0.4

1 2 5

  • 0.75
  • 0.4
  • 0.45

Name & Gender

Index Name Result Gender Result

Missing Missing 1 Exact Missing 2 Partial Missing 3 Disagree Missing 4 Missing Agree 5 Exact Agree 6 Partial Agree 7 Disagree Agree 8 Missing Disagree 9 Exact Disagree 10 Partial Disagree 11 Disagree Disagree

Birth Month/Day

Index Meaning

One or both Dates are Missing mm/dd 1 The mm/dd Dates are an Exact match 2 The mm/dd have an Edit Distance of 1

Birth Year

Index Meaning

One or both years are Missing 1 Dates are 0-4 Years different 2 Dates are 5-9 Years different 3 Dates are 10-14 Years different 4 Dates are 15-19 Years different

SSN

Index Meaning

One or both SSNs are Missing 1 SSNs are an Exact match 2 SSNs have an Edit Distance of 1 … Edit distance can continue to 9

This position holds a penalty for when name & gender are both missing, month\day is

  • ff, the year is 15-19 different,

but SSN is exactly the same.

slide-68
SLIDE 68

Page: Confidential 68

FPF – wgt4dim

Name & 
 Gender DOB 
 Edit Dist. DOB
 Year Diff. SSN 
 Missing SSN 
 Exact SSN 
 Edit Dist.=1

  • 0.5
  • 0.65

1

  • 0.5
  • 0.65

2

  • 0.5
  • 0.65

3

  • 0.5
  • 0.65

4

  • 0.5
  • 0.65

5

  • 0.5
  • 0.65

1 1 1 1 2 1 3 1 4 1 5 2

  • 0.65
  • 0.2
  • 0.25

2 1

  • 0.3

2 2

  • 0.7
  • 0.35
  • 0.4

2 3

  • 0.75
  • 0.4
  • 0.45

2 4

  • 0.8
  • 0.45
  • 0.5

2 5

  • 0.85
  • 0.5
  • 0.55

1

  • 0.2

1 1

  • 0.2

1 2

  • 0.2

1 3

  • 0.2

1 4

  • 0.2

1 5

  • 0.2

1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2

  • 0.35
  • 0.1
  • 0.15

1 2 1

  • 0.3

1 2 2

  • 0.4
  • 0.25
  • 0.3

1 2 3

  • 0.45
  • 0.3
  • 0.35

1 2 4

  • 0.7
  • 0.35
  • 0.4

1 2 5

  • 0.75
  • 0.4
  • 0.45

Name & Gender

Index Name Result Gender Result

Missing Missing 1 Exact Missing 2 Partial Missing 3 Disagree Missing 4 Missing Agree 5 Exact Agree 6 Partial Agree 7 Disagree Agree 8 Missing Disagree 9 Exact Disagree 10 Partial Disagree 11 Disagree Disagree

Birth Month/Day

Index Meaning

One or both Dates are Missing mm/dd 1 The mm/dd Dates are an Exact match 2 The mm/dd have an Edit Distance of 1

Birth Year

Index Meaning

One or both years are Missing 1 Dates are 0-4 Years different 2 Dates are 5-9 Years different 3 Dates are 10-14 Years different 4 Dates are 15-19 Years different

SSN

Index Meaning

One or both SSNs are Missing 1 SSNs are an Exact match 2 SSNs have an Edit Distance of 1 … Edit distance can continue to 9

This position shows no penalty for when all 4 dimensions are in perfect agreement.

slide-69
SLIDE 69

Page: Confidential

  • Added or changed comparison function or comparison

code in Algorithm

  • When population has grown by >20% of original population
  • It has been two years since weights last calculated
  • When new source has been added, especially one that

comes from a different geographical area (East Coast has different name, phone and address distributions than West Coast)

  • When upgrading to a new version of IBM MDM

When to Recalculate Weights?

69

Generating Weights

slide-70
SLIDE 70

Page: Confidential

  • New algorithm comparison functions require new

weights

  • Adjustments to weight scores require new

thresholds

  • New thresholds may require adjustments to FPF

weights

  • Sometimes what look like threshold issues, can be

fixed by tweaking weights

Weights and Threshold Connections

70

Generating Weights

slide-71
SLIDE 71

Threshold Analysis

slide-72
SLIDE 72

Page: Confidential

Ignore Link

  • Two thresholds: link, ignore, or create task

Thresholds

72

TASKS Clerical Review Threshold (CR) Auto-Link Threshold (AL)

Threshold Analysis

slide-73
SLIDE 73

Page: Confidential

Ignore Link

  • Spreading thresholds further apart (usually more on AL) ensures that

matches will be highly accurate, but it increases the number of tasks

Gaining Higher Accuracy

73

TASKS Clerical Review Threshold (CR) Auto-Link Threshold (AL)

Threshold Analysis

slide-74
SLIDE 74

Page: Confidential

Ignore Link

  • If CR=AL, there are no manual review tasks

Total Automation

74

CR = AL

Threshold Analysis

slide-75
SLIDE 75

Page: Confidential

  • Determine optimal score (where we believe MDM found same

person)

  • Scan a data sample for False Positives and False Negatives
  • Determine score at which records should be manually reviewed
  • Validate that the MDM tool is producing accurate matches
  • Confirm that the weights generated by MDM make sense
  • Identify general data quality issues
  • Show hub how willing company trusts matching/linking process

Why do we Analyze Thresholds?

75

Threshold Analysis

slide-76
SLIDE 76

Page: Confidential

  • Threshold Analysis uses real data from across sources:
  • Uses the “Threshold Analysis Pair Generation” job
  • Answer “Yes” “No” or “Maybe” to the assigned pairs
  • Results compiled and analyzed by Threshold Calculator
  • Thresholds will be adjusted to provide desired accuracy
  • Deploy updated Thresholds and repeat for further

revisions

Threshold Analysis Process

76

Threshold Analysis

slide-77
SLIDE 77

Page: Confidential

  • Sample records are grouped in pairs
  • Pairs will appear in random score order
  • Do the records belong to the same person?
  • Make a gut reaction, don’t think too hard
  • Expect misspellings and typos, take into consideration
  • Some of the pairs are supposed to be bad
  • Watch for twins and Jr./Sr. pairs (data very similar, but

DOB or Name might reveal difference)

Sample Pairs Exercise

77

Threshold Analysis

slide-78
SLIDE 78

Page: Confidential

Sample Pair 1: Yes, No, Maybe?

78

Record X Record Y Name Birth Date Address SSN Home Phone Cell Phone Gender M M Marital Status

Threshold Analysis

slide-79
SLIDE 79

Page: Confidential

Sample Pair 2: Yes, No, Maybe?

79

Record X Record Y Name Birth Date Address SSN 993-20-1661 993-20-1661 Home Phone Cell Phone Gender Marital Status

Threshold Analysis

slide-80
SLIDE 80

Page: Confidential

Sample Pair 3: Yes, No, Maybe?

80

Record X Record Y Name Sue Chaudray-Patel Susan C. Patel Birth Date Address SSN Home Phone Cell Phone Gender Marital Status

Threshold Analysis

slide-81
SLIDE 81

Page: Confidential

Sample Pair 4: Yes, No, Maybe?

81

Record X Record Y Name Rick H. Morrison, Jr. Richard Henry Morrison Birth Date 1938-06-05 Address SSN Home Phone Cell Phone Gender Marital Status

Threshold Analysis

slide-82
SLIDE 82

Page: Confidential

Sample Pair 5: Yes, No, Maybe?

82

Record X Record Y Name Rick H. Morrison, Jr. Richard Henry Morrison Birth Date 1952-04-17 1952-04-17 Address 8821 W. Grosse Point Way Ann Arbor, MI 48104 SSN 993-20-1661 993-02-1661 Home Phone (313) 623-1863 (517) 881-1437 Cell Phone (517) 881-1437 Gender M M Marital Status Married

Threshold Analysis

slide-83
SLIDE 83

Page: Confidential

Sample Pair 6: Yes, No, Maybe?

83

Record X Record Y Name LaDonna M. Jeffries Jeff M. L’Donne Birth Date 1979-12-14 1963-08-30 Address 118 N. Gartner Road, Apt. 3B
 Kalamazoo, MI 49003 19 West Big Timber Dryden, MI 48428 SSN 999-99-9999 171-12-1646 Home Phone (269) 234-3782 Cell Phone (269) 383-1129 (810) 623-1672 Gender F M Marital Status Single Divorced

Threshold Analysis

slide-84
SLIDE 84

Page: Confidential

Sample Pair 7: Yes, No, Maybe?

84

Record X Record Y Name Vallie G. Musial Val Y. Musia Birth Date 1982-02-26 1982-02-26 Address 3792 W. Kingston St. Ann Arbor, MI 48106 SSN 271-19-1209 Home Phone (269) 392-1810 Cell Phone (269) 932-1180 Gender F F Marital Status Single

Threshold Analysis

slide-85
SLIDE 85

Page: Confidential

Sample Pair 8: Yes, No, Maybe?

85

Record X Record Y Name Ernest L. Johns Ernest L. Johns Birth Date 1932-03-28 1932-03-28 Address SSN 810-78-1206 810-78-1206 Home Phone (312) 445-2343 (708) 293-7093 Cell Phone Gender M M Marital Status

Threshold Analysis

slide-86
SLIDE 86

Page: Confidential

  • Company representatives review Sample

Pairs

  • Use Pair Manager or Excel
  • ~2K matched pairs (10 pairs/0.1 change)
  • Must indicate Yes, No, or Maybe to validate

each pair in the file

Sample Pair Review

86

Threshold Analysis

slide-87
SLIDE 87

Page: Confidential

Using Pair Manager

87

Revisit answers based on status Buttons for “Yes”, “No”, “Maybe” Exact matches turn green Rearrange and hide fields

Threshold Analysis

slide-88
SLIDE 88

Page: Confidential

  • Sample Pairs → Threshold

Calculator

  • False Positive Rate
  • Goal: <1 in 1 million
  • False Negative Rate
  • Goal: <1 in 1 thousand
  • Estimated # of Tasks

Dependent on Resources

Threshold Calculation

88

Threshold Analysis

slide-89
SLIDE 89

Implementation Process

slide-90
SLIDE 90

Page: Confidential

Bulk Cross Match Clean Data Extract

General Implementation Process

90

Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config Reiterate

Implementation Process

slide-91
SLIDE 91

Page: Confidential

  • Review requirements defined for the Data

Extract by customer

  • Dependencies: Clear understanding of data

sources involved, data fields needed, and how to gather that data in a way that MDM software can consume it

Review Customer Requirements

91

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-92
SLIDE 92

Page: Confidential

  • Configure a Member Model to fit project needs: data

loaded includes, metadata (like sources and attributes), validation lists, and lookup tables in Workbench

  • Initiate member model defines the way that the MDM

software stores, manages, and validates data

  • Dependencies: Data Extract Guide outlines specific

attributes and fields and Implementation Approach defines additional data dictionary requirements

Configure Data Model

92

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-93
SLIDE 93

Page: Confidential

  • Design/build an algorithm to address

attributes, comparisons, search requirements, and bucketing design needs

  • Dependencies: Proper data elements must

be in place before algorithm can be fully developed

Configure Algorithm

93

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-94
SLIDE 94

Page: Confidential

  • Analyze data to ensure it conforms to

specifications and fix any problems found

  • Data Extract is a sampling of data
  • Dependencies: Data Extract Guide outlines

data requirements: Workbench and CloverETL perform data cleansing

Clean Data Extract

94

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-95
SLIDE 95

Page: Confidential

  • Create a new database for Master Data

Service to reference

  • Dependencies: Need to have a supported

database platform

Deploy Instance: Create Empty DB

95

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-96
SLIDE 96

Page: Confidential

  • Install Master Data Service engine, configure

ODBC connection, and establish Windows service

  • Dependencies: Need proper software

installation files for OS and an empty database in order to create your instance

Deploy Instance: Create Hub Instance

96

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-97
SLIDE 97

Page: Confidential

  • Create core database tables, define field

properties, index tables, populate data dictionary tables with default settings

  • Dependencies: Need to have Master Data

Service software installed and access to empty database and hub instance

Deploy Instance: Bootstrap DB

97

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-98
SLIDE 98

Page: Confidential

  • Data Dictionary tables control validation rules,

application properties, attributes, sources, nicknames, and core algorithm settings

  • Dependencies: Need Master Data Service

engine with a bootstrapped database running before data dictionary can be imported

Deploy Instance: Deploy Configuration

98

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-99
SLIDE 99

Page: Confidential

  • Derived data is data processed by algorithm: four events
  • Raw data is parsed into segment specific unload files
  • Comparison strings are built from standardized data
  • Members are assigned bucket hashes
  • Binary files are created for faster computation
  • Dependencies: Need most components installed and

configured (eg, hub engine, member model, algorithm): if changes made to algorithm, then data must be re-derived

Derive Data

99

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-100
SLIDE 100

Page: Confidential

  • Measure value frequency and assign weights
  • Weight generation process goes through

multiple steps to measure frequency of individual values in database

  • Dependencies: Must have engine installed and

algorithm configured (if data already derived then weight generation will take less time)

Generate Weights

100

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-101
SLIDE 101

Page: Confidential

  • Compare records, calculate comparison scores based on weights,

and then link records that match

  • BXM, a process for comparing and linking thousands of records per

second, is most commonly performed in the initial stage of implementation and again right before system goes live

  • Dependencies: Must have derived data and weights before

performing BXM (engine, algorithm, and dictionary must be in place)

  • Uses weights to generate/aggregate comparison score (compare

to thresholds to determine auto-linking and task generation)

Bulk Cross Match (BXM)

101

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-102
SLIDE 102

Page: Confidential

  • Once data is loaded into Master Data Service,

establish how well system & data are performing

  • Assess attribute completeness, score

distribution, entity and bucket size, and thresholds

  • Dependencies: Core engine and data must be

fully loaded to run data analytics

Analyze and Review

102

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-103
SLIDE 103

Page: Confidential

  • Take results of analysis and make tweaks to algorithm and

data dictionary, if necessary

  • After edits re-derive data, run another BXM, and analyze

results again

  • Dependencies: Bucket design changes usually require re-

deriving, but not another BXM

  • Comparison changes require new weights, re-derivation,

and a new BXM

Reiterate

103

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-104
SLIDE 104

Page: Confidential

  • Test hub configuration using CloverETL graph

designed to perform a MEMPUT operation

  • Dependencies: Configuration and data must

be fully loaded into Master Data Service

Test the Hub Configuration

104

Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config

Implementation Process

slide-105
SLIDE 105

Matching Challenges

slide-106
SLIDE 106

Page: Confidential

  • Decision Theoretic
  • Uses likelihood ratio test to determine if two records refer to same
  • bject
  • Easily accommodates adding attributes to the matching system
  • Empirical Component
  • Comparison functions reflect knowledge of types of errors typically

associated with particular attributes

  • That is, how do you measure (not weight) closeness?
  • Data-Adaptive
  • Underlying probability densities for test statistic are estimated from data

To Match or Not To Match

106

Matching Challenges

slide-107
SLIDE 107

Page: Confidential

Truth versus Matching Decision

107

Correct Decision Correct Decision False Positive False Negative Match Don’t Match Matching Decision Different Member Same Member Truth

Balance between the two determined by setting thresholds specific to customer’s requirements Likelihood Theory (adaptive weights) aimed at reducing FP

Matching Challenges

slide-108
SLIDE 108

Page: Confidential

  • Frequently caused by ad-hoc weighting schemes
  • How to correctly combine comparison results across multiple

attributes while allowing for partial matches?

  • If two birth dates agree but SSNs are two digits off, would it count the

same as if SSNs agreed but birth dates were two digits off?

  • What if SSNs disagreed but last name was the same?
  • How to account for information differences between two match

results?

  • Eg, a match on JOHN SMITH provides less information than a match
  • n ALBERT EINSTEIN in making the match/don’t match decision

False Positives

108

Matching Challenges

slide-109
SLIDE 109

Page: Confidential

  • Determine the threshold at which you are confident

that two records represent the same person

  • Determine the threshold at which you want to manually

review potential duplicates

  • Validate that the configured algorithm is producing

meaningful matches

  • Confirm that the generated weights make sense
  • Identify general data quality issues

Objectives of Threshold Analysis

109

Matching Challenges

slide-110
SLIDE 110

Page: Confidential

  • Generate ~20 sample pairs per tenth point and look for:
  • False Positives in data set?
  • Add more anonymous values or nicknames?
  • Overall, do the scores make sense?
  • Do higher scoring pairs have more attributes in

common than lower scoring pairs?

  • Is algorithm comparing/scoring members as expected?

Pre-Threshold Analysis: Review

110

Matching Challenges

slide-111
SLIDE 111

Page: Confidential

  • Looking for False Positives usually means “eyeballing”

sample pairs and looking for certain telltale signs

  • In health care situations, look for high scoring pairs with

attribute differences between name, DOB, and/or SSN

  • Lower scores will have a lot of false positives, so goal is

to locate highest scoring false positives, usually ≥10

  • If we can eliminate higher scoring false positives, then
  • rganization will likely be able to implement a lower AL

Looking for False Positives

111

Matching Challenges

slide-112
SLIDE 112

Page: Confidential

  • Usually can identify trends in false positives such as twins, father/son, or husband/wife

linkages that can be penalized and eliminated with the FPF that uses the following rules:

  • If there is a complete mismatch on the name, FPF is set to true
  • If there is a phonetic or partial name match AND either DOB or GENDER disagrees, FPF

is set to true. (FPF looks at DOB or SEX only if there is a partial match.)

  • If a name suffix disagrees, FPF is set to true
  • ANON values are treated as a 0 in comparisons and are treated neutrally by FPF
  • Augment the algorithm as necessary by adding or updating FPF settings
  • Rerun comparisons and review new linkages while confirming the false positives you

targeted from the last run are no longer being linked

  • Important to remember that it is possible to introduce false negatives while trying to

eliminate false positives when applying FPF

Resolve FP with False Positive Filter

112

Matching Challenges

slide-113
SLIDE 113

Page: Confidential 113

slide-114
SLIDE 114

Page: Confidential

Questions

114