Master Data Management
Nick Pizzi, PhD pizzi@imt.ca
Master Data Management Nick Pizzi, PhD pizzi@imt.ca IMT IMT - - PowerPoint PPT Presentation
Master Data Management Nick Pizzi, PhD pizzi@imt.ca IMT IMT provides solutions in: e-Health strategy and architecture; integration; master data management; and big data analytics IMT specializes in large-scale Electronic Health Record
Nick Pizzi, PhD pizzi@imt.ca
Page: Confidential
integration; master data management; and big data analytics
Registries that utilize a probabilistic matching algorithm
nearly all core Client and Provider Registries in Canada that now serve as foundational components for the Canada Health Infoway EHR Blueprint
EHR projects in Canada and USA
2
Page: Confidential
3
Page: Confidential
critical business processes across the enterprise
patients, suppliers, partners, products, employees, accounts …
transaction, application and decision
5
Master Data
Page: Confidential
single, trusted and complete version of critical data assets to downstream applications, end users and processes
called “Golden Records”
information
6
Master Data
Page: Confidential
create an “index”
point to multiple records
at the source level
views (virtual Golden Record)
7
Master Data
Page: Confidential
central ownership of data
enterprise
8
Master Data
Page: Confidential
to a task
viewing relationships, or populating a data warehouse for business intelligence
9
Master Data
Page: Confidential
records to the MDM via the Broker, API, or periodic batch updates
demographic identifiers (MEMIDENT)
10
Master Data
Page: Confidential
12
ID# Name Work Address Home Phone Email 319883 Debbie Smith 4150-50 Tecnology Way Carson City, NV 89722 775.789.1020 dbecker@mdmgroup.org 319884 Ronald Bucher 931 West Canyon Blvd. Gardnerville, NV 775.212.1891 rbucher@gmail.com 319885 Jennifer Long 882 N. Weldon Way Reno, NV 89502 775.279.5629 Jen1871@aol.com 319886 Mike Smith 775.302.2582 Msmith2@yahoo.com
Data Source
Entities & Attributes
Page: Confidential
13
Deb Becker-Smith EID: 456 Debbie Beckersmith Deborah Becker Deb Becker-Smith
Source A Source B Source C
Entities & Attributes
Page: Confidential
Phone help determine the individual people in data
matches
Entity
14
Entities & Attributes
Page: Confidential
15
Entities & Attributes
Page: Confidential
16
MDM HUB
MEMPHONE
Home Phone Mobile Phone Work Phone Fax Number
MEMNAME
Patient Name Previous Name Guarantor Next of Kin
mpi_memphone
phCc phArea phNumber phExtension phComment
mpi_memphone
Entities & Attributes
Page: Confidential
17
Deb Becker-Smith EID: 456 Debbie Beckersmith Deborah Becker Deb Becker-Smith
Source A Source B Source C
Birth Date Name Home Address Home Phone
Entities & Attributes
Page: Confidential
19
Standardize Convert data to simplest form for easier use during matching process Bucket Organize records that share common values for faster search retrieval Compare Compare pairs of records using a probabilistic method to calculate a score (512) 634-5144 6345144 1344456 6345144 vs. 6345414
Data
3.9 Algorithm & Linkages
Page: Confidential
they are linked together as part of the same Entity
20
MDM HUB
vs.
EID EID
Score ≥ AutoLink?
When new records are added to the hub, buckets are used to find candidates to compare against If the comparison score is greater than the auto-link threshold the records are given the same EID, linking them to the same entity
Algorithm & Linkages
Page: Confidential
version
attribute
21
Algorithm & Linkages
Page: Confidential
22
Jones, Ron 2772 W. North Ave. Oak Park, IL 60302 SSN: 972-41-2318 Jones, R. Ph: (773) 826-2825 DOB: 07/26/1964 SSN: 972-41-2318 Jones, Ronald R. 5282 W. Chicago Ave. Chicago, IL 60610 Ph: (773) 826-2825 Jones, Ronald R. 5282 W. Chicago Ave. Chicago, IL 60610 Ph: (773) 826-2825 DOB: 07/26/1964 SSN: 972-41-2318
Algorithm & Linkages
Page: Confidential
23
Source + MemIdNum Source + MemIdNum MemRecNo Algorithm Entity Manager Create Task ? MemRecNo
MemIdNum used to store original primary key for each source record MemRecNo assigned as internal MDM key field for each unique record Assess probable matches to determine comparison score Measure score to decide whether to link, do nothing, or create a task Assign same EID to both records when they are linked If score is unclear then a task is created for Data Stewards to review
Algorithm & Linkages
Page: Confidential
Ignore Link
25
TASKS Clerical Review Threshold (CR) Auto-Link Threshold (AL)
Thresholds & Tasks
Page: Confidential
assessment is wrong
26
CR AL
FP FN
Thresholds & Tasks
Page: Confidential
that penalizes scores for small inaccuracies or missing data
27
CR AL
FP FN
Thresholds & Tasks
Page: Confidential
28 Potential Overlay
Source A
record w′
Source A
record w Potential Duplicate
Source A
record w
Source A
record w Potential Linkage
Source A
record w
Source B
record w Review Identifier
Source A
record x
Source A
record w
Thresholds & Tasks
Page: Confidential
29
Smith Household 1122 Main Street Reno NV, 89522 Mike Smith EID: 4569872 Deb Becker-Smith EID: 1456984 Mikey Smith EID: 9734546
Thresholds & Tasks
Page: Confidential
words
30
Thresholds & Tasks
Page: Confidential
Rajeshi · Katsumoto · Jorge · Twin · Hernandez Katai · John · M’Tembe · Mikel · Hiroshi · Larssen Marija · Ozols · Unknown · Chi · D’Esopo · Fergus O’Shea · Maria · Obidos · Wei · James · Vladamir Jian · Baby · Pham · Oliver · Kensington · Kimball Paolo · Morgan · Boy · Silvia · McDonald · Woo Hiram · Fiona · Marta · Alexi · Zhang · Krishna NoLastName · Isaac · Gabrielle · Test · Vargas Tomas · Hubert · Paul · Nguyen · Hussein · Liam
31
Thresholds & Tasks
Page: Confidential
32
Adds/Updates are Consumed
Adds and Updates are called Puts. Puts can come via the API, Web Services, Message Brokers, or as a bulk process.
Parse Data to Segments
Each record is assigned an identifier in the MEMHEAD table. Attributes are sent to the appropriate table or UNL file, Names go to MEMNAME, etc.
Derive Data
Derived data takes raw information and creates standardized versions of the data for easier comparison and
into buckets for faster searching.
Compare Records
Pairs of records compared using a probabilistic method. Values are weighted by frequency, similarity, or their relationship to
the score is tallied.
Link or Create Tasks
The results of the comparisons are measured against a set of thresholds that determine whether to link, ignore, or create a tasks
Thresholds & Tasks
Page: Confidential
33
Record X Name: Ron R. Jones Phone: (773) 826-2825 ID #: 972-41-2318 Record Y Name: Ronald Jones Phone: (773) 826-2852 ID #: 972-41-2318
13.8 Thresholds & Tasks
Page: Confidential
35
Standardize Standardization takes the raw data from a record and cleans it to make analysis easier. The resulting string is stored in the mpi_memcmpd table. Bucket Bucketing organizes records with similar data to optimize the search process. Bucket hash indexes are stored in the mpi_membktd table. Compare Comparison measures the probability that candidate records (from buckets) match. Scores are based on pre-defined weights in lookup tables.
Algorithm Details
Page: Confidential
types, such as Names, Addresses, Dates, Phones, etc.
36
Algorithm Details
Page: Confidential
37
Original Data Standardized Data Maria R. Fontana FONTANA:MARIA:R::. 391-20-1923 391201923 (832) 812-1193 8121193 (832) 811-2915 8112915 mfont91@us.ibm.com MFONT91USIBM 1973-09-21 19730921 928 West Kingston Court Chicago, IL 60617 N-928:S-W:S-KINGSTON:S-CT:S- CHICAGO:S-IL:N-60617
FONTANA:MARIA:R::.^391201923^8121193~8112915^MFONT91USIBM^ 19730921^N-928:S-W:S-KINGSTON:S-CT:S-CHICAGO:S-IL:N-60617
OR condition Attribute delimiter
Algorithm Details
Page: Confidential
38
Standardize Standardization takes the raw data from a record and cleans it to make analysis easier. The resulting string is stored in the mpi_memcmpd table. Bucket Bucketing organizes records with similar data to optimize the search process. Bucket hash indexes are stored in the mpi_membktd table. Compare Comparison measures the probability that candidate records (from buckets) match. Scores are based on pre-defined weights in lookup tables.
Algorithm Details
Page: Confidential
39
Algorithm Details
Page: Confidential
40
Bucket Function Patient Name
(MAX: 2, MIN: 1)
Bucket Function Birth Date
(MAX: 1, MIN: 0)
Bucket Group Name + DOB
(MAX: 2, MIN: 2)
JOHN + M SMITH + M SMITH + DOB JOHN + SMITH JOHN + DOB M + DOB
JOHN M SMITH = 3 tokens DOB ⇾ 1 token
2 Name tokens 0 DOB tokens 1 Name token 1 DOB token
Algorithm Details
Page: Confidential
41
Generation Type What Happens YYYMMDD Bucket records according to Full Date, so there is no transformation MMDD Use MMDD portion of date token to group dates by anniversary (19280912 ⟹ 0912) Phonetic Metaphone: Convert tokens to key sound markers. Industry standard phonetic methodology Normphone: Use Initiate’s proprietary phonetic method, works best with Western languages Arabic Name: Apply phonetics to English translation of Arabic names Equivalence Apply a String Equivalency (Nickname or Abbreviation) from a lookup table (Jim ⟹ James) Equivalence & Phonetics Blend phonetic and equivalence conversions against tokens. (ED ⟹ EDWARD ⟹ ETWRT) Sorted Sort the contents of numeric tokens (3014201324 ⟹ 0011223344, 52431 ⟹ 12345) nGRAM Sequences Bucket chunks of data with length N (N = 4 against “6345111” ⟹ 6345 3451 4511 5111)
Algorithm Details
Page: Confidential
are conceived
discrepancies
data
matches
& key sounds
42
(723) 445-2983 PHONE Standardization 4452983 Attribute Bucket (Sorted) 2344589
Algorithm Details
Page: Confidential
43
Algorithm Details
Page: Confidential
44
Name FONTANA:MARIA:R::. SSN 391201923 Phone 8121193 DOB 19830921
FNTN + MR FNTN + R MR + R 011223399 1112389 FNTN + 198309 MR + 198309 R + 198309
Algorithm Details
Page: Confidential
45
Standardize Standardization takes the raw data from a record and cleans it to make analysis easier. The resulting string is stored in the mpi_memcmpd table. Bucket Bucketing organizes records with similar data to optimize the search process. Bucket hash indexes are stored in the mpi_membktd table. Compare Comparison measures the probability that candidate records (from buckets) match. Scores are based on pre-defined weights in lookup tables.
Algorithm Details
Page: Confidential
match, Yes or No?
start with same initial (digit)?
does it take to make the values match?
the same key sound markers?
values be a nickname or alias?
46
Data Type Comparison Method(s) to Use Name All Birth Date Exact Match Edit Distance Passport # Exact Match Gender Exact Match Phone # Exact Match Edit Distance
Algorithm Details
Page: Confidential
47
3922019 versus 3292089 Phillip versus Philippe 1982-03-21 versus 1982-08-24 Hashida versus Hashida
Algorithm Details
Page: Confidential
correlation)
pre-defined conditions are met
comparison functions, FPFs issue penalties instead of awarding points
48
Algorithm Details
Page: Confidential
Record X Record Y Score Name John Michael Smith Johnny M. Smith 5.2 Gender Male Male 0.3 Address 828 W. High St. Camden, ME 04843 828 W. High St. Camden, ME 04843 5.8 Phone (207) 236-9132 (207) 236-9132 5.2 SSN 771-29-1821 0.0 Total: 16.5 Total (FPF Adjusted): 14.3
49
FPF Penalty
Name: partial match Gender: exact DOB: missing SSN: missing
Algorithm Details
Page: Confidential
51
Generating Weights
Page: Confidential
52
Generating Weights
Page: Confidential
53 Original Data Initiate receives data in a wide variety of formats and this data must be normalized Parse Data to Segments Attributes are sent to the appropriate table
to MEMNAME, etc. Create Comparison Strings Standardized data is carat-delimited and sent to MEMCMPD table or UNL file Assign Bucket Hashes Hashes are assigned to each record and sent to MEMBKTD table or UNL file Compile Binary Files MEMCMPD and MEMBKTD are converted to binary & stored in bxm files
Generating Weights
Page: Confidential
value can uniquely identify a record
pulled
few days to process
54
Generating Weights
Page: Confidential
value appears within overall population. Common values (like John) have a low score, rare values (like Chitsumungo) have a high score.
Eg, “Gordon” vs. “Gorton” has a distance of 1 edit. Exact match has highest score, but each edit lowers score by a certain degree.
scores, extra credit points, and penalties for variance. Eg, there is a maximum weight for Full Name that ensures that the name does not generate a disproportionate score.
55
Generating Weights
Page: Confidential
56
Table Description mpi_wgthead
Holds core definitions of weights, including comparison specification string and weight type
mpi_wgt1dim
Holds weight values for comparison functions that have a single comparison attribute (eg, SSN, DOB)
mpi_wgt2dim
Holds weight values for appropriate comparison functions that use two attributes (eg, Eye + Hair Color)
mpi_wgt3dim
Holds weight values for appropriate comparison functions that use three attributes (eg, Zip Code + Address + phone)
mpi_wgt4dim
Holds weight values for the False Positive Filter, which uses four separate attributes to control situations like Twins or Jr’s & Sr’s who are mistakenly linked
mpi_wgtsval
Holds common string weight values based purely on frequency (weights for people's names and attributes that use a simple “match or do not match” like gender)
mpi_wgtnval
Holds common numeric weight values based purely on frequency (date information like birth year)
mpi_stranon
Holds anonymous values established by the Anonymous Value Utility (not a weight table per se, but commonly associated with weights because of role that anonymous values play in measuring frequency)
Generating Weights
Page: Confidential
57
Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6
CMPID-SSN-DIST 7
CMPID-SSN-DIST 8
CMPID-SSN-DIST 9
Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3
CMPID-DOB-DIST 4
Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9
Generating Weights
Page: Confidential
58
Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6
CMPID-SSN-DIST 7
CMPID-SSN-DIST 8
CMPID-SSN-DIST 9
Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3
CMPID-DOB-DIST 4
Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9
Edit distance weights, as with SSN, use the index as a placeholder for the edits. 0=missing 1=exact match 2=one edit 3=two edits …
Generating Weights
Page: Confidential
59
Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6
CMPID-SSN-DIST 7
CMPID-SSN-DIST 8
CMPID-SSN-DIST 9
Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3
CMPID-DOB-DIST 4
Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9
Date comparison function stores weights across two tables. 0=missing 1=exact match 2=one edit 3=two edits … In the case of an exact match, the weight values are stored in wgtnval
Generating Weights
Page: Confidential
60
Comparison Type Index Weight CMPID-DOB-YEAR 1900 342 CMPID-DOB-YEAR 1899 386 CMPID-DOB-YEAR 1898 398 CMPID-DOB-YEAR 1897 419 CMPID-DOB-YEAR 1901 423 CMPID-DOB-YEAR 1896 425 CMPID-DOB-YEAR 1895 445 CMPID-DOB-YEAR 1894 463 CMPID-DOB-YEAR 2001 463 CMPID-DOB-YEAR 1954 473 CMPID-DOB-YEAR 1955 473 CMPID-DOB-YEAR 1977 475 CMPID-DOB-YEAR 2003 475 CMPID-DOB-YEAR
477
Exact match scores for dates are based
Here you clearly see fake data because the ‘Baby Boomers’ should be the most common (lowest score) years. If there is an exact match between dates, but the year is not listed, the default score (-1) is awarded. This is the highest score because the value was very rare.
Generating Weights
Page: Confidential
61
Comparison Type Index Weight CMPID-SSN-DIST CMPID-SSN-DIST 1 556 CMPID-SSN-DIST 2 403 CMPID-SSN-DIST 3 315 CMPID-SSN-DIST 4 177 CMPID-SSN-DIST 5 31 CMPID-SSN-DIST 6
CMPID-SSN-DIST 7
CMPID-SSN-DIST 8
CMPID-SSN-DIST 9
Comparison Type Index Weight CMPID-DOB-DIST CMPID-DOB-DIST 1 CMPID-DOB-DIST 2 44 CMPID-DOB-DIST 3
CMPID-DOB-DIST 4
Comparison Type Index Weight CMPID-AXP-1DIM CMPID-AXP-1DIM 1 419 CMPID-AXP-1DIM 2 504 CMPID-AXP-1DIM 3 596 CMPID-AXP-1DIM 4 723 CMPID-AXP-1DIM 5 860 CMPID-AXP-1DIM 6 997 CMPID-AXP-1DIM 7 1126 CMPID-AXP-1DIM 8 1273 CMPID-SSN-DIST 9
For AXP , the index is used to indicate the # of digits in numerical address elements. 0=missing 1=one digit 2=two digits 3=three digits … Scores are applied when two addresses contain the exact same
because one-digit address numbers are common, but eight- digit numbers are rare.
Generating Weights
Page: Confidential
62
Phone Comparison Type Index Missing Exact ED 1 ED2 ED 3 ED 4 ED 5 ED 6 CMPID-AXP-2DIM 299 217 60
CMPID-AXP-2DIM 1 300 615 560 420 300 250 220 200 CMPID-AXP-2DIM 2 263 584 530 395 283 205 159 132 CMPID-AXP-2DIM 3 226 553 500 370 266 160 98 64 CMPID-AXP-2DIM 4 189 522 470 345 249 115 37
CMPID-AXP-2DIM 5 152 491 440 320 232 70
CMPID-AXP-2DIM 6 115 460 410 295 215 25
CMPID-AXP-2DIM 7 78 429 380 270 198
CMPID-AXP-2DIM 8 41 398 350 245 181
CMPID-AXP-2DIM 9 4 367 320 220 164 -110 -268 -344 CMPID-AXP-2DIM 10
336 290 195 147 -155 -329 -412 CMPID-AXP-2DIM 11
305 260 170 130 -200 -390 -480 CMPID-AXP-2DIM 12
274 230 145 113 -245 -451 -548 CMPID-AXP-2DIM 13
243 200 120 96 -290 -512 -616 CMPID-AXP-2DIM 14
212 170 95 79 -335 -573 -684 CMPID-AXP-2DIM 15
181 140 70 62 -380 -634 -752
2D weights for AXP use the columns to represent phone edit distance and the rows to represent address edit distance. Index column for address works like wgt1dim (0=missing, 1=exact, 2=one edit …). Score when both phone and address are totally different. Score when both phone and address are exact match.
Generating Weights
Page: Confidential
63
Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-NAME-XACT L 133 CMPID-NAME-XACT N 133 CMPID-NAME-XACT X 267 CMPID-NAME-XACT CHRIS 316 CMPID-NAME-XACT JOHN 318 CMPID-NAME-XACT JENNIFER 333 CMPID-NAME-XACT BRITTANY 339 CMPID-NAME-XACT LUPE 352 CMPID-NAME-XACT DARIUS 354 CMPID-NAME-XACT a 396 CMPID-NAME-XACT d
CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 Comparison Type Index Weight CMPID-AXP-XACT AZ 145 CMPID-AXP-XACT ST 181 CMPID-AXP-XACT AVE 360 CMPID-AXP-XACT RD 465 CMPID-AXP-XACT JUNCTION 471 CMPID-AXP-XACT CREEK 557 CMPID-AXP-XACT BLVD 566 CMPID-AXP-XACT DR 566 CMPID-AXP-XACT HILLS 597 CMPID-AXP-XACT WEST 598 CMPID-AXP-XACT AFB 617 CMPID-AXP-XACT PARK 633 CMPID-AXP-XACT a 883 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000
XACT weights are frequency-based, so the more common a value is, the lower the score. The rarer a value is, the higher the score.
Generating Weights
Page: Confidential
64
Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-NAME-XACT L 133 CMPID-NAME-XACT N 133 CMPID-NAME-XACT X 267 CMPID-NAME-XACT CHRIS 316 CMPID-NAME-XACT JOHN 318 CMPID-NAME-XACT JENNIFER 333 CMPID-NAME-XACT BRITTANY 339 CMPID-NAME-XACT LUPE 352 CMPID-NAME-XACT DARIUS 354 CMPID-NAME-XACT a 396 CMPID-NAME-XACT d
CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 Comparison Type Index Weight CMPID-AXP-XACT AZ 145 CMPID-AXP-XACT ST 181 CMPID-AXP-XACT AVE 360 CMPID-AXP-XACT RD 465 CMPID-AXP-XACT JUNCTION 471 CMPID-AXP-XACT CREEK 557 CMPID-AXP-XACT BLVD 566 CMPID-AXP-XACT DR 566 CMPID-AXP-XACT HILLS 597 CMPID-AXP-XACT WEST 598 CMPID-AXP-XACT AFB 617 CMPID-AXP-XACT PARK 633 CMPID-AXP-XACT a 883 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000
Default agree weights are used when a value is encountered that does not have a preset weight score. The ‘a’ weight is the highest score, because the value is very rare.
Generating Weights
Page: Confidential
65
Comparison Type Index Weight CMPID-NAME-XACT R 115 CMPID-NAME-XACT L 133 CMPID-NAME-XACT N 133 CMPID-NAME-XACT X 267 CMPID-NAME-XACT CHRIS 316 CMPID-NAME-XACT JOHN 318 CMPID-NAME-XACT JENNIFER 333 CMPID-NAME-XACT BRITTANY 339 CMPID-NAME-XACT LUPE 352 CMPID-NAME-XACT DARIUS 354 CMPID-NAME-XACT a 396 CMPID-NAME-XACT d
CMPID-NAME-PARM __FULLNAME_MAXWGT 594 CMPID-NAME-PARM __NORM_MCCIDX_EQUAL 20 Comparison Type Index Weight CMPID-AXP-XACT AZ 145 CMPID-AXP-XACT ST 181 CMPID-AXP-XACT AVE 360 CMPID-AXP-XACT RD 465 CMPID-AXP-XACT JUNCTION 471 CMPID-AXP-XACT CREEK 557 CMPID-AXP-XACT BLVD 566 CMPID-AXP-XACT DR 566 CMPID-AXP-XACT HILLS 597 CMPID-AXP-XACT WEST 598 CMPID-AXP-XACT AFB 617 CMPID-AXP-XACT PARK 633 CMPID-AXP-XACT a 883 CMPID-AXP-PARM __ADDR_STREET_MAXWGT 3000
Parameter weights issue bonus points, put caps on maximum scores, and indicate the penalties to subtract for edit distance, nicknames, and phonetics.
Generating Weights
Page: Confidential 66
FPF – wgt4dim
Name & Gender DOB Edit Dist. DOB Year Diff. SSN Missing SSN Exact SSN Edit Dist.=1
1
2
3
4
5
1 1 1 1 2 1 3 1 4 1 5 2
2 1
2 2
2 3
2 4
2 5
1
1 1
1 2
1 3
1 4
1 5
1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
Name & Gender
Index Name Result Gender Result
Missing Missing 1 Exact Missing 2 Partial Missing 3 Disagree Missing 4 Missing Agree 5 Exact Agree 6 Partial Agree 7 Disagree Agree 8 Missing Disagree 9 Exact Disagree 10 Partial Disagree 11 Disagree Disagree
Birth Month/Day
Index Meaning
One or both Dates are Missing mm/dd 1 The mm/dd Dates are an Exact match 2 The mm/dd have an Edit Distance of 1
Birth Year
Index Meaning
One or both years are Missing 1 Dates are 0-4 Years different 2 Dates are 5-9 Years different 3 Dates are 10-14 Years different 4 Dates are 15-19 Years different
SSN
Index Meaning
One or both SSNs are Missing 1 SSNs are an Exact match 2 SSNs have an Edit Distance of 1 … Edit distance can continue to 9
Page: Confidential 67
FPF – wgt4dim
Name & Gender DOB Edit Dist. DOB Year Diff. SSN Missing SSN Exact SSN Edit Dist.=1
1
2
3
4
5
1 1 1 1 2 1 3 1 4 1 5 2
2 1
2 2
2 3
2 4
2 5
1
1 1
1 2
1 3
1 4
1 5
1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
Name & Gender
Index Name Result Gender Result
Missing Missing 1 Exact Missing 2 Partial Missing 3 Disagree Missing 4 Missing Agree 5 Exact Agree 6 Partial Agree 7 Disagree Agree 8 Missing Disagree 9 Exact Disagree 10 Partial Disagree 11 Disagree Disagree
Birth Month/Day
Index Meaning
One or both Dates are Missing mm/dd 1 The mm/dd Dates are an Exact match 2 The mm/dd have an Edit Distance of 1
Birth Year
Index Meaning
One or both years are Missing 1 Dates are 0-4 Years different 2 Dates are 5-9 Years different 3 Dates are 10-14 Years different 4 Dates are 15-19 Years different
SSN
Index Meaning
One or both SSNs are Missing 1 SSNs are an Exact match 2 SSNs have an Edit Distance of 1 … Edit distance can continue to 9
This position holds a penalty for when name & gender are both missing, month\day is
but SSN is exactly the same.
Page: Confidential 68
FPF – wgt4dim
Name & Gender DOB Edit Dist. DOB Year Diff. SSN Missing SSN Exact SSN Edit Dist.=1
1
2
3
4
5
1 1 1 1 2 1 3 1 4 1 5 2
2 1
2 2
2 3
2 4
2 5
1
1 1
1 2
1 3
1 4
1 5
1 1 1 1 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
Name & Gender
Index Name Result Gender Result
Missing Missing 1 Exact Missing 2 Partial Missing 3 Disagree Missing 4 Missing Agree 5 Exact Agree 6 Partial Agree 7 Disagree Agree 8 Missing Disagree 9 Exact Disagree 10 Partial Disagree 11 Disagree Disagree
Birth Month/Day
Index Meaning
One or both Dates are Missing mm/dd 1 The mm/dd Dates are an Exact match 2 The mm/dd have an Edit Distance of 1
Birth Year
Index Meaning
One or both years are Missing 1 Dates are 0-4 Years different 2 Dates are 5-9 Years different 3 Dates are 10-14 Years different 4 Dates are 15-19 Years different
SSN
Index Meaning
One or both SSNs are Missing 1 SSNs are an Exact match 2 SSNs have an Edit Distance of 1 … Edit distance can continue to 9
This position shows no penalty for when all 4 dimensions are in perfect agreement.
Page: Confidential
code in Algorithm
comes from a different geographical area (East Coast has different name, phone and address distributions than West Coast)
69
Generating Weights
Page: Confidential
weights
thresholds
weights
fixed by tweaking weights
70
Generating Weights
Page: Confidential
Ignore Link
72
TASKS Clerical Review Threshold (CR) Auto-Link Threshold (AL)
Threshold Analysis
Page: Confidential
Ignore Link
matches will be highly accurate, but it increases the number of tasks
73
TASKS Clerical Review Threshold (CR) Auto-Link Threshold (AL)
Threshold Analysis
Page: Confidential
Ignore Link
74
CR = AL
Threshold Analysis
Page: Confidential
person)
75
Threshold Analysis
Page: Confidential
revisions
76
Threshold Analysis
Page: Confidential
DOB or Name might reveal difference)
77
Threshold Analysis
Page: Confidential
78
Record X Record Y Name Birth Date Address SSN Home Phone Cell Phone Gender M M Marital Status
Threshold Analysis
Page: Confidential
79
Record X Record Y Name Birth Date Address SSN 993-20-1661 993-20-1661 Home Phone Cell Phone Gender Marital Status
Threshold Analysis
Page: Confidential
80
Record X Record Y Name Sue Chaudray-Patel Susan C. Patel Birth Date Address SSN Home Phone Cell Phone Gender Marital Status
Threshold Analysis
Page: Confidential
81
Record X Record Y Name Rick H. Morrison, Jr. Richard Henry Morrison Birth Date 1938-06-05 Address SSN Home Phone Cell Phone Gender Marital Status
Threshold Analysis
Page: Confidential
82
Record X Record Y Name Rick H. Morrison, Jr. Richard Henry Morrison Birth Date 1952-04-17 1952-04-17 Address 8821 W. Grosse Point Way Ann Arbor, MI 48104 SSN 993-20-1661 993-02-1661 Home Phone (313) 623-1863 (517) 881-1437 Cell Phone (517) 881-1437 Gender M M Marital Status Married
Threshold Analysis
Page: Confidential
83
Record X Record Y Name LaDonna M. Jeffries Jeff M. L’Donne Birth Date 1979-12-14 1963-08-30 Address 118 N. Gartner Road, Apt. 3B Kalamazoo, MI 49003 19 West Big Timber Dryden, MI 48428 SSN 999-99-9999 171-12-1646 Home Phone (269) 234-3782 Cell Phone (269) 383-1129 (810) 623-1672 Gender F M Marital Status Single Divorced
Threshold Analysis
Page: Confidential
84
Record X Record Y Name Vallie G. Musial Val Y. Musia Birth Date 1982-02-26 1982-02-26 Address 3792 W. Kingston St. Ann Arbor, MI 48106 SSN 271-19-1209 Home Phone (269) 392-1810 Cell Phone (269) 932-1180 Gender F F Marital Status Single
Threshold Analysis
Page: Confidential
85
Record X Record Y Name Ernest L. Johns Ernest L. Johns Birth Date 1932-03-28 1932-03-28 Address SSN 810-78-1206 810-78-1206 Home Phone (312) 445-2343 (708) 293-7093 Cell Phone Gender M M Marital Status
Threshold Analysis
Page: Confidential
86
Threshold Analysis
Page: Confidential
87
Revisit answers based on status Buttons for “Yes”, “No”, “Maybe” Exact matches turn green Rearrange and hide fields
Threshold Analysis
Page: Confidential
Calculator
Dependent on Resources
88
Threshold Analysis
Page: Confidential
Bulk Cross Match Clean Data Extract
90
Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config Reiterate
Implementation Process
Page: Confidential
91
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
loaded includes, metadata (like sources and attributes), validation lists, and lookup tables in Workbench
software stores, manages, and validates data
attributes and fields and Implementation Approach defines additional data dictionary requirements
92
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
93
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
94
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
95
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
96
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
97
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
98
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
configured (eg, hub engine, member model, algorithm): if changes made to algorithm, then data must be re-derived
99
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
100
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
and then link records that match
second, is most commonly performed in the initial stage of implementation and again right before system goes live
performing BXM (engine, algorithm, and dictionary must be in place)
to thresholds to determine auto-linking and task generation)
101
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
establish how well system & data are performing
distribution, entity and bucket size, and thresholds
fully loaded to run data analytics
102
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
data dictionary, if necessary
results again
deriving, but not another BXM
and a new BXM
103
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
104
Bulk Cross Match Clean Data Extract Review Customer Requirements Configure Data Model Configure Algorithm Deploy Instance Derive Data Generate Weights Analyze And Review Test Config
Implementation Process
Page: Confidential
associated with particular attributes
106
Matching Challenges
Page: Confidential
107
Correct Decision Correct Decision False Positive False Negative Match Don’t Match Matching Decision Different Member Same Member Truth
Balance between the two determined by setting thresholds specific to customer’s requirements Likelihood Theory (adaptive weights) aimed at reducing FP
Matching Challenges
Page: Confidential
attributes while allowing for partial matches?
same as if SSNs agreed but birth dates were two digits off?
results?
108
Matching Challenges
Page: Confidential
that two records represent the same person
review potential duplicates
meaningful matches
109
Matching Challenges
Page: Confidential
common than lower scoring pairs?
110
Matching Challenges
Page: Confidential
sample pairs and looking for certain telltale signs
attribute differences between name, DOB, and/or SSN
to locate highest scoring false positives, usually ≥10
111
Matching Challenges
Page: Confidential
linkages that can be penalized and eliminated with the FPF that uses the following rules:
is set to true. (FPF looks at DOB or SEX only if there is a partial match.)
targeted from the last run are no longer being linked
eliminate false positives when applying FPF
112
Matching Challenges
Page: Confidential 113
Page: Confidential
114