Data Vault An in-depth look Juraj Pivovarov Background Owner - - PowerPoint PPT Presentation

data vault
SMART_READER_LITE
LIVE PREVIEW

Data Vault An in-depth look Juraj Pivovarov Background Owner - - PowerPoint PPT Presentation

Data Vault An in-depth look Juraj Pivovarov Background Owner and Consultant at Metadata Innovations Inc. Clients: AEP, OAS, Divestco, CWD, Nexen, AER Data Architect, Data Analyst, Data Modeler Software Developer,


slide-1
SLIDE 1
slide-2
SLIDE 2

Data Vault

An in-depth look

slide-3
SLIDE 3

Juraj Pivovarov

Background

  • Owner and Consultant at Metadata Innovations Inc.
  • Clients: AEP, OAS, Divestco, CWD, Nexen, AER
  • Data Architect, Data Analyst, Data Modeler
  • Software Developer, Combinatorial Optimization, Image Processing
  • M. Sc. Computer Science
  • B. Sc. Pure Mathematics

Hobbies

  • Chess, Scrabble, and Speedcubing
slide-4
SLIDE 4

Background and History

slide-5
SLIDE 5
  • Q. What is Data Vault?
  • A. Data Vault is a modeling methodology for the enterprise data warehouse.
  • It is not something you buy, but something you implement

Data Vault encompasses

  • Data Warehouse Architecture
  • Data Vault Modeling
  • Data Vault Methodology

○ Project planning ○ Project execution ○ Review and improvement

slide-6
SLIDE 6

Why learn Data Vault?

  • You may want to USE it!
  • You may come across it
  • Adopt some of the ideas
slide-7
SLIDE 7

History

  • Data Vault was invented by Dan Linstedt

while at US Department of Defense

  • First published 2001
  • Very popular in Netherlands in insurance and banking.
  • Catching on in North America

Pedestrian Google Search comparison on 2016-09-11

  • Data Vault Data Modeling: 296,000 hits
  • Star Schema Data Modeling: 462,000 hits
  • Data Modeling: 9,140,000 hits
slide-8
SLIDE 8
  • Q. What’s wrong with 3NF data models?

Traditional Data Modeling and Warehousing has difficulties with

  • Changing key structure
  • Changing relationships from 1:M to M:M
  • Performance at scale
  • Complex loading dependencies
  • Inconsistent History tracking
slide-9
SLIDE 9

Data Vault Promises

  • “All the data, all the time”
  • Agile and flexible
  • Scalable
slide-10
SLIDE 10

Data Vault Architecture

slide-11
SLIDE 11

Kimball Warehouse

SRC1 SRC2 SRC3 STG OLAP Cubes Star Schema DWH Transformation and Cleanup “Conformed Dimensions”

slide-12
SLIDE 12

Inmon Warehouse

SRC1 SRC2 SRC3 STG OLAP Cubes 3NF DWH Transformation and Cleanup Data Marts

slide-13
SLIDE 13

Data Vault Warehouse

SRC1 SRC2 SRC3 STG OLAP Cubes Business DV Transformation and Cleanup Data Marts Raw DV Auditable

slide-14
SLIDE 14

Data Vault Core Constructs

slide-15
SLIDE 15

Data Vault Types

  • Hubs

○ Unique list of business keys

  • Links

○ Unique associations of business keys

  • Satellites

○ Descriptive data, time variant

  • Reference Tables (optional)

○ For capturing meanings of codes used

slide-16
SLIDE 16

Hub - definition

Represents Business Keys

  • Hub Surrogate Key (PK)
  • Business Key (simple or composite)
  • Source
  • Load Date
  • Last Seen Date (optional)

Hub

slide-17
SLIDE 17

Hub

  • Single point of definition for business key

○ Not duplicated across other tables

  • Represents the first time the DWH sees the business key
  • Never deleted
  • Business keys should be able to stand on their own.
  • Business keys are what allow you to integrate data across business functions.
slide-18
SLIDE 18

Link - definition

Represents relationships between keys

  • Link Surrogate Key (PK)
  • Hub Keys
  • Dependent child key (optional)
  • Source
  • Load Date
  • Last Seen Date (optional)

Hub2 Hub1 Hub3 Link

slide-19
SLIDE 19

Link

  • Links are always M:M
  • Links may be between multiple hubs
  • “A link must have more than one parent table.” [DV1]
  • Links provide primary flexibility touted by Data Vault
slide-20
SLIDE 20

Satellite - definition

Represents Context, over time.

  • Hub or Link Surrogate Key (PK)
  • Load Date (PK)
  • Attributes {1,...,n}
  • Source (optional)
  • Load End Date

Sat2 Sat3 Hub2 Hub1 Hub3 Link Sat1 Sat4

slide-21
SLIDE 21

Satellites

  • Represent context over a fixed time interval
  • Split by

○ Source - avoid flip-flop effect ○ Rate of Change ○ Data Types

  • Design Decisions

○ Continuum of 1 Satellite per attribute vs 1 Giant Satellite

slide-22
SLIDE 22

Reference Tables - definition

They describe meanings of codes used in Satellites, if applicable. Many options on how to design them

  • Directly, with or without history
  • r as full blown Hubs and Satellites

Foreign Keys

  • Logical foreign keys from Satellites to Reference Tables
  • Never physically implemented

Hub Sat Ref

slide-23
SLIDE 23

Lifetimes

slide-24
SLIDE 24

Links and Satellites - in depth

slide-25
SLIDE 25

Peg-Leg Links

  • Degenerate Links, these have only one Hub reference
  • “A link must have more than one parent table.” [DV1]
  • “They connect two or more hubs, (or same hub twice)” [DV2]
  • Can be produced as byproduct of some DWH automation tools.

Hub Link

slide-26
SLIDE 26

Hierarchical Links

  • Parent and child references to same Hub

Hub Link

slide-27
SLIDE 27

Same-as Links

  • Same-as links are used to record hub synonyms
  • Here, we have FOUR links
  • They mean A=B, B=C, C=D and D=E
  • Is A the same-as E?
  • Problem: it is not obvious

B A C E D

slide-28
SLIDE 28

Same-as Links

  • We have an equivalence class

○ X=X ○ X=Y means Y=X ○ X=Y and Y=Z means X=Z

  • Need Transitive Closure

○ Determine what vertices are reachable

  • Four explicit links, Six implicit ones
  • In general O(n^2) total logical links

B A C E D

slide-29
SLIDE 29

Same-as Links

  • Representing Equivalence Classes
  • Elect ‘leader’ in each class
  • One edge from each to leader
  • Don’t forget reflexive edge, A=A

B A C E D

slide-30
SLIDE 30

Avoid Links to Links

Hub Country Hub Province Link Province Hub City Link City Hub Country Hub Province Link Province Hub City Link City

slide-31
SLIDE 31

Multi Active Satellites

Ex: Modeling phone numbers

  • Different MOBILE, HOME, and WORK numbers

○ Can have two MOBILE phones etc.

  • MOBILE1, MOBILE2, …, MOBILEn is a limited solution.
  • Multiple rows are active in the satellite, for given Hub/Link parent.

Implementation:

  • Add a SEQ number to the primary key of the Satellite
slide-32
SLIDE 32

Effectivity Satellites

Ex: Employee leaves, then comes back for short contract.

  • Effectivity Satellite models discrete time intervals

for which the Hub or Link is valid. Implementation

  • Add Begin Date,

End Date to rows of Satellite

  • (Do not overload meanings of

Load Date, Load End Date)

slide-33
SLIDE 33

Query-Assist Tables

slide-34
SLIDE 34

Point-in-Time (PIT) Tables

  • Query assist tables (optional)
  • Tie together Hub/Link + exact Sat

Rows for point-in-time Example

  • Hub Key
  • Sat 1 Load Date
  • Sat 2 Load Date

H S1 S2

slide-35
SLIDE 35

Bridge Tables

  • Query assist tables (optional)
  • Tie together Links and Hubs

Example

  • H1 Key + H1 business key
  • H2 Key
  • H3 Key
  • L1 Key
  • L2 Key

H2 L1 L2 H3 H1

slide-36
SLIDE 36

Data Vault Tradeoffs

slide-37
SLIDE 37

+ Simple Structures

  • Consistency: Easy to understand and extend.
  • Template-based SQL.

HUB LINK SATELLITE REF

slide-38
SLIDE 38

+ Few Dependencies

  • Simplicity: Easy to determine “Load of the Rings.”
  • Scalability: Very easy to parallelize

HUB LINK SATELLITE REF

slide-39
SLIDE 39

+ Other Benefits

  • Auditability “All the data, all the time”

○ Nothing is changed on the way into the Raw Vault ○ Business rules and data cleanup happens downstream

  • Flexibility: M:M Links

○ If the cardinality of a relationship changes for some reason, no change required in DV

  • Extendable

○ Add new hubs, links and satellites without reengineering any existing structures

slide-40
SLIDE 40
  • Downsides
  • Labelled “Hypernormalized”

○ The data model is much more abstract

  • Not easily queryable

○ M:M links mean every query needs to handle this case ○ Satellites have multiple rows, must find appropriate one ○ Must respect effectivity satellites on Hubs and Links.

  • To automate well, requires some metadata mgmt capabilities

○ Benefits of consistency only come with rigorous adherence to standards ○ Model driven development goes a long way

  • Not all constraints can be enforced

○ see Country/Province/State example ○ No R/I with Satellites and Reference Tables.

slide-41
SLIDE 41

When to use Data Vault?

  • Many source systems, with sometimes contradictory facts
  • Auditability is required
  • Some team members have familiarity with Data Vault
  • Big Data requirements, at least volume and variability
  • Anticipated changes deal with cardinality of relationships
  • To simplify data warehousing efforts, make use of repeatable patterns
slide-42
SLIDE 42

For more information

slide-43
SLIDE 43
  • Q. What’s new in Data Vault 2.0 (2016)?
  • Hashes vs Integer Keys

○ Hashes are more parallelizable ○ Avoid lookups of surrogate keys ○ Compute them instead!

  • Expanding on Data Vault Methodology

○ Many examples with SSIS and SqlServer ○ Ties in to Master Data Services ○ Producing Star Schema Data Marts, etc.

slide-44
SLIDE 44

Data Vault Automation Tools

Data Vaults can be auto generated, to some extent, by examining source schemas.

  • WhereScape
  • Quippu (open source)
  • BI Ready
  • AnalytiX DS
  • Rapid Ace (Dan’s original toolset)

Disclaimer: YMMV.

slide-45
SLIDE 45

Bibliography

[DV2] Building a Scalable Data Warehouse with Data Vault 2.0 ~Dan Linstedt, Michael Olschimke. 2015 [DV1] Super Charge Your Data Warehouse ~Dan Linstedt. 2011 [HH] Modeling the Agile Data Warehouse with Data Vault ~Hans Hultgren. 2012 [PENT] Pentaho Kettle Solutions ~Kasper de Graaf (p465-495). 2010 [LinkedIn] LinkedIn Data Vault Group discussions