Drupal as a Data Warehouse Everybody Into the Data Lake! Gail - - PowerPoint PPT Presentation

drupal as a data warehouse
SMART_READER_LITE
LIVE PREVIEW

Drupal as a Data Warehouse Everybody Into the Data Lake! Gail - - PowerPoint PPT Presentation

Drupal as a Data Warehouse Everybody Into the Data Lake! Gail Radecki, CHCP, American Academy of Allergy Asthma & Immunology Ezra Wolfe, EthosCE Devin Zuczek, EthosCE Above all else, show the data Edward Tufte My name is Gail


slide-1
SLIDE 1

Everybody Into the Data Lake!

Gail Radecki, CHCP, American Academy of Allergy Asthma & Immunology Ezra Wolfe, EthosCE Devin Zuczek, EthosCE

Drupal as a Data Warehouse

slide-2
SLIDE 2

Above all else, show the data

Edward Tufte

“ “

slide-3
SLIDE 3

My name is Gail…

  • Pushy
  • A know-just-enough
  • Annoying

...and I’m one of “those” customers.

slide-4
SLIDE 4

The World of Continuing Medical Education

  • Accreditation/Compliance
  • Needs Assessment
  • Grants Reconciliation
slide-5
SLIDE 5

Data reporting...stayed the same

slide-6
SLIDE 6

This is where Community comes in handy

  • EthosCE User Group
  • Support tickets
  • Online Community
slide-7
SLIDE 7

The Importance of Outcomes

  • Medical Specialty Society
  • Needs assessment
  • Accreditation
slide-8
SLIDE 8

We need a solution!

slide-9
SLIDE 9

My name is Ezra. I am the product manager. This is a me with a bad haircut.

slide-10
SLIDE 10

The Product:

  • 65+ hospitals, health systems, associations
  • Almost 1 million learner accounts
  • 7.9 million course enrollments
slide-11
SLIDE 11

All happy families customers are alike; each unhappy family customer is unhappy about their reports.

Leo Tolstoy

“ “

slide-12
SLIDE 12

Existing process

  • 1. Customer requests a new report
  • 2. Requirements
  • 3. Development feasibility
  • 4. Back and forth with customer
  • 5. Make the view, put it in code
  • 6. Change management, code review, documentation, tests
  • 7. Custom report is deployed! We are done!
  • 8. Repeat steps #1-7 again because "that" customer forgot something
  • 9. Repeat steps #1-7 again later because of a product schema change
slide-13
SLIDE 13

My name is Devin and I am a systems architect. Throwback to PHP 4? This is a duck.

slide-14
SLIDE 14

Big data!!!

Phase 1 Big data Phase 2

?

Phase 3 Profit!

slide-15
SLIDE 15

Making a plan

  • Previous solutions involved a suite of modules:

Views, Homebox, Charts + HighCharts, Views Data Export.

  • Building our own tool would have been a distraction

from our core business — we are not data scientists.

  • We needed a tool that we could give to customers

to report on their own data instead of us doing it.

slide-16
SLIDE 16

Vendor selection

  • Vendors don’t know about...

○ Drupal ○ php serialization ○ webform

  • It’s your job to do your due diligence and

ensure you select the correct system

slide-17
SLIDE 17

BYODW?

(Bring your own data warehouse?)

slide-18
SLIDE 18

Did I mention where I work??

  • Non-profit
  • Tech maintenance vs. adding features
  • We have to justify EVERYTHING
slide-19
SLIDE 19
slide-20
SLIDE 20

First steps

  • Could we point a tool at Drupal and have it report
  • ut of the box?
  • Do we need a data warehouse?

To answer the question of why reporting on Drupal data in its native form is not optimal, we have to look at how the data is stored.

slide-21
SLIDE 21

DBA 101: Tables

  • U. Corp. changed their name. Making this change requires...
people user_id name employer_name 1 Barry Cuda
  • U. Corp.
2 Abby Normal Gekko & Co. 3 Rita Book
  • U. Corp.
4 Ray O’Sun
  • U. Corp.
slide-22
SLIDE 22

DBA 101: Tables

  • U. Corp. changed their name. Making this change requires...

3 row updates. Not good at scale, as this will lock those rows for editing!

people user_id name employer_name 1 Barry Cuda Initech 2 Abby Normal Gekko & Co 3 Rita Book Initech 4 Ray O’Sun Initech
slide-23
SLIDE 23

Normalizing

Eliminate columns with duplicate data by creating separate tables, and identify that data with a key. Move data that is not relevant to the primary key.

People uid name employer_name 1 Barry Cuda Initech 2 Abby Normal Gekko & Co 3 Rita Book Initech People uid name eid 1 Barry Cuda X 2 Abby Normal X 3 Rita Book X Employer data eid name 5 Initech 6 Gekko & Co. Employer relation uid eid 1 5 2 6 3 5
slide-24
SLIDE 24

DBA 201: Normalizing

How do we efficiently store a user, full name, location, and employer?

User user_id user_name 1 john 2 jane Profile profile_id user_id 2 1 3 2 Location loc_id province country 3 PA US 4 NJ US Profile location profile_id loc_id 2 3 3 4 Profile name profile_id fullname 2 John Smith 3 Jane Doe Profile employer profile_id emp_id 2 6 3 7 Employer employer_id name 6 A, LLC. 7 D Corp No data is unnecessarily dependent. An update only requires 1 write.

...that includes the database!

slide-25
SLIDE 25

The problem

Operational database

  • Fast writes
  • Holds data
  • Relational model not

informative

  • Built for integrity

Reporting database

  • Fast, easy reads
  • Holds information
  • Relational model more

informative to users

— E.F. Codd, "Further Normalization of the Data Base Relational Model
slide-26
SLIDE 26

Another module

slide-27
SLIDE 27

Denormalizer module

This module pulls from Drupal defined schema to programmatically build denormalized tables or views for use in data warehousing flows. Example: A node of type “chinchilla” with a rate field. We could also request the raw vote table. 1. Denormalized table name 2. Entity/table to denormalize 3. The primary key to use 4. The changed key to use

slide-28
SLIDE 28

SELECT * FROM...

developlocal.drupal_node dn_course LEFT JOIN developlocal.drupal_field_data_field_course_rating_access field_course_rating_access ON field_course_rating_access.entity_type = 'node' AND field_course_rating_access.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_catalog field_course_catalog ON field_course_catalog.entity_type = 'node' AND field_course_catalog.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_date field_course_date ON field_course_date.entity_type = 'node' AND field_course_date.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_disclosure field_course_disclosure ON field_course_disclosure.entity_type = 'node' AND field_course_disclosure.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_event_date field_course_event_date ON field_course_event_date.entity_type = 'node' AND field_course_event_date.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_external_url field_course_external_url ON field_course_external_url.entity_type = 'node' AND field_course_external_url.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_format field_course_format ON field_course_format.entity_type = 'node' AND field_course_format.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_taxonomy_term_data field_course_format_tax ON field_course_format_tid = field_course_format_tax.tid LEFT JOIN developlocal.drupal_field_data_field_course_image field_course_image ON field_course_image.entity_type = 'node' AND field_course_image.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_live field_course_live ON field_course_live.entity_type = 'node' AND field_course_live.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_location field_course_location ON field_course_location.entity_type = 'node' AND field_course_location.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_password field_course_password ON field_course_password.entity_type = 'node' AND field_course_password.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_rating field_course_rating ON field_course_rating.entity_type = 'node' AND field_course_rating.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_transcript field_course_transcript ON field_course_transcript.entity_type = 'node' AND field_course_transcript.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_enrollment_requirement_min field_enrollment_requirement_min ON field_enrollment_requirement_min.entity_type = 'node' AND field_enrollment_requirement_min.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_requirements_max field_requirements_max ON field_requirements_max.entity_type = 'node' AND field_requirements_max.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_requirements_min field_requirements_min ON field_requirements_min.entity_type = 'node' AND field_requirements_min.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_show_child_catalog field_show_child_catalog ON field_show_child_catalog.entity_type = 'node' AND field_show_child_catalog.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_show_child_transcript field_show_child_transcript ON field_show_child_transcript.entity_type = 'node' AND field_show_child_transcript.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_og_group_ref og_group_ref ON
  • g_group_ref.entity_type = 'node' AND og_group_ref.entity_id = dn_course.nid
LEFT JOIN developlocal.drupal_field_data_field_accme_data field_accme_data ON field_accme_data.entity_type = 'node' AND field_accme_data.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_activity_format field_activity_format ON field_activity_format.entity_type = 'node' AND field_activity_format.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_level field_course_level ON field_course_level.entity_type = 'node' AND field_course_level.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_topic field_course_topic ON field_course_topic.entity_type = 'node' AND field_course_topic.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_custom_target_audience field_custom_target_audience ON field_custom_target_audience.entity_type = 'node' AND field_custom_target_audience.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_presenter field_presenter ON field_presenter.entity_type = 'node' AND field_presenter.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_faculty_list field_faculty_list ON field_faculty_list.entity_type = 'node' AND field_faculty_list.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_hotel_link field_hotel_link ON field_hotel_link.entity_type = 'node' AND field_hotel_link.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_hotel_photo field_hotel_photo ON field_hotel_photo.entity_type = 'node' AND field_hotel_photo.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_venue_phone field_venue_phone ON field_venue_phone.entity_type = 'node' AND field_venue_phone.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_category field_course_category ON field_course_category.entity_type = 'node' AND field_course_category.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_taxonomy_term_data field_course_category_tax ON field_course_category_tid = field_course_category_tax.tid LEFT JOIN developlocal.drupal_field_data_field_related_courses_view field_related_courses_view ON field_related_courses_view.entity_type = 'node' AND field_related_courses_view.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_show_on_calendar field_show_on_calendar ON field_show_on_calendar.entity_type = 'node' AND field_show_on_calendar.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_course_node cn ON cn.nid = dn_course.nid WHERE (dn_course.type IN ('course', 'group_event_series_event', 'course_imported')) GROUP BY `dn_course`.nid;

Querying field data

slide-29
SLIDE 29 dw_mysiteprod.dw_user user_id fullname province country employer changed 1 Barry Cuda PA US A, LLC 2018-01-01 2 Abbey Normal NJ US B, Inc. 2018-05-25 Denormalizer will build a denormalized query and dump that data into a separate database. On subsequent runs, it will only insert or update new and changed records based
  • n the ID and
"changed key".
slide-30
SLIDE 30 d_chinchilla d_user d_external_table drupal_user drupal_node drupal_profile drupal_field_data drupal_field_data drupal_field_data secret_table external_table Prod DB Data DB

Only the data you want is exposed in the database containing the denormalized data. Third party tools can be configured to connect as a user that only has read access to the data DB. A read-only replica can be used for increased performance.

slide-31
SLIDE 31 d_user d_chinchilla external_table sendgrid_opens g_ad Data...lake? (denormalizer)

????

“The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

  • James Dixon, Pentaho
slide-32
SLIDE 32

Tricks

uid colors 1 [blue, green] 2 [blue, green, red] uid color 1 blue 1 green 2 blue 2 green 2 red We can do some things that should not be done in an

  • perational database, like

combining multiple values into an array column, to be "unnested" later in a reporting database. item_id p_id 1 [6,7,8] 6 [7,8]

slide-33
SLIDE 33

What’s the Difference?

Data lake

  • You don't know what you're

using this data for.

  • Data can be structured or

unstructured.

  • Retain all data. Avoids creating

data silos.

Data warehouse

  • You know what you're using

the data for.

  • Data typically structured or

processed.

  • Data is modeled to make

logical sense to the user, or to create “information”.

slide-34
SLIDE 34

BLT ETL, ELT?

E = Extract T = Transform L = Load B = Bacon

slide-35
SLIDE 35

Singer

The open-source standard for writing scripts that move data. https://www.singer.io

slide-36
SLIDE 36

Singer

Taps extract data from any source and write it to a standard stream in a JSON-based format. Targets consume data from taps and do something with it, like load it into a file, API or database.

slide-37
SLIDE 37

Singer cycle

Tap Target State

  • 1. The tap extracts information from the source
  • 2. The target

takes in data from the tap and transforms it to what the target needs. It outputs the state that the tap also sent

  • ver.
  • 3. State is

then passed back to the tap, and the process continues at the last replication point. # tap-mysql -c catalog.json -s state.json | target-postgres > state.json

The catalog defines how to replicate the data. We use Drupal to generate these "streams".

# drush singer-json > catalog.json

slide-38
SLIDE 38

Singer

d_user d_chinchilla external_table sendgrid_opens g_ad Data lake

Athena BigQuery PostgreSQL Redshift

??? Singer

(denormalized)
slide-39
SLIDE 39

Our solution

  • Drupal and Singer being used to populate the

data lake

  • Production database on AWS Aurora provides a

read-only replica of the lake

  • ~15 minutes, Singer extracts and loads new

data from lake of Drupal + other data in MySQL to PostgreSQL

  • Looker transforms and reports on data
  • Customer modeling generated from Drupal
  • 100M+ rows being processed

Drupal Singer PostgreSQL

slide-40
SLIDE 40

Rollout & Results

  • Customer interviews, mockups, beta

tests for feature design

  • Pre-built embedded dashboards in all

customer sites

  • Self-service reporting as an add-on
  • No price increase
  • Data warehouse became a

competitive advantage

slide-41
SLIDE 41

Data Can Be Beautiful

slide-42
SLIDE 42

Some Section header Second Line Support Desk Heros

Custom reports are now built by non-developers. Often for free! Sometimes in minutes!

slide-43
SLIDE 43

Have we arrived?

  • Customers appreciate all the work
  • Still some things to work out on both ends
slide-44
SLIDE 44

Join us! Join us!

slide-45
SLIDE 45

Resources

  • Singer ETL: https://www.singer.io/
  • Denormalizer: https://www.drupal.org/project/denormalizer
  • Drupal Singer: https://www.drupal.org/sandbox/devin/3006817
  • MySQL tap: https://github.com/singer-io/tap-mysql
  • PostgreSQL tap:

https://github.com/singer-io/tap-postgres/tree/master/tap_postgres

  • PostgreSQL target: https://github.com/statsbotco/target-postgres

Some project are reaching maturity. Stay tuned! https://looker.com/blog/introducing-singer

slide-46
SLIDE 46

Join us for contribution opportunities

Friday, April 12, 2019

9:00-18:00 Room: 602

Mentored Contribution First Time Contributor Workshop General Contribution

#DrupalContributions

9:00-12:00 Room: 606 9:00-18:00 Room: 6A
slide-47
SLIDE 47

What did you think?

Locate this session at the DrupalCon Seattle website: http://seattle2019.drupal.org/schedule Take the Survey! https://www.surveymonkey.com/r/DrupalConSeattle