Everybody Into the Data Lake!
Gail Radecki, CHCP, American Academy of Allergy Asthma & Immunology Ezra Wolfe, EthosCE Devin Zuczek, EthosCE
Drupal as a Data Warehouse
Drupal as a Data Warehouse Everybody Into the Data Lake! Gail - - PowerPoint PPT Presentation
Drupal as a Data Warehouse Everybody Into the Data Lake! Gail Radecki, CHCP, American Academy of Allergy Asthma & Immunology Ezra Wolfe, EthosCE Devin Zuczek, EthosCE Above all else, show the data Edward Tufte My name is Gail
Everybody Into the Data Lake!
Gail Radecki, CHCP, American Academy of Allergy Asthma & Immunology Ezra Wolfe, EthosCE Devin Zuczek, EthosCE
Drupal as a Data Warehouse
Above all else, show the data
Edward Tufte
My name is Gail…
...and I’m one of “those” customers.
The World of Continuing Medical Education
Data reporting...stayed the same
This is where Community comes in handy
The Importance of Outcomes
We need a solution!
My name is Ezra. I am the product manager. This is a me with a bad haircut.
The Product:
All happy families customers are alike; each unhappy family customer is unhappy about their reports.
Leo Tolstoy
Existing process
My name is Devin and I am a systems architect. Throwback to PHP 4? This is a duck.
Big data!!!
Phase 1 Big data Phase 2
?
Phase 3 Profit!
Making a plan
Views, Homebox, Charts + HighCharts, Views Data Export.
from our core business — we are not data scientists.
to report on their own data instead of us doing it.
Vendor selection
○ Drupal ○ php serialization ○ webform
ensure you select the correct system
(Bring your own data warehouse?)
Did I mention where I work??
First steps
To answer the question of why reporting on Drupal data in its native form is not optimal, we have to look at how the data is stored.
DBA 101: Tables
DBA 101: Tables
3 row updates. Not good at scale, as this will lock those rows for editing!
people user_id name employer_name 1 Barry Cuda Initech 2 Abby Normal Gekko & Co 3 Rita Book Initech 4 Ray O’Sun InitechNormalizing
Eliminate columns with duplicate data by creating separate tables, and identify that data with a key. Move data that is not relevant to the primary key.
People uid name employer_name 1 Barry Cuda Initech 2 Abby Normal Gekko & Co 3 Rita Book Initech People uid name eid 1 Barry Cuda X 2 Abby Normal X 3 Rita Book X Employer data eid name 5 Initech 6 Gekko & Co. Employer relation uid eid 1 5 2 6 3 5DBA 201: Normalizing
How do we efficiently store a user, full name, location, and employer?
User user_id user_name 1 john 2 jane Profile profile_id user_id 2 1 3 2 Location loc_id province country 3 PA US 4 NJ US Profile location profile_id loc_id 2 3 3 4 Profile name profile_id fullname 2 John Smith 3 Jane Doe Profile employer profile_id emp_id 2 6 3 7 Employer employer_id name 6 A, LLC. 7 D Corp No data is unnecessarily dependent. An update only requires 1 write....that includes the database!
The problem
Operational database
informative
Reporting database
informative to users
— E.F. Codd, "Further Normalization of the Data Base Relational ModelAnother module
Denormalizer module
This module pulls from Drupal defined schema to programmatically build denormalized tables or views for use in data warehousing flows. Example: A node of type “chinchilla” with a rate field. We could also request the raw vote table. 1. Denormalized table name 2. Entity/table to denormalize 3. The primary key to use 4. The changed key to use
SELECT * FROM...
developlocal.drupal_node dn_course LEFT JOIN developlocal.drupal_field_data_field_course_rating_access field_course_rating_access ON field_course_rating_access.entity_type = 'node' AND field_course_rating_access.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_catalog field_course_catalog ON field_course_catalog.entity_type = 'node' AND field_course_catalog.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_date field_course_date ON field_course_date.entity_type = 'node' AND field_course_date.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_disclosure field_course_disclosure ON field_course_disclosure.entity_type = 'node' AND field_course_disclosure.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_event_date field_course_event_date ON field_course_event_date.entity_type = 'node' AND field_course_event_date.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_external_url field_course_external_url ON field_course_external_url.entity_type = 'node' AND field_course_external_url.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_format field_course_format ON field_course_format.entity_type = 'node' AND field_course_format.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_taxonomy_term_data field_course_format_tax ON field_course_format_tid = field_course_format_tax.tid LEFT JOIN developlocal.drupal_field_data_field_course_image field_course_image ON field_course_image.entity_type = 'node' AND field_course_image.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_live field_course_live ON field_course_live.entity_type = 'node' AND field_course_live.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_location field_course_location ON field_course_location.entity_type = 'node' AND field_course_location.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_password field_course_password ON field_course_password.entity_type = 'node' AND field_course_password.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_rating field_course_rating ON field_course_rating.entity_type = 'node' AND field_course_rating.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_course_transcript field_course_transcript ON field_course_transcript.entity_type = 'node' AND field_course_transcript.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_enrollment_requirement_min field_enrollment_requirement_min ON field_enrollment_requirement_min.entity_type = 'node' AND field_enrollment_requirement_min.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_requirements_max field_requirements_max ON field_requirements_max.entity_type = 'node' AND field_requirements_max.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_requirements_min field_requirements_min ON field_requirements_min.entity_type = 'node' AND field_requirements_min.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_show_child_catalog field_show_child_catalog ON field_show_child_catalog.entity_type = 'node' AND field_show_child_catalog.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_field_show_child_transcript field_show_child_transcript ON field_show_child_transcript.entity_type = 'node' AND field_show_child_transcript.entity_id = dn_course.nid LEFT JOIN developlocal.drupal_field_data_og_group_ref og_group_ref ONQuerying field data
Only the data you want is exposed in the database containing the denormalized data. Third party tools can be configured to connect as a user that only has read access to the data DB. A read-only replica can be used for increased performance.
????
“The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
Tricks
uid colors 1 [blue, green] 2 [blue, green, red] uid color 1 blue 1 green 2 blue 2 green 2 red We can do some things that should not be done in an
combining multiple values into an array column, to be "unnested" later in a reporting database. item_id p_id 1 [6,7,8] 6 [7,8]
What’s the Difference?
Data lake
using this data for.
unstructured.
data silos.
Data warehouse
the data for.
processed.
logical sense to the user, or to create “information”.
BLT ETL, ELT?
E = Extract T = Transform L = Load B = Bacon
Singer
The open-source standard for writing scripts that move data. https://www.singer.io
Singer
Taps extract data from any source and write it to a standard stream in a JSON-based format. Targets consume data from taps and do something with it, like load it into a file, API or database.
Singer cycle
Tap Target State
takes in data from the tap and transforms it to what the target needs. It outputs the state that the tap also sent
then passed back to the tap, and the process continues at the last replication point. # tap-mysql -c catalog.json -s state.json | target-postgres > state.json
The catalog defines how to replicate the data. We use Drupal to generate these "streams".
# drush singer-json > catalog.json
Singer
d_user d_chinchilla external_table sendgrid_opens g_ad Data lakeAthena BigQuery PostgreSQL Redshift
??? Singer
(denormalized)Our solution
data lake
read-only replica of the lake
data from lake of Drupal + other data in MySQL to PostgreSQL
Drupal Singer PostgreSQL
Rollout & Results
tests for feature design
customer sites
competitive advantage
Data Can Be Beautiful
Some Section header Second Line Support Desk Heros
Custom reports are now built by non-developers. Often for free! Sometimes in minutes!
Have we arrived?
Join us! Join us!
Resources
https://github.com/singer-io/tap-postgres/tree/master/tap_postgres
Some project are reaching maturity. Stay tuned! https://looker.com/blog/introducing-singer
Join us for contribution opportunities
Friday, April 12, 2019
9:00-18:00 Room: 602Mentored Contribution First Time Contributor Workshop General Contribution
#DrupalContributions
9:00-12:00 Room: 606 9:00-18:00 Room: 6AWhat did you think?
Locate this session at the DrupalCon Seattle website: http://seattle2019.drupal.org/schedule Take the Survey! https://www.surveymonkey.com/r/DrupalConSeattle