Data Engineering Hierarchy of Needs Angel Daz Self-Intro Data - - PowerPoint PPT Presentation

data engineering
SMART_READER_LITE
LIVE PREVIEW

Data Engineering Hierarchy of Needs Angel Daz Self-Intro Data - - PowerPoint PPT Presentation

Data Engineering Hierarchy of Needs Angel Daz Self-Intro Data Engineering Consultant Tools Python, AWS, Airflow, Ansible Business Problems Batch Processing Workflows ELT / ETL Ground-Up Data Infrastructures Maslow at the


slide-1
SLIDE 1

Hierarchy of Needs Angel D’az

Data Engineering

slide-2
SLIDE 2

Data Engineering Consultant Tools

  • Python, AWS, Airflow, Ansible

Business Problems

  • Batch Processing Workflows ELT / ETL
  • Ground-Up Data Infrastructures

Self-Intro

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Maslow at the Blackfoot Reservation in 1938

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

But why Another mental model for data

slide-12
SLIDE 12

Focus is on fundamentals Reasoning > Principles > Tools

slide-13
SLIDE 13

Automation

01.

slide-14
SLIDE 14

Without Automation

slide-15
SLIDE 15

Why Automation first?

slide-16
SLIDE 16

What is a good baseline for Automation?

slide-17
SLIDE 17

What is a good baseline for Automation?

Scripts

  • Source control and Schedule below scripts

○ Script Existing Manual and Predictable Data Wrangling ○ Move legacy click and drag workflows over to scripts

slide-18
SLIDE 18

What does robust Automation look like?

More layers of complexity

slide-19
SLIDE 19

What does robust Automation look like?

More layers of complexity

  • Infrastructure as Code (IaC)
slide-20
SLIDE 20

What does robust Automation look like?

More layers of complexity

  • Infrastructure as Code (IaC)
  • Data Workflow Orchestration
slide-21
SLIDE 21

Why Airflow? It’s Extensible

Engineering Talent

  • Leverages Python language as the analytics standard

Technical

  • Connections to any data source
  • Lightweight backend works on any Linux/Unix Server
  • Code as Abstraction Layer
slide-22
SLIDE 22

Extract

02.

slide-23
SLIDE 23

Extract (v.)

slide-24
SLIDE 24

Without Extraction, there are no ingredients for which our analysts to do their work Without ingredients, any optimization is premature.

Extract (v.)

slide-25
SLIDE 25

Either no-code Data Integration SaaS solution Or Fully automate your Data Source connections in code

Extract (v.)

slide-26
SLIDE 26

Load

03.

slide-27
SLIDE 27

Load

Cheaper storage killed ETL. And ELT took its place.

slide-28
SLIDE 28

Load

Data Lakes

  • Raw data will be in a rough state.
  • Cloud Storage allows Analysts to query

○ Queries may be complex

slide-29
SLIDE 29

Load

Data Lakes

  • Raw data will be in a rough state.
  • Cloud Storage allows Analysts to query

○ Queries may be complex

  • Daily Snapshots (more info)
slide-30
SLIDE 30

Load

Data Lakes

  • Raw data will be in a rough state.
  • Cloud Storage allows Analysts to query

○ Queries may be complex

  • Daily Snapshots (more info)
  • Optimize with Parquet files
slide-31
SLIDE 31

Transform

04.

slide-32
SLIDE 32

Transform

slide-33
SLIDE 33

Transform

Data Work that can be kept in SQL only. Why?

slide-34
SLIDE 34

Why SQL only?

1. Maintainable Workflows

slide-35
SLIDE 35

Why SQL only?

1. Maintainable Workflows 2. More Complexity a. Remove Data Silos

slide-36
SLIDE 36

Why SQL only?

1. Maintainable Workflows 2. More Complexity a. Remove Data Silos b. Parameterize your SQL

slide-37
SLIDE 37

Parameterize your SQL

SELECT {{ cols }} FROM tbl {{ where }}

slide-38
SLIDE 38

Why SQL only?

1. Maintainable Workflows 2. More Complexity a. Remove Data Silos b. Parameterize your SQL c. Data Quality Testing

slide-39
SLIDE 39

Optimize Analysis

05.

slide-40
SLIDE 40

Optimize Analysis

slide-41
SLIDE 41

Optimize Analysis

Time Sensitive Reporting

  • Spark
slide-42
SLIDE 42

Optimize Analysis

Time Sensitive Reporting

  • Spark

Custom Data Transformations

  • Jupyter Notebooks
slide-43
SLIDE 43

Optimize Analysis

Time Sensitive Reporting

  • Spark

Custom Data Transformations

  • Jupyter Notebooks

Large Scale Processes

  • Reduce Computational Cost with Systems Engineering
slide-44
SLIDE 44

Machine Learning

06.

slide-45
SLIDE 45

Machine Learning

slide-46
SLIDE 46

Machine Learning

slide-47
SLIDE 47

Streaming

07.

slide-48
SLIDE 48

Streaming

Streaming for Data Analysis, alone, is rare.

slide-49
SLIDE 49

Conclusion

Big “Why?”s

slide-50
SLIDE 50

Transparency And Reproducibility

slide-51
SLIDE 51

Enabling Ethics

slide-52
SLIDE 52

Thank you!

Say hi! Ask questions! Writing: angelddaz.substack.com Contact: angel@ocelotdata.com