Guest Lecture Daniel Dao & Nick Buroojy OVERVIEW What is - - PowerPoint PPT Presentation

guest lecture
SMART_READER_LITE
LIVE PREVIEW

Guest Lecture Daniel Dao & Nick Buroojy OVERVIEW What is - - PowerPoint PPT Presentation

Guest Lecture Daniel Dao & Nick Buroojy OVERVIEW What is Civitas Learning What We Do Mission Statement Demo What I Do How I Use Databases Nick Buroojy CIVITAS LEARNING, INC. CONFIDENTIAL INFORMATION WHAT IS CIVITAS LEARNING


slide-1
SLIDE 1

Daniel Dao & Nick Buroojy

Guest Lecture

slide-2
SLIDE 2

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

OVERVIEW

What is Civitas Learning What We Do Mission Statement Demo What I Do How I Use Databases Nick Buroojy

slide-3
SLIDE 3

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

WHAT IS CIVITAS LEARNING

Civitas Learning Mid-sized startup Data driven company Education

slide-4
SLIDE 4

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

“We partner with forward-thinking colleges and universities, harnessing the power of insight and action analytics to help a million more students learn well and finish strong.” – The Million More Mission

slide-5
SLIDE 5

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

WHAT WE DO

  • Work with institutions to provide insights through

various applications

  • Inspire
slide-6
SLIDE 6

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Inspire for Faculty Demo

slide-7
SLIDE 7

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

WHAT I DO

  • My role in the company
  • How my work is broken down
  • Product
  • Dev managers, PSMs, engineers
  • Frontend
  • Work with HTML/CSS/ReactJS
  • Backend
  • Writing APIs
  • Working with models
  • Writing SQL
  • Optimizing performance
  • Writing tests
slide-8
SLIDE 8

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

HOW I USE DATABASES

slide-9
SLIDE 9

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Nick Buroojy

  • Graduated from Carnegie Mellon
  • Bachelors in Computer Science
  • Software Engineering
  • I've been working in Software for about 6 years
  • I've been at Civitas for three years
  • I’ve worked at Apple, Google, Civitas
slide-10
SLIDE 10

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Goals

At the end of this lecture, you will be able to:

  • Describe the process Civitas uses to manipulate

data.

  • Describe the differences between column and row
  • riented data stores
  • Explain how Redshift uses distributed compute for

query performance

  • Describe the use of the data layout options

DIST_KEY and SORT_KEY

slide-11
SLIDE 11

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Civitas Data Flow

Load Extract Red Shift Web Server Secure File Transfer Protocol Transform Extract Load Extract Transform Load Load

slide-12
SLIDE 12

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Extract

  • As long as the data is in the tables, there are export

commands that can simply dump the data to a file.

APP APP LOAD RAW RAW RAW PostGres

slide-13
SLIDE 13

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Transform

SRC_ SRC_ SRC_ RAW RAW SELECT SPBPERS.SPBPERS_PIDM AS raw_person_id , SPBPERS.SPBPERS_BIRTH_DATE AS raw_birth_dt , SPBPERS.SPBPERS_DEAD_DATE AS raw_death_dt , SPBPERS.SPBPERS_SEX AS raw_gender , null AS raw_primary_language , null AS raw_country_of_origin FROM src_banner_saturn.spbpers

slide-14
SLIDE 14

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Load

Red Shift Table File SFTP

Flat file: Plain Text file that is non-hierarchical, usually in the form of CSV, or TSV. Each row represents one row in the database.

Table Table

slide-15
SLIDE 15

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Data Flow

Red Shift Web Server Secure File Transfer Protocol

slide-16
SLIDE 16

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Redshift Performance

  • Columnar data storage
  • Distributed data storage
  • DIST_KEY
  • SORT_KEY
  • Parallel query execution
  • COPY / UNLOAD
slide-17
SLIDE 17

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Columnar data storage

Row-oriented data store example:

Source: docs.aws.amazon.com

slide-18
SLIDE 18

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Columnar data storage

Column-oriented data store example:

Source: docs.aws.amazon.com

slide-19
SLIDE 19

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Distributed data storage

  • Why?
  • DB constraints
  • Disk
  • CPU
  • Network
slide-20
SLIDE 20

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Partial aggregations

  • SUM

1 3 6 6 1 6 8 3 5 6 7 10+21+15+6 6 15 21 10

slide-21
SLIDE 21

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Partial aggregations

  • COUNT

1 3 6 6 1 6 8 3 5 6 7 3+4+3+1 1 3 4 3

slide-22
SLIDE 22

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Partial aggregations

  • AVG = SUM / COUNT

1 3 6 6 1 6 8 3 5 6 7 SUM / COUNT

slide-23
SLIDE 23

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Partial aggregations

  • Redshift can distribute
  • AVG
  • SUM
  • COUNT
  • MAX
  • MIN
  • STDDEV
  • More challenging (slower)
  • COUNT DISTINCT
  • ORDER BY x LIMIT n
slide-24
SLIDE 24

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

DIST KEY

  • Allows Redshift user to specify which records are
  • n the same node
  • Used to keep balanced
  • Used for join locality
  • Can perform a join without “shuffling”. That is, sending

data between nodes.

slide-25
SLIDE 25

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

SORT KEY

  • Orders of storage for records
  • Allows queries to skip ranges
  • Allows for faster joins (merge vs. hash)
  • Faster ORDER BY queries
slide-26
SLIDE 26

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

PRIMARY KEY

  • Redshift doesn’t enforce primary keys or foreign

keys

  • Primary key must be non-null and unique
  • Used by query optimizer
  • Civitas checks our keys after building each table
  • COUNT(pk) == COUNT(*) == COUNT(DISTINCT pk)
slide-27
SLIDE 27

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

COPY

  • Loads flat file data from bulk storage (S3) into

Redshift

  • Each node loads some parts of the data
  • Master doesn’t touch the data, and is not a

bottleneck

  • Unload: opposite direction. Redshift -> S3
slide-28
SLIDE 28

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Summary

  • Process Civitas uses to manipulate data.
  • Columnar data layout
  • Distributed query aggregations
  • Data layout options
  • Careers at Civitas Learning
slide-29
SLIDE 29

CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Questions?