guest lecture
play

Guest Lecture Daniel Dao & Nick Buroojy OVERVIEW What is - PowerPoint PPT Presentation

Guest Lecture Daniel Dao & Nick Buroojy OVERVIEW What is Civitas Learning What We Do Mission Statement Demo What I Do How I Use Databases Nick Buroojy CIVITAS LEARNING, INC. CONFIDENTIAL INFORMATION WHAT IS CIVITAS LEARNING


  1. Guest Lecture Daniel Dao & Nick Buroojy

  2. OVERVIEW What is Civitas Learning What We Do Mission Statement Demo What I Do How I Use Databases Nick Buroojy CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  3. WHAT IS CIVITAS LEARNING Civitas Learning Mid-sized Data driven Education startup company CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  4. “ We partner with forward-thinking colleges and universities, harnessing the power of insight and action analytics to help a million more students learn well and finish strong. ” – The Million More Mission CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  5. WHAT WE DO • Work with institutions to provide insights through various applications • Inspire CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  6. Inspire for Faculty Demo CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  7. WHAT I DO • My role in the company • How my work is broken down • Product • Dev managers, PSMs, engineers • Frontend • Work with HTML/CSS/ReactJS • Backend • Writing APIs • Working with models • Writing SQL • Optimizing performance • Writing tests CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  8. HOW I USE DATABASES CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  9. Nick Buroojy • Graduated from Carnegie Mellon • Bachelors in Computer Science • Software Engineering • I've been working in Software for about 6 years • I've been at Civitas for three years • I’ve worked at Apple, Google, Civitas CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  10. Goals At the end of this lecture, you will be able to: • Describe the process Civitas uses to manipulate data. • Describe the differences between column and row oriented data stores • Explain how Redshift uses distributed compute for query performance • Describe the use of the data layout options DIST_KEY and SORT_KEY CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  11. Civitas Data Flow Secure File Transfer Protocol Red Shift Transform Extract Load Extract Extract Transform Web Load Load Server Load CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  12. Extract • As long as the data is in the tables, there are export commands that can simply dump the data to a file. PostGres APP RAW LOAD RAW APP RAW CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  13. Transform SRC_ RAW SRC_ SRC_ RAW SELECT SPBPERS.SPBPERS_PIDM AS raw_person_id , SPBPERS.SPBPERS_BIRTH_DATE AS raw_birth_dt , SPBPERS.SPBPERS_DEAD_DATE AS raw_death_dt , SPBPERS.SPBPERS_SEX AS raw_gender , null AS raw_primary_language , null AS raw_country_of_origin FROM src_banner_saturn.spbpers CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  14. Load SFTP Red Shift Table File Table Table Flat file: Plain Text file that is non-hierarchical, usually in the form of CSV, or TSV. Each row represents one row in the database. CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  15. Data Flow Secure File Transfer Protocol Red Shift Web Server CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  16. Redshift Performance • Columnar data storage • Distributed data storage • DIST_KEY • SORT_KEY • Parallel query execution • COPY / UNLOAD CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  17. Columnar data storage Row-oriented data store example: Source : docs.aws.amazon.com CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  18. Columnar data storage Column-oriented data store example: Source : docs.aws.amazon.com CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  19. Distributed data storage • Why? • DB constraints • Disk • CPU • Network CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  20. Partial aggregations • SUM 10+21+15+6 21 15 10 6 1 3 6 3 5 6 7 1 6 8 6 CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  21. Partial aggregations • COUNT 3+4+3+1 4 3 3 1 1 3 6 3 5 6 7 1 6 8 6 CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  22. Partial aggregations • AVG = SUM / COUNT SUM / COUNT 1 3 6 3 5 6 7 1 6 8 6 CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  23. Partial aggregations • Redshift can distribute • AVG • SUM • COUNT • MAX • MIN • STDDEV • … • More challenging (slower) • COUNT DISTINCT • ORDER BY x LIMIT n CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  24. DIST KEY • Allows Redshift user to specify which records are on the same node • Used to keep balanced • Used for join locality • Can perform a join without “shuffling”. That is, sending data between nodes. CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  25. SORT KEY • Orders of storage for records • Allows queries to skip ranges • Allows for faster joins (merge vs. hash) • Faster ORDER BY queries CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  26. PRIMARY KEY • Redshift doesn’t enforce primary keys or foreign keys • Primary key must be non-null and unique • Used by query optimizer • Civitas checks our keys after building each table • COUNT(pk) == COUNT(*) == COUNT(DISTINCT pk) CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  27. COPY • Loads flat file data from bulk storage (S3) into Redshift • Each node loads some parts of the data • Master doesn’t touch the data, and is not a bottleneck • Unload: opposite direction. Redshift -> S3 CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  28. Summary • Process Civitas uses to manipulate data. • Columnar data layout • Distributed query aggregations • Data layout options • Careers at Civitas Learning CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

  29. Questions? CIVITAS LEARNING, INC. – CONFIDENTIAL INFORMATION

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend