databases
play

Databases IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent - PowerPoint PPT Presentation

Databases IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp What are databases ? Holds data Organi z es data Retrie v e / Search data thro u gh DBMS A u s u all y large collection of data organi z


  1. Databases IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  2. What are databases ? Holds data Organi z es data Retrie v e / Search data thro u gh DBMS A u s u all y large collection of data organi z ed especiall y for rapid search and retrie v al . INTRODUCTION TO DATA ENGINEERING

  3. Databases and file storage Databases File s y stems Less organi z ed Ver y organi z ed Simple , less added f u nctionalit y F u nctionalit y like search , replication , ... INTRODUCTION TO DATA ENGINEERING

  4. Str u ct u red and u nstr u ct u red data Str u ct u red : database schema Relational database { "key": "value"} Semi - str u ct u red JSON Unstr u ct u red : schemaless , more like � les Videos , photos INTRODUCTION TO DATA ENGINEERING

  5. SQL and NoSQL SQL NoSQL Tables Non - relational databases Database schema Str u ct u red or u nstr u ct u red Relational databases Ke y-v al u e stores ( e . g . caching ) Doc u ment DB ( e . g . JSON objects ) INTRODUCTION TO DATA ENGINEERING

  6. SQL : The database schema -- Create Customer Table CREATE TABLE "Customer" ( "id" SERIAL NOT NULL, "first_name" varchar, "last_name" varchar, PRIMARY KEY ("id") ); -- Create Order Table CREATE TABLE "Order" ( "id" SERIAL NOT NULL, -- Join both tables on foreign key "customer_id" integer REFERENCES "Customer", SELECT * FROM "Customer" "product_name" varchar, INNER JOIN "Order" "product_price" integer, ON "customer_id" = "Customer"."id"; PRIMARY KEY ("id") ); id | first_name | ... | product_price 1 | Vincent | ... | 10 INTRODUCTION TO DATA ENGINEERING

  7. SQL : Star schema The star schema consists of one or more fact tables referencing an y n u mber of dimension tables . Facts : things that happened ( eg . Prod u ct Orders ) Dimensions : information on the w orld ( eg . C u stomer Information ) 1 Wikipedia : h � ps :// en .w ikipedia . org /w iki / Star _ schema INTRODUCTION TO DATA ENGINEERING

  8. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

  9. What is parallel comp u ting IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  10. Idea behind parallel comp u ting Basis of modern data processing tools Memor y Processing po w er Idea Split task into s u btasks Distrib u te s u btasks o v er se v eral comp u ters Work together to � nish task INTRODUCTION TO DATA ENGINEERING

  11. The tailor shop R u nning a tailor shop Goal : 100 shirts Best tailor � nishes shirt / 20 min u tes Other tailors do shirt / 1 ho u r M u ltiple tailors w orking together > best tailor INTRODUCTION TO DATA ENGINEERING

  12. Benefits of parallel comp u ting Processing po w er Memor y: partition the dataset RAM memor y chip : INTRODUCTION TO DATA ENGINEERING

  13. Risks of parallel comp u ting O v erhead d u e to comm u nication Parallel slo w do w n : Task needs to be large Need se v eral processing u nits INTRODUCTION TO DATA ENGINEERING

  14. An e x ample INTRODUCTION TO DATA ENGINEERING

  15. multiprocessing.Pool from multiprocessing import Pool def take_mean_age(year_and_group): year, group = year_and_group return pd.DataFrame({"Age": group["Age"].mean()}, index=[year]) with Pool(4) as p: results = p.map(take_mean_age, athlete_events.groupby("Year")) result_df = pd.concat(results) INTRODUCTION TO DATA ENGINEERING

  16. dask import dask.dataframe as dd # Partition dataframe into 4 athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4) # Run parallel computations on each partition result_df = athlete_events_dask.groupby('Year').Age.mean().compute() INTRODUCTION TO DATA ENGINEERING

  17. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

  18. Parallel comp u tation frame w orks IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  19. INTRODUCTION TO DATA ENGINEERING

  20. HDFS INTRODUCTION TO DATA ENGINEERING

  21. MapRed u ce INTRODUCTION TO DATA ENGINEERING

  22. Hi v e R u ns on Hadoop Str u ct u red Q u er y Lang u age : Hi v e SQL Initiall y MapRed u ce , no w other tools INTRODUCTION TO DATA ENGINEERING

  23. Hi v e : an e x ample SELECT year, AVG(age) FROM views.athlete_events GROUP BY year INTRODUCTION TO DATA ENGINEERING

  24. A v oid disk w rites Maintained b y Apache So �w are Fo u ndation INTRODUCTION TO DATA ENGINEERING

  25. Resilient distrib u ted datasets ( RDD ) Spark relies on them Similar to list of t u ples Transformations : .map() or .filter() Actions : .count() or .first() INTRODUCTION TO DATA ENGINEERING

  26. P y Spark P y thon interface to Spark DataFrame abstraction Looks similar to Pandas INTRODUCTION TO DATA ENGINEERING

  27. P y Spark : an e x ample # Load the dataset into athlete_events_spark first SELECT year, AVG(age) FROM views.athlete_events (athlete_events_spark GROUP BY year .groupBy('Year') .mean('Age') .show()) INTRODUCTION TO DATA ENGINEERING

  28. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

  29. Workflo w sched u ling frame w orks IN TR OD U C TION TO DATA E N G IN E E R IN G Vincent Vankr u nkels v en Data Engineer @ DataCamp

  30. An e x ample pipeline Ho w to sched u le ? Man u all y cron sched u ling tool What abo u t dependencies ? INTRODUCTION TO DATA ENGINEERING

  31. DAGs Directed Ac y clic Graph Set of nodes Directed edges No c y cles INTRODUCTION TO DATA ENGINEERING

  32. The tools for the job Lin ux' s cron Spotif y' s L u igi Apache Air � o w INTRODUCTION TO DATA ENGINEERING

  33. Created at Airbnb DAGs P y thon INTRODUCTION TO DATA ENGINEERING

  34. Airflo w: an e x ample DAG INTRODUCTION TO DATA ENGINEERING

  35. Airflo w: an e x ample in code # Create the DAG object dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *") # Define operations start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag) ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag) ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag) enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag) # Set up dependency flow start_cluster.set_downstream(ingest_customer_data) ingest_customer_data.set_downstream(enrich_customer_data) ingest_product_data.set_downstream(enrich_customer_data) INTRODUCTION TO DATA ENGINEERING

  36. Let ' s practice ! IN TR OD U C TION TO DATA E N G IN E E R IN G

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend