CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome - - PowerPoint PPT Presentation

csc 369 distributed computing
SMART_READER_LITE
LIVE PREVIEW

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome - - PowerPoint PPT Presentation

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome Syllabus Teaching and Communication Textbook(s) Grading Exams Labs Late Policies Course What is distributed computing Why study it?


slide-1
SLIDE 1

CSC 369: Distributed Computing

Alex Dekhtyar

Day 1: Welcome

slide-2
SLIDE 2
slide-3
SLIDE 3

Syllabus Course

  • Teaching and Communication
  • Textbook(s)
  • Grading
  • Exams
  • Labs
  • Late Policies
  • What is “distributed computing”
  • Why study it?
  • Examples of problems
slide-4
SLIDE 4

Syllabus: Teaching and Communication

Lectures are synchronous but recorded

slide-5
SLIDE 5

Syllabus: Teaching and Communication

Lectures are synchronous but recorded Lab periods may be used for guided activities But often are just for work on lab assignments Office hour between lecture and lab (M,F)

slide-6
SLIDE 6

Syllabus: Teaching and Communication

Zoom Slack Static Website Canvas Mailing list

slide-7
SLIDE 7

Waitlist

Drop/Add deadline: April 15

All waitlisted students get full access to class for two weeks First five (5) days - all adds handled automatically Everyone else - I will look at the state of affairs next Monday.

slide-8
SLIDE 8

Syllabus: Textbooks

NONE

Lecture Notes Documentation Original MapReduce and Spark papers

slide-9
SLIDE 9

Syllabus: Books

Donald Miner, Adam Shook, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, O’Reiley Media, 1st Edition, 2012, ISBN: 978-1449327170. Mahmoud Parsian, Data Algorithms: Recipes for Scaling Up With Hadoop and Spark, O’Reiley Media, 2015, ISBN: 978-1491906187. Christina Chodorow, MongoDB: The Definitive Guide, O’Reiley Media, 2013, ISBN: 978-144924468 Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark: Lightining-Fast Big Data Analysis, Packt, 2015, ISBN: 978- 1449358624 Tomasz Drabas, Denny Lee, Learning PySpark, O’Reiley Media, 2017, ISBN-13: 978-1786463708

slide-10
SLIDE 10

Syllabus: Grading

Labs 50-60% Exams/Written Assessments 35-50% Homework/Study guides 0-5%

slide-11
SLIDE 11

Syllabus: Labs

~ 8 Labs (rougly weekly)

  • 1 Intro (Lab 1 starts today)
  • 2- 3 MongoDB
  • 2-3 Hadoop
  • 2-3 Spark
slide-12
SLIDE 12

Syllabus: Labs

~ 8 Labs (rougly weekly)

  • 1 Intro (Lab 1 starts today)
  • 2- 3 MongoDB
  • 2-3 Hadoop
  • 2-3 Spark

Mostly individual Some pair programming experiments mid-quarter

slide-13
SLIDE 13

Syllabus: Exams

slide-14
SLIDE 14

Syllabus: Exams

?

slide-15
SLIDE 15

Syllabus: Exams

Combination of programming and short timed tests.

  • MongoDB programming test + quiz
  • Hadoop programming test + quiz
  • Spark programming test + quiz (Final exam time)
slide-16
SLIDE 16

Syllabus: Exams

Combination of programming and short timed tests.

  • MongoDB programming test + quiz
  • Hadoop programming test + quiz
  • Spark programming test + quiz (Final exam time)

Open “most things” on programming tests Still thinking how to make quizzes work

slide-17
SLIDE 17

Syllabus: Late Policies

Step 1. Talk to Me!!!!!

slide-18
SLIDE 18

Syllabus: Late Policies

Step 1. Talk to Me!!!!!

  • Deadlines are already lenient
  • There is a grace period
  • Deadlines are to prevent you from being bogged down with one problem
  • Partial credit
slide-19
SLIDE 19

Syllabus Course

  • Teaching and Communication
  • Textbook(s)
  • Grading
  • Exams
  • Labs
  • Late Policies
  • What is “distributed computing”
  • Why study it?
  • Examples of problems
slide-20
SLIDE 20

One small thing: I forgot to ask a couple of questions

https://forms.gle/2vuNJr1nR6FWpioG8

slide-21
SLIDE 21

Distributed Computing

slide-22
SLIDE 22

Distributed Computing

Multiple independent computers work on the same problem at the same time

slide-23
SLIDE 23

Distributed Computing

Multiple independent computers work on the same problem at the same time

slide-24
SLIDE 24

Distributed Computing

Multiple independent computers work on the same problem at the same time

slide-25
SLIDE 25

Distributed Computing

Multiple independent computers work on the same problem at the same time

slide-26
SLIDE 26

Distributed Computing

Multiple independent computers work on the same problem at the same time

slide-27
SLIDE 27

Distributed Computing

Multiple independent computers work on the same problem at the same time

Facilicated by distributed computing systems and frameworks

slide-28
SLIDE 28

Distributed Computing

Multiple independent computers work on the same problem at the same time CSC 369: writing software for solving problems using existing distributed computing frameworks CSC 469: studying how to build distributed computing frameworks

slide-29
SLIDE 29

Distributed Computing

CSC 369: writing software for solving problems using existing distributed computing frameworks

BIG GREY BOX

slide-30
SLIDE 30
slide-31
SLIDE 31

Elephant in the room

slide-32
SLIDE 32

Elephant in the room

BIG DATA

slide-33
SLIDE 33

BIG DATA Problems

Big Data = any data collection that is larger than the storage capacity

  • f a single computer system used to

process it.

slide-34
SLIDE 34

BIG DATA Problems

Big Data = any data collection that is larger than the storage capacity

  • f a single computer system used to

process it.

Problems that are easy to solve as small data problems turn out to be difficult as big data problems

slide-35
SLIDE 35

This is why we cannot have nice things

Problems that are easy to solve as small data problems turn out to be difficult as big data problems

teach CSC 369

slide-36
SLIDE 36

When you have a hammer everything is a nail I am a “database guy”, so for me “distributed computing problems” = “data management and analysis problems” Distributed Relational DBMS are not different than regular Relational DBMS and thus are covered in CSC 365 So, we’ll study other distributed frameworks

slide-37
SLIDE 37
slide-38
SLIDE 38

MongoDB: distributed non-relational document store Replicates and Shards data Works with JSON objects

slide-39
SLIDE 39

MongoDB: distributed non-relational document store Replicates and Shards data Works with JSON objects Hadoop: open-source implementation of MapReduce framework MapReduce: distributed computing framework for data processing Map: transform data Reduce: combine information

slide-40
SLIDE 40

MongoDB: distributed non-relational document store Replicates and Shards data Works with JSON objects Hadoop: open-source implementation of MapReduce framework MapReduce: distributed computing framework for data processing Map: transform data Reduce: combine information Spark: lazy evaluation data processing over Hadoop Resilient Distributed Datasets (RDDs): optimize data processing Implemented in Scala PySpark: Python interface to Spark

slide-41
SLIDE 41

What Types of Problems?

  • Handout #2
  • The “Facebook” Example
  • The “Google” Example
  • The “Twitter” Example
  • The “Census” Example
  • The “Bioinformatics” Example

I’ll record a 10-15 companion video.

slide-42
SLIDE 42

In Lab Today

1. Confirm that everyone has access to ambari-head and MongoDB, change passwords 2. Lab 1: JSON processing