CSC 369: Distributed Computing Alex Dekhtyar April 22 Day 8: - - PowerPoint PPT Presentation

csc 369 distributed computing
SMART_READER_LITE
LIVE PREVIEW

CSC 369: Distributed Computing Alex Dekhtyar April 22 Day 8: - - PowerPoint PPT Presentation

CSC 369: Distributed Computing Alex Dekhtyar April 22 Day 8: Problem-solving with db.collection.aggregate() April 22: Vladimir Lenins Birthday Housekeeping Lab 3: now with a deadline (Friday midnight + grace period) Lab 4: Friday --


slide-1
SLIDE 1

CSC 369: Distributed Computing

Alex Dekhtyar

Day 8: Problem-solving with db.collection.aggregate()

April 22

slide-2
SLIDE 2

April 22: Vladimir Lenin’s Birthday

slide-3
SLIDE 3

Housekeeping

Lab 3: now with a deadline (Friday midnight + grace period) Lab 4: Friday -- Monday, May 4 (gives you time) Lab 5: Hadoop Friday: quiz. Be ON TIME Monday: 12:10pm - Lab Test. Read email/slack for details

Grading: Lab 2 -> Quiz -> Lab 2 -> Lab Test -> Lab 1

slide-4
SLIDE 4

Back into the fray

slide-5
SLIDE 5

Very Tersely

Filtering Projection Transformation Join Grouping Sort

Given a condition - keep only objects that satisfy it Modify the contents of its object based solely on what’s in the object itself

Aggregation

Break collection into groups, each representing

  • bjects with same values of some keys

Combine objects from two different collections based on matches in values of some keys Compute an aggregate value over a set of objects Return objects in a specific order

slide-6
SLIDE 6

… and a few more

Ungrouping Unwinding Limit Skip

Opposite of grouping - build an object for each element of an array Return a specific number of documents

Sample

Return documents after skipping a specified number Return a random sample of documents

Facets

Run multiple operations concurrently, combine results in a single document

slide-7
SLIDE 7

$operation

Filtering Projection Join Grouping Sort Aggregation $match $project $group $lookup $sort Unwinding Limit Skip Sample $unwind $limit $skip $sample

slide-8
SLIDE 8

$operation

Filtering Projection Join Grouping Sort Aggregation $match $project $set $unset $addFields $redact $replaceRoot $group $bucket $bucketAuto $lookup $graphLookup $sortByCount $sort

slide-9
SLIDE 9

This is a lot to take in

slide-10
SLIDE 10

How do we actually solve problems with db.collection.aggregate()

???

slide-11
SLIDE 11

Key things to remember

Filtering Selection Projection Transformation Join Grouping Aggregation Unwind

Other operations - as needed to assist the main flow

Faceting

slide-12
SLIDE 12

Key things to remember

Selections/Filters are EASY to recognize

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

slide-13
SLIDE 13

Key things to remember

Selections/Filters are EASY to recognize

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

What are the tell-tales?

slide-14
SLIDE 14

Key things to remember

Selections/Filters are EASY to recognize

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

What are the tell-tales?

constants

slide-15
SLIDE 15

Key things to remember

Selections/Filters are EASY to recognize

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

What are the tell-tales?

constants

slide-16
SLIDE 16

Key things to remember

Selections/Filters are EASY to recognize

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

What are the tell-tales?

constants comparisons

slide-17
SLIDE 17

Key things to remember

Projections are everywhere

Use Case #1: Show only the things we are interested in Use Case #2: Transform the output

Shows up in support of other operations (selection, join, grouping) Central activity in an information request

slide-18
SLIDE 18

Key things to remember

Projections are everywhere

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

Use Case #1: Support

slide-19
SLIDE 19

Key things to remember

Projections are everywhere

For all days in March, find the number of hospitalized people in the state of California. Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

Explicit restrictions Use Case #1: Support

slide-20
SLIDE 20

Key things to remember

Projections are everywhere

Compute the ratio of people on ICU to all hospitalized people Create a “status” attribute. Set “status” to “in trouble” if the number of new deaths exceeds 10% of the number of new cases. Otherwise, set status to “coping”.

Use Case #2: Main Target

slide-21
SLIDE 21

Key things to remember

Projections are everywhere

Compute the ratio of people on ICU to all hospitalized people Create a “status” attribute. Set “status” to “in trouble” if the number of new deaths exceeds 10% of the number of new cases. Otherwise, set status to “coping”.

Use Case #2: Main Target

Computation (using single

  • bject data)
slide-22
SLIDE 22

Key things to remember

Projections are everywhere

Compute the ratio of people on ICU to all hospitalized people Create a “status” attribute. Set “status” to “in trouble” if the number of new deaths exceeds 10% of the number of new cases. Otherwise, set status to “coping”.

Use Case #2: Main Target

Computation (using single

  • bject data)

Explicit Transformation

slide-23
SLIDE 23

Key things to remember

Projections are everywhere

Use Case #3: Implicit Cleanup after Joins/Unwinds/Grouping

slide-24
SLIDE 24

Key things to remember

Projections are everywhere

Use Case #3: Implicit Cleanup after Joins/Unwinds/Grouping For each state report the total number of days with more than 10 ICU

  • patients. Report results in the form:

{state: <state>, badICUDays: <nDays>}

slide-25
SLIDE 25

Key things to remember

Projections are everywhere

Use Case #3: Implicit For each state report the total number of days with more than 10 ICU

  • patients. Report results in the form:

{state: <state>, badICUDays: <nDays>} {$match: {...}}, {$group: {_id:”$state”, badICUDays: {$sum:1}}}

slide-26
SLIDE 26

Key things to remember

Projections are everywhere

Use Case #3: Implicit For each state report the total number of days with more than 10 ICU

  • patients. Report results in the form:

{state: <state>, badICUDays: <nDays>} {$match: {...}}, {$group: {_id:”$state”, badICUDays: {$sum:1}}}

{_id: “CA”, badICUDays: 21 }

slide-27
SLIDE 27

Key things to remember

Projections are everywhere

Use Case #3: Implicit For each state report the total number of days with more than 10 ICU

  • patients. Report results in the form:

{state: <state>, badICUDays: <nDays>} {$match: {...}}, {$group: {_id:”$state”, badICUDays: {$sum:1}}}

{_id: “CA”, badICUDays: 21 }

slide-28
SLIDE 28

Key things to remember

Projections are everywhere

Use Case #3: Implicit For each state report the total number of days with more than 10 ICU

  • patients. Report results in the form:

{state: <state>, badICUDays: <nDays>} {$match: {...}}, {$group: {_id:”$state”, badICUDays: {$sum:1}}}, {$project: {_id:0, state:”$_id”}}

{state: “CA”, badICUDays: 21 }

slide-29
SLIDE 29

Key things to remember

Grouping combines data from multiple documents into one

For each state report the total number of days with more than 10 ICU patients. Report results in the form: {state: <state>, badICUDays: <nDays>}

Is this a grouping and aggregation query?

slide-30
SLIDE 30

Key things to remember

Grouping combines data from multiple documents into one

For each state report the total number of days with more than 10 ICU patients. Report results in the form: {state: <state>, badICUDays: <nDays>}

Is this a grouping and aggregation query? Yes, with daily.json data

slide-31
SLIDE 31

Key things to remember

Grouping combines data from multiple documents into one

For each state report the total number of days with more than 10 ICU patients. Report results in the form: {state: <state>, badICUDays: <nDays>}

Is this a grouping and aggregation query? Yes, with daily.json data

{ "_id" : ObjectId("5e941e9cf9e720b73b7d96ff"), "date" : 20200405, "state" : "AK", "positive" : 185, "negative" : 6099, "pending" : null, "hospitalizedCurrently" : null, "hospitalizedCumulative" : 20, "inIcuCurrently" : 12, "inIcuCumulative" : null, "onVentilatorCurrently" : null, "onVentilatorCumulative" : null, "recovered" : null, "hash" : "661d7b0f627847a2dceb5d70d4e9260965031cc2", "dateChecked" : "2020-04-05T20:00:00Z", "death" : 6, "hospitalized" : 20, "total" : 6284, "totalTestResults" : 6284, "posNeg" : 6284, "fips" : "02", "deathIncrease" : 1, "hospitalizedIncrease" : 4, "negativeIncrease" : 230, "positiveIncrease" : 14, "totalTestResultsIncrease" : 244}

slide-32
SLIDE 32

Key things to remember

Grouping combines data from multiple documents into one

For each state report the total number of days with more than 10 ICU patients. Report results in the form: {state: <state>, badICUDays: <nDays>}

Is this a grouping and aggregation query? No, with other input data

slide-33
SLIDE 33

Key things to remember

Grouping combines data from multiple documents into one

For each state report the total number of days with more than 10 ICU patients. Report results in the form: {state: <state>, badICUDays: <nDays>}

Is this a grouping and aggregation query? No, with other input data

{ "_id" : 8888, state: “CA”, month: “March”, badICUDays: 9, goodICUDays: 4, noInfo: 17, cumulativeICUPatients: 88 }

slide-34
SLIDE 34

Key things to remember

Grouping combines data from multiple documents into one

For each state report the total number of days with more than 10 ICU patients. Report results in the form: {state: <state>, badICUDays: <nDays>}

KNOW YOUR DATA!!!

slide-35
SLIDE 35

Key things to remember

Grouping does NOT always mean aggregation

For each state create a list of dates when there were more than 10 ICU patients

slide-36
SLIDE 36

Key things to remember

Grouping does NOT always mean aggregation

For each state create a list of dates when there were more than 10 ICU patients

$push $addToSet Are your biggest friends!

slide-37
SLIDE 37

Key things to remember

We can “hide” information while grouping

$push $addToSet Are your biggest friends!

{$group: {_id:”$state”, avgPatients:{$avg:”$hospitalized”}, $push: {$hospitalized} } }

slide-38
SLIDE 38

Key things to remember

We can “hide” information while grouping

$push $addToSet Create array attributes

We can “unhide” information AFTER grouping

slide-39
SLIDE 39

Key things to remember

We can “hide” information while grouping

$push $addToSet Create array attributes

We can “unhide” information AFTER grouping

{$group: {_id:”$state”, avgPatients:{$avg:”$hospitalized”}, data: $push: {$hospitalized} } },

{$unwind: “$data”} $unwind after $group

slide-40
SLIDE 40

Key things to remember

We can “hide” information while grouping We can “unhide” information AFTER grouping Grouping combines data from multiple documents into one Grouping does NOT always mean aggregation Compound Keys $first, $last constant key values

slide-41
SLIDE 41

Key Things To Remember

Joins involve comparisons of documents to documents

slide-42
SLIDE 42

Key Things To Remember

Joins involve comparisons of documents to documents

Use Case #1: Join to different collections Use Case #2: Self Join Often can be avoided by embedding documents

slide-43
SLIDE 43

Students

take

Courses

(StudentID, Name) (CourseID, Name,Section, Instructor) (StudentID, Course, Section)

SELECT * FROM Students s, take t, Courses c WHERE s.StudentID = t.StudentId and t.Course = C.CourseID and t.Section = C.Section

slide-44
SLIDE 44

Students

take

Courses

(StudentID, Name) (CourseID, Name,Section, Instructor) (StudentID, Course, Section)

{course: “CSC 369”, roster: [{student:”Bob Smith”}, {student:”Alice Lee”}, ... ] }

slide-45
SLIDE 45

Key Things To Remember

Joins involve comparisons of documents to documents

Use Case #1: Join to different collections Use Case #2: Self Join $lookup is expensive. Self Joins can be “tricked”

slide-46
SLIDE 46

Key Things To Remember

Joins involve comparisons of documents to documents

Use Case #1: Join to different collections Use Case #2: Self Join $lookup is expensive. Self Joins can be “tricked” with $group

slide-47
SLIDE 47

Tricks

and

dealing with MongoDB idiosyncrasies

slide-48
SLIDE 48

Trick 1: $project as a filter

Problem: $match cannot compare two attributes to each other Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

{ … positive: 566, positiveIncrease: 65 … } {$match: {positiveIncrease: {$gt: {$multiply: [0.1, $positive”]}}}}

slide-49
SLIDE 49

Trick 1: $project as a filter

Problem: $match cannot compare two attributes to each other Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

{ … positive: 566, positiveIncrease: 65 … }

{$project: {flag: {$cond: [{$gt: [{“$positiveIncrease”, {$multiply: [“$positive”, 0.1]}]}, True, False }]}}

slide-50
SLIDE 50

Trick 1: $project as a filter

Problem: $match cannot compare two attributes to each other Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

{ … positive: 566, positiveIncrease: 65 … }

{$project: {flag: {$cond: [{$gt: [{“$positiveIncrease”, {$multiply: [“$positive”, 0.1]}]}, True, False }]}}, {$match: {flag: True}}

slide-51
SLIDE 51

Trick 1: $project as a filter

Problem: $match cannot compare two attributes to each other Report each day when the number of new cases exceeded 10% of the number of cumulative cases.

{ … positive: 566, positiveIncrease: 65 … }

{$project: {flag: {$cond: [{$gt: [{“$positiveIncrease”, {$multiply: [“$positive”, 0.1]}]}, True, False }]}}, {$match: {flag: True}}

slide-52
SLIDE 52

Trick 1: $project as a filter

Problem: $match cannot compare two attributes to each other

All computations can and should be done in $project

slide-53
SLIDE 53

Trick 2: Who has the optimal value?

Problem: {$sort: {foo:-1}},{limit:1} fails when there are ties Report the state and the date of the largest single increase in the number

  • f positive cases.
slide-54
SLIDE 54

Trick 2: Who has the optimal value?

Problem: {$sort: {foo:-1}},{limit:1} fails when there are ties Report the state and the date of the largest single increase in the number

  • f positive cases.

Step 1: Use $group $push to “hide” data Use constant for grouping value Step 1.5: Unwind

slide-55
SLIDE 55

Trick 2: Who has the optimal value?

Report the state and the date of the largest single increase in the number

  • f positive cases.

{ _id: “1”, largestIncrease: 10841 data: [{...},{...},...,{...}] }

{$unwind:”$data”}

{ _id: “1”, largestIncrease: 10841 data: {...} }

slide-56
SLIDE 56

Trick 2: Who has the optimal value?

Report the state and the date of the largest single increase in the number

  • f positive cases.

{ _id: “1”, largestIncrease: 10841 data: [{...},{...},...,{...}] }

{$unwind:”$data”}

{ _id: “1”, largestIncrease: 10841 data: {...} }

{$project ...}

Get rid of embedding if needed

Bulky, but straightforward and repeatable

slide-57
SLIDE 57

Trick 2: Who has the optimal value?

Problem: {$sort: {foo:-1}},{limit:1} fails when there are ties Report the state and the date of the largest single increase in the number

  • f positive cases.

Step 1: Use $group $push to “hide” data Use constant for grouping value Step 1.5: Unwind And $project if desired Step 2. See Trick 1 to finish

slide-58
SLIDE 58

Trick 3: Join Avoidance

Self-joins can be done outside of $lookup But with some painful manipulations

Leverage $group $push / $addToSet $unwind $project

slide-59
SLIDE 59

Trick 4: Generalizing Joins

Problem: $lookup is a left outer equijoin

Joins can be more complex:

slide-60
SLIDE 60

Trick 4: Generalizing Joins

Problem: $lookup is a left outer equijoin

Joins can be more complex: Report governors of all states with less than 400 positive cases per million on April 4, 2020

daily.json

{state: “CA” governor:”Gavin Newsom” population: 39510000}

slide-61
SLIDE 61

Trick 4: Generalizing Joins

Problem: $lookup is a left outer equijoin

Joins can be more complex: Report governors of all states with less than 400 positive cases per million on April 4, 2020

{... state:”CA” positive: 12026 ... }

{state: “CA” governor:”Gavin Newsom” population: 39510000}

slide-62
SLIDE 62

Trick 4: Generalizing Joins

Problem: $lookup is a left outer equijoin

Joins can be more complex: Report governors of all states with less than 400 positive cases per million on April 4, 2020

Join = Cartesian Product followed by Selection

slide-63
SLIDE 63

Trick 4: Generalizing Joins

Problem: $lookup is a left outer equijoin

Joins can be more complex: Report governors of all states with less than 400 positive cases per million on April 4, 2020

Join = Cartesian Product followed by Selection

Use for comparisons (Trick 1)