CS 744: GANDIVA Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - - - PowerPoint PPT Presentation

cs 744 gandiva
SMART_READER_LITE
LIVE PREVIEW

CS 744: GANDIVA Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - - - PowerPoint PPT Presentation

CS 744: GANDIVA Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Course project proposal - Midterm Bismarck Supervised learning, Unified Interface Shared memory, Model fits in memory Parameter Server Large datasets, large models (PB


slide-1
SLIDE 1

CS 744: GANDIVA

Shivaram Venkataraman Fall 2019

slide-2
SLIDE 2

ADMINISTRIVIA

  • Course project proposal
  • Midterm
slide-3
SLIDE 3

Machine Learning Bismarck Supervised learning, Unified Interface Shared memory, Model fits in memory Parameter Server Large datasets, large models (PB scale) Consistency model, Fault tolerance Tensorflow Need for flexible programming model Dataflow graph, Heterogeneous accelerators Ray Reinforcement learning applications Actors and tasks, Local and global scheduler

slide-4
SLIDE 4

Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications

slide-5
SLIDE 5

MACHINE LEARNING WORKFLOW?

slide-6
SLIDE 6

SHARED ML CLUSTERS

Rack

slide-7
SLIDE 7

WORKLOAD

Feedback-driven exploration

slide-8
SLIDE 8

AFFINITY

slide-9
SLIDE 9

INTRA JOB PREDICTABILITY

slide-10
SLIDE 10

MECHANISMS (1)

Rack

  • 1. Suspend-Resume
  • 2. Migration
slide-11
SLIDE 11

MECHANISMS (2)

Rack

  • 3. Grow-shrink
  • 4. Profiling
slide-12
SLIDE 12

SCHEDULING POLICY

Goals early feedback cluster efficiency cluster-level fairness? Two modes Reactive Introspective

slide-13
SLIDE 13

REACTIVE MODE

React to events Job arrivals, departures, failures Hierarchical Preference Nodes with same “affinity” Nodes with “different affinity” Nodes with “no affinity” Suspend-resume …

slide-14
SLIDE 14

INTROSPECTIVE MODE

Monitor and optimize placement of jobs periodically Actions Packing Migration Grow-shrink

slide-15
SLIDE 15

DISCUSSION

https://forms.gle/aHYbNcTFdGJtXefj9

slide-16
SLIDE 16

What are some guarantees provided by Mesos that are not provided by Gandiva? Explain with an example

slide-17
SLIDE 17

Are mechanisms in Gandiva also useful in a cluster running Apache Spark jobs? Provide one example either for or against

slide-18
SLIDE 18
slide-19
SLIDE 19

NEXT STEPS

New module on SQL! Course project introductions Midterm