Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low - - PowerPoint PPT Presentation

parallel splash belief propagation
SMART_READER_LITE
LIVE PREVIEW

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low - - PowerPoint PPT Presentation

Parallel Splash Belief Propagation Joseph E. Gonzalez Yucheng Low Carlos Guestrin David OHallaron Computers which worked on this project: BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFS Tashish01, Tashi02,


slide-1
SLIDE 1

Carnegie Mellon

Parallel Splash Belief Propagation

Joseph E. Gonzalez Yucheng Low Carlos Guestrin David O’Hallaron

Computers which worked on this project: BigBro1, BigBro2, BigBro3, BigBro4, BigBro5, BigBro6, BiggerBro, BigBroFS Tashish01, Tashi02, Tashi03, Tashi04, Tashi05, Tashi06, …, Tashi30, parallel, gs6167, koobcam (helped with writing)

slide-2
SLIDE 2

Why talk about parallelism now?

1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010

Change in the Foundation of ML

Future Sequential Performance Log(Speed in GHz)

2

Future Parallel Performance

Release Date

slide-3
SLIDE 3

3

Why is this a Problem?

Sophistication Parallelism

Nearest Neighbor [Google et al.] Basic Regression [Cheng et al.] Graphical Models [Mendiburu et al.] Support Vector Machines [Graf et al.]

Want to be here

slide-4
SLIDE 4

Why is it hard?

4

Algorithmic Efficiency Parallel Efficiency Implementation Efficiency

Eliminate wasted computation Expose independent computation Map computation to real hardware

slide-5
SLIDE 5

The Key Insight

5

Statistical Structure

  • •Graphical Model Structure
  • •Graphical Model Parameters

Computational Structure

  • •Chains of Computational Dependences
  • •Decay of Influence

Parallel Structure

  • •Parallel Dynamic Scheduling
  • •State Partitioning for Distributed Computation
slide-6
SLIDE 6

6

The Result

Nearest Neighbor [Google et al.] Basic Regression [Cheng et al.] Support Vector Machines [Graf et al.]

Goal

Splash Belief Propagation

Graphical Models [Gonzalez et al.]

Sophistication Parallelism

Graphical Models [Mendiburu et al.]

slide-7
SLIDE 7

Outline

Overview Graphical Models: Statistical Structure Inference: Computational Structure τε - Approximate Messages: Statistical Structure Parallel Splash

Dynamic Scheduling Partitioning

Experimental Results Conclusions

7

slide-8
SLIDE 8

Graphical Models and Parallelism

Graphical models provide a common language for general purpose parallel algorithms in machine learning

A parallel inference algorithm would improve:

8

Protein Structure Prediction

Inference is a key step in Learning Graphical Models

Computer Vision Movie Recommendation

slide-9
SLIDE 9

Overview of Graphical Models

Graphical represent of local statistical dependencies

9

Observed Random Variables Latent Pixel Variables Local Dependencies Noisy Picture

Inference

What is the probability that this pixel is black?

“True” Pixel Values Continuity Assumptions

slide-10
SLIDE 10

Synthetic Noisy Image Problem

Overlapping Gaussian noise Assess convergence and accuracy

Noisy Image Predicted Image

slide-11
SLIDE 11

Protein Side-Chain Prediction

Model side-chain interactions as a graphical model

11

What is the most likely orientation?

Inference

slide-12
SLIDE 12

Protein Side-Chain Prediction

276 Protein Networks: Approximately:

700 Variables 1600 Factors 70 Discrete orientations

Strong Factors

12

0.05 0.1 0.15 6 14 22 30 38 46 Degree

Example Degree Distribution

slide-13
SLIDE 13

Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Friends(A,B) And Smokes(A) è Smokes(B)

Markov Logic Networks

Represent Logic as a graphical model

13

Cancer(A) Cancer(B) Smokes(A) Smokes(B) Friends(A,B) A: Alice B: Bob True/False? Pr(Cancer(B) = True | Smokes(A) = True & Friends(A,B) = True) = ?

Inference

slide-14
SLIDE 14

Markov Logic Networks

14

Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Friends(A,B) And Smokes(A) è Smokes(B) Cancer(A) Cancer(B) Smokes(A) Smokes(B) Friends(A,B) A: Alice B: Bob True/False?

UW-Systems Model

8K Binary Variables 406K Factors

Irregular degree distribution:

Some vertices with high degree

slide-15
SLIDE 15

Outline

Overview Graphical Models: Statistical Structure Inference: Computational Structure τε - Approximate Messages: Statistical Structure Parallel Splash

Dynamic Scheduling Partitioning

Experimental Results Conclusions

15

slide-16
SLIDE 16

The Inference Problem

NP-Hard in General Approximate Inference:

Belief Propagation

16

Smokes(A) è Cancer(A) Smokes(B) è Cancer(B) Friends(A,B) And Smokes(A) è Smokes(B) Cancer(A) Cancer(B) Smokes(A) Smokes(B) Friends(A,B) A: Alice B: Bob True/False?

What is the probability that Bob Smokes given Alice Smokes? What is the best configuration of the protein side-chains? What is the probability that each pixel is black?

slide-17
SLIDE 17

Belief Propagation (BP)

Iterative message passing algorithm Naturally Parallel Algorithm

17

slide-18
SLIDE 18

Parallel Synchronous BP

Given the old messages all new messages can be computed in parallel:

18

New Messages Old Messages

CPU 2 CPU 1 CPU 3 CPU n

Map-Reduce Ready!

slide-19
SLIDE 19

Sequential Computational Structure

19

slide-20
SLIDE 20

Hidden Sequential Structure

20

slide-21
SLIDE 21

Hidden Sequential Structure

Running Time:

21

Evidence Evidence

Time for a single parallel iteration Number of Iterations

slide-22
SLIDE 22

Optimal Sequential Algorithm

Forward-Backward Naturally Parallel 2n2/p

p ≤ 2n

22

Running Time

2n

Gap p = 1

Optimal Parallel

n

p = 2

slide-23
SLIDE 23

Key Computational Structure

Naturally Parallel 2n2/p

p ≤ 2n

23

Running Time

Optimal Parallel

n

p = 2 Gap

Inherent Sequential Structure Requires Efficient Scheduling

slide-24
SLIDE 24

Outline

Overview Graphical Models: Statistical Structure Inference: Computational Structure τε - Approximate Messages: Statistical Structure Parallel Splash

Dynamic Scheduling Partitioning

Experimental Results Conclusions

24

slide-25
SLIDE 25

Parallelism by Approximation

τε represents the minimal sequential structure

25

True Messages τε -Approximation

1 2 3 4 5 6 7 8 9

10

1

slide-26
SLIDE 26

Tau-Epsilon Structure

Often τε decreases quickly:

26

Markov Logic Networks Protein Networks Message Approximation Error in Log Scale

slide-27
SLIDE 27

Running Time Lower Bound

27

Theorem: Using p processors it is not possible to obtain a τε approximation in time less than: Parallel Component Sequential Component

slide-28
SLIDE 28

A single processor can only make k-τε +1 vertices left aware in k-iterations

Consider one direction using p/2 processors (p≥2):

28

1

n

τε τε τε τε τε τε τε

τε

n - τε We must make n - τε vertices τε left-aware

Proof: Running Time Lower Bound

slide-29
SLIDE 29

Optimal Parallel Scheduling

Processor 1 Processor 2 Processor 3

29

Theorem: Using p processors this algorithm achieves a τε approximation in time:

slide-30
SLIDE 30

Proof: Optimal Parallel Scheduling

All vertices are left-aware of the left most vertex on their processor After exchanging messages After next iteration: After k parallel iterations each vertex is (k-1)(n/p) left-aware

30

slide-31
SLIDE 31

Proof: Optimal Parallel Scheduling

After k parallel iterations each vertex is (k-1)(n/p) left- aware Since all vertices must be made τε left aware: Each iteration takes O(n/p) time:

31

slide-32
SLIDE 32

Comparing with SynchronousBP

Processor 1 Processor 2 Processor 3

32

Synchronous Schedule Optimal Schedule Gap

slide-33
SLIDE 33

Outline

Overview Graphical Models: Statistical Structure Inference: Computational Structure τε - Approximate Messages: Statistical Structure Parallel Splash

Dynamic Scheduling Partitioning

Experimental Results Conclusions

33

slide-34
SLIDE 34

The Splash Operation

Generalize the optimal chain algorithm: to arbitrary cyclic graphs:

~

34

1) Grow a BFS Spanning tree with fixed size 2) Forward Pass computing all messages at each vertex 3) Backward Pass computing all messages at each vertex

slide-35
SLIDE 35

Local State CPU 2 Local State CPU 3 Local State CPU 1

Running Parallel Splashes

Partition the graph Schedule Splashes locally Transmit the messages along the boundary of the partition

35

Splash Splash Splash Key Challenges: 1) How do we schedules Splashes? 2) How do we partition the Graph?

slide-36
SLIDE 36

Local State

Scheduling Queue

Where do we Splash?

Assign priorities and use a scheduling queue to select roots: Splash Splash

? ? ?

CPU 1

How do we assign priorities?

slide-37
SLIDE 37

Message Scheduling

Residual Belief Propagation [Elidan et al., UAI 06]:

Assign priorities based on change in inbound messages

1

37

Message Message Message

2

Message Message Message

Large Change Small Change Small Change Large Change

Small Change: Expensive No-Op Large Change: Informative Update

slide-38
SLIDE 38

Problem with Message Scheduling

Small changes in messages do not imply small changes in belief:

38

Small change in all message Large change in belief

Message Message Belief Message Message

slide-39
SLIDE 39

Problem with Message Scheduling

Large changes in a single message do not imply large changes in belief:

39

Large change in a single message Small change in belief

Message Belief Message Message Message

slide-40
SLIDE 40

Belief Residual Scheduling

Assign priorities based on the cumulative change in belief:

1 1

+

1

+ rv =

Message Change

40

A vertex whose belief has changed substantially since last being updated will likely produce informative new messages.

slide-41
SLIDE 41

Message vs. Belief Scheduling

Belief Scheduling improves accuracy and convergence

41

0% 20% 40% 60% 80% 100% Belief Residuals Message Residual

% Converged in 4Hrs

Better 0.02 0.03 0.04 0.05 0.06 50 100 L1 Error in Beliefs Time (Seconds)

Error in Beliefs

Message Scheduling Belief Scheduling

slide-42
SLIDE 42

Splash Pruning

Belief residuals can be used to dynamically reshape and resize Splashes: Low Beliefs Residual

slide-43
SLIDE 43

Splash Size

Using Splash Pruning our algorithm is able to dynamically select the optimal splash size

43

50 100 150 200 250 300 350 10 20 30 40 50 60 Running Time (Seconds) Splash Size (Messages)

Without Pruning With Pruning

Better

slide-44
SLIDE 44

Example

Synthetic Noisy Image Factor Graph Vertex Updates

Many Updates Few Updates

Algorithm identifies and focuses

  • n hidden sequential structure

44

slide-45
SLIDE 45

Parallel Splash Algorithm

Partition factor graph over processors Schedule Splashes locally using belief residuals Transmit messages on boundary

Local State CPU 1

Splash

Local State CPU 2 Local State CPU 3

Splash

Fast Reliable Network

Splash

45

Scheduling Queue Scheduling Queue Scheduling Queue

Given a uniform partitioning of the chain graphical model, Parallel Splash will run in time: retaining optimality.

Theorem:

slide-46
SLIDE 46

CPU 1 CPU 2

Partitioning Objective

The partitioning of the factor graph determines:

Storage, Computation, and Communication

Goal:

Balance Computation and Minimize Communication

46

Ensure Balance Comm. cost

slide-47
SLIDE 47

The Partitioning Problem

Objective: Depends on: NP-Hard à METIS fast partitioning heuristic

Work: Comm:

47

Minimize Communication Ensure Balance

Update counts are not known!

slide-48
SLIDE 48

Unknown Update Counts

Determined by belief scheduling Depends on: graph structure, factors, … Little correlation between past & future update counts

48

Noisy Image Update Counts

Simple Solution: Uninformed Cut

slide-49
SLIDE 49

Uniformed Cuts

Greater imbalance & lower communication cost

Update Counts

Uninformed Cut Optimal Cut

49 1 2 3 4

Denoise UW-Syst.

Work Imbalance

0.5 0.6 0.7 0.8 0.9 1 1.1

Denoise UW-Syst.

Communication Cost

Uninformed Optimal

Better Better

Too Much Work Too Little Work

slide-50
SLIDE 50

Over-Partitioning

Over-cut graph into k*p partitions and randomly assign CPUs

Increase balance Increase communication cost (More Boundary)

CPU 1 CPU 2 CPU 1 CPU 2 CPU 2 CPU 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 1 CPU 2 CPU 1 CPU 2 Without Over-Partitioning k=6

50

slide-51
SLIDE 51

Over-Partitioning Results

Provides a simple method to trade between work balance and communication cost

51

1 1.5 2 2.5 3 3.5 4 5 10 15 Partition Factor k

Communication Cost

1.5 2 2.5 3 3.5 5 10 15 Partition Factor k

Work Imbalance

Better Better

slide-52
SLIDE 52

CPU Utilization

Over-partitioning improves CPU utilization:

52

10 20 30 40 50 60 70 100 200 Active CPUs Time (Seconds)

UW-Systems MLN

10 20 30 40 50 60 70 20 40 Time (Seconds)

Denoise

No Over-Part 10x Over-Part

slide-53
SLIDE 53

Parallel Splash Algorithm

Over-Partition factor graph

Randomly assign pieces to processors

Schedule Splashes locally using belief residuals Transmit messages on boundary

Local State CPU 1

Splash

Local State CPU 2 Local State CPU 3

Splash

Fast Reliable Network

Splash

53

Scheduling Queue Scheduling Queue Scheduling Queue

slide-54
SLIDE 54

Outline

Overview Graphical Models: Statistical Structure Inference: Computational Structure τε - Approximate Messages: Statistical Structure Parallel Splash

Dynamic Scheduling Partitioning

Experimental Results Conclusions

54

slide-55
SLIDE 55

Experiments

Implemented in C++ using MPICH2 as a message passing API Ran on Intel OpenCirrus cluster: 120 processors

15 Nodes with 2 x Quad Core Intel Xeon Processors Gigabit Ethernet Switch

Tested on Markov Logic Networks obtained from Alchemy [Domingos et al. SSPR 08]

Present results on largest UW-Systems and smallest UW-Languages MLNs

55

slide-56
SLIDE 56

Parallel Performance (Large Graph)

20 40 60 80 100 120 30 60 90 120 Speedup Number of CPUs

No Over-Part 5x Over-Part

56

UW-Systems

8K Variables 406K Factors

Single Processor Running Time:

1 Hour

Linear to Super- Linear up to 120 CPUs

Cache efficiency

Linear Better

slide-57
SLIDE 57

Parallel Performance (Small Graph)

UW-Languages

1K Variables 27K Factors

Single Processor Running Time:

1.5 Minutes

Linear to Super- Linear up to 30 CPUs

Network costs quickly dominate short running-time

57

10 20 30 40 50 60 30 60 90 120 Speedup Number of CPUs No Over-Part 5x Over-Part Linear Better

slide-58
SLIDE 58

Outline

Overview Graphical Models: Statistical Structure Inference: Computational Structure τε - Approximate Messages: Statistical Structure Parallel Splash

Dynamic Scheduling Partitioning

Experimental Results Conclusions

58

slide-59
SLIDE 59

Summary

59

Algorithmic Efficiency Parallel Efficiency Implementation Efficiency

Independent Parallel Splashes Splash Structure + Belief Residual Scheduling Distributed Queues Asynchronous Communication Over-Partitioning

Experimental results on large factor graphs:

Linear to super-linear speed-up using up to 120 processors

slide-60
SLIDE 60

Sophistication Parallelism

Parallel Splash Belief Propagation

We are here

Conclusion

slide-61
SLIDE 61

Questions

61

slide-62
SLIDE 62

Protein Results

62

slide-63
SLIDE 63

3D Video Task

63

slide-64
SLIDE 64

Distributed Parallel Setting

Opportunities:

Access to larger systems: 8 CPUs à 1000 CPUs Linear Increase:

RAM, Cache Capacity, and Memory Bandwidth

Challenges:

Distributed state, Communication and Load Balancing

64

Fast Reliable Network Node

CPU

Bus

Memory

Cache

Node

CPU

Bus

Memory

Cache