[PPT] - FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel PowerPoint Presentation

SLIDE 1

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 1

Daniel Thuerck1,2 (advisors Michael Goesele1,2 and Marc Pfetsch1) Maxim Naumov3

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

1 Graduate School of Computational Engineering, TU Darmstadt 2 Graphics, Capture and Massively Parallel Computing, TU Darmstadt 3 NVIDIA Research

SLIDE 2

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 2

INTRODUCTION TO LINEAR PROGRAMMING

SLIDE 3

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 3

Linear Programs

min 𝑑⊤𝑦 s.t. 𝐵𝑦 ≤ 𝑐 𝑦 ≥ 0

𝑑

𝐵 = 𝑏1

𝑈

𝑏𝑛

𝑈

𝑐 = 𝑐1 𝑐𝑛 where and Linear objective function Linear constraints

SLIDE 4

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 4

Linear Programs: Applications

[3P Logistics]

SLIDE 5

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 5

INTERNALS OF AN LP SOLVER

Lower-Level Parallelism in LP

SLIDE 6

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 6

Solving LPs

A is 𝑛 × 𝑜 matrix, with 𝑛 ≪ 𝑜
A is sparse and has full row-rank
Variables may be bounded: 𝑚 ≤

𝑦 ≤ 𝑣

min 𝑑⊤𝑦 s.t. 𝐵𝑦 = 𝑐 . 𝑦 ≥ 0

“Standard” LP format

SLIDE 7

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 7

Solving LPs

𝑑 𝑑

Simplex Interior Point

SLIDE 8

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 8

Solving LPs Simplex Interior Point (IPM)

“Basis” (active set)

𝐵𝐶 =

𝐸 𝐵⊤ 𝐵

𝐵𝐸−1𝐵⊤

“Augmented (Newton) System” “Normal Equations”

SLIDE 9

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 9

Solving LPs IPM / Normal Equations

𝐵𝐸−1𝐵⊤

𝑛 × 𝑛, SPD, mi

migh ght be dense se

Squared condition number
Solution: Cholesky-factorization
r CG method

IPM / Aug. System

(𝑛 + 𝑜) × (𝑛 + 𝑜), sparse
Symmetric, indefinite
Solution: Indefinite LDLT or

MINRES method

𝐸 𝐵⊤ 𝐵

SLIDE 10

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 10

Solving LPs IPM / Normal Equations

𝐵𝐸−1𝐵⊤

𝑛 × 𝑛, SPD, mi

migh ght be dense se

Squared condition number
Solution: Cholesky-factorization
r CG method

IPM / Aug. System

(𝑛 + 𝑜) × (𝑛 + 𝑜), sparse
Symmetric, indefinite
Solution: Indefinite LDLT or

MINRES method

𝐸 𝐵⊤ 𝐵

SLIDE 11

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 11

Introducing culip-lp…

An ongoing implementation of Mehrotra’s Primal-Dual interior point algorithm [1], featuring...

 (Iterati rative ve) Linear Algebra based on the “Aug ugment ented ed Matrix rix” approach,  Ful ull-ran rank guarantees,  Comprehensive pre repro proce cessi ssing & pre resc scaling aling.

Towards solving large-scale LPs on the GPU as open source ce for everybody

SLIDE 12

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 12

IMPLEMENTING CULIP-LP

Progress report

SLIDE 13

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 13

Solver architecture

Preprocess Scale Standardize IPM loop

SLIDE 14

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 14

Solver architecture

In Input t data:

Constraints

𝐵𝑓𝑟𝑦 = 𝑐𝑓𝑟

Constraints

𝐵𝑚𝑓𝑦 ≤ 𝑐𝑚𝑓

Objective vector

𝑑

Bounds (on some variables)

𝑚, 𝑣 Preprocess Scale Standardize IPM loop

SLIDE 15

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 15

Solver architecture

Storage ge forma mat: t: CSR

Compressed sparse row format
Provides efficient row-based access by 3 arr

rrays ays:

𝑏 𝑐 𝑑 𝑒 𝑓 0 1 1 2 0 col_Ind a b c d e val 2 3 4 5 row_ptr

𝐵𝑓𝑟𝑦 = 𝑐𝑓𝑟 𝐵𝑚𝑓𝑦 ≤ 𝑐𝑚𝑓 𝑑 𝑚, 𝑣 Preprocess Scale Standardize IPM loop

SLIDE 16

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 16

Solver architecture

Ex

Examp mple: e: LP “pb-simp-nonunif” (see [2])

Input matrix: 1,4 Mio x 23k with 4,36 Mio nonzeros
Removed 1 singleton inequality
Removed 3629 low-forcing constraints
Removed 1 fixed variable
Removed 1,1 Mio (!) singleton inequalities
Result: approx. 3,6

6 Mio nonzeros removed

𝐵𝑓𝑟𝑦 = 𝑐𝑓𝑟 𝐵𝑚𝑓𝑦 ≤ 𝑐𝑚𝑓 𝑑 𝑚, 𝑣 Execute in rounds Preprocess Scale Standardize IPM loop

SLIDE 17

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 17

Solver architecture

𝐵𝑓𝑟𝑦 = 𝑐𝑓𝑟 𝐵𝑚𝑓𝑦 ≤ 𝑐𝑚𝑓 𝑑 𝑚, 𝑣

Goal: : Reduce element variance in matrices

Scaling [3] makes a difference
1. Geometric scaling (1x – 4x)
2. Equilibration (1x)

𝐵𝑗,⋅ = 𝐵𝑗,⋅ max |𝐵𝑗,⋅| min(|𝐵𝑗,⋅|) 𝐵𝑗,⋅ = 𝐵𝑗,⋅ 𝐵𝑗,⋅ 2

Preprocess Scale Standardize IPM loop

SLIDE 18

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 18

Solver architecture

𝐵𝑓𝑟𝑦 = 𝑐𝑓𝑟 𝐵𝑚𝑓𝑦 ≤ 𝑐𝑚𝑓 𝑑 𝑚, 𝑣

Goal: : Forma mat LP in in standar ard form

Shift variables:
Split (free) variables
Build std’ matrix:

𝐵𝑚𝑓 𝐽 𝐵𝑓𝑟 = 𝑐𝑀𝑓 𝑐𝑓𝑟 𝑦 → 𝑦 = 𝑦+ − 𝑦− 𝑦+, 𝑦− ≥ 0 l ≤ 𝑦 ≤ 𝑣 → 0 ≤ 𝑦′ ≤ 𝑣 + 𝑚 Preprocess Scale Standardize IPM loop

SLIDE 19

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 19

Solver architecture

En Ensure re A has f full rank (sym ymbolica ically ly)

𝑄𝐵𝑅 =

𝑛𝑣 𝑛𝑑 𝐵𝑦 = 𝑐 𝑑 𝑣 Preprocess Scale Standardize IPM loop

SLIDE 20

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 20

Solver architecture

𝐵𝑦 = 𝑐 𝑑 𝑣

Goal: Solve KKT conditions by Newton steps
Steps:
Augmented matrix assembly
Solv

lving ing the e (ind ndef efinit inite) ) augmented mented matrix ix

Solv

lve twice ce: predictor and corrector

Stepsize along 𝑤 = 𝑤𝑞 + 𝑤𝑑

𝑤𝑞 𝑤𝑑 Preprocess Scale Standardize IPM loop

SLIDE 21

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 21

Solving the augmented system

Iterat rative ive stra rategy: egy:

Symmetric, indefinite: use MINRES [4] (in parts)
Equilibrate system implicitly
Preconditioner: Experiments ongoing

𝐸 𝐵⊤ 𝐵

Dire rect ct strate rategy: gy:

Symmetric, indefinite: use SPRAL SSIDS [5]
Reordering by METIS [6]
Scaling for large pivots

SLIDE 22

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 22

PERFORMANCE EVALUATION

Intermediate findings

SLIDE 23

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 23

Benchmark problems

Problem name [7] M N NNZ ex9 40,962 10,404 517,112 ex10 696,608 17,680 1,162,000 neos-631710 169,576 167,056 834,166 bley_xl1 175,620 5831 869,391 map06 328,818 164,547 549,920 map10 328,818 164,547 549,920 nb10tb 150,495 73340 1,172,289 neos-142912 58,726 416,040 1,855,220 in 1,526,202 1,449,074 6,811,639

SLIDE 24

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 24

Performance

Problem name [7] NNZ CLP barr [sec] culip-lp [sec] ex9 517,112 X (NC) 81 ex10 1,162,000 X (NS) 141 neos-631710 834,166 172 478 bley_xl1 869,391 X (NS) 1,492 map06 549,920 X (NC) 466 map10 549,920 X (NC) 615 nb10tb 1,172,289 X (NC) 2,461 neos-142912 1,855,220 356 447 in 6,811,639 X (NS) NC

X – failed, NS – did not start 1st iteration, NC – did not converged within 1 hour

SLIDE 25

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 25

Runtime breakdown

1 2 3 4 5 6 7 8 9 10 1 11 21 31 41 51 61 71 81 91

time [sec] IPM step

Problem: map10 [7] Corrector Predictor Example: map10 [7] MINRES SPRAL

SLIDE 26

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 26

Iterative vs. direct methods

1000 2000 3000 4000 5000 6000 1 2 3 4 5 6 7 8 9 10 11 12 13

Iterations IPM step

MINRES Iterations

Example: map10 [7] Corrector Predictor

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00

1 2 3 4 5 6 7 8 9 10 11 12 13

Relative Residual

IPM step

MINRES relative residual

SLIDE 27

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 27

Numerical difficulty

Condition of matrix depends mainly on

𝐸 𝐵⊤ 𝐵

max(𝑦𝑗𝑡𝑗) min 𝑦𝑗𝑡𝑗 ≈ 1010

where xT=[x1,…,xn] are solution and sT=[s1,…,sn] are slack variables

Remedies

2x2 pivoting in factorizations (e.g. 𝑀𝐸𝑀⊤

in SPRAL)

Preconditioning for MINRES or GMRES

𝐸 = 𝑒𝑗𝑏𝑕(𝑦) ⋅ 𝑒𝑗𝑏𝑕(𝑡) with strong duality towards the end often yielding

SLIDE 28

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 28

Findings on the solver’s performance

We know that individual components attain speedup on the GPU

Linear algebra components, such as preconditioned MINRES, GMRES, etc.
Graph reorderings and matchings

LP problems:

Medium-sized problems: faster than open-source IPM code CLP
We can solve many large problems where CLP fails
However, more time is needed to integrate all components together on the GPU

SLIDE 29

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 29

What’s keeping you from optimizing your runtime?

LP Solver (a.k.a “the black box”)

=

Equilibration Matrix factorization Bipartite matching SpMV Matrix reordering LP scaling Krylov-Solver Max-flow Component BFS Graph partitioning Preconditioner MPS I/O Preprocessing Rank estimation

SLIDE 30

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 30

Future Work

Nu Nume merics ics

Pr

Precond

ndition

tioning ing techni niqu ques es

Investigate influence of 2x2 pivots
ts in LDLT factorization

Tuning

Benchmark & optimize warp-based kernels in preprocessing
Never explicitly form the standardized matrix

SLIDE 31

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 31

FEASIBILITY STUDY: LP DECOMPOSITIONS

Higher-Level Parallelism in LP

SLIDE 32

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 32

Solving an LP: The usual setup

Large LP

Preprocess Standardize Solve

LP Solver (a.k.a “the black box”)

Solution

SLIDE 33

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 33

Higher-level parallelism by LP decomposition

Large LP Solution Apply LP decomposition Assemble master/slave solutions Master LP Slave LP 1 Slave LP k

SLIDE 34

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 34

LP-decompositions: feasibility

Decomposition works on structure of the constraint matrix 𝐵: Benders [9] Dantzig-Wolfe [10]

SLIDE 35

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 35

LP-decompositions: prototype

Implemented a Benders’ decomposition using hypergraph partitioning:

SLIDE 36

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 36

Acknowledgements

The work of Daniel Thuerck is supported by the 'Excell llen ence e In Init itiat iative ive' of the Germa man n Federa ral and State te Governments nments and the Graduate ate Sch chool of Comp mputati tationa nal l En Engi ginee eering ing at Technische Universität Darmstadt.

SLIDE 37

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 37

References

[1] Mehrotra, Sanjay. "On the implementation of a primal-dual interior point method." SIAM Journal on

ptimization 2.4 (1992): 575-601.

[2] Gondzio, Jacek. "Presolve analysis of linear programs prior to applying an interior point method." INFORMS Journal on Computing 9.1 (1997): 73-91. [3] Gondzio, Jacek. "Presolve analysis of linear programs prior to applying an interior point method." INFORMS Journal on Computing 9.1 (1997): 73-91. [4] Paige, Christopher C., and Michael A. Saunders. "Solution of sparse indefinite systems of linear equations." SIAM journal on numerical analysis 12.4 (1975): 617-629. [5] Paige, Christopher C., and Michael A. Saunders. "Solution of sparse indefinite systems of linear equations." SIAM journal on numerical analysis 12.4 (1975): 617-629.

SLIDE 38

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 38

References

[6] Karypis, George, and Vipin Kumar. "A fast and high quality multilevel scheme for partitioning irregular graphs." SIAM Journal on scientific Computing 20.1 (1998): 359-392. [7] Koch, Thorsten, et al. "MIPLIB 2010." Mathematical Programming Computation 3.2 (2011): 103-163. [8] Forrest, John, David de la Nuez, and Robin Lougee-Heimer. "CLP user guide." IBM Research (2004). [9] Benders, Jacques F. "Partitioning procedures for solving mixed-variables programming problems." Numerische mathematik 4.1 (1962): 238-252. [10] Dantzig, George B., and Philip Wolfe. "Decomposition principle for linear programs." Operations research 8.1 (1960): 101-111.