FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel - - PowerPoint PPT Presentation

β–Ά
finding parallelism in general purpose linear programming
SMART_READER_LITE
LIVE PREVIEW

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel - - PowerPoint PPT Presentation

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt 2 Graphics, Capture and Massively Parallel


slide-1
SLIDE 1

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 1

Daniel Thuerck1,2 (advisors Michael Goesele1,2 and Marc Pfetsch1) Maxim Naumov3

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

1 Graduate School of Computational Engineering, TU Darmstadt 2 Graphics, Capture and Massively Parallel Computing, TU Darmstadt 3 NVIDIA Research

slide-2
SLIDE 2

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 2

INTRODUCTION TO LINEAR PROGRAMMING

slide-3
SLIDE 3

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 3

Linear Programs

min π‘‘βŠ€π‘¦ s.t. 𝐡𝑦 ≀ 𝑐 𝑦 β‰₯ 0

𝑑

𝐡 = 𝑏1

π‘ˆ

𝑏𝑛

π‘ˆ

𝑐 = 𝑐1 𝑐𝑛 where and Linear objective function Linear constraints

slide-4
SLIDE 4

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 4

Linear Programs: Applications

[3P Logistics]

slide-5
SLIDE 5

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 5

INTERNALS OF AN LP SOLVER

Lower-Level Parallelism in LP

slide-6
SLIDE 6

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 6

Solving LPs

  • A is 𝑛 Γ— π‘œ matrix, with 𝑛 β‰ͺ π‘œ
  • A is sparse and has full row-rank
  • Variables may be bounded: π‘š ≀

𝑦 ≀ 𝑣

min π‘‘βŠ€π‘¦ s.t. 𝐡𝑦 = 𝑐 . 𝑦 β‰₯ 0

β€œStandard” LP format

slide-7
SLIDE 7

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 7

Solving LPs

𝑑 𝑑

Simplex Interior Point

slide-8
SLIDE 8

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 8

Solving LPs Simplex Interior Point (IPM)

β€œBasis” (active set)

𝐡𝐢 =

𝐸 𝐡⊀ 𝐡

π΅πΈβˆ’1𝐡⊀

β€œAugmented (Newton) System” β€œNormal Equations”

slide-9
SLIDE 9

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 9

Solving LPs IPM / Normal Equations

π΅πΈβˆ’1𝐡⊀

  • 𝑛 Γ— 𝑛, SPD, mi

migh ght be dense se

  • Squared condition number
  • Solution: Cholesky-factorization
  • r CG method

IPM / Aug. System

  • (𝑛 + π‘œ) Γ— (𝑛 + π‘œ), sparse
  • Symmetric, indefinite
  • Solution: Indefinite LDLT or

MINRES method

𝐸 𝐡⊀ 𝐡

slide-10
SLIDE 10

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 10

Solving LPs IPM / Normal Equations

π΅πΈβˆ’1𝐡⊀

  • 𝑛 Γ— 𝑛, SPD, mi

migh ght be dense se

  • Squared condition number
  • Solution: Cholesky-factorization
  • r CG method

IPM / Aug. System

  • (𝑛 + π‘œ) Γ— (𝑛 + π‘œ), sparse
  • Symmetric, indefinite
  • Solution: Indefinite LDLT or

MINRES method

𝐸 𝐡⊀ 𝐡

slide-11
SLIDE 11

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 11

Introducing culip-lp…

An ongoing implementation of Mehrotra’s Primal-Dual interior point algorithm [1], featuring...

οƒΌ (Iterati rative ve) Linear Algebra based on the β€œAug ugment ented ed Matrix rix” approach, οƒΌ Ful ull-ran rank guarantees, οƒΌ Comprehensive pre repro proce cessi ssing & pre resc scaling aling.

Towards solving large-scale LPs on the GPU as open source ce for everybody

slide-12
SLIDE 12

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 12

IMPLEMENTING CULIP-LP

Progress report

slide-13
SLIDE 13

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 13

Solver architecture

Preprocess Scale Standardize IPM loop

slide-14
SLIDE 14

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 14

Solver architecture

In Input t data:

  • Constraints

π΅π‘“π‘Ÿπ‘¦ = π‘π‘“π‘Ÿ

  • Constraints

π΅π‘šπ‘“π‘¦ ≀ π‘π‘šπ‘“

  • Objective vector

𝑑

  • Bounds (on some variables)

π‘š, 𝑣 Preprocess Scale Standardize IPM loop

slide-15
SLIDE 15

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 15

Solver architecture

Storage ge forma mat: t: CSR

  • Compressed sparse row format
  • Provides efficient row-based access by 3 arr

rrays ays:

𝑏 𝑐 𝑑 𝑒 𝑓 0 1 1 2 0 col_Ind a b c d e val 2 3 4 5 row_ptr

π΅π‘“π‘Ÿπ‘¦ = π‘π‘“π‘Ÿ π΅π‘šπ‘“π‘¦ ≀ π‘π‘šπ‘“ 𝑑 π‘š, 𝑣 Preprocess Scale Standardize IPM loop

slide-16
SLIDE 16

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 16

Solver architecture

  • Ex

Examp mple: e: LP β€œpb-simp-nonunif” (see [2])

  • Input matrix: 1,4 Mio x 23k with 4,36 Mio nonzeros
  • Removed 1 singleton inequality
  • Removed 3629 low-forcing constraints
  • Removed 1 fixed variable
  • Removed 1,1 Mio (!) singleton inequalities
  • Result: approx. 3,6

6 Mio nonzeros removed

π΅π‘“π‘Ÿπ‘¦ = π‘π‘“π‘Ÿ π΅π‘šπ‘“π‘¦ ≀ π‘π‘šπ‘“ 𝑑 π‘š, 𝑣 Execute in rounds Preprocess Scale Standardize IPM loop

slide-17
SLIDE 17

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 17

Solver architecture

π΅π‘“π‘Ÿπ‘¦ = π‘π‘“π‘Ÿ π΅π‘šπ‘“π‘¦ ≀ π‘π‘šπ‘“ 𝑑 π‘š, 𝑣

Goal: : Reduce element variance in matrices

  • Scaling [3] makes a difference
  • 1. Geometric scaling (1x – 4x)
  • 2. Equilibration (1x)

𝐡𝑗,β‹… = 𝐡𝑗,β‹… max |𝐡𝑗,β‹…| min(|𝐡𝑗,β‹…|) 𝐡𝑗,β‹… = 𝐡𝑗,β‹… 𝐡𝑗,β‹… 2

Preprocess Scale Standardize IPM loop

slide-18
SLIDE 18

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 18

Solver architecture

π΅π‘“π‘Ÿπ‘¦ = π‘π‘“π‘Ÿ π΅π‘šπ‘“π‘¦ ≀ π‘π‘šπ‘“ 𝑑 π‘š, 𝑣

Goal: : Forma mat LP in in standar ard form

  • Shift variables:
  • Split (free) variables
  • Build std’ matrix:

π΅π‘šπ‘“ 𝐽 π΅π‘“π‘Ÿ = 𝑐𝑀𝑓 π‘π‘“π‘Ÿ 𝑦 β†’ 𝑦 = 𝑦+ βˆ’ π‘¦βˆ’ 𝑦+, π‘¦βˆ’ β‰₯ 0 l ≀ 𝑦 ≀ 𝑣 β†’ 0 ≀ 𝑦′ ≀ 𝑣 + π‘š Preprocess Scale Standardize IPM loop

slide-19
SLIDE 19

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 19

Solver architecture

En Ensure re A has f full rank (sym ymbolica ically ly)

𝑄𝐡𝑅 =

𝑛𝑣 𝑛𝑑 𝐡𝑦 = 𝑐 𝑑 𝑣 Preprocess Scale Standardize IPM loop

slide-20
SLIDE 20

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 20

Solver architecture

𝐡𝑦 = 𝑐 𝑑 𝑣

  • Goal: Solve KKT conditions by Newton steps
  • Steps:
  • Augmented matrix assembly
  • Solv

lving ing the e (ind ndef efinit inite) ) augmented mented matrix ix

  • Solv

lve twice ce: predictor and corrector

  • Stepsize along 𝑀 = π‘€π‘ž + 𝑀𝑑

π‘€π‘ž 𝑀𝑑 Preprocess Scale Standardize IPM loop

slide-21
SLIDE 21

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 21

Solving the augmented system

Iterat rative ive stra rategy: egy:

  • Symmetric, indefinite: use MINRES [4] (in parts)
  • Equilibrate system implicitly
  • Preconditioner: Experiments ongoing

𝐸 𝐡⊀ 𝐡

Dire rect ct strate rategy: gy:

  • Symmetric, indefinite: use SPRAL SSIDS [5]
  • Reordering by METIS [6]
  • Scaling for large pivots
slide-22
SLIDE 22

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 22

PERFORMANCE EVALUATION

Intermediate findings

slide-23
SLIDE 23

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 23

Benchmark problems

Problem name [7] M N NNZ ex9 40,962 10,404 517,112 ex10 696,608 17,680 1,162,000 neos-631710 169,576 167,056 834,166 bley_xl1 175,620 5831 869,391 map06 328,818 164,547 549,920 map10 328,818 164,547 549,920 nb10tb 150,495 73340 1,172,289 neos-142912 58,726 416,040 1,855,220 in 1,526,202 1,449,074 6,811,639

slide-24
SLIDE 24

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 24

Performance

Problem name [7] NNZ CLP barr [sec] culip-lp [sec] ex9 517,112 X (NC) 81 ex10 1,162,000 X (NS) 141 neos-631710 834,166 172 478 bley_xl1 869,391 X (NS) 1,492 map06 549,920 X (NC) 466 map10 549,920 X (NC) 615 nb10tb 1,172,289 X (NC) 2,461 neos-142912 1,855,220 356 447 in 6,811,639 X (NS) NC

X – failed, NS – did not start 1st iteration, NC – did not converged within 1 hour

slide-25
SLIDE 25

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 25

Runtime breakdown

1 2 3 4 5 6 7 8 9 10 1 11 21 31 41 51 61 71 81 91

time [sec] IPM step

Problem: map10 [7] Corrector Predictor Example: map10 [7] MINRES SPRAL

slide-26
SLIDE 26

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 26

Iterative vs. direct methods

1000 2000 3000 4000 5000 6000 1 2 3 4 5 6 7 8 9 10 11 12 13

Iterations IPM step

MINRES Iterations

Example: map10 [7] Corrector Predictor

1.E-07 1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00

1 2 3 4 5 6 7 8 9 10 11 12 13

Relative Residual

IPM step

MINRES relative residual

slide-27
SLIDE 27

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 27

Numerical difficulty

Condition of matrix depends mainly on

𝐸 𝐡⊀ 𝐡

max(𝑦𝑗𝑑𝑗) min 𝑦𝑗𝑑𝑗 β‰ˆ 1010

where xT=[x1,…,xn] are solution and sT=[s1,…,sn] are slack variables

Remedies

  • 2x2 pivoting in factorizations (e.g. π‘€πΈπ‘€βŠ€

in SPRAL)

  • Preconditioning for MINRES or GMRES

𝐸 = 𝑒𝑗𝑏𝑕(𝑦) β‹… 𝑒𝑗𝑏𝑕(𝑑) with strong duality towards the end often yielding

slide-28
SLIDE 28

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 28

Findings on the solver’s performance

We know that individual components attain speedup on the GPU

  • Linear algebra components, such as preconditioned MINRES, GMRES, etc.
  • Graph reorderings and matchings

LP problems:

  • Medium-sized problems: faster than open-source IPM code CLP
  • We can solve many large problems where CLP fails
  • However, more time is needed to integrate all components together on the GPU
slide-29
SLIDE 29

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 29

What’s keeping you from optimizing your runtime?

LP Solver (a.k.a β€œthe black box”)

=

Equilibration Matrix factorization Bipartite matching SpMV Matrix reordering LP scaling Krylov-Solver Max-flow Component BFS Graph partitioning Preconditioner MPS I/O Preprocessing Rank estimation

slide-30
SLIDE 30

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 30

Future Work

Nu Nume merics ics

  • Pr

Precond

  • ndition

tioning ing techni niqu ques es

  • Investigate influence of 2x2 pivots
  • ts in LDLT factorization

Tuning

  • Benchmark & optimize warp-based kernels in preprocessing
  • Never explicitly form the standardized matrix
slide-31
SLIDE 31

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 31

FEASIBILITY STUDY: LP DECOMPOSITIONS

Higher-Level Parallelism in LP

slide-32
SLIDE 32

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 32

Solving an LP: The usual setup

Large LP

Preprocess Standardize Solve

LP Solver (a.k.a β€œthe black box”)

Solution

slide-33
SLIDE 33

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 33

Higher-level parallelism by LP decomposition

Large LP Solution Apply LP decomposition Assemble master/slave solutions Master LP Slave LP 1 Slave LP k

slide-34
SLIDE 34

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 34

LP-decompositions: feasibility

Decomposition works on structure of the constraint matrix 𝐡: Benders [9] Dantzig-Wolfe [10]

slide-35
SLIDE 35

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 35

LP-decompositions: prototype

Implemented a Benders’ decomposition using hypergraph partitioning:

slide-36
SLIDE 36

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 36

Acknowledgements

The work of Daniel Thuerck is supported by the 'Excell llen ence e In Init itiat iative ive' of the Germa man n Federa ral and State te Governments nments and the Graduate ate Sch chool of Comp mputati tationa nal l En Engi ginee eering ing at Technische UniversitΓ€t Darmstadt.

slide-37
SLIDE 37

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 37

References

[1] Mehrotra, Sanjay. "On the implementation of a primal-dual interior point method." SIAM Journal on

  • ptimization 2.4 (1992): 575-601.

[2] Gondzio, Jacek. "Presolve analysis of linear programs prior to applying an interior point method." INFORMS Journal on Computing 9.1 (1997): 73-91. [3] Gondzio, Jacek. "Presolve analysis of linear programs prior to applying an interior point method." INFORMS Journal on Computing 9.1 (1997): 73-91. [4] Paige, Christopher C., and Michael A. Saunders. "Solution of sparse indefinite systems of linear equations." SIAM journal on numerical analysis 12.4 (1975): 617-629. [5] Paige, Christopher C., and Michael A. Saunders. "Solution of sparse indefinite systems of linear equations." SIAM journal on numerical analysis 12.4 (1975): 617-629.

slide-38
SLIDE 38

10.05.2017 | TU Darmstadt | GCC / GSC CE | Daniel Thuerck, Maxim Naumov | 38

References

[6] Karypis, George, and Vipin Kumar. "A fast and high quality multilevel scheme for partitioning irregular graphs." SIAM Journal on scientific Computing 20.1 (1998): 359-392. [7] Koch, Thorsten, et al. "MIPLIB 2010." Mathematical Programming Computation 3.2 (2011): 103-163. [8] Forrest, John, David de la Nuez, and Robin Lougee-Heimer. "CLP user guide." IBM Research (2004). [9] Benders, Jacques F. "Partitioning procedures for solving mixed-variables programming problems." Numerische mathematik 4.1 (1962): 238-252. [10] Dantzig, George B., and Philip Wolfe. "Decomposition principle for linear programs." Operations research 8.1 (1960): 101-111.