Calc: The challenges of scalable arithmetic How threading can be - - PowerPoint PPT Presentation

calc the challenges of scalable arithmetic how threading
SMART_READER_LITE
LIVE PREVIEW

Calc: The challenges of scalable arithmetic How threading can be - - PowerPoint PPT Presentation

Calc: The challenges of scalable arithmetic How threading can be challenging Michael Meeks General Manager at Collabora Productivity michael.meeks@collabora.com Skype - mmeeks, G+ - mejmeeks@gmail.com Stand at the crossroads and look; ask


slide-1
SLIDE 1

1 / 25 FOSDEM 2018 | Michael Meeks

Michael Meeks

General Manager at Collabora Productivity michael.meeks@collabora.com

Skype - mmeeks, G+ - mejmeeks@gmail.com

Calc: The challenges of scalable arithmetic How threading can be challenging

“Stand at the crossroads and look; ask for the ancient paths, ask where the good way is, and walk in it, and you will find rest for your souls...” - Jeremiah 6:16

www.collaboraoffice.com

slide-2
SLIDE 2

2

2 / 25 FOSDEM 2018 | Michael Meeks

Calc threading - Overview

  • LibreOffice 6.0 Calc
  • Existing structure & parallelism
  • Why thread ?
  • The initial solution & problems
  • mis-factored code
  • dependency issues
  • The group calculation piece
  • Profiling & optimizing
  • Future work & expansion …

Disclaimer & Thanks: Almost all of this work was done by Tor Lillqvist & Dennis Francis – who can’t be here today. Some great code reading & improvement. Disclaimer & Thanks: Almost all of this work was done by Tor Lillqvist & Dennis Francis – who can’t be here today. Some great code reading & improvement.

slide-3
SLIDE 3

3

3 / 25 FOSDEM 2018 | Michael Meeks

LibreOffice 6.0 Calc ...

  • A 30+ year old code-base
  • Primary Data structures hugely

improved recently

  • Still some scope for improvement:

FormulaGroup vs. FormulaCell, per-cell dependency records etc.

  • Calculation Engine in need of love
  • Some insights into how it works
  • Some problems wrt. threading.
slide-4
SLIDE 4

4 / 25 FOSDEM 2018 | Michael Meeks

Core structures since 4.3 (mdds::multi_type_vector)

ScDocument ScTable svl::SharedString block double block EditTextObject block ScFormulaCell block ScColumn Broadcasters Text widths Script types Cell values Cell notes

This bit: This bit:

slide-5
SLIDE 5

5 / 25 FOSDEM 2018 | Michael Meeks

FormulaCellGroups

ScFormulaCell ScTokenArray ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCell ScFormulaCellGroup … Tokens … RPN

Sample Token types (StackVar)

  • svSingleRef → A1
  • svDoubleRef → A1:C3
  • svExternalSingleRef etc.
  • svDouble → 42.0
  • svString → “hello world”
  • svByte → ocDiv, ocMacro ...

Sample Token types (StackVar)

  • svSingleRef → A1
  • svDoubleRef → A1:C3
  • svExternalSingleRef etc.
  • svDouble → 42.0
  • svString → “hello world”
  • svByte → ocDiv, ocMacro ...
slide-6
SLIDE 6

6 / 25 FOSDEM 2018 | Michael Meeks

Normal Formula interpreting

double ScFormulaCell::GetValue() { MaybeInterpret(); return GetRawValue(); } void ScFormulaCell::Interpret() { … amazing recursion flattening … InterpretTail() // ie. ... { … new ScInterpreter( this, pDocument, rContext, aPos, *pCode /* those tokens */);

  • >Interpret()

StackVar ScInterpreter::Interpret() { … execute reverse-polish stack … … execute functions … … get cell values from references …

Recursion++

slide-7
SLIDE 7

7 / 25 FOSDEM 2018 | Michael Meeks

InterpretFormulaGroup

ScTokenArray ScFormulaCellGroup … Tokens … RPN

1 2 2 1 7 6 9 6 5 2 3 4

getValues Collected to Matrix Interpret: OpenCL Software Examine for safe cases Examine for safe cases

Even non-threaded software case: faster Shares function input collection work. Aggregated / linearized doubles / strings in the matrix

slide-8
SLIDE 8

Why Thread ?

slide-9
SLIDE 9

9 / 25 FOSDEM 2018 | Michael Meeks

CPUs get wider not faster

  • Sometimes CPUs get slower …
  • Process clocks stymied at 3-4 GHz
  • IPC improvements ~stalled
  • Real IPC wins:
  • Laptops

minimum 4 threads →

– Mid-range

8 threads. →

  • PC / Workstation

– 8

16 threads: the new normal. →

  • Affordable too ...
  • Many thanks to AMD for sponsoring this work.
slide-10
SLIDE 10

10 / 25 FOSDEM 2018 | Michael Meeks

2017 Crash reporting stats

  • Frustratingly ‘cores’ not threads.

2 1 7

  • 1
  • 1

2 1 7

  • 2
  • 1

2 1 7

  • 3
  • 1

2 1 7

  • 4
  • 1

2 1 7

  • 5
  • 1

2 1 7

  • 6
  • 1

2 1 7

  • 7
  • 1

2 1 7

  • 8
  • 1

2 1 7

  • 9
  • 1

2 1 7

  • 1
  • 1

2 1 7

  • 1

1

  • 1

2 1 7

  • 1

2

  • 1

2 1 8

  • 1
  • 1

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%

Crash report % by CPU core count over time.

48 36 32 24 16 12 10 8 6 4 2 1 200 400 600 800 1000 1200 1400 1600 1800 2000

Reports from large core count machines.

48 40 36 32 24 16 12 10

slide-11
SLIDE 11

Initial Solution ...

slide-12
SLIDE 12

12 / 25 FOSDEM 2018 | Michael Meeks

Thread InterpretFormulaGroup

  • Attempt re-use of existing formula core
  • Try to avoid special / sub-setting code-paths

for existing formula-group conversion: a more generic solution.

  • Concept:
  • Pre-calculate dependent cells to control

recursion outside of threads.

  • Protect invariants with assertions
  • Black-list problematic functions ...
  • Parallelise using existing interpreter.
slide-13
SLIDE 13

13 / 25 FOSDEM 2018 | Michael Meeks

Parallelize existing interpreter

double ScFormulaCell::GetValue() { MaybeInterpret(); return GetRawValue(); } void ScFormulaCell::Interpret() { … amazing recursion flattening … InterpretTail() // ie. ... { … new ScInterpreter( this, pDocument, rContext, aPos, *pCode /* those tokens */);

  • >Interpret()

StackVar ScInterpreter::Interpret() { … execute reverse-polish stack … … execute functions … … get cell values from references …

Pre-fetch all dependent values – and lock-that down:

void ScFormulaCell::MaybeInterpret() ... assert(!pDocument->mbThreadedGroupCalcInProgress);

Pre-calculated → No recursion

slide-14
SLIDE 14

14 / 25 FOSDEM 2018 | Michael Meeks

ScInterpreter: calcs formulae

ScDocument ScTable ScFormulaCell block Broadcasters ScBroadcastAreaSlotMachine ScColumn

Dependencies Dependencies

ScInterpreter

ScTokenArray ScFormulaCellGroup … Tokens … RPN Mutates: INDEX, OFFSET etc. Cloud Web fn’s Macros Ext’ns Mutates! Vlookup Cache

Number format, Link mgmt etc.

slide-15
SLIDE 15

15 / 25 FOSDEM 2018 | Michael Meeks

ScInterpreter: some fixes

  • Basic iteration - broken:
  • class FormulaTokenArray

– sal_uInt16 nIndex; // Current step index – FormulaToken* FirstRPN() { nIndex = 0;

return NextRPN(); }

  • Now has an external iterator

– a man-week+ to un-wind this, and debug the last pieces

that relied on this.

  • Added mutation guards:
  • ScMutationGuard aGuard(this,

ScMutationGuardFlags::CORE);

– In all likely-looking places: where core state is changed.

slide-16
SLIDE 16

16 / 25 FOSDEM 2018 | Michael Meeks

Disabling nasties:

  • Dependency graph manipulation
  • During calculation:

– Indirect, Offset, Match, Cell, ocTableOp

  • Other stuff
  • Macros – disabled for now.

– Could detect ‘pure’ ie. non-mutating functions – Also parallelize the basic/ interpreter (?)

  • Info

grab-bag of bits. →

  • ocExternal

UNO extensions: →

– currently in: but can do ~un-controlled mutation (?)

slide-17
SLIDE 17

17 / 25 FOSDEM 2018 | Michael Meeks

More nasties ...

  • Several global variables
  • No-where obvious to hang them
  • Now some thread_local variables

– Calculation stack – Current-document being calculated – Matrix positions – nC,nR

  • Somewhat horrific: fix obsolete Mac toolchain.
  • ScInterpreterContext
  • Added – passed through all functions.

– Impacts eg. ‘GetValue’ though ...

slide-18
SLIDE 18

18 / 25 FOSDEM 2018 | Michael Meeks

single1 2 4 8 16

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 re-calculating 100k formulae on 1m doubles

Meeks/Linux Ryzen/Win10 Thread count Seconds to calculate

How did that look: initially ...

  • Faster
  • Getting some nice

speedups – ignoring the hyper-threaded- ness:

  • 8.5s

2.5 with 4 → threads 3.4x →

  • 4.7

0.86 - ~5.5x → with 8 threads

slide-19
SLIDE 19

19 / 25 FOSDEM 2018 | Michael Meeks

Up to this point:

  • Plain Old calculation – single threaded (POC)
  • Group calculation

A) Single Threaded Software Group calc (STSG) B) OpenCL: GPU parallelism after conversion C) New threaded calculation (NTC)

  • Then: C) slower than A) in some cases …

– Collecting data from sheets, branching, type handling, etc. again

and again for each formulacell …

  • Expensive – threading doesn’t help.

– A) collects once – and has some SSE2 goodness …

  • So

add a ‘threaded A)’ - simple & better … →

  • Weighting decision: POC vs. ... based on complexity.
slide-20
SLIDE 20

20 / 25 FOSDEM 2018 | Michael Meeks

Improving performance ...

  • Why don’t we get a 8x for 8 threads ?
  • Terrible profiling tools on Windows.
  • Linux – used ‘perf’ looking for threading

issues:

– sudo perf record --call-graph dwarf \

  • -switch-events -c 1 # etc.
  • Looking for false-sharing

– And other horrors.

slide-21
SLIDE 21

21 / 25 FOSDEM 2018 | Michael Meeks

Horror: rampant heap thrash

  • RPN calculation – stack based:
  • Tons of stack operations: pushing values etc.
  • Do memory allocation & frees.

– Using the ancient / internal allocator – never intended

for heavy parallel use.

→ drop the custom allocator hugely faster → → Re-use tokens where possible too.

  • std::stack

deque lists … → →

  • Horrible: std::vector instead

far better. →

  • Re-using ScInterpreterContext ...
slide-22
SLIDE 22

22 / 25 FOSDEM 2018 | Michael Meeks

Other issues ...

  • Where ‘GetDouble’ meets SfxItemSet ...
  • fixed SvNumberFormatter thread safety.
slide-23
SLIDE 23

23 / 25 FOSDEM 2018 | Michael Meeks

Threading & optimizing story:

Row 1 Row 2 Row 3 Row 4 2 4 6 8 10 12 Column 1 Column 2 Column 3

B a s e l i n e f r

  • m

r e c e n t m a s t e r G r

  • u

p I n t e r p r e t e r w

  • r

k b y T

  • r

T h r e a d S

  • f

t w a r e I n t e r p r e t e r A v

  • i

d T

  • k

e n A r r a y t h r a s h h a l v e t h e n u m b e r

  • f

t h r e a d s i f H T i s a c t i v e d i s a b l e c u s t

  • m

a l l

  • c

a t

  • r

u s e a c a c h e f

  • r

F

  • r

m u l a D

  • u

b l e T

  • k

e n a l l

  • c

a t i

  • n

m a k e t

  • k

e n c a c h e t h r e a d l

  • c

a l u p t h e t

  • k

e n c a c h e s i z e t

  • 1

6 h a l v e t h r e a d c

  • u

n t i f H T a c t i v e f

  • r

g r

  • u

p i n t e r p r e t e r t

  • U

s e C + + t h r e a d s

200 400 600 800 1000 1200 1400 1600 1800

Benchmarking some of our sample sheets ...

Note this perf. Regression from threading for some workloads came from avoiding the SoftwareGrou pInterpreter

slide-24
SLIDE 24

24 / 25 FOSDEM 2018 | Michael Meeks

Future work

  • Stop the Crash-testing from asserting ...
  • Implicit intersection: killing us (again)

– Move RPN to have precise ranges

  • Extend threaded unit tests further …
  • Move more global variables to ScInterpreterContext
  • Make FormulaCell a 1x item group
  • Make POC calcalation a forced-single-threaded calc

– Always thread SoftwareGroup Intepreter

  • De-bong the format-typeuse
  • =J20 – should not change format type if J20

changes format.

– A sheet-creation-time optimization … – Intersects with ‘units’ work too.

slide-25
SLIDE 25

25

25 / 25 FOSDEM 2018 | Michael Meeks

Conclusions

  • Calculation can be threaded
  • Significant speedups are possible
  • Profiling & optimizing works
  • “it is slow” == “not enough invested yet”

– All problems are just economics

  • Many thanks to AMD for their support.

Oh, that my words were recorded, that they were written on a scroll, that they were inscribed with an iron tool on lead, or engraved in rock for ever! I know that my Redeemer lives, and that in the end he will stand upon the earth. And though this body has been destroyed yet in my flesh I will see God, I myself will see him, with my own eyes - I and not

  • another. How my heart yearns within me. - Job 19: 23-27