calc the challenges of scalable arithmetic how threading
play

Calc: The challenges of scalable arithmetic How threading can be - PowerPoint PPT Presentation

Calc: The challenges of scalable arithmetic How threading can be challenging Michael Meeks General Manager at Collabora Productivity michael.meeks@collabora.com Skype - mmeeks, G+ - mejmeeks@gmail.com Stand at the crossroads and look; ask


  1. Calc: The challenges of scalable arithmetic How threading can be challenging Michael Meeks General Manager at Collabora Productivity michael.meeks@collabora.com Skype - mmeeks, G+ - mejmeeks@gmail.com “Stand at the crossroads and look; ask for the ancient paths, ask where the good way is, and walk in it, and you will find rest for your souls...” - Jeremiah 6:16 www.collaboraoffice.com FOSDEM 2018 | Michael Meeks 1 / 25

  2. Calc threading - Overview ● LibreOffice 6.0 Calc ● Existing structure & parallelism ● Why thread ? ● The initial solution & problems ● mis-factored code Disclaimer & Thanks: Disclaimer & Thanks: Almost all of this ● dependency issues Almost all of this work was done by Tor work was done by Tor ● The group calculation piece Lillqvist & Dennis Lillqvist & Dennis Francis – who can’t be Francis – who can’t be ● Profiling & optimizing here today. here today. Some great code Some great code ● Future work & expansion … reading & improvement. reading & improvement. 2 FOSDEM 2018 | Michael Meeks 2 / 25

  3. LibreOffice 6.0 Calc ... ● A 30+ year old code-base ● Primary Data structures hugely improved recently ● Still some scope for improvement: FormulaGroup vs. FormulaCell, per-cell dependency records etc. ● Calculation Engine in need of love ● Some insights into how it works ● Some problems wrt. threading. 3 FOSDEM 2018 | Michael Meeks 3 / 25

  4. Core structures since 4.3 (mdds::multi_type_vector) ScTable ScColumn svl::SharedString block ScDocument double block EditTextObject block This bit: This bit: Broadcasters ScFormulaCell block Cell notes Text widths Cell values Script types FOSDEM 2018 | Michael Meeks 4 / 25

  5. FormulaCellGroups ScFormulaCell ScFormulaCell ScFormulaCellGroup ScFormulaCell … Tokens ScTokenArray ScFormulaCell … RPN Sample Token types (StackVar) ScFormulaCell Sample Token types (StackVar) ● svSingleRef → A1 ● svSingleRef → A1 ● svDoubleRef → A1:C3 ● svDoubleRef → A1:C3 ScFormulaCell ● svExternalSingleRef etc. ● svExternalSingleRef etc. ● svDouble → 42.0 ● svDouble → 42.0 ● svString → “hello world” ScFormulaCell ● svString → “hello world” ● svByte → ocDiv, ocMacro ... ● svByte → ocDiv, ocMacro ... FOSDEM 2018 | Michael Meeks 5 / 25

  6. Normal Formula interpreting Recursion++ double ScFormulaCell::GetValue() { MaybeInterpret(); return GetRawValue(); } void ScFormulaCell::Interpret() { … amazing recursion flattening … InterpretTail() // ie. ... { … new ScInterpreter( this, pDocument, rContext, aPos, *pCode /* those tokens */); ->Interpret() StackVar ScInterpreter::Interpret() { … execute reverse-polish stack … … execute functions … … get cell values from references … FOSDEM 2018 | Michael Meeks 6 / 25

  7. InterpretFormulaGroup Examine for Examine for safe cases ScFormulaCellGroup safe cases 1 2 … Tokens 2 ScTokenArray 1 … RPN 7 getValues Interpret: 6 Collected to OpenCL 9 6 Matrix Software 5 2 Even non-threaded software case: faster 3 Shares function input collection work. Aggregated / linearized doubles / strings in the matrix 4 FOSDEM 2018 | Michael Meeks 7 / 25

  8. Why Thread ?

  9. CPUs get wider not faster ● Sometimes CPUs get slower … ● Process clocks stymied at 3-4 GHz ● IPC improvements ~stalled ● Real IPC wins: ● Laptops → minimum 4 threads – Mid-range 8 threads. → ● PC / Workstation – 8 16 threads: the new normal. → ● Affordable too ... ● Many thanks to AMD for sponsoring this work. FOSDEM 2018 | Michael Meeks 9 / 25

  10. 2017 Crash reporting stats ● Frustratingly ‘cores’ not threads. Reports from large core count machines. Crash report % by CPU core count over time. 2000 100.00% 1800 90.00% 1600 80.00% 48 1400 70.00% 36 48 32 1200 60.00% 40 24 36 16 1000 50.00% 32 12 24 10 800 40.00% 16 8 12 6 600 30.00% 10 4 2 400 20.00% 1 200 10.00% 0 0.00% 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - - - - 1 2 3 4 5 6 7 8 9 0 1 2 1 0 0 0 0 0 0 0 0 0 1 1 1 0 - - - - - - - - - - - - - 7 7 7 7 7 7 7 7 7 7 7 7 8 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 FOSDEM 2018 | Michael Meeks 10 / 25

  11. Initial Solution ...

  12. Thread InterpretFormulaGroup ● Attempt re-use of existing formula core ● Try to avoid special / sub-setting code-paths for existing formula-group conversion: a more generic solution. ● Concept: ● Pre-calculate dependent cells to control recursion outside of threads. ● Protect invariants with assertions ● Black-list problematic functions ... ● Parallelise using existing interpreter. FOSDEM 2018 | Michael Meeks 12 / 25

  13. Parallelize existing interpreter double ScFormulaCell::GetValue() Pre-fetch all dependent { MaybeInterpret(); values – and lock-that down: return GetRawValue(); } void ScFormulaCell::MaybeInterpret() ... assert(!pDocument->mbThreadedGroupCalcInProgress); void ScFormulaCell::Interpret() { … amazing recursion flattening … InterpretTail() // ie. ... { … new ScInterpreter( this, pDocument, rContext, aPos, *pCode /* those tokens */); ->Interpret() Pre-calculated → StackVar ScInterpreter::Interpret() { No recursion … execute reverse-polish stack … … execute functions … … get cell values from references … FOSDEM 2018 | Michael Meeks 13 / 25

  14. ScInterpreter: calcs formulae ScTable ScFormulaCell block ScDocument ScColumn Number format, ScFormulaCellGroup Link mgmt etc. Broadcasters … Tokens Vlookup ScTokenArray ScBroadcastAreaSlotMachine Cache Dependencies … RPN Dependencies ScInterpreter Mutates! Macros Ext’ns Web fn’s Cloud Mutates: INDEX, OFFSET etc. FOSDEM 2018 | Michael Meeks 14 / 25

  15. ScInterpreter: some fixes ● Basic iteration - broken: ● class FormulaTokenArray – sal_uInt16 nIndex; // Current step index – FormulaToken* FirstRPN() { nIndex = 0; return NextRPN(); } ● Now has an external iterator – a man-week+ to un-wind this, and debug the last pieces that relied on this. ● Added mutation guards: ● ScMutationGuard aGuard(this, ScMutationGuardFlags::CORE); – In all likely-looking places: where core state is changed. FOSDEM 2018 | Michael Meeks 15 / 25

  16. Disabling nasties: ● Dependency graph manipulation ● During calculation: – Indirect, Offset, Match, Cell, ocTableOp ● Other stuff ● Macros – disabled for now. – Could detect ‘pure’ ie. non-mutating functions – Also parallelize the basic/ interpreter (?) ● Info → grab-bag of bits. ● ocExternal UNO extensions: → – currently in: but can do ~un-controlled mutation (?) FOSDEM 2018 | Michael Meeks 16 / 25

  17. More nasties ... ● Several global variables ● No-where obvious to hang them ● Now some thread_local variables – Calculation stack – Current-document being calculated – Matrix positions – nC,nR ● Somewhat horrific: fix obsolete Mac toolchain. ● ScInterpreterContext ● Added – passed through all functions. – Impacts eg. ‘GetValue’ though ... FOSDEM 2018 | Michael Meeks 17 / 25

  18. How did that look: initially ... ● Faster re-calculating 100k formulae on 1m doubles 9.00 ● Getting some nice 8.00 speedups – 7.00 ignoring the Seconds to calculate 6.00 hyper-threaded- 5.00 Meeks/Linux ness: 4.00 Ryzen/Win10 ● 8.5s 3.00 → 2.5 with 4 2.00 threads 3.4x → 1.00 ● 4.7 → 0.86 - ~5.5x 0.00 single1 2 4 8 16 with 8 threads Thread count FOSDEM 2018 | Michael Meeks 18 / 25

  19. Up to this point: ● Plain Old calculation – single threaded (POC) ● Group calculation A) Single Threaded Software Group calc (STSG) B) OpenCL: GPU parallelism after conversion C) New threaded calculation (NTC) ● Then: C) slower than A) in some cases … – Collecting data from sheets, branching, type handling, etc. again and again for each formulacell … ● Expensive – threading doesn’t help. – A) collects once – and has some SSE2 goodness … ● So add a ‘threaded A)’ - simple & better … → ● Weighting decision: POC vs. ... based on complexity. FOSDEM 2018 | Michael Meeks 19 / 25

  20. Improving performance ... ● Why don’t we get a 8x for 8 threads ? ● Terrible profiling tools on Windows. ● Linux – used ‘perf’ looking for threading issues: – sudo perf record --call-graph dwarf \ --switch-events -c 1 # etc. ● Looking for false-sharing – And other horrors. FOSDEM 2018 | Michael Meeks 20 / 25

  21. Horror: rampant heap thrash ● RPN calculation – stack based: ● Tons of stack operations: pushing values etc. ● Do memory allocation & frees. – Using the ancient / internal allocator – never intended for heavy parallel use. → drop the custom allocator hugely faster → → Re-use tokens where possible too. ● std::stack deque lists … → → ● Horrible: std::vector instead → far better. ● Re-using ScInterpreterContext ... FOSDEM 2018 | Michael Meeks 21 / 25

  22. Other issues ... ● Where ‘GetDouble’ meets SfxItemSet ... ● fixed SvNumberFormatter thread safety. FOSDEM 2018 | Michael Meeks 22 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend