IN SUPPORT OF WORKLOAD-AWARE STREAMING STATE MANAGEMENT Vasiliki - - PowerPoint PPT Presentation

in support of workload aware streaming state management
SMART_READER_LITE
LIVE PREVIEW

IN SUPPORT OF WORKLOAD-AWARE STREAMING STATE MANAGEMENT Vasiliki - - PowerPoint PPT Presentation

IN SUPPORT OF WORKLOAD-AWARE STREAMING STATE MANAGEMENT Vasiliki Kalavri John Liagouris vkalavri@bu.edu liagos@bu.edu HotStorage 2020 14 July 2020 STREAMING DATAFLOWS Nexmark Q4: Rolling average of winning bids auctions source


slide-1
SLIDE 1

HotStorage 2020 14 July 2020

IN SUPPORT OF WORKLOAD-AWARE STREAMING STATE MANAGEMENT

John Liagouris liagos@bu.edu Vasiliki Kalavri vkalavri@bu.edu

slide-2
SLIDE 2

STREAMING DATAFLOWS

Logical Dataflow

Worker 1 Worker 2

Physical Dataflow

auctions source bids source join rolling average sink

Nexmark Streaming Benchmark Suite: https://beam.apache.org/documentation/sdks/java/testing/nexmark/

Nexmark Q4: “Rolling average of winning bids”

2

slide-3
SLIDE 3

LARGER-THAN-MEMORY STATE MANAGEMENT

put/get put/get <k,v> <k,v>

Worker 1 Worker 2

Large operator state is backed by key-value stores

3

slide-4
SLIDE 4

LARGER-THAN-MEMORY STATE MANAGEMENT

put/get put/get <k,v> <k,v>

Worker 1 Worker 2

Large operator state is backed by key-value stores LSM-based write-optimized store with efficient range scans

4

slide-5
SLIDE 5

STATE REQUIREMENTS VARY ACROSS OPERATORS

Average: Read-Modify-Write a single value Join: Write-heavy and can potentially accumulate large state

auctions source bids source join rolling average sink

Dataflow operators may have different state access patterns and memory requirements

Nexmark Q4: “Rolling average of winning bids”

5

slide-6
SLIDE 6

CURRENT PRACTICE: MONOLITHIC STATE MANAGEMENT

All key-value stores in the dataflow are globally-configured One key-value store (RocksDB) per stateful operator instance

<k,v> <k,v>

Worker 1 Worker 2

<k,v> <k,v>

6

slide-7
SLIDE 7

FLAWS OF MONOLITHIC STATE MANAGEMENT

<k,v> <k,v>

Worker 1 Worker 2

<k,v> <k,v>

  • Oblivious store configuration
  • Unnecessary data marshaling
  • Unnecessary key-value store features

7

slide-8
SLIDE 8

UNNECESSARY KEY-VALUE STORE FEATURES

  • State partitioning
  • State scoping
  • Concurrent access to state
  • State checkpointing

All these operations are handled by modern stream processors outside the state store Stream processors guarantee single-thread access to state

8

slide-9
SLIDE 9

WORKLOAD-AWARE STREAMING STATE MANAGEMENT

Multiple state stores of different types and configurations according to the requirements

  • f the stateful operators

Worker 1 Worker 2

put/get store:u64 rmw_u64 store:<u64,auction> store:<u64,bid>

Streaming operators are instantiated

  • nce and are long-running: their

access patterns and state sizes are largely known in advance

9

slide-10
SLIDE 10

A FLEXIBLE TESTBED FOR STREAMING STATE MANAGEMENT

  • Implemented in Rust
  • Based on Timely Dataflow stream processor
  • Supports two key-value stores
  • RocksDB
  • FASTER
  • Supports different window evaluation strategies

Timely Dataflow: https://github.com/TimelyDataflow/timely-dataflow

FASTER: Hybrid log with efficient lookups and in-place updates

FASTER: https://github.com/microsoft/FASTER Testbed: https://github.com/jliagouris/wassm

RocksDB: LSM-based with efficient range scans

10

slide-11
SLIDE 11

EXPERIMENTAL RESULTS

11

slide-12
SLIDE 12

EVALUATION GOALS

1. Study the effect of the backend’s data layout on the evaluation of streaming windows 2. Study the effect of workload-aware configuration on queries with multiple stateful operators

12

slide-13
SLIDE 13
  • Query 1: Count the number of records in a

30s window that slides every 1s

  • Input rate: 10K records/s
  • Single thread execution
  • Report end-to-end latency (ms) per record

EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

COUNT-30s-1s

13

slide-14
SLIDE 14
  • p90

p99 p99.9 Complementary CDF: Each point (x,y) indicates that y% of the latency measurements are at least x ms … Lower is better

EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

COUNT-30s-1s

14

slide-15
SLIDE 15
  • RocksDB PUT/GET: On record,

retrieve window contents, apply new record, and put the updated contents back to the store

EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

COUNT-30s-1s

Lower is better

15

slide-16
SLIDE 16
  • RocksDB MERGE: On record, put

record to the store using MERGE. The record is applied to the window contents lazily on trigger

EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

COUNT-30s-1s

Lower is better

16

slide-17
SLIDE 17
  • 100X in p99

FASTER performs better due to in-place updates

EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

COUNT-30s-1s

Lower is better

17

slide-18
SLIDE 18
  • Query 2: Rank records in a 30s tumbling

window

  • Input rate: 1K records/s
  • Single thread execution
  • Report end-to-end latency (ms) per

record

  • EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

RANK-30s-30s

Lower is better

18

slide-19
SLIDE 19

100X

  • RocksDB MERGE performs

best due to lazy evaluation

EFFECT OF DATA LAYOUT ON WINDOW EVALUATION

RANK-30s-30s

1000X Lower is better

19

slide-20
SLIDE 20

THERE IS NO CLEAR WINNER

  • 100X in p99

COUNT-30s-1s

100X

  • RANK-30s-30s

1000X

20

slide-21
SLIDE 21

MONOLITHIC VS WORKLOAD-WARE STATE MANAGEMENT

  • Experiments with six Nexmark* queries
  • Different stateful operators (joins, window aggregations, custom aggregations)
  • Simple workload-aware configuration of data types and available memory size

*Nexmark Streaming Benchmark Suite: https://beam.apache.org/documentation/sdks/java/testing/nexmark/

21

slide-22
SLIDE 22

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

Q4

custom join and rolling aggregate

  • State store used: FASTER
  • Input rate: 10K records/s
  • SIngle thread execution
  • Monolithic memory configuration: 8GB
  • Workload-aware memory configuration: 6GB

(bids), 1.5GB (auctions), 512MB (average)

  • Report end-to-end latency (ms) per record
  • 22
slide-23
SLIDE 23

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

Q4

custom join and rolling aggregate

  • State store used: FASTER
  • Input rate: 10K records/s
  • SIngle thread execution
  • Monolithic memory configuration: 8GB
  • Workload-aware memory configuration: 6GB

(bids), 1.5GB (auctions), 512MB (average)

  • Report end-to-end latency (ms) per record

23

slide-24
SLIDE 24

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

Q4

custom join and rolling aggregate

  • State store used: FASTER
  • Input rate: 10K records/s
  • SIngle thread execution
  • Monolithic memory configuration: 8GB
  • Workload-aware memory configuration: 6GB

(bids), 1.5GB (auctions), 512MB (average)

  • Report end-to-end latency (ms) per record

24

slide-25
SLIDE 25

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

Q4

custom join and rolling aggregate

6X in p99

  • State store used: FASTER
  • Input rate: 10K records/s
  • SIngle thread execution
  • Monolithic memory configuration: 8GB
  • Workload-aware memory configuration: 6GB

(bids), 1.5GB (auctions), 512MB (average)

  • Report end-to-end latency (ms) per record

25

slide-26
SLIDE 26

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

Q5

sliding window aggregation

  • State store used: FASTER
  • Input rate: 10K records/s
  • Single thread execution
  • Monolithic memory configuration: 8GB
  • Workload-aware memory configuration: 6GB

(additions), 1GB (deletions), 512MB (accumulations), 512MB (hot items)

  • Report end-to-end latency (ms) per record

26

slide-27
SLIDE 27

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

Q5

sliding window aggregation

  • 14X in p99
  • State store used: FASTER
  • Input rate: 10K records/s
  • Single thread execution
  • Monolithic memory configuration: 8GB
  • Workload-aware memory configuration: 6GB

(additions), 1GB (deletions), 512MB (accumulations), 512MB (hot items)

  • Report end-to-end latency (ms) per record

27

slide-28
SLIDE 28

Q4

latency vs throughput with a single thread

Q7

latency with varying the number of threads

  • MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

28

slide-29
SLIDE 29

Q4

latency vs throughput with a single thread

Q7

latency with varying the number of threads

  • 2X higher

throughput

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

29

slide-30
SLIDE 30

Q4

latency vs throughput with a single thread

Q7

latency with varying the number of threads

  • 2X higher

throughput FASTER (monolithic) and RocksDB (monolithic) do not keep up with 2M records/s

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

30

slide-31
SLIDE 31

Q4

latency vs throughput with a single thread

Q7

latency with varying the number of threads

  • benefits persist in multi-

worker dataflows

MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT

31

slide-32
SLIDE 32

OPEN QUESTIONS

  • One store fits all or many?
  • Do we need new streaming benchmarks?
  • What are the desirable store features to support advanced state
  • perations (e.g. state migration, etc.)?
  • How can we learn streaming state characteristics?

32

slide-33
SLIDE 33

SUMMARY

Workload-aware streaming state management

https://github.com/jliagouris/wassm

Testbed:

  • We need to revisit current monolithic

approaches

  • State store layout affects query

performance significantly

  • Workload-aware state management

achieves up to 14X speedup and 2X higher throughput in Nexmark queries

Worker 1 Worker 2

put/get store:u64 rmw_u64 store:<u64,auction> store:<u64,bid>

John Liagouris liagos@bu.edu Vasiliki Kalavri vkalavri@bu.edu

33