Lessons Learned with Cassandra & Spark_ Matthias Niehoff - - PowerPoint PPT Presentation

lessons learned with cassandra spark
SMART_READER_LITE
LIVE PREVIEW

Lessons Learned with Cassandra & Spark_ Matthias Niehoff - - PowerPoint PPT Presentation

Lessons Learned with Cassandra & Spark_ Matthias Niehoff Apache: Big Data 2017 @matthiasniehoff 1 @codecentric Our Use Cases_ join read write join read write Lessons Learned with Cassandra Data modeling: Primary key_ Primary


slide-1
SLIDE 1

Lessons Learned with Cassandra & Spark_

Matthias Niehoff Apache: Big Data 2017

@matthiasniehoff @codecentric

1

slide-2
SLIDE 2

Our Use Cases_

read read join join write write

slide-3
SLIDE 3

Lessons Learned with

Cassandra

slide-4
SLIDE 4
  • Primary key defines access to a table
  • efficient access only by key
  • reading one or multiple entries by key
  • Cannot be changed after creation
  • Need to query by another key 


=> create a new table

  • Need to query by a lot of different keys

=> Cassandra might not be a got fit

Data modeling: Primary key_

slide-5
SLIDE 5
  • Strategy to reduce partition size
  • Becomes part of the partition key
  • Must be easily calculable for querying
  • Aim for even sized partitions
  • Do the math for partition sizes!
  • value count
  • size in bytes

Care about bucketing_

slide-6
SLIDE 6
  • Well known:


If you delete a column or whole row,
 the data is not really deleted.
 Rather a tombstone is created to mark the deletion.

  • Much later tombstones are removed during

compactions.

Data modeling: Deletions_

slide-7
SLIDE 7
  • Inserts / Updates on collections

  • Frozen collections
  • treats collection as one big blob
  • no tombstones on insert
  • does not support field updates

  • Non frozen collections
  • incremental updates w/o tombstones
  • tombstones for every other update/insert

Unexpected Tombstones: Built-in Maps, Lists, Sets_

slide-8
SLIDE 8
  • sstable2json shows sstable file in json format
  • Usage: go to /var/lib/cassandra/data/keyspace/table
  • > sstable2json *-Data.db
  • See the individual rows of the data files
  • sstabledump in 3.6

Debug tool: sstable2json_

slide-9
SLIDE 9

Example_

CREATE TABLE customer_cache.tenant ( name text PRIMARY KEY, status text ) select * from tenant ; name | status

  • -----+--------

ru | ACTIVE es | ACTIVE jp | ACTIVE vn | ACTIVE pl | ACTIVE cz | ACTIVE

slide-10
SLIDE 10

Example_

{"key": "ru", "cells": [["status","ACTIVE",1464344127007511]]}, {"key": "it", "cells": [[„status“,"ACTIVE",1464344146457930, T]]}, {"key": "de", "cells": [["status","ACTIVE",1464343910541463]]}, {"key": "ro", "cells": [["status","ACTIVE",1464344151160601]]}, {"key": "fr", "cells": [["status","ACTIVE",1464344072061135]]}, {"key": "cn", "cells": [["status","ACTIVE",1464344083085247]]}, {"key": "kz", "cells": [["status","ACTIVE",1467190714345185]]}

deletion marker

slide-11
SLIDE 11
  • synchronous query introduce unnecessary delay

Bulk Reads or Writes_

Client Cassandra t t+1 t+2 t+3 t+4 t+5

slide-12
SLIDE 12
  • parallel async queries

Bulk Reads or Writes: Async_

Client Cassandra t t+1 t+2 t+3 t+4 t+5

slide-13
SLIDE 13

Example_

Session session = cc.openSession(); PreparedStatement getEntries = session.prepare("SELECT * FROM keyspace.table WHERE key=?"); private List<ResultSetFuture> sendQueries(Collection<String> keys) { List<ResultSetFuture> futures = Lists.newArrayListWithExpectedSize(keys.size()); for (String key : keys { futures.add(session.executeAsync(getEntries.bind(key))); } return futures; }

slide-14
SLIDE 14

Example_

private void processAsyncResults(List<ResultSetFuture> futures) { for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) { ResultSet rs = future.get(); if (rs.getAvailableWithoutFetching() > 0 || 
 rs.one() != null) { // do your program logic here } } }

slide-15
SLIDE 15
  • One keyspace per tenant?
  • One (set of) table(s) per tenant?
  • Our option: Table per tenant
  • Feasible only for limited number of tenants (~1000)

Separating Data of Different Tenants_

slide-16
SLIDE 16
  • Switch on monitoring
  • ELK, OpsCenter, self built, ....
  • Avoid Log level debug for C* messages
  • Drowning in irrelevant messages
  • Substantial performance drawback
  • Log level info for development, pre-production
  • Log level error in production sufficient

Monitoring_

slide-17
SLIDE 17
  • Cassandra never checks if there is enough space left
  • n disk for writing
  • Keeps writing data till the disk is full
  • Can bring the OS to a halt
  • Cassandra error messages are confusing at this point
  • Thus monitoring disk space is mandatory

Monitoring: Disk Space_

slide-18
SLIDE 18
  • A lot of disk space is required for

compaction

  • I.e. for SizeTieredCompaction up to 50%

free disk space is needed

  • Set-up monitoring on disk space
  • Alert if the data carrying disk partition fills

up to 50%

  • Add nodes to the cluster and rebalance

Monitoring: Disk Space_

slide-19
SLIDE 19

Lessons Learned with

Spark (Streaming)

slide-20
SLIDE 20

Quick Recap - Spark Resources_

https://spark.apache.org/docs/latest/cluster-overview.html

can run multiple executors Executors have memory and cores cores define degree of parallelization

slide-21
SLIDE 21
  • Resource allocation is static per application
  • Streaming jobs need fixed resources over a long time
  • Unused resource for the driver
  • Overestimate resources for peek load

Scaling Spark_

slide-22
SLIDE 22
  • Spark Core is just a logical abstraction
  • Microbatches idle most of the time
  • Beware of overusing CPUs
  • Leave space for temporary glitches

Scaling - Overallocating_

slide-23
SLIDE 23
  • Bursts off data increase processing time
  • May result in OOM

Use back pressure mechanism_

spark.streaming.backpressure.enabled spark.streaming.backpressure.initialRate spark.streaming.kafka.maxRatePerPartition

slide-24
SLIDE 24
  • In batch: just load it, when needed
  • In streaming:
  • Long running application
  • Is the data static?
  • Does it change over time? How frequently?

Lookup additional data_

input load

slide-25
SLIDE 25
  • Broadcast data
  • static data
  • load once at the start of the application
  • Use mapPartitions()
  • connection & lookup for every partition
  • high load
  • connection overhead

Lookup additional data_

slide-26
SLIDE 26
  • Broadcast Connection
  • lookup for every partition
  • connection created once per executor
  • still high load on datasource
  • mapWithState()
  • maintains keyed state
  • Initial state at application start
  • technical messages trigger updates
  • can only be used with key (no update all)

Lookup additional data_

slide-27
SLIDE 27

Don’t hide the Spark UI_

slide-28
SLIDE 28
  • missing information, i.e. for streaming
  • crucial for debugging
  • do not build yourself!
  • high frequency of events
  • not all data available using REST API

  • use the history server to see stopped/failed jobs

Don’t hide the Spark UI_

slide-29
SLIDE 29
  • Support starting with Spark 2.1
  • Still alpha
  • Concepts in place, implementation ongoing
  • Solve some problems on your own, i.e. event time join

Event Time Support Yet To Come_

event processing 1 2 3 4 5 6 7 8 9 t in minutes

slide-30
SLIDE 30
  • First of all: it is distributed
  • Centralized logging and monitoring
  • Availability
  • Perfomance
  • Errors
  • System Load

Operating Spark is not easy_

slide-31
SLIDE 31

Lessons Learned with

Cassandra & Spark

slide-32
SLIDE 32

repartitionByCassandraReplica_

Node 1 Node 2 Node 3 Node 4

1-25 26-50 51-75 76-0

slide-33
SLIDE 33

repartitionByCassandraReplica_

Node 1 Node 2 Node 3 Node 4

1-25 26-50 51-75 76-0

some tasks took ~3s longer..

slide-34
SLIDE 34
  • Watch for Spark Locality Level
  • aim for process or node local
  • avoid any

Spark locality_

slide-35
SLIDE 35
  • spark job does not run on every C* node
  • # spark nodes < # cassandra nodes
  • # job cores < # cassandra nodes
  • spark job cores all on one node
  • time for repartition > time saving through locality

Do not use repartitionByCassandraReplica when ...

slide-36
SLIDE 36
  • one query per partition key
  • one query at a time per executor

joinWithCassandraTable_

Spark Cassandra t t+1 t+2 t+3 t+4 t+5

slide-37
SLIDE 37
  • parallel async queries

joinWithCassandraTable_

Spark Cassandra t t+1 t+2 t+3 t+4 t+5

slide-38
SLIDE 38
  • built a custom async implementation

joinWithCassandraTable_

someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) } [collect futures] return List<Tuple2<Left,Right>> }); });

slide-39
SLIDE 39
  • solved with SPARKC-233 


(1.6.0 / 1.5.1 / 1.4.3)

  • 5-6 times faster 


than sync implementation!

joinWithCassandraTable_

slide-40
SLIDE 40
  • joinWithCassandraTable is a full inner join

Left join with Cassandra_

RDD C*

slide-41
SLIDE 41
  • Might include shuffle --> quite expensive

Left join with Cassandra_

RDD C* join = RDD substract = RDD union RDD =

slide-42
SLIDE 42

Left join with Cassandra_

  • built a custom async implementation


someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) ... } [collect futures] return List<Tuple2<Left,Optional<Right>>> }); });

slide-43
SLIDE 43
  • solved with SPARKC-1.81


(2.0.0)

  • basically uses async joinWithC* 


implementation

Left join with Cassandra_

slide-44
SLIDE 44
  • spark.cassandra.connection.keep_alive_ms
  • Default: 5s
  • Streaming Batch Size > 5s
  • Open Connection for every new batch
  • Should be multiple times the streaming interval!

Connection keep alive_

slide-45
SLIDE 45
  • cache saves performance by preventing recalculation
  • it also helps you with regards to correctness!

Cache! Not only for performance_

val changedStream = someDStream.map(e -> someMethod(e)).cache() changedStream.saveToCassandra("keyspace","table1") changedStream.saveToCassandra("keyspace","table1") ChangedEntry someMethod(Entry e) { return new ChangedEntry(new Date(),...); }

slide-46
SLIDE 46
  • Know the most important internals
  • Know your tools
  • Monitor your cluster
  • Use existing knowledge resources
  • Use the mailing lists
  • Participate in the community

Summary_

slide-47
SLIDE 47

Questions?

Matthias Niehoff

IT-Consultant

codecentric AG Hochstraße 11 42697 Solingen, Germany matthias.niehoff@codecentric.de www.codecentric.de blog.codecentric.de matthiasniehoff