Lessons Learned with Cassandra & Spark_
Matthias Niehoff Apache: Big Data 2017
@matthiasniehoff @codecentric
1
Lessons Learned with Cassandra & Spark_ Matthias Niehoff - - PowerPoint PPT Presentation
Lessons Learned with Cassandra & Spark_ Matthias Niehoff Apache: Big Data 2017 @matthiasniehoff 1 @codecentric Our Use Cases_ join read write join read write Lessons Learned with Cassandra Data modeling: Primary key_ Primary
Matthias Niehoff Apache: Big Data 2017
@matthiasniehoff @codecentric
1
Our Use Cases_
read read join join write write
Lessons Learned with
=> create a new table
=> Cassandra might not be a got fit
Data modeling: Primary key_
Care about bucketing_
If you delete a column or whole row, the data is not really deleted. Rather a tombstone is created to mark the deletion.
compactions.
Data modeling: Deletions_
Unexpected Tombstones: Built-in Maps, Lists, Sets_
Debug tool: sstable2json_
Example_
CREATE TABLE customer_cache.tenant ( name text PRIMARY KEY, status text ) select * from tenant ; name | status
ru | ACTIVE es | ACTIVE jp | ACTIVE vn | ACTIVE pl | ACTIVE cz | ACTIVE
Example_
{"key": "ru", "cells": [["status","ACTIVE",1464344127007511]]}, {"key": "it", "cells": [[„status“,"ACTIVE",1464344146457930, T]]}, {"key": "de", "cells": [["status","ACTIVE",1464343910541463]]}, {"key": "ro", "cells": [["status","ACTIVE",1464344151160601]]}, {"key": "fr", "cells": [["status","ACTIVE",1464344072061135]]}, {"key": "cn", "cells": [["status","ACTIVE",1464344083085247]]}, {"key": "kz", "cells": [["status","ACTIVE",1467190714345185]]}
deletion marker
Bulk Reads or Writes_
Client Cassandra t t+1 t+2 t+3 t+4 t+5
Bulk Reads or Writes: Async_
Client Cassandra t t+1 t+2 t+3 t+4 t+5
Example_
Session session = cc.openSession(); PreparedStatement getEntries = session.prepare("SELECT * FROM keyspace.table WHERE key=?"); private List<ResultSetFuture> sendQueries(Collection<String> keys) { List<ResultSetFuture> futures = Lists.newArrayListWithExpectedSize(keys.size()); for (String key : keys { futures.add(session.executeAsync(getEntries.bind(key))); } return futures; }
Example_
private void processAsyncResults(List<ResultSetFuture> futures) { for (ListenableFuture<ResultSet> future : Futures.inCompletionOrder(futures)) { ResultSet rs = future.get(); if (rs.getAvailableWithoutFetching() > 0 || rs.one() != null) { // do your program logic here } } }
Separating Data of Different Tenants_
Monitoring_
Monitoring: Disk Space_
compaction
free disk space is needed
up to 50%
Monitoring: Disk Space_
Lessons Learned with
Quick Recap - Spark Resources_
https://spark.apache.org/docs/latest/cluster-overview.html
can run multiple executors Executors have memory and cores cores define degree of parallelization
Scaling Spark_
Scaling - Overallocating_
Use back pressure mechanism_
spark.streaming.backpressure.enabled spark.streaming.backpressure.initialRate spark.streaming.kafka.maxRatePerPartition
Lookup additional data_
input load
Lookup additional data_
Lookup additional data_
Don’t hide the Spark UI_
Don’t hide the Spark UI_
Event Time Support Yet To Come_
event processing 1 2 3 4 5 6 7 8 9 t in minutes
Operating Spark is not easy_
Lessons Learned with
repartitionByCassandraReplica_
Node 1 Node 2 Node 3 Node 4
1-25 26-50 51-75 76-0
repartitionByCassandraReplica_
Node 1 Node 2 Node 3 Node 4
1-25 26-50 51-75 76-0
some tasks took ~3s longer..
Spark locality_
Do not use repartitionByCassandraReplica when ...
joinWithCassandraTable_
Spark Cassandra t t+1 t+2 t+3 t+4 t+5
joinWithCassandraTable_
Spark Cassandra t t+1 t+2 t+3 t+4 t+5
joinWithCassandraTable_
someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) } [collect futures] return List<Tuple2<Left,Right>> }); });
(1.6.0 / 1.5.1 / 1.4.3)
than sync implementation!
joinWithCassandraTable_
Left join with Cassandra_
RDD C*
Left join with Cassandra_
RDD C* join = RDD substract = RDD union RDD =
Left join with Cassandra_
someDStream.transformToPair(rdd -> { return rdd.mapPartitionsToPair(iterator -> { ... Session session = cc.openSession()) { while (iterator.hasNext()) { ... session.executeAsync(..) ... } [collect futures] return List<Tuple2<Left,Optional<Right>>> }); });
(2.0.0)
implementation
Left join with Cassandra_
Connection keep alive_
Cache! Not only for performance_
val changedStream = someDStream.map(e -> someMethod(e)).cache() changedStream.saveToCassandra("keyspace","table1") changedStream.saveToCassandra("keyspace","table1") ChangedEntry someMethod(Entry e) { return new ChangedEntry(new Date(),...); }
Summary_
Questions?
Matthias Niehoff
IT-Consultant
codecentric AG Hochstraße 11 42697 Solingen, Germany matthias.niehoff@codecentric.de www.codecentric.de blog.codecentric.de matthiasniehoff