OVERVIEW OF CASSANDRA WHY WOULD YOU NAME A DATABASE AFTER A GREEK - - PowerPoint PPT Presentation

overview of cassandra
SMART_READER_LITE
LIVE PREVIEW

OVERVIEW OF CASSANDRA WHY WOULD YOU NAME A DATABASE AFTER A GREEK - - PowerPoint PPT Presentation

OVERVIEW OF CASSANDRA WHY WOULD YOU NAME A DATABASE AFTER A GREEK MYTH OF NOT BEING LISTENED TO? AGENDA Bio Monitoring Basic C* Data Model Version issues and Replication vs. tombstoning Quorum Maintenance Tasks


slide-1
SLIDE 1

OVERVIEW OF CASSANDRA

WHY WOULD YOU NAME A DATABASE AFTER A GREEK MYTH OF NOT BEING LISTENED TO?

slide-2
SLIDE 2

AGENDA

  • Bio
  • Basic C* Data Model
  • Replication vs.

Quorum

  • Failure Recovery
  • AWS Implications
  • Monitoring
  • Version issues and

tombstoning

  • Maintenance Tasks
  • The Dark Side
slide-3
SLIDE 3

QUICK BIO

  • Programming since 1981
  • Four patents
  • 2010 JavaOne Rock Star and Duke’s Choice winner
  • Frequent contributor to Pragmatic Programmer

magazine, SearchAws.com and LinkinPulse News

  • briantarbox.org, log4jfugue.org,

BrianTarbox@gmail.com

slide-4
SLIDE 4

C* DATA MODEL; COMPARISON

  • “When all you have is a hammer, everything looks

like a thumb.” - Morgan

  • Relational (tabular) model: Postgresql
  • Relationship (graph) model: Neo4J
  • Document model: Mongo
  • Time-series model : C*
slide-5
SLIDE 5

THE REAL DIFFERENCE BETWEEN SQL AND “NO-SQL”

  • In SQL we’re trained to design based on storing

the data, ideally in 3rd Normal form. Queries are bolted on later.

  • In no-SQL we design based on the queries we’ll
  • perform. “Table” structure falls out of that.
  • Queries should get top billing b/c if you just store

the data who cares?

slide-6
SLIDE 6

C* DATA MODEL, EXAMPLE

  • Wide rows; wide columns, heterogeneous

columns

  • For example, a row per stock, with each column

being all we know about that stock for that day.

  • Designed to be easy to “select” a row and then

read thousands of columns sequentially

  • Not designed to randomly select specific columns
slide-7
SLIDE 7

CQL, SLICE PREDICATES

  • In postgres you might say “select * from stock

where ticker=“IBM” and price > 100”

  • You simply can not do that with C*
  • SQL uses indexes to speed up access to rows;

indexes are very problematic in C*

  • Often the C* answer is denormalization
slide-8
SLIDE 8

SLICE PREDICATES

  • Columns have names (e.g. “date”, but columns can

also contain many (hundreds) of values.

  • Slice predicates let you specify which columns to

select

slide-9
SLIDE 9

CLIQUE, INC.: C* ANTIPATTERN

  • My last company folded, but not before providing

a C* anti pattern

  • Collaboration software; many ad-hoc queries

(who’s in what context, where was “x” said, etc)

  • We ended up with 14 copies of the main data,

each in its own column-family.

  • Bad Dog.
slide-10
SLIDE 10

REPLICATION VS. READ/WRITE LEVEL

  • Replication refers to how many distinct copies of

the data there are

  • Read/Write Level refers to how many of the

replicas must respond/agree before proceeding

slide-11
SLIDE 11

THE WRITE PATH

  • Client picks C* node at random, it becomes the

Coordinator, etc. (diagram), send to replica # of nodes, wait til ’n’ respond before returning

slide-12
SLIDE 12

THE READ PATH

  • diagram (coordinator, send to all nodes with data,

wait for ’n’ to respond)

  • Read Repair
slide-13
SLIDE 13

FAILURE RECOVERY - WHERE C* REALLY SHINES

  • What happens when a node fails?
  • How many nodes can fail w/o data loss depends
  • n # nodes and #replicas
  • Auto-recovery vs. backup and restore
  • With the usual caveats… C* recovery “just works”
slide-14
SLIDE 14

RUNNING C* ON AWS

  • Scale out not up
  • More spindles is better
  • Log dir vs. data dir
  • Selecting the right instance type
  • You must run with NTP (not an AWS standard)
slide-15
SLIDE 15

CONFIGURATION

  • The main C* config file is 700 lines long
  • You really need to deeply understand most of it.
  • cluster_name, listen_address, commitlog_directory,

endpoint_snitch, seed_provider,

compaction_throughput_mb_per_sec, concurrent_reads, snapshot_before_compaction, phi_convict_threshold, commitlog_sync, partitioner, key_cache_size_in_mb, row_cache_save_period,

tombstone_warn_threshold, read_request_timeout_in_ms, cross_node_timeout,

internode_compression, inter_dc_tcp_nodelay, dynamic_snitch_badness_threshold, dynamic_snitch_update_interval,


hinted_handoff_enabled, max_hints_delivery_threads,…..

slide-16
SLIDE 16

MONITORING YOUR C* CLUSTER

slide-17
SLIDE 17

VERSION ISSUES AND TOMBSTONES

  • Life is better if you never delete records
  • If you delete you can end up with tombstones
  • To deal with tombstones you need to run Repair…

and that is a whole nasty can of worms.

slide-18
SLIDE 18

MAINTENANCE TASKS

  • Full and minor compressions
  • snapshot your disks if using AWS/EBS
slide-19
SLIDE 19

THE DARK SIDE, PART 1

  • Datastax maintains three parallel release branches,

with vastly different feature sets

  • New releases are always unstable; never accept an

n.0, n.1 or n.2 release

slide-20
SLIDE 20

THE DARK SIDE, PART 2

  • C* uses schema-less design
  • Requires knowledge of slice predicates rather than

SQL

  • DataStax decided to adopt schema and CQL to

gain marketshare at the expense of their soul.

  • You can now pretend C* is relational (except no

indexes and mostly no where clauses)