Search and Time Series Databases Corso di Sistemi e Architetture per - - PowerPoint PPT Presentation

search and time series databases
SMART_READER_LITE
LIVE PREVIEW

Search and Time Series Databases Corso di Sistemi e Architetture per - - PowerPoint PPT Presentation

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data


slide-1
SLIDE 1

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

Search and Time Series Databases

Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini

slide-2
SLIDE 2

The reference Big Data stack

Valeria Cardellini - SABD 2016/17 1

Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

slide-3
SLIDE 3

Why search platforms?

  • How to find documents that match queries?

– With text search faster than RDBMs

  • How to obtain specific features?

– Such as highlighting, spatial search, suggestions, guided navigation, …

Valeria Cardellini - SABD 2016/17 2

slide-4
SLIDE 4

Search engines

  • Most popular search platforms:

– Apache Solr – ElasticSearch

  • ETL process

Valeria Cardellini - SABD 2016/17 3

slide-5
SLIDE 5

Apache Solr

  • Scalable, highly reliable and open-source framework

for searching data

  • Built on Apache Lucene

– Open-source library for indexing and search – Used by Solr for full-text search

  • Can index documents written in:
  • XML, JSON, CSV and binary formats
  • Runs as Java Web application
  • Provides a REST-like web service that exposes

services to manage the lifecycle of documents in the index (indexing, querying, …)

  • Used by most popular Web apps (Apple, Instagram,

LinkedIn, …)

Valeria Cardellini - SABD 2016/17 4

slide-6
SLIDE 6

Solr: key features

  • Faceting

– To group the results based on specific field or defined criteria, providing the count of each subset – Example: shopping site can provide facets to narrow search results by manufacturer or price

  • Auto-suggest

– To present list of possible query terms

  • Spell check

– To suggest corrected spelling of query terms

  • Highlighting
  • Document clustering

– To group related documents in the search results

  • Spatial search

– To filter search results based on location

Valeria Cardellini - SABD 2016/17 5

slide-7
SLIDE 7

Solr: key features

  • Pagination and ranking of search results
  • Results grouping

– To group the results based on a grouping field and return the top documents in each group

  • Near real-time search

– To search documents immediately after they have been indexed; useful for apps with dynamic changing content (e.g., news)

  • More Like This

– identifies other documents that are similar to one in a result set

Valeria Cardellini - SABD 2016/17 6

slide-8
SLIDE 8

Solr feature example

Valeria Cardellini - SABD 2016/17 7

slide-9
SLIDE 9

Solr components

Valeria Cardellini - SABD 2016/17 8

slide-10
SLIDE 10

Solr components

  • Request Handlers: handle a request at a URL

– E.g.: /select

  • Search Components: part of a Search Handler, a

componentized request handler

– Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable

  • Update Handlers: handle an indexing request
  • Update Processors chain: per-handler componentized

chain that handles updates

  • Query Parser plugins

– Mix and match query types in a single request – Function plugins for Function Query

  • Text Analysis plugins: Analyzers, Tokenizers,

TokenFilters

  • Response Writers: serialize and stream response to

client

Valeria Cardellini - SABD 2016/17 9

slide-11
SLIDE 11

Scaling Solr: SolrCloud

  • How to provide distributed indexing and search

capabilities?

– Up to millions of users and millions of indexed documents

  • SolrCloud: deployment functionality of Solr which

allows to setup clusters of Solr servers

– Enables and simplifies horizontal scaling of a search index through replication and sharding – Sharding: incoming queries are distributed to to shards in the collection, which respond with merged results – Replication: to handle higher concurrent query load by spreading the requests to multiple servers

  • No master node to allocate nodes, shards and

replicas

  • SolrCloud uses ZooKeeper for storing shared

configuration files and for coordination

Valeria Cardellini - SABD 2016/17 10

slide-12
SLIDE 12

Solr distributed architecture

Valeria Cardellini - SABD 2016/17 11

slide-13
SLIDE 13

Elasticsearch

  • Distributed, multitenant-capable and scalable

full-text search engine with REST-based interface and schema-free JSON documents

  • Search engine based on Apache Lucene
  • Developed in Java
  • Distributed

– Indices can be divided into shards and each shard can have zero or more replicas – Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s) – Rebalancing and routing are done automatically

Valeria Cardellini - SABD 2016/17 12

slide-14
SLIDE 14

Elastic (ELK) Stack

  • Elasticsearch is closely integrated with

Logstash and Kibana (Elastic Stack, previously known as ELK)

  • Logstash

– Server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch

  • Kibana

– Data visualization platform

Valeria Cardellini - SABD 2016/17 13

slide-15
SLIDE 15

Solr vs. Elasticsearch

  • Solr

– Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment wide range

  • f plug-ins are available
  • Elasticsearch

– Newer, but already very widely used – Focus on extracting value from data generally, and not just

  • n search

– Part of ELK stack – Schema-free and document-oriented

Valeria Cardellini - SABD 2016/17 14

  • Elasticsearch vs Solr on Google Trends
slide-16
SLIDE 16

Time series data base (TSDB)

  • How to analyze DevOps monitoring, application

metrics, IoT sensor data?

– Time series databases (TSDBs) provides an effective and lightweight solution

  • Optimized for handling high-volume time series data

– Time series: a sequence of data points (arrays of numbers) indexed by time (a date time or a date time range), e.g.:

  • Time series of stock prices (price curve)
  • Time series of energy consumption (load profile)
  • Log of temperature values (temperature trace)
  • Optimized for providing complex logic to analyze time

series data

– Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in DBMS

Valeria Cardellini - SABD 2016/17 15

slide-17
SLIDE 17

TSDB: overview

  • Create, enumerate, update and destroy

various time series and organize them in some fashion

– Series may be organized hierarchically and have companion metadata – Provide basic calculations on a series as a whole , (e.g., multiplying, adding, or combining various time series into a new time series) – Filter on arbitrary patterns (e.g., day of the week, low value, high value) – Provide additional statistical functions that are targeted to time series data

Valeria Cardellini - SABD 2016/17 16

slide-18
SLIDE 18

TSDB: some products

  • Some open-source products

– CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org

  • Stores numeric time-series data and render graphs of

this data on demand

– InfluxDB https://www.influxdata.com – KairosDB https://kairosdb.github.io

  • Stores its time series in Cassandra

– OpenTSDB http://opentsdb.net

  • Stores its time series in HBase

– Riak-TS http://basho.com/products/riak-ts/

Valeria Cardellini - SABD 2016/17 17

slide-19
SLIDE 19

InfluxDB

  • Written in Go
  • Supports high write loads and large data set storage
  • Conserves space through downsampling

– By automatically expiring and deleting unwanted data as well as backup and restore

  • Provides easy-to-use SQL-like query language for

interacting with data

  • Provides simple, high performing write and query

HTTP(S) APIs, e.g.:

– To create a database

url -i -XPOST http://localhost:8086/query --data- urlencode "q=CREATE DATABASE mydb”

– To write data

curl -i -XPOST 'http://localhost:8086/write? db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'

Valeria Cardellini - SABD 2016/17 18

slide-20
SLIDE 20

InfluxDB datastore

  • Data organized by time series, which contain a

measured value, like “cpu_load” or “temperature”

  • Time series have zero to many points, one for each

discrete sample of the metric

  • Points consist of:

– time (a timestamp) – a measurement (e.g., “cpu_load”) – at least one key-value field (the measured value itself, e.g. “value=0.64”, or “temperature=21.2”) – and zero to many key-value tags containing any metadata about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”)

Valeria Cardellini - SABD 2016/17 19

slide-21
SLIDE 21

InfluxDB datastore

  • General format of points:

<measurement>[,<tag-key>=<tag-value>...] <field- key>=<field-value>[,<field2-key>=<field2- value>...] [unix-nano-timestamp]

  • Examples of points:

– cpu,host=serverA,region=us_west value=0.64 – payment,device=mobile,product=Notepad,method=credit

  • billed=33,licenses=3i 1434067467100293230

– stock,symbol=AAPL bid=127.46,ask=127.48 – temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000

  • Valeria Cardellini - SABD 2016/17

20

slide-22
SLIDE 22

InfluxDB datastore

  • A measurement is like a SQL table, where the

primary index is time

  • With respect to DBMS:

– No need to define schemas up-front – Null values are not stored

Valeria Cardellini - SABD 2016/17 21

slide-23
SLIDE 23

InfluxDB stack

  • Integrated with Telegraph, Chronograf and Kapacitor

(TICK stack)

Valeria Cardellini - SABD 2016/17 22

slide-24
SLIDE 24

References

  • Apache Solr Reference Guide, http://bit.ly/2scksQF
  • InfluxDB Version 1.2 Documentation,

http://bit.ly/2ryagFT

  • Dunning and Friedman, “Time Series Databases”,

O’Reilly, 2015.

Valeria Cardellini - SABD 2016/17 23