Search and Time Series Databases Corso di Sistemi e Architetture per - - PowerPoint PPT Presentation
Search and Time Series Databases Corso di Sistemi e Architetture per - - PowerPoint PPT Presentation
Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data
The reference Big Data stack
Valeria Cardellini - SABD 2016/17 1
Resource Management Data Storage Data Processing High-level Interfaces Support / Integration
Why search platforms?
- How to find documents that match queries?
– With text search faster than RDBMs
- How to obtain specific features?
– Such as highlighting, spatial search, suggestions, guided navigation, …
Valeria Cardellini - SABD 2016/17 2
Search engines
- Most popular search platforms:
– Apache Solr – ElasticSearch
- ETL process
Valeria Cardellini - SABD 2016/17 3
Apache Solr
- Scalable, highly reliable and open-source framework
for searching data
- Built on Apache Lucene
– Open-source library for indexing and search – Used by Solr for full-text search
- Can index documents written in:
- XML, JSON, CSV and binary formats
- Runs as Java Web application
- Provides a REST-like web service that exposes
services to manage the lifecycle of documents in the index (indexing, querying, …)
- Used by most popular Web apps (Apple, Instagram,
LinkedIn, …)
Valeria Cardellini - SABD 2016/17 4
Solr: key features
- Faceting
– To group the results based on specific field or defined criteria, providing the count of each subset – Example: shopping site can provide facets to narrow search results by manufacturer or price
- Auto-suggest
– To present list of possible query terms
- Spell check
– To suggest corrected spelling of query terms
- Highlighting
- Document clustering
– To group related documents in the search results
- Spatial search
– To filter search results based on location
Valeria Cardellini - SABD 2016/17 5
Solr: key features
- Pagination and ranking of search results
- Results grouping
– To group the results based on a grouping field and return the top documents in each group
- Near real-time search
– To search documents immediately after they have been indexed; useful for apps with dynamic changing content (e.g., news)
- More Like This
– identifies other documents that are similar to one in a result set
Valeria Cardellini - SABD 2016/17 6
Solr feature example
Valeria Cardellini - SABD 2016/17 7
Solr components
Valeria Cardellini - SABD 2016/17 8
Solr components
- Request Handlers: handle a request at a URL
– E.g.: /select
- Search Components: part of a Search Handler, a
componentized request handler
– Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable
- Update Handlers: handle an indexing request
- Update Processors chain: per-handler componentized
chain that handles updates
- Query Parser plugins
– Mix and match query types in a single request – Function plugins for Function Query
- Text Analysis plugins: Analyzers, Tokenizers,
TokenFilters
- Response Writers: serialize and stream response to
client
Valeria Cardellini - SABD 2016/17 9
Scaling Solr: SolrCloud
- How to provide distributed indexing and search
capabilities?
– Up to millions of users and millions of indexed documents
- SolrCloud: deployment functionality of Solr which
allows to setup clusters of Solr servers
– Enables and simplifies horizontal scaling of a search index through replication and sharding – Sharding: incoming queries are distributed to to shards in the collection, which respond with merged results – Replication: to handle higher concurrent query load by spreading the requests to multiple servers
- No master node to allocate nodes, shards and
replicas
- SolrCloud uses ZooKeeper for storing shared
configuration files and for coordination
Valeria Cardellini - SABD 2016/17 10
Solr distributed architecture
Valeria Cardellini - SABD 2016/17 11
Elasticsearch
- Distributed, multitenant-capable and scalable
full-text search engine with REST-based interface and schema-free JSON documents
- Search engine based on Apache Lucene
- Developed in Java
- Distributed
– Indices can be divided into shards and each shard can have zero or more replicas – Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s) – Rebalancing and routing are done automatically
Valeria Cardellini - SABD 2016/17 12
Elastic (ELK) Stack
- Elasticsearch is closely integrated with
Logstash and Kibana (Elastic Stack, previously known as ELK)
- Logstash
– Server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch
- Kibana
– Data visualization platform
Valeria Cardellini - SABD 2016/17 13
Solr vs. Elasticsearch
- Solr
– Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment wide range
- f plug-ins are available
- Elasticsearch
– Newer, but already very widely used – Focus on extracting value from data generally, and not just
- n search
– Part of ELK stack – Schema-free and document-oriented
Valeria Cardellini - SABD 2016/17 14
- Elasticsearch vs Solr on Google Trends
Time series data base (TSDB)
- How to analyze DevOps monitoring, application
metrics, IoT sensor data?
– Time series databases (TSDBs) provides an effective and lightweight solution
- Optimized for handling high-volume time series data
– Time series: a sequence of data points (arrays of numbers) indexed by time (a date time or a date time range), e.g.:
- Time series of stock prices (price curve)
- Time series of energy consumption (load profile)
- Log of temperature values (temperature trace)
- Optimized for providing complex logic to analyze time
series data
– Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in DBMS
Valeria Cardellini - SABD 2016/17 15
TSDB: overview
- Create, enumerate, update and destroy
various time series and organize them in some fashion
– Series may be organized hierarchically and have companion metadata – Provide basic calculations on a series as a whole , (e.g., multiplying, adding, or combining various time series into a new time series) – Filter on arbitrary patterns (e.g., day of the week, low value, high value) – Provide additional statistical functions that are targeted to time series data
Valeria Cardellini - SABD 2016/17 16
TSDB: some products
- Some open-source products
– CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org
- Stores numeric time-series data and render graphs of
this data on demand
– InfluxDB https://www.influxdata.com – KairosDB https://kairosdb.github.io
- Stores its time series in Cassandra
– OpenTSDB http://opentsdb.net
- Stores its time series in HBase
– Riak-TS http://basho.com/products/riak-ts/
Valeria Cardellini - SABD 2016/17 17
InfluxDB
- Written in Go
- Supports high write loads and large data set storage
- Conserves space through downsampling
– By automatically expiring and deleting unwanted data as well as backup and restore
- Provides easy-to-use SQL-like query language for
interacting with data
- Provides simple, high performing write and query
HTTP(S) APIs, e.g.:
– To create a database
url -i -XPOST http://localhost:8086/query --data- urlencode "q=CREATE DATABASE mydb”
– To write data
curl -i -XPOST 'http://localhost:8086/write? db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000'
Valeria Cardellini - SABD 2016/17 18
InfluxDB datastore
- Data organized by time series, which contain a
measured value, like “cpu_load” or “temperature”
- Time series have zero to many points, one for each
discrete sample of the metric
- Points consist of:
– time (a timestamp) – a measurement (e.g., “cpu_load”) – at least one key-value field (the measured value itself, e.g. “value=0.64”, or “temperature=21.2”) – and zero to many key-value tags containing any metadata about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”)
Valeria Cardellini - SABD 2016/17 19
InfluxDB datastore
- General format of points:
<measurement>[,<tag-key>=<tag-value>...] <field- key>=<field-value>[,<field2-key>=<field2- value>...] [unix-nano-timestamp]
- Examples of points:
– cpu,host=serverA,region=us_west value=0.64 – payment,device=mobile,product=Notepad,method=credit
- billed=33,licenses=3i 1434067467100293230
– stock,symbol=AAPL bid=127.46,ask=127.48 – temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000
- Valeria Cardellini - SABD 2016/17
20
InfluxDB datastore
- A measurement is like a SQL table, where the
primary index is time
- With respect to DBMS:
– No need to define schemas up-front – Null values are not stored
Valeria Cardellini - SABD 2016/17 21
InfluxDB stack
- Integrated with Telegraph, Chronograf and Kapacitor
(TICK stack)
Valeria Cardellini - SABD 2016/17 22
References
- Apache Solr Reference Guide, http://bit.ly/2scksQF
- InfluxDB Version 1.2 Documentation,
http://bit.ly/2ryagFT
- Dunning and Friedman, “Time Series Databases”,
O’Reilly, 2015.
Valeria Cardellini - SABD 2016/17 23