 
              Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica Search and Time Series Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini
The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2016/17 1
Why search platforms? • How to find documents that match queries? – With text search faster than RDBMs • How to obtain specific features? – Such as highlighting, spatial search, suggestions, guided navigation, … Valeria Cardellini - SABD 2016/17 2
Search engines • Most popular search platforms: – Apache Solr – ElasticSearch • ETL process Valeria Cardellini - SABD 2016/17 3
Apache Solr • Scalable, highly reliable and open-source framework for searching data • Built on Apache Lucene – Open-source library for indexing and search – Used by Solr for full-text search • Can index documents written in: • XML, JSON, CSV and binary formats • Runs as Java Web application • Provides a REST-like web service that exposes services to manage the lifecycle of documents in the index (indexing, querying, … ) • Used by most popular Web apps (Apple, Instagram, LinkedIn, … ) Valeria Cardellini - SABD 2016/17 4
Solr: key features • Faceting – To group the results based on specific field or defined criteria, providing the count of each subset – Example: shopping site can provide facets to narrow search results by manufacturer or price • Auto-suggest – To present list of possible query terms • Spell check – To suggest corrected spelling of query terms • Highlighting • Document clustering – To group related documents in the search results • Spatial search – To filter search results based on location Valeria Cardellini - SABD 2016/17 5
Solr: key features • Pagination and ranking of search results • Results grouping – To group the results based on a grouping field and return the top documents in each group • Near real-time search – To search documents immediately after they have been indexed; useful for apps with dynamic changing content (e.g., news) • More Like This – identifies other documents that are similar to one in a result set Valeria Cardellini - SABD 2016/17 6
Solr feature example Valeria Cardellini - SABD 2016/17 7
Solr components Valeria Cardellini - SABD 2016/17 8
Solr components • Request Handlers: handle a request at a URL – E.g.: /select � • Search Components: part of a Search Handler, a componentized request handler – Includes: Query, Faceting, Highlighting, Debug, … – Distributed Search capable • Update Handlers: handle an indexing request • Update Processors chain: per-handler componentized chain that handles updates • Query Parser plugins – Mix and match query types in a single request – Function plugins for Function Query • Text Analysis plugins: Analyzers, Tokenizers, TokenFilters • Response Writers: serialize and stream response to client Valeria Cardellini - SABD 2016/17 9
Scaling Solr: SolrCloud • How to provide distributed indexing and search capabilities? – Up to millions of users and millions of indexed documents • SolrCloud: deployment functionality of Solr which allows to setup clusters of Solr servers – Enables and simplifies horizontal scaling of a search index through replication and sharding – Sharding: incoming queries are distributed to to shards in the collection, which respond with merged results – Replication: to handle higher concurrent query load by spreading the requests to multiple servers • No master node to allocate nodes, shards and replicas • SolrCloud uses ZooKeeper for storing shared configuration files and for coordination Valeria Cardellini - SABD 2016/17 10
Solr distributed architecture Valeria Cardellini - SABD 2016/17 11
Elasticsearch • Distributed, multitenant-capable and scalable full-text search engine with REST-based interface and schema-free JSON documents • Search engine based on Apache Lucene • Developed in Java • Distributed – Indices can be divided into shards and each shard can have zero or more replicas – Each server hosts one or more shards, and acts as a coordinator to delegate operations to the correct shard(s) – Rebalancing and routing are done automatically Valeria Cardellini - SABD 2016/17 12
Elastic (ELK) Stack • Elasticsearch is closely integrated with Logstash and Kibana (Elastic Stack, previously known as ELK) • Logstash – Server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to Elasticsearch • Kibana – Data visualization platform Valeria Cardellini - SABD 2016/17 13
Solr vs. Elasticsearch • Elasticsearch vs Solr on Google Trends • Solr – Mature, widely deployed product – Active and large developer community – Provides highly detailed functional environment wide range of plug-ins are available • Elasticsearch – Newer, but already very widely used – Focus on extracting value from data generally, and not just on search – Part of ELK stack – Schema-free and document-oriented Valeria Cardellini - SABD 2016/17 14
Time series data base (TSDB) • How to analyze DevOps monitoring, application metrics, IoT sensor data? – Time series databases ( TSDBs ) provides an effective and lightweight solution • Optimized for handling high-volume time series data – Time series: a sequence of data points (arrays of numbers) indexed by time (a date time or a date time range), e.g.: • Time series of stock prices (price curve) • Time series of energy consumption (load profile) • Log of temperature values (temperature trace) • Optimized for providing complex logic to analyze time series data – Queries for historical data, replete with time ranges and roll ups and arbitrary time zone conversions are difficult in DBMS Valeria Cardellini - SABD 2016/17 15
TSDB: overview • Create, enumerate, update and destroy various time series and organize them in some fashion – Series may be organized hierarchically and have companion metadata – Provide basic calculations on a series as a whole , (e.g., multiplying, adding, or combining various time series into a new time series) – Filter on arbitrary patterns (e.g., day of the week, low value, high value) – Provide additional statistical functions that are targeted to time series data Valeria Cardellini - SABD 2016/17 16
TSDB: some products • Some open-source products – CrateDB https://crate.io – Chronix http://www.chronix.io – Graphite https://graphiteapp.org • Stores numeric time-series data and render graphs of this data on demand – InfluxDB https://www.influxdata.com – KairosDB https://kairosdb.github.io • Stores its time series in Cassandra – OpenTSDB http://opentsdb.net • Stores its time series in HBase – Riak-TS http://basho.com/products/riak-ts/ Valeria Cardellini - SABD 2016/17 17
InfluxDB • Written in Go • Supports high write loads and large data set storage • Conserves space through downsampling – By automatically expiring and deleting unwanted data as well as backup and restore • Provides easy-to-use SQL-like query language for interacting with data • Provides simple, high performing write and query HTTP(S) APIs, e.g.: Valeria Cardellini - SABD 2016/17 – To create a database url -i -XPOST http://localhost:8086/query --data- urlencode "q=CREATE DATABASE mydb” � – To write data curl -i -XPOST 'http://localhost:8086/write? db=mydb' --data-binary 'cpu_load_short,host=server01,region=us-west value=0.64 1434055562000000000' � 18
InfluxDB datastore • Data organized by time series, which contain a measured value, like “cpu_load” or “temperature” • Time series have zero to many points, one for each discrete sample of the metric • Points consist of: – time (a timestamp) – a measurement (e.g., “cpu_load”) – at least one key-value field (the measured value itself, e.g. “value=0.64”, or “temperature=21.2”) – and zero to many key-value tags containing any metadata about the value (e.g. “host=server01”, “region=EMEA”, “dc=Frankfurt”) Valeria Cardellini - SABD 2016/17 19
InfluxDB datastore • General format of points: <measurement>[,<tag-key>=<tag-value>...] <field- key>=<field-value>[,<field2-key>=<field2- value>...] [unix-nano-timestamp] � • Examples of points: – cpu,host=serverA,region=us_west value=0.64 � – payment,device=mobile,product=Notepad,method=credit � billed=33,licenses=3i 1434067467100293230 � – stock,symbol=AAPL bid=127.46,ask=127.48 � – temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000 � � Valeria Cardellini - SABD 2016/17 20
InfluxDB datastore • A measurement is like a SQL table, where the primary index is time • With respect to DBMS: – No need to define schemas up-front – Null values are not stored Valeria Cardellini - SABD 2016/17 21
InfluxDB stack • Integrated with Telegraph, Chronograf and Kapacitor (TICK stack) Valeria Cardellini - SABD 2016/17 22
References • Apache Solr Reference Guide, http://bit.ly/2scksQF • InfluxDB Version 1.2 Documentation, http://bit.ly/2ryagFT • Dunning and Friedman, “Time Series Databases”, O’Reilly, 2015. Valeria Cardellini - SABD 2016/17 23
Recommend
More recommend