Revealing Elasticsearch Implementation, Integration, and Execution

Objective: Get access to a cluster, index documents, find them, and present them.

Web developers ● Data scientists ● Target audience Report developers ● Technologists ● Infrastructure/DevOPS ●

What is Elasticsearch? Written in Java ○ Open source ■ Cross platform ■ Based on Lucene and Apache Solr ○ Scaled, real-time search & analytics ○ Full RESTful API ○ Plugin ecosystem ○ SDKs for Java, .NET, many more ○ Eventually consistent ○

An Elastic Timeline

Elasticsearch History 2010 2011 2012 2013 2014 2015 2016 2017 1.x 0.x $104M in funding 2.x 5.x Elastic Cloud Prelert

Getting Started

Objective: All you need is an endpoint http://localhost:9200/_search

Getting Out of the Gates Option 1 (*Ix) Option 2 (Windows) Option 3 (Cloud) Apt-get the latest version of Download the latest Create a free account with Elasticsearch (5.2.1) from version of Elasticsearch Elastic.co elastic.co from elastic.co Create a free account with Run bin/elasticsearch Run bin\elasticsearch.bat Amazon Web Services Curl http://localhost:9200 Many other providers

Cluster Overview

Objective: Understand how data is stored and transactions are scaled

Standard Configuration A typical production cluster will contain 3 ● nodes (installations) Additional nodes can be brought ○ online through discovery A typical node will contain 5 primary ● shards and 5 replica shards Data is replicated across all nodes so loss ● of a node will not affect cluster A master node is commonly specified to ● handle routing of requests Data is also serialized to disk and can be ● recovered

Storage: A cluster with 3 nodes of 32GB RAM machines has 32GB of cache.

Questions?

Indexing Data

Objective: All you need is Postman

Inverted Indexes Elasticsearch uses a Find all the unique words that appear in document ● structure called an List documents in which word (token) appears ● inverted index Reduces total search size Find all documents in ● which token exists Ranks documents based on occurrences ● Cases are removed in tokens ● Word stemming & casing Stemming algorithm drops “ing”, “ly”, “s”, etc ● All inverted indexes are normalized ● Normalization Custom analyzers can be applied to documents ●

Mappings Elasticsearch Mapping Available types: Elasticsearch will attempt to ● Boolean “guess” type mappings as each document is indexed. ● Long ● Double Once created, mappings cannot be changed without re-creating ● Date the index. ● String A custom mapping can be applied before indexing documents.

Analyzers None Language ● ● Standard 33+ languages supported ● ○ Splits the input text on word boundaries Stems words based on language ○ ○ Terms are lower cased Removes language specific “stop” words ○ ○ Whitespace Custom ● ● Breaks text into terms whenever it E.g. Remove “stop” words using a ○ ○ encounters a whitespace character language filter Simple ● Breaks text into terms whenever it ○ encounters a character which is not a letter Terms are lower cased ○

Patient Document Example { JSON format (Javascript Object Notation) ● "patient": { "first_name": "John", Index by PUTting document to index endpoint ● "last_name": "Doe", "dob": 252507600000, (PUT patients/patient/1) ○ "gender": "Male", Last item is unique key (1) "race": "White", ○ "height": 1.8288, Index operation automatically creates an index ● "weight": 90.7185, "eyes": "blue", if it has not been created before "hair": "brown", Elasticsearch “guesses” types as they are "age": 39, ● "tobacco": "no", posted "location": { "lat": 40.762446, Each indexed document is given a version ● "lon": -73.831653 }, number "conditions": [{ Index API optionally allows for optimistic ● "icd10": "M54.5", "description": "Low back pain" concurrency control when the version }, { "icd10": "Z91.018", parameter is specified "description": "Allergy to other foods" Bulk-indexing supported (Bulk API) ● }], "medications": [{ River plugins (Oracle, MSSQL, MySQL) ● "name": "Aspirin", "dosage": 150, "units": "mg", "frequency": 8, "freq_units": "hours" }] } }

Questions?

Querying Documents

Objective: All you need is JSON

QueryDSL Domain-Specific Language Leaf query clauses ● Leaf query clauses look for a particular value in a particular field, such as the match, term or range ○ queries. These queries can be used by themselves. Compound query clauses ● Compound query clauses wrap other leaf or compound queries and are used to combine multiple ○ queries in a logical fashion (such as the bool or dis_max query), or to alter their behavior (such as the constant_score query).

Common Query Types Full Text Joining ● ● Match All Nested ○ ○ Query String Geo ○ ● Term Geo Shape ● ○ Term Geo Distance ○ ○ Range Geo Polygon ○ ○ Exists Specialized ○ ● Regexp More Like This ○ ○ Fuzzy Template ○ ○ Compound Script ● ○ Bool ○ Boosting ○

Sample Bool Query JSON format (Javascript Object ● Notation) { "query": { Search by performing GET ● "bool": { "must": [{ against a specific index "match": { "medications.name": "Aspirin" } /GET patients/_search ○ }], "filter": [{ This query returns all men "term": { ● "gender": "Male" } between the ages of 30 and 50 }, { "range": { who use aspirin "age": { "lte": 50, "gte": 30 } } }] } } }

Query Result Query returns a formatted JSON result indicating the search metrics { "took": 1, "timed_out": false, "_shards": { Took ● "total": 5, Length of time in milliseconds the query "successful": 5, ○ "failed": 0 took to execute and return }, "hits": { Shards ● "total": 1, "max_score": 1.3862944, Number of shards utilized in execution ○ "hits": [{ of the query "_index": "patients", "_type": "patient", Hits ● "_id": "1", "_score": 1.3862944, Total and max score of all results ○ "_source": { Hits[] is an array of resulting ○ "first_name": "John", documents, which can be limited by size "last_name": "Doe", "dob": 252507600000, . . . } }] } }

Aggregates An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. { "query": { "bool": { Bucketing "must": [{ A family of aggregations that build buckets, where each bucket "match": { "gender": "Male" is associated with a key and a document criterion. } }] }, Metric "aggs": { Aggregations that keep track and compute metrics over a set of "medications": { documents. "terms": { "field": "medications.name" } Matrix } } A family of aggregations that operate on multiple fields and } produce a matrix result based on the values extracted from the } requested document fields. Pipeline Aggregations that aggregate the output of other aggregations and their associated metrics

Query Result A bucket aggregation finds all { . . . documents matching the query (in "aggregations" : { "medications" : { this case all males) and aggregates "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets" : [ the results into key and doc_count { "key" : "Aspirin", fields. "doc_count" : 2465 }, { "key" : "Omeprazole", Only documents matching the "doc_count" : 1824 }, { initial query will be considered for "key" : "Lisinopril", "doc_count" : 1121 }, aggregation. ] } } }

Statistical Aggregates The aggregations in this family compute metrics based on values { "query": { "bool": { extracted in one way or another "must": [{ "match": { from the documents that are being "gender": "Male" } }] aggregated. The values are typically }, "aggs": { extracted from the fields of the "age_stats": { "extended_stats": { "field": "age" document (using the field data), } } but can also be generated using } } } scripts.

Revealing Elasticsearch Implementation, Integration, and Execution - PowerPoint PPT Presentation

Revealing Elasticsearch Implementation, Integration, and Execution Objective: Get access to a cluster, index documents, find them, and present them. Web developers Data scientists Target audience Report developers

Elasticsearch T E G

JSON Logging with Elasticsearch Radu Gheorghe search statistics Where do your logs end up?

Astrometry: Revealing the Other Astrometry: Revealing the Other Two Dimensions of Velocity Two

Shield your cluster Security with Elasticsearch Alexander Reelsen @spinscale alex@elastic.co

Pronto Elasticsearch Extension Practice in eBay Donggeng Yu 12/07/2019, Pronto, eBay 1 Agenda

Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and

SUSE Enterprise Storage 5.5 Object Storage Metadata Sync Module Configuration Elasticsearch,

How Elasticsearch powers the Guardians newsroom shay banon @kimchy phil wills @philwills

How Elasticsearch powers the Guardians newsroom shay banon @kimchy graham tackley

Revealing the origin of the X-ray variability in Sco X-1 Xiaofeng Cao Huazhong Normal University

Bioinformatics pipeline for revealing tumour heterogeneity Mustafa Anl Tuncel Department of

Revealing Private Information in a Patent Race Pavel Kocourek 1 February 15, 2020 1

Rank Revealing QR factorization F. Guyomarch, D. Mezher and B. Philippe 1 Outline

Monitoring & Traceability of Jobs using ElasticSearch for DIRACGrid project Yash Srivastava

ELK Elasticsearch Logstash - Kibana Welcome to Infomart Infomart is a media monitoring app

Scaling Ubers Elasticsearch as an Geo-Temporal Database Danny Yuan @ Uber Use Cases for a

High Performance Computing in Web Browsers CE Seminar WT14/15 Henning Lohse High Performance

Online Condition Monitoring in CalemEAM Michael Bchner Erfurt University of Applied Sciences

MongoDB Thomas Schwarz, SJ MongoDB History 2007 Developed by 10gen as a Platform as a Service

The best-laid schemes Graham Bell EDItEUR Building a Better Business Seminar, 16th March 2017

NoSQL Postgres Oleg Bartunov Ivan Panchenko Postgres Professional Moscow University PGDay,

BUILDING BLOCKS UML & more.... banerjee@cs.queensu.ca 1 Main Sections UML Use Case

Software Development About us 100% Mexican Software development company. Partners won ExperTIC

Automation & Testing Saved a Project from the Brink of Collapse Jonathan Solrzano-Hamilton