Using ElasticSearch as a fast, flexible, and scalable solution to - - PowerPoint PPT Presentation

using elasticsearch as a fast flexible and scalable
SMART_READER_LITE
LIVE PREVIEW

Using ElasticSearch as a fast, flexible, and scalable solution to - - PowerPoint PPT Presentation

Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists Christian Gendreau, Canadensys Marie-Elise Lecoq, GBIF France Introduction ElasticSearch is an open source, document oriented, distributed


slide-1
SLIDE 1

Using ElasticSearch as a fast, flexible, and scalable solution to search

  • ccurrence records and checklists

Christian Gendreau, Canadensys Marie-Elise Lecoq, GBIF France

slide-2
SLIDE 2

Introduction

ElasticSearch is an open source, document oriented, distributed search engine, built on top of Apache Lucene.

From ElasticSearch GitHub page

slide-3
SLIDE 3

Setup

  • Java 6 or higher
  • Download : # wget …elasticsearch-0.90.5.zip
  • Unzip
slide-4
SLIDE 4

Configuration

  • Name your cluster
  • Replication and multi-shard are enabled by default
  • Start : # bin/elasticsearch
slide-5
SLIDE 5

Add data

Using the REST API

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1'

  • d '{

"user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elastic Search" }'

slide-6
SLIDE 6

Import data

Rivers

  • Document-based database (mongoDB)
  • JDBC (relational database)
  • Data source (wikipedia, Twitter)
slide-7
SLIDE 7

Mapping

  • Schema-less
  • Customize indexing
  • Customize querying
slide-8
SLIDE 8

Autocomplete

  • analyzer edge-ngram
  • wildcard query or prefix query: not a scalable solution
  • completion suggest : experimental
slide-9
SLIDE 9

ElasticSearch at Canadensys

Database of Vascular Plants of Canada (VASCAN)

data.canadensys.net/vascan

slide-10
SLIDE 10

Our ElasticSearch index

Index structure for scientific names

  • autocompletion : edge_ngram filter
  • “carex” -> “ca”,”car”,”care”,”carex”
  • genus first letter : pattern_replace filter
  • “carex feta” -> “c. feta”
  • epithet : path_hierarchy tokenizer
  • “carex feta” -> “feta”
slide-11
SLIDE 11

ElasticSearch at GBIF France

Data stored in ElasticSearch are updated upon MongoDB changes. The search engine requests elasticsearch using filters like taxon, date, place, dataset and geolocalisation. Statistic calculation using facets

slide-12
SLIDE 12

ElasticSearch at GBIF France

slide-13
SLIDE 13

ElasticSearch - Solr

  • Solr and elasticsearch both tries to solve the same problem

with no much differences

  • Development setup and production deployment (replication /

sharding) easier with elasticsearch

  • By default, the elasticsearch is well configured for Lucene and

customization remains easy.

slide-14
SLIDE 14

Facets

  • “Group by” in SQL
  • Mostly used for calculate statistics
  • Example :

curl -XGET [...] "facets" : { ”dataset" : { "terms" : { "field" : ”dataset", "order" : "term” …

slide-15
SLIDE 15

API and libraries

REST API

  • interoperability between different programming languages
  • HTTP request

Java API

  • more efficient than REST API due to the binary API use.
  • built in marshaling(data formatting on the network)
slide-16
SLIDE 16

Query - RESTfull API

Example:

$ curl localhost:9200/vascan/_search?pretty=1 -d '{"query":{ "match":{ "name" :{ "query":"carex" } } } }’

slide-17
SLIDE 17

Query - Java API

Code example:

... SearchRequestBuilder srb = client.prepareSearch(INDEX_NAME) .setQuery(QueryBuilders .boolQuery() .should(QueryBuilders.matchQuery("vernacular_name",text)) .setTypes(VERNACULAR_TYPE); ...

slide-18
SLIDE 18

Pitfalls

  • Error reporting (index creation, river creation)
  • Results may be hard to predict using complex queries
  • Documentation
  • With each mapping modification comes a free reindex from

data

slide-19
SLIDE 19

Future

  • Scientific Name analyzer
  • Geospatial component
slide-20
SLIDE 20

Thank you!