Pronto Elasticsearch Extension Practice in eBay Donggeng Yu - - PowerPoint PPT Presentation

pronto elasticsearch extension practice in ebay
SMART_READER_LITE
LIVE PREVIEW

Pronto Elasticsearch Extension Practice in eBay Donggeng Yu - - PowerPoint PPT Presentation

Pronto Elasticsearch Extension Practice in eBay Donggeng Yu 12/07/2019, Pronto, eBay 1 Agenda 1 Overview of Elasticsearch in eBay 2 Use Cases & Challenges 3 Tools Extension for Clusters Management 4 Service Extension for Clusters


slide-1
SLIDE 1

1

Donggeng Yu 12/07/2019, Pronto, eBay

Pronto Elasticsearch Extension Practice in eBay

slide-2
SLIDE 2

2

Agenda

Overview of Elasticsearch in eBay Use Cases & Challenges Tools Extension for Clusters Management Service Extension for Clusters Capability 1 2 3 4

slide-3
SLIDE 3

3

Elastic Stack

  • ELKB

‒ Elasticsearch - Search & Aggregation ‒ Logstash – ETL ‒ Kibana – Visualization ‒ Beats – Data Shipper

  • X-Pack

‒ security, alerting, monitoring, reporting, machine learning and etc.

  • Use Cases & OOTB Solutions

‒ Logs / Metrics ‒ APM / Uptime ‒ SIEM / Endpoint Security ‒ Site Search / App Search / Enterprise ‒ Maps

slide-4
SLIDE 4

4

Pronto Ecosystem in eBay

62%

Supporting text goes here under the number

slide-5
SLIDE 5

5

100+ clusters 6k+ nodes VM(openstack) / Container(k8s)

slide-6
SLIDE 6

6

Agenda

Overview of Elasticsearch in eBay Use Cases & Challenges Tools Extension for Clusters Management Service Extension for Clusters Capability 3 4 2 1

slide-7
SLIDE 7

7

Use Cases in eBay

  • Use Cases:

‒ Near real time search / aggregation

‒ Virtual Shop / Tire Installation / Terapeak / SEO ‒ On-Site Traffic

‒ Metrics & Logs

‒ UFES / Ceilometer / SRE / UMP ‒ More than 20T/day for a single cluster

slide-8
SLIDE 8

8

Vertical Shop & Tire Installation

slide-9
SLIDE 9

9

Terapeak - eCommerce Data Insights

  • Terapeak

‒ SAAS based tool for providing ecommerce data insights to online sellers ‒ Acquired by eBay

  • Tech Stack

‒ From RMDB + SOLR to ELK ‒ S3 and Hadoop for data staging ‒ Spark for data ETL ‒ Kafka for data queue ‒ Postgres for Data Warehouse ‒ Elasticsearch for indexing and search ‒ ReactJS for front-end application

slide-10
SLIDE 10

10 10

UFES - Anomaly Detection for SLB

  • Goal

‒ Unified Front-End Services - Move eBay Closer to Users so that the world shops first

  • n eBay. UFES team built out 8 new

Internet Points of Presence(POP) across the globe ‒ Need to route traffic via UFES PoPs by replacing the Netscaler Hardware SEO Load Balancers with Envoy Proxy based Software Load Balancers.

  • Elastic Stack

‒ Filebeats + Kafka + Elasticsearch Clusters ‒ Dashboard for monitoring and comparison ‒ Anomaly Detection for SLB

slide-11
SLIDE 11

11 11

Ceilometer - IT Operation Analytics

slide-12
SLIDE 12

12 12

Challenges of Managing Clusters Fleets at Scale

  • Integrated with eBay’s Platform & follow the

standards

‒ Configuration management & Change management ‒ Full lifecycle management

  • Easy onboarding and integration

‒ Elasticsearch as a Service ‒ How to free customer to focus on domain business

  • Performance & High Availability

‒ Search: Site facing application response time should less than 100 ms ‒ Ingesting: 20T per day for a single cluster ‒ Different deployments, like cross region deployment

  • Cost Control

‒ Hardware cost ‒ License fee (support some features like security, alert and ML) ‒ Human resource ‒ Support (7*24 on-call support & on-site support, etc.)

Performance HA Onboarding Integration Cost

slide-13
SLIDE 13

13 13

Solutions for Challenges

  • From VM to Container

‒ VM (Openstack)

‒ Fixed flavor ‒ Puppet Foreman infrastructure ‒ Puppet module for Elasticsearch

‒ Container (K8s)

‒ Flexible flavor (request/limit) ‒ Operator Pattern ‒ Deployment + Statefulset + Service

  • Best practices & Different deployments

‒ Important System Configuration & Best practices ‒ Anti-Affinity (High availability) ‒ Cross region deployment (High availability) ‒ Flavor chosen by traffic (Cost saving) ‒ Hot-warm architecture (Cost saving) ‒ LB for write / read

Cluster Provision & Management

Performance HA Onboarding Integration Cost

slide-14
SLIDE 14

14 14

Solutions for Challenges

Tooling and Service Extension

Performance HA Onboarding Integration Cost

slide-15
SLIDE 15

15 15

Agenda

Overview of Elasticsearch in eBay Use Cases & Challenges Tools Extension for Clusters Management Service Extension for Clusters Capability 2 1 4 3

slide-16
SLIDE 16

16 16

Use Case Onboarding

  • Capacity planning

‒ What’s the use case and use scenarios

‒ Data retention / active period

‒ Performance

‒ Index rate / search rate ‒ Document & bulk size

‒ Deployment & Cost

‒ How many nodes? ‒ What’s the hardware configuration? ‒ What kind of deployment should be used?

‒ Best practices

‒ Software configuration ‒ Deployment in different Region ‒ Keep the margin to ensure that traffic becomes large without performance issues

Node Storage Memory CPU Network Master Low Low Low Low Data Extreme High High Medium Ingest Low Medium High Medium Coordinator Low Medium Medium Medium Machine Learning Low Extreme Extreme Medium

slide-17
SLIDE 17

17 17

Onboarding Self-Service and Sizing Tool

Onboarding Integration

slide-18
SLIDE 18

18 18

Customer Support

  • Support model

‒ Different SLA for different use cases

‒ Search response time should less than 100ms ‒ Cluster should NOT be in RED

‒ 7*24 support for Site-facing or Tier 2 above

‒ SEC call / Pagerduty

  • Support case

‒ Cluster in RED

‒ Node missing and replica is 0 ‒ Dangling index

‒ Response time

‒ Full GC because of Machine check error (MCE) ‒ Too many shards and fields

Onboarding Integration Cost

slide-19
SLIDE 19

19 19

Data Ingestion Pipeline

  • Added Value for customers

‒ Self-service, no coding/testing ‒ No onboarding required

  • Shared cluster

‒ 30+ use cases / 3T per day

  • Shared data assets

‒ Partition by application name

  • Shared dashboard

‒ 30+ Dashboards ‒ 300+ Charts/Visualizations

Onboarding Integration

slide-20
SLIDE 20

20 20

Simple Steps - service onboarding a new use case

pom.xml web.xml

slide-21
SLIDE 21

21 21

Data Management & Optimization

  • Backup & Restore

‒ Snapshot lifecycle management (SWIFT as the repository )

  • Time series data

‒ Benefits of using time-based indices

‒ Delete index is faster than delete by query ‒ Use hot-warm architecture ‒ Close indices or force-merge read-only indices

‒ Time series data

‒ Treapeak v.s UFES (different needs)

  • LifeCycle management

‒ Central policy management / Web UI / OOTB Policies

Performan ce Onboardi ng Integratio n Cost

slide-22
SLIDE 22

22 22

Index Management Tool vs. Curator vs. ILM

Function Curator Pronto Index Mgmt. Tool Elastic ILM

High Availability N/A YES YES Web UI N/A YES YES Version Compatibility N/A 2.x/5.x/6.x/7.x 6.8+ Multi-Clusters N/A YES N/A

slide-23
SLIDE 23

23 23

Diagnostic Tool

  • Features

‒ Find Improper settings or usage ‒ Job scheduler & Diagnostic report for potential issues

  • Rules

‒ Too many indices / Too many shards / Index have too many fields ‒ Shard size check (20GB to 40GB) ‒ Imbalance shards ‒ Replica number should bigger than 0 ‒ Node missing / Rack Id attribute missed / Minimum master ‒ Machine check error / Server disk full ‒ Alias & index template checking

Performanc e Cost

slide-24
SLIDE 24

24 24

Performance & User Scenarios

  • Many Factors:

‒ Index / Shard ‒ Query / Scripting ‒ Mapping / Setting Behavior Use Cases

Index heavy Logging / Metrics / Security / APM Search heavy App Search / Site Search / Analytics Update heavy Caching / Systems of Record

slide-25
SLIDE 25

25 25

Performance Issues & Optimization

  • Wildcard search

‒ Customer use beginning patterns with * and ?. ‒ Avoid to use * or ?.

  • Stopwords & Shard Size

‒ Reindex with the stop words ‒ Use more shards to improve the throughput

  • Too many indices / shards / fields

‒ Close or delete the unused indices ‒ Improve the document modeling ‒ Disable the dynamic mapping

  • Performance Optimization

‒ Disable swapping & give memory to the file system cache ‒ Unset or increase the refresh interval ‒ Disable refresh and replicas for initial loads ‒ Use auto generated Ids ‒ Disable the features you do not need ‒ Don’t use default dynamic string mapping ‒ Watch your shard size / shrink index ‒ Force Merge ‒ Pre-Index data ‒ Avoid scripts ‒ Force-merge read-only indices ‒ Warm up global ordinals ‒ Replicas might help with through, but not always

slide-26
SLIDE 26

26 26

Performance Testing Tool

  • Performance testing

‒ Testing data ‒ Testing scripts ‒ Test report for analysis

  • Web based tool

‒ Developed based on the Gatling ‒ Web UI to select the testing scripts and testing data ‒ Test report for analysis

Performance

slide-27
SLIDE 27

27 27

Agenda

Overview of Elasticsearch in eBay Use Cases & Challenges Tools Extension for Clusters Management Service Extension for Clusters Capability 2 1 3 4

slide-28
SLIDE 28

28 28

Solution and security plugin for Elasticsearch

  • Pronto Security Plugin

‒ TLS for encrypted communications ‒ Cluster / Index level RBAC control ‒ Follow eBay’s standard

‒ API Key for Application ‒ 2FA for user login ‒ Audit logs

  • Security Consideration

‒ Authentication / RBAC ‒ Certification retention ‒ Firewall / White IP list ‒ Vulnerability management

Cost

slide-29
SLIDE 29

29 29

X-Pack Subscription

  • License cost

‒ License fee is based on the node count

  • How to Extend

‒ Develop the Kibana Application ‒ Integrate with the alerting and anomaly detection service

Cost

slide-30
SLIDE 30

30 30

Alerting Service

  • Schedule

‒ A schedule for running a query and checking the condition.

  • Query

‒ The query to run as input to the

  • condition. Watches support the full

Elasticsearch query and aggregation

  • Condition

‒ A condition that determines whether

  • r not to execute the actions. You can

use simple conditions (always true), or use scripting for more sophisticated scenarios

  • Action

‒ One or more actions, such as sending email, pushing data to 3rd party systems through a webhook ‒ Throttling

Biz Data Alert

Cost

slide-31
SLIDE 31

31 31

Cost

slide-32
SLIDE 32

32 32

ML for Anomaly Detection

Cost

slide-33
SLIDE 33

33 33

Review

  • Easy onboarding and

integration

  • High Availability & High

Performance

  • Low cost for hardware /

license fee / support efforts

Performance HA Onboarding Integration Cost

slide-34
SLIDE 34

34 34

  • Thanks!
  • mailto:dongyu@ebay.com