1
Donggeng Yu 12/07/2019, Pronto, eBay
Pronto Elasticsearch Extension Practice in eBay Donggeng Yu - - PowerPoint PPT Presentation
Pronto Elasticsearch Extension Practice in eBay Donggeng Yu 12/07/2019, Pronto, eBay 1 Agenda 1 Overview of Elasticsearch in eBay 2 Use Cases & Challenges 3 Tools Extension for Clusters Management 4 Service Extension for Clusters
1
Donggeng Yu 12/07/2019, Pronto, eBay
2
3
‒ Elasticsearch - Search & Aggregation ‒ Logstash – ETL ‒ Kibana – Visualization ‒ Beats – Data Shipper
‒ security, alerting, monitoring, reporting, machine learning and etc.
‒ Logs / Metrics ‒ APM / Uptime ‒ SIEM / Endpoint Security ‒ Site Search / App Search / Enterprise ‒ Maps
4
Supporting text goes here under the number
5
6
7
‒ Near real time search / aggregation
‒ Virtual Shop / Tire Installation / Terapeak / SEO ‒ On-Site Traffic
‒ Metrics & Logs
‒ UFES / Ceilometer / SRE / UMP ‒ More than 20T/day for a single cluster
8
9
‒ SAAS based tool for providing ecommerce data insights to online sellers ‒ Acquired by eBay
‒ From RMDB + SOLR to ELK ‒ S3 and Hadoop for data staging ‒ Spark for data ETL ‒ Kafka for data queue ‒ Postgres for Data Warehouse ‒ Elasticsearch for indexing and search ‒ ReactJS for front-end application
10 10
‒ Unified Front-End Services - Move eBay Closer to Users so that the world shops first
Internet Points of Presence(POP) across the globe ‒ Need to route traffic via UFES PoPs by replacing the Netscaler Hardware SEO Load Balancers with Envoy Proxy based Software Load Balancers.
‒ Filebeats + Kafka + Elasticsearch Clusters ‒ Dashboard for monitoring and comparison ‒ Anomaly Detection for SLB
11 11
12 12
‒ Configuration management & Change management ‒ Full lifecycle management
‒ Elasticsearch as a Service ‒ How to free customer to focus on domain business
‒ Search: Site facing application response time should less than 100 ms ‒ Ingesting: 20T per day for a single cluster ‒ Different deployments, like cross region deployment
‒ Hardware cost ‒ License fee (support some features like security, alert and ML) ‒ Human resource ‒ Support (7*24 on-call support & on-site support, etc.)
Performance HA Onboarding Integration Cost
13 13
‒ VM (Openstack)
‒ Fixed flavor ‒ Puppet Foreman infrastructure ‒ Puppet module for Elasticsearch
‒ Container (K8s)
‒ Flexible flavor (request/limit) ‒ Operator Pattern ‒ Deployment + Statefulset + Service
‒ Important System Configuration & Best practices ‒ Anti-Affinity (High availability) ‒ Cross region deployment (High availability) ‒ Flavor chosen by traffic (Cost saving) ‒ Hot-warm architecture (Cost saving) ‒ LB for write / read
Performance HA Onboarding Integration Cost
14 14
Performance HA Onboarding Integration Cost
15 15
16 16
‒ What’s the use case and use scenarios
‒ Data retention / active period
‒ Performance
‒ Index rate / search rate ‒ Document & bulk size
‒ Deployment & Cost
‒ How many nodes? ‒ What’s the hardware configuration? ‒ What kind of deployment should be used?
‒ Best practices
‒ Software configuration ‒ Deployment in different Region ‒ Keep the margin to ensure that traffic becomes large without performance issues
Node Storage Memory CPU Network Master Low Low Low Low Data Extreme High High Medium Ingest Low Medium High Medium Coordinator Low Medium Medium Medium Machine Learning Low Extreme Extreme Medium
17 17
Onboarding Integration
18 18
‒ Different SLA for different use cases
‒ Search response time should less than 100ms ‒ Cluster should NOT be in RED
‒ 7*24 support for Site-facing or Tier 2 above
‒ SEC call / Pagerduty
‒ Cluster in RED
‒ Node missing and replica is 0 ‒ Dangling index
‒ Response time
‒ Full GC because of Machine check error (MCE) ‒ Too many shards and fields
Onboarding Integration Cost
19 19
‒ Self-service, no coding/testing ‒ No onboarding required
‒ 30+ use cases / 3T per day
‒ Partition by application name
‒ 30+ Dashboards ‒ 300+ Charts/Visualizations
Onboarding Integration
20 20
21 21
‒ Snapshot lifecycle management (SWIFT as the repository )
‒ Benefits of using time-based indices
‒ Delete index is faster than delete by query ‒ Use hot-warm architecture ‒ Close indices or force-merge read-only indices
‒ Time series data
‒ Treapeak v.s UFES (different needs)
‒ Central policy management / Web UI / OOTB Policies
Performan ce Onboardi ng Integratio n Cost
22 22
Function Curator Pronto Index Mgmt. Tool Elastic ILM
High Availability N/A YES YES Web UI N/A YES YES Version Compatibility N/A 2.x/5.x/6.x/7.x 6.8+ Multi-Clusters N/A YES N/A
23 23
‒ Find Improper settings or usage ‒ Job scheduler & Diagnostic report for potential issues
‒ Too many indices / Too many shards / Index have too many fields ‒ Shard size check (20GB to 40GB) ‒ Imbalance shards ‒ Replica number should bigger than 0 ‒ Node missing / Rack Id attribute missed / Minimum master ‒ Machine check error / Server disk full ‒ Alias & index template checking
Performanc e Cost
24 24
‒ Index / Shard ‒ Query / Scripting ‒ Mapping / Setting Behavior Use Cases
Index heavy Logging / Metrics / Security / APM Search heavy App Search / Site Search / Analytics Update heavy Caching / Systems of Record
25 25
‒ Customer use beginning patterns with * and ?. ‒ Avoid to use * or ?.
‒ Reindex with the stop words ‒ Use more shards to improve the throughput
‒ Close or delete the unused indices ‒ Improve the document modeling ‒ Disable the dynamic mapping
‒ Disable swapping & give memory to the file system cache ‒ Unset or increase the refresh interval ‒ Disable refresh and replicas for initial loads ‒ Use auto generated Ids ‒ Disable the features you do not need ‒ Don’t use default dynamic string mapping ‒ Watch your shard size / shrink index ‒ Force Merge ‒ Pre-Index data ‒ Avoid scripts ‒ Force-merge read-only indices ‒ Warm up global ordinals ‒ Replicas might help with through, but not always
26 26
‒ Testing data ‒ Testing scripts ‒ Test report for analysis
‒ Developed based on the Gatling ‒ Web UI to select the testing scripts and testing data ‒ Test report for analysis
Performance
27 27
28 28
‒ TLS for encrypted communications ‒ Cluster / Index level RBAC control ‒ Follow eBay’s standard
‒ API Key for Application ‒ 2FA for user login ‒ Audit logs
‒ Authentication / RBAC ‒ Certification retention ‒ Firewall / White IP list ‒ Vulnerability management
Cost
29 29
‒ License fee is based on the node count
‒ Develop the Kibana Application ‒ Integrate with the alerting and anomaly detection service
Cost
30 30
‒ A schedule for running a query and checking the condition.
‒ The query to run as input to the
Elasticsearch query and aggregation
‒ A condition that determines whether
use simple conditions (always true), or use scripting for more sophisticated scenarios
‒ One or more actions, such as sending email, pushing data to 3rd party systems through a webhook ‒ Throttling
Cost
31 31
Cost
32 32
Cost
33 33
Performance HA Onboarding Integration Cost
34 34