Cr Cruise Co Control: Effo l: Effortle less M Manage gement o - PowerPoint PPT Presentation

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o of K f Kafka fka Clu Clusters Adem Efe Gencer Senior Software Engineer LinkedIn

Kafka: A Distributed Stream Processing Platform : High throughput & low latency : Message persistence on partitioned data : Total ordering within each partition 2

Key Concepts: Brokers, Topics, Partitions, and Replicas Kafka Cluster : Broker-0 : Broker-1 : Broker-2 1 2 1 1 2 1 1 : A Replica of Partition-1 of Blue Topic 3

Key Concepts: Leaders and Followers : Broker-0 : Broker-1 : Broker-2 1 2 1 1 2 1 1 : The Leader Replica 1 : A Follower Replica 4

Key Concepts: Producers Producer-1 Producer-2 : Broker-0 : Broker-1 : Broker-2 1 2 1 1 2 1 5

Key Concepts: Consumers : Broker-0 : Broker-1 : Broker-2 1 2 1 1 2 1 Consumer-1 Consumer-2 6

Key Concepts: Failover via Leadership Transfer ✗ : Broker-0 : Broker-2 1 2 1 1 2 1 7

Key Concepts: Failover via Leadership Transfer ✗ : Broker-0 : Broker-2 1 2 1 1 2 1 8

Kafka Incurs Management Overhead : Large deployments – e.g. @ : 2.6K+ Brokers, 44K+ Topics, 5M Partitions, 5T Messages / day : Frequent hardware failures ✗ : Load skew among brokers : Kafka cluster expansion and reduction “Elephant” (CC0): https://pixabay.com/en/elephant-safari-animal-defence-1421167, "Seesaw at Evelle" by Rachel Coleman (CC BY-SA 2.0): https://www.flickr.com/photos/rmc28/4862153119, “Inflatable Balloons” (Public Domain): https://commons.wikimedia.org/wiki/File:InflatableBalloons.jpg 9

Alleviating the Management Overhead Admin Operations for Cluster Maintenance 1 Anomaly Detection with Self-Healing 2 Real-Time Monitoring of Kafka Clusters 3 10

1 Admin Operations for Cluster Maintenance : Dynamically balance the cluster load + : Add / remove brokers - : Demote brokers – i.e. remove leadership of all replicas : Trigger preferred leader election : Fix offline replicas 11

1 Admin Operations for Cluster Maintenance : Dynamically balance the cluster load + : Add / remove brokers - : Demote brokers – i.e. remove leadership of all replicas : Trigger preferred leader election : Fix offline replicas 12

Dynamically Balance the Cluster Load Must satisfy hard goals , including: : Guarantee rack-aware distribution of replicas : Never exceed the capacity of broker resources – e.g. disk, CPU, network bandwidth : Enforce operational requirements – e.g. maximum replica count per broker 13

Dynamically Balance the Cluster Load Satisfy soft goals as much as possible – i.e. best effort : Balance disk, CPU, inbound/outbound network traffic utilization of brokers : Balance replica distribution : Balance potential outbound network load ✗ : Balance distribution of partitions from the same topic ✗ 14

2 Anomaly Detection with Self-Healing : Goal violation – rebalance cluster : Broker failure – decommission broker(s) ✗ : Metric anomaly – demote broker(s) 15

3 Real-Time Monitoring of Kafka Clusters : Examine the replica, leader, and load distribution : Identify under-replicated , under-min-ISR , and offline partitions : Check the health of brokers, disks, and user tasks 16

Building Blocks of Management: Moving Replicas : Broker-0 Replica Move 1 2 : Broker-1 1 2 1 17

Building Blocks of Management: Moving Replicas : Broker-0 Replica Move 1 2 1 Broader impact, but expensive • Requires data transfer* : Broker-1 1 2 * Replica swap: Bidirectional reassignments of distinct partition replicas among brokers 18

Building Blocks of Management: Moving Leadership : Broker-0 Leadership Move 2 1 : Broker-1 1 2 1 19

Building Blocks of Management: Moving Leadership : Broker-0 Leadership Move 2 1 Cheap, but has limited impact • Affects network bytes out and CPU : Broker-1 1 2 1 20

A Multi-Objective Optimization Problem Achieve conflicting cluster management goals while minimizing the impact of required operations on user traffic 21

ARCHITECTURE “Joy Oil Gas Station Blueprints” (Public Domain): https://commons.wikimedia.org/wiki/File:Joy_Oil_gas_station_blueprints.jpg

Cruise Control Architecture Load Backup and Recovery Pluggable History T. Component Sample Metric Metrics • Implements a public interface Store Sampler Monitor Reporter T. Reported • Accepts custom user code Metrics Capacity Resolver Metrics Reporter Internal Topic Anomaly Detector Analyzer Goal Metric Broker Kafka • Created and used by Cruise Broker Violation Anomaly Failure Cluster REST API Failures Control and its metrics reporter Finder(s) Goal(s) Anomaly Notifier Throttled Proposal Execution Executor 23

Metrics Reporter Load Produces selected Kafka cluster metrics to the History T. configured metrics reporter topic with the Metrics configured frequency Reporter T. Metrics Reporter Kafka Cluster 24

Monitor Sample Metric Store Sampler Monitor Capacity Resolver Generates a model ( ) to describe the cluster 25

Monitor: Cluster Model ( ) : Topology – rack, host, and broker distribution : Placement – replica, leadership, and partition distribution : Load – current and historical utilization of brokers and replicas disk cpu nw-in … nw-out time latest Monitoring windows utilization 26

Monitor: Metric Sampler Sample Metric Metrics Store Sampler Monitor Reporter T. Reported Metrics Capacity Resolver Kafka • Periodically (e.g. every 5 min) consumes the reported Cluster metrics to model the load on brokers and partitions 27

Monitor: Sample Store Load Backup and Recovery History T. Sample Metric Store Sampler Monitor Capacity Resolver Kafka Cluster • • Produces broker and partition models to load history topic, and uses the stored data to recover upon failure 28

Monitor: Capacity Resolver Sample Metric Store Sampler Monitor Capacity Resolver • • • Gathers the broker capacities from a pluggable resolver 29

Analyzer Generates proposals to achieve goals via a fast and near-optimal heuristic solution Analyzer Goal(s) 30

Analyzer: Goals Generates proposals to achieve goals via a fast and near-optimal heuristic solution Analyzer : Priorities – custom order of optimization : Strictness – hard (e.g. rack awareness) or soft (e.g. resource Goal(s) utilization balance) optimization demands : Modes – e.g. kafka-assigner ( https://github.com/linkedin/kafka-tools ) 31

Analyzer: Proposals Generates proposals to achieve goals via a fast and near-optimal heuristic solution + = Analyzer Goal(s) Proposals – in order of priority: • Leadership move > Replica move > Replica swap Goal(s) 32

Executor Proposal execution: • Dynamically controls the maximum number of concurrent leadership / replica reassignments • Ensures only one execution at a time • Enables graceful cancellation of ongoing executions Integration with replication quotas (KIP-73) Kafka Cluster Throttled Proposal Execution Executor 33

Anomaly Detector Identifies, notifies, and fixes (self-healing): • Violation of anomaly detection goals • Broker failures • Metric anomalies Disk failures (JBOD) Anomaly Detector Goal Metric Broker Violation Anomaly Failure : Faulty vs Healthy Cluster Finder(s) Anomaly Notifier : Reactive vs. Proactive Mitigation 34

Anomaly Detector: Goal Violations and Self-Healing Checks for the violation of the anomaly detection goals • Identifies fixable and unfixable goal violations • Self-healing triggers a cluster rebalance operation • Avoids false positives due to broker failure, upgrade, restart, or release certification Anomaly Detector Goal Metric Broker Violation Anomaly Failure Finder(s) Anomaly Notifier Faulty Reactive Proactive Healthy 35

Anomaly Detector: Broker Failures Concerned with whether brokers are responsive: • Ignores the internal state deterioration of brokers • Identifies fail-stop failures Anomaly Detector Goal Metric Broker Kafka Broker Violation Anomaly Failure Cluster Failures Finder(s) Anomaly Notifier Faulty Reactive Proactive Healthy 36

Anomaly Detector: Broker Failures and Self-Healing Checks for broker failures: • Enables a grace period to lower false positives – e.g. due to upgrade, restart, or release certification • Self-healing triggers a remove operation for failed brokers Anomaly Detector Goal Metric Broker Kafka Broker Violation Anomaly Failure Cluster Failures Finder(s) Anomaly Notifier Faulty Reactive Proactive Healthy 37

Anomaly Detector: Reactive Mitigation Cluster maintenance becomes costly Requires immediate attention of affected services Poor user experience due to frequent service Anomaly Detector interruptions Goal Metric Broker Violation Anomaly Failure Finder(s) Server & network failures Anomaly Notifier Size of clusters ~ Volume of user traffic Hardware degradation 38

Anomaly Detector: Metric Anomaly Checks for abnormal changes in broker metrics – e.g. a recent spike in log flush time: • Self-healing triggers a demote operation for slow brokers Anomaly Detector Goal Metric Broker Violation Anomaly Failure Finder(s) Anomaly Notifier Faulty Reactive Proactive Healthy 39

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o - PowerPoint PPT Presentation

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o of K f Kafka fka Clu Clusters Adem Efe Gencer Senior Software Engineer LinkedIn Kafka: A Distributed Stream Processing Platform : High throughput & low latency :

Case Study: Cruise Control Murray Cole Cruise Control 1 Cruise Control Basic idea

CRUISE TRENDS & INDUSTRY OUTLOOK Cruise Lines International Association (CLIA), the

Ham Radio Cruise Ship Operation By Clay Abrams K6AEP 5/20/14 Operation of Amateur Radio from

SMALL CRUISE LINE INDUSTRY Guam and Micronesia 1 CRUISE LINE INDUSTRY IN ASIA An Incredible

The Mobile Alabama Cruise Terminal and City of Mobile City of Mobile Alabama Cruise Terminal

DFWC Route to Recovery Cruise Industry Update Adrian Pittaway, MSC Cruises Global Cruise

Design and Calibration of the Jaguar XK Adaptive Cruise Control System Tim Jagger MathWorks

PAT H T O T HE PRO MISE T HE NAT IO NAL AND L O CAL EFFO RT T O HEL P MAKE CO L L

Islands of the Pacific Northwest One or Two Week Cruise Week 1: September 14 th 20 th Week 2:

95 Cruise Destinations stopovers * worldwide Atlantic Mediterranean

BORDEAUX BARGE CRUISES INTRODUCTION P 1 THE HO TEL BARGE MIRABELLE P 2 CLASSIC CRUISE

The Green Cruise Port Project HELCOM Cooperation Platform Results of Interreg Project Green

GUIDELINES FOR CRUISE TERMINALS PIANC WG 152 PIANC - Terms of reference WG152 Guide line for

CRUISE TOURISM : MALAYSIAS EXPERIENCE YONG NG EE CHIN IN Ministry of T ourism sm and

Importance of Cruise Sector in the Importance of Cruise Sector in the Americas 2004-2012 Richard

Cruise Business 2019 Karen Obeck Seaspan Shipyards Cruise Ship Business Seaspan Shipyards

Improving knowledge of operational activities of emergency services using spatio-temporal

Special Markets Solutions Introduction to Group Mortgage Insurance Presented by: <Name>

55a Pathology: Endocrine System 55a Pathology: Endocrine System Class Outline 5 minutes

GMARG Teleconference Thursday 17 January 2019 Todays Agenda Todays Agenda Item

Buil ilt Environment People and Place Cluster SSP/Paris Studio/The Conversation Christine

MH Tariff Completion & Accuracy Project lead: Bailey Mitchell & Auzewell Chitewe

LCAP Goal 4: Positive School Environment, Climate, and Culture- with Equity at the Core and

POB Career college promise classes Deepal.Patel Career Development Coordinator, Phillip O Berry

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o - PowerPoint PPT Presentation

Cr Cruise Co Control: Effo l: Effortle less M Manage gement o of K f Kafka fka Clu Clusters Adem Efe Gencer Senior Software Engineer LinkedIn Kafka: A Distributed Stream Processing Platform : High throughput & low latency :

Case Study: Cruise Control Murray Cole Cruise Control 1 Cruise Control Basic idea

CRUISE TRENDS &amp; INDUSTRY OUTLOOK Cruise Lines International Association (CLIA), the

Ham Radio Cruise Ship Operation By Clay Abrams K6AEP 5/20/14 Operation of Amateur Radio from

SMALL CRUISE LINE INDUSTRY Guam and Micronesia 1 CRUISE LINE INDUSTRY IN ASIA An Incredible

The Mobile Alabama Cruise Terminal and City of Mobile City of Mobile Alabama Cruise Terminal

DFWC Route to Recovery Cruise Industry Update Adrian Pittaway, MSC Cruises Global Cruise

Design and Calibration of the Jaguar XK Adaptive Cruise Control System Tim Jagger MathWorks

PAT H T O T HE PRO MISE T HE NAT IO NAL AND L O CAL EFFO RT T O HEL P MAKE CO L L

Islands of the Pacific Northwest One or Two Week Cruise Week 1: September 14 th 20 th Week 2:

95 Cruise Destinations stopovers * worldwide Atlantic Mediterranean

BORDEAUX BARGE CRUISES INTRODUCTION P 1 THE HO TEL BARGE MIRABELLE P 2 CLASSIC CRUISE

The Green Cruise Port Project HELCOM Cooperation Platform Results of Interreg Project Green

GUIDELINES FOR CRUISE TERMINALS PIANC WG 152 PIANC - Terms of reference WG152 Guide line for

CRUISE TOURISM : MALAYSIAS EXPERIENCE YONG NG EE CHIN IN Ministry of T ourism sm and

Importance of Cruise Sector in the Importance of Cruise Sector in the Americas 2004-2012 Richard

Cruise Business 2019 Karen Obeck Seaspan Shipyards Cruise Ship Business Seaspan Shipyards

Improving knowledge of operational activities of emergency services using spatio-temporal

Special Markets Solutions Introduction to Group Mortgage Insurance Presented by: &lt;Name&gt;

55a Pathology: Endocrine System 55a Pathology: Endocrine System Class Outline 5 minutes

GMARG Teleconference Thursday 17 January 2019 Todays Agenda Todays Agenda Item

Buil ilt Environment People and Place Cluster SSP/Paris Studio/The Conversation Christine

MH Tariff Completion &amp; Accuracy Project lead: Bailey Mitchell &amp; Auzewell Chitewe

LCAP Goal 4: Positive School Environment, Climate, and Culture- with Equity at the Core and

POB Career college promise classes Deepal.Patel Career Development Coordinator, Phillip O Berry

CRUISE TRENDS & INDUSTRY OUTLOOK Cruise Lines International Association (CLIA), the

Special Markets Solutions Introduction to Group Mortgage Insurance Presented by: <Name>

MH Tariff Completion & Accuracy Project lead: Bailey Mitchell & Auzewell Chitewe