Operating Multi-Tenant Kafka Services for Developers Data Council - - PowerPoint PPT Presentation

operating multi tenant kafka services for developers
SMART_READER_LITE
LIVE PREVIEW

Operating Multi-Tenant Kafka Services for Developers Data Council - - PowerPoint PPT Presentation

Operating Multi-Tenant Kafka Services for Developers Data Council SF 2019 Ali Hamidi - Heroku Data Agenda Intro Motivation Single Tenant Dedicated Multi-tenancy Configuration & Tuning Testing Automation


slide-1
SLIDE 1

Operating Multi-Tenant Kafka Services for Developers

Ali Hamidi - Heroku Data

Data Council SF 2019

slide-2
SLIDE 2

2

Agenda

  • Intro
  • Motivation
  • Single Tenant Dedicated
  • Multi-tenancy
  • Configuration & Tuning
  • Testing
  • Automation
  • Limitations

Data Council SF 2019 - Heroku Data

slide-3
SLIDE 3

3

Intro

I am… Ali Hamidi, an engineer on the Heroku Data team at Salesforce. Heroku is... a cloud platform that lets companies build, deliver, monitor and scale apps. Heroku Data is… the team that provides secure, scalable data services on the Heroku Platform.

Data Council SF 2019 - Heroku Data

slide-4
SLIDE 4

4

Apache Kafka

  • Distributed Streaming Platform

Data Council SF 2019 - Heroku Data

slide-5
SLIDE 5

5

Apache Kafka

  • Distributed Streaming Platform
  • Publish/Subscribe (=> Produce/Consume)

Data Council SF 2019 - Heroku Data

slide-6
SLIDE 6

6

Apache Kafka

  • Distributed Streaming Platform
  • Publish/Subscribe (=> Produce/Consume)
  • Durable message store (commit log)

Data Council SF 2019 - Heroku Data

slide-7
SLIDE 7

7

Apache Kafka

  • Distributed Streaming Platform
  • Publish/Subscribe (=> Produce/Consume)
  • Durable message store (commit log)
  • Highly available

Data Council SF 2019 - Heroku Data

slide-8
SLIDE 8

8

Apache Kafka on Heroku

  • Fully Managed Service

Data Council SF 2019 - Heroku Data

slide-9
SLIDE 9

9

Apache Kafka on Heroku

  • Fully Managed Service
  • Opinionated

Data Council SF 2019 - Heroku Data

slide-10
SLIDE 10

10 

Apache Kafka on Heroku

  • Fully Managed Service
  • Opinionated
  • Configured for best practices for most users*

Data Council SF 2019 - Heroku Data

slide-11
SLIDE 11

11 

Use Cases

  • Decompose a monolithic app

Data Council SF 2019 - Heroku Data

slide-12
SLIDE 12

12 

Use Cases

  • Decompose a monolithic app
  • Process high volume, real-time data streams

Data Council SF 2019 - Heroku Data

slide-13
SLIDE 13

13 

Use Cases

  • Decompose a monolithic app
  • Process high volume, real-time data streams
  • Power a real-time, event-driven architecture

Data Council SF 2019 - Heroku Data

slide-14
SLIDE 14

14 

Decompose a monolithic app

SHIFT Commerce

Data Council SF 2019 - Heroku Data

slide-15
SLIDE 15

15 

Quoine

  • QUOINE is a leading global fintech company that

provides trading, exchange, and next generation financial services powered by blockchain technology

  • Consume real-time cryptocurrency pricing data from

individual markets and exchanges

Data Council SF 2019 - Heroku Data

slide-16
SLIDE 16

16 

Caesars Entertainment

  • Ingest, aggregate, and process customer data in

real-time to provide the best customer experience

  • Real-time, event-driven architecture

Data Council SF 2019 - Heroku Data

slide-17
SLIDE 17

17 

The Motivation

Data Council SF 2019 - Heroku Data

slide-18
SLIDE 18

18 

Why Multi-tenant Kafka?

  • More accessible
  • Additional use cases
  • Development
  • Testing
  • Low volume production

Data Council SF 2019 - Heroku Data

slide-19
SLIDE 19

19  Data Council SF 2019 - Heroku Data

slide-20
SLIDE 20

20 

Single Tenant Dedicated

Data Council SF 2019 - Heroku Data

slide-21
SLIDE 21

21  Data Council SF 2019 - Heroku Data

slide-22
SLIDE 22

22  Data Council SF 2019 - Heroku Data

slide-23
SLIDE 23

23 

Multi-tenancy

Data Council SF 2019 - Heroku Data

slide-24
SLIDE 24

24  Data Council SF 2019 - Heroku Data

slide-25
SLIDE 25

25 

Multi-tenancy

  • Resource isolation
  • Security
  • Performance
  • Safety
  • Parity
  • Feature
  • Behaviour
  • Compatibility
  • Costs
  • Resources
  • Operational

Data Council SF 2019 - Heroku Data

slide-26
SLIDE 26

26 

Multi-tenancy

  • Resource isolation
  • Security
  • Performance
  • Safety
  • Parity
  • Feature
  • Behaviour
  • Compatibility
  • Costs
  • Resources
  • Operational

Data Council SF 2019 - Heroku Data

slide-27
SLIDE 27

27 

Security

Data Council SF 2019 - Heroku Data

slide-28
SLIDE 28

28 

A tenant should not be able to access another tenant’s data

Data Council SF 2019 - Heroku Data

slide-29
SLIDE 29

29  Data Council SF 2019 - Heroku Data

slide-30
SLIDE 30

30  Data Council SF 2019 - Heroku Data

slide-31
SLIDE 31

31 

Security

  • Access Control Lists (ACLs)
  • Namespacing

Data Council SF 2019 - Heroku Data

slide-32
SLIDE 32

32 

Security

  • Access Control Lists (ACLs)
  • User A can carry out action B on resource C
  • Namespacing

Data Council SF 2019 - Heroku Data

slide-33
SLIDE 33

33 

Security

  • Access Control Lists (ACLs)
  • User A can carry out action B on resource C
  • Namespacing
  • wabash-58779.events

Data Council SF 2019 - Heroku Data

slide-34
SLIDE 34

34 

Performance

Data Council SF 2019 - Heroku Data

slide-35
SLIDE 35

35 

A tenant should not adversely affect another tenant’s performance

Data Council SF 2019 - Heroku Data

slide-36
SLIDE 36

36 

Performance

  • Quotas
  • Produce
  • Consume

Data Council SF 2019 - Heroku Data

slide-37
SLIDE 37

37 

Safety

Data Council SF 2019 - Heroku Data

slide-38
SLIDE 38

38 

A tenant should not jeopardise the stability of the cluster

Data Council SF 2019 - Heroku Data

slide-39
SLIDE 39

39 

Safety

  • Limits
  • Topics
  • Partitions
  • Consumer Groups
  • Storage
  • Throughput

Data Council SF 2019 - Heroku Data

slide-40
SLIDE 40

40 

Capacity = Message Throughput * Retention * Replication

Data Council SF 2019 - Heroku Data

slide-41
SLIDE 41

41 

Safety

  • Limits
  • Topics
  • Partitions
  • Consumer Groups
  • Storage Capacity
  • Throughput

Data Council SF 2019 - Heroku Data

slide-42
SLIDE 42

42 

Safety

  • Limits
  • Topics
  • Partitions
  • Consumer Groups
  • Storage Capacity
  • Throughput
  • Monitoring

Data Council SF 2019 - Heroku Data

slide-43
SLIDE 43

43 

Safety

  • Limits
  • Topics
  • Partitions
  • Consumer Groups
  • Storage Capacity
  • Throughput
  • Monitoring
  • Limit enforcement!

Data Council SF 2019 - Heroku Data

slide-44
SLIDE 44

44 

Multi-tenancy

  • Resource isolation
  • Security
  • Performance
  • Safety
  • Parity
  • Feature
  • Behaviour
  • Compatibility
  • Costs
  • Resources
  • Operational

Data Council SF 2019 - Heroku Data

slide-45
SLIDE 45

45 

Parity

Data Council SF 2019 - Heroku Data

slide-46
SLIDE 46

46 

For the service to be useful, it needs to behave like a normal cluster

Data Council SF 2019 - Heroku Data

slide-47
SLIDE 47

47 

Parity

  • Access to a standard cluster

Data Council SF 2019 - Heroku Data

slide-48
SLIDE 48

48 

Parity

  • Access to a standard cluster
  • ...but with some limitations

Data Council SF 2019 - Heroku Data

slide-49
SLIDE 49

49 

Multi-tenancy

  • Resource isolation
  • Security
  • Performance
  • Safety
  • Parity
  • Feature
  • Behaviour
  • Compatibility
  • Costs
  • Resources
  • Operational

Data Council SF 2019 - Heroku Data

slide-50
SLIDE 50

50 

Compatibility

Data Council SF 2019 - Heroku Data

slide-51
SLIDE 51

51 

The service needs to support standard clients No vendor lock-in

Data Council SF 2019 - Heroku Data

slide-52
SLIDE 52

52 

Compatibility

  • Open Source Apache Kafka
  • Not a fork
  • No custom code required
  • Use standard clients

Data Council SF 2019 - Heroku Data

slide-53
SLIDE 53

53 

Multi-tenancy

  • Resource isolation
  • Security
  • Performance
  • Safety
  • Parity
  • Feature
  • Behaviour
  • Compatibility
  • Costs
  • Resources
  • Operational

Data Council SF 2019 - Heroku Data

slide-54
SLIDE 54

54 

Costs

Data Council SF 2019 - Heroku Data

slide-55
SLIDE 55

55 

The service needs to be financially feasible

Data Council SF 2019 - Heroku Data

slide-56
SLIDE 56

56 

Resource Costs

  • Packing Density
  • Utilization

Data Council SF 2019 - Heroku Data

slide-57
SLIDE 57

57 

Resource Costs

  • Cluster size?
  • No over provisioning
  • Seamless upgrading
  • Can’t move tenants (can’t migrate message offsets)

Data Council SF 2019 - Heroku Data

slide-58
SLIDE 58

58 

Operational Costs

  • Minimal operational burden
  • Minimize impact/blast radius

Data Council SF 2019 - Heroku Data

slide-59
SLIDE 59

59 

Operational Costs

  • Safe defaults
  • Similar clusters to our dedicated
  • Automation (kind of our thing)
  • Testing (lots)

Data Council SF 2019 - Heroku Data

slide-60
SLIDE 60

60 

Configuration & Tuning

Data Council SF 2019 - Heroku Data

slide-61
SLIDE 61

61 

Configuration & Tuning

  • Partitions
  • Quotas
  • Topics & Consumer Groups
  • Guard Rails

Data Council SF 2019 - Heroku Data

slide-62
SLIDE 62

62 

Partitions

  • Lots of partitions
  • 48,000
  • Max file descriptors
  • 500,000
  • Max mmap count
  • 500,000

Data Council SF 2019 - Heroku Data

slide-63
SLIDE 63

63 

Quotas

  • Per Broker!
  • Counter intuitive enforcement

Data Council SF 2019 - Heroku Data

slide-64
SLIDE 64

64 

Topics & Consumer Groups

  • Explicit Topic creation
  • Explicit Consumer Group creation

Data Council SF 2019 - Heroku Data

slide-65
SLIDE 65

65 

Guard Rails

  • Limit potential bad usage

Data Council SF 2019 - Heroku Data

slide-66
SLIDE 66

66 

Guard Rails

  • Limit potential bad usage
  • “Customers don’t make mistakes, we make bad tools”

Data Council SF 2019 - Heroku Data

slide-67
SLIDE 67

67 

# Heroku Data Control Plane min_retention_time = 24.hours

Data Council SF 2019 - Heroku Data

slide-68
SLIDE 68

68 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days

Data Council SF 2019 - Heroku Data

slide-69
SLIDE 69

69 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3

Data Council SF 2019 - Heroku Data

slide-70
SLIDE 70

70 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3 min_replication_factor = 3

Data Council SF 2019 - Heroku Data

slide-71
SLIDE 71

71 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3 min_replication_factor = 3 max_replication_factor = 3

Data Council SF 2019 - Heroku Data

slide-72
SLIDE 72

72 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3 min_replication_factor = 3 max_replication_factor = 3 # Kafka num.partitions=8

Data Council SF 2019 - Heroku Data

slide-73
SLIDE 73

73 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3 min_replication_factor = 3 max_replication_factor = 3 # Kafka num.partitions=8 replication=3

Data Council SF 2019 - Heroku Data

slide-74
SLIDE 74

74 

# Heroku Data Control Plane min_retention_time = 24.hours max_retention_time = 7.days default_replication_factor = 3 min_replication_factor = 3 max_replication_factor = 3 # Kafka num.partitions=8 replication=3 min.insync.replicas=2

Data Council SF 2019 - Heroku Data

slide-75
SLIDE 75

75 

Monitoring

  • Custom tool
  • Agentless
  • JMX is great!
  • Jolokia + JMX is great(er)!

Data Council SF 2019 - Heroku Data

slide-76
SLIDE 76

76 

Testing

  • How many partitions?
  • What kind of throughput?
  • Common operations?
  • Failure scenarios?
  • Max packing density?
  • Bugs?

Data Council SF 2019 - Heroku Data

slide-77
SLIDE 77

77 

Testing - Automation!

  • Internal tool
  • Leverages internal Heroku Data API
  • Creates large number of “tenants”
  • Creates topics and consumer groups for each tenant
  • Spins up pool of producers and consumers for each
  • Simulates different types of users

Data Council SF 2019 - Heroku Data

slide-78
SLIDE 78

78 

Bugs :(

  • KAFKA-4725
  • Contributed PR
  • Fixed in 0.10.1.1

Data Council SF 2019 - Heroku Data

slide-79
SLIDE 79

79 

Automation

Data Council SF 2019 - Heroku Data

slide-80
SLIDE 80

80 

Heroku Data Control Plane

  • Kafka does a lot

Data Council SF 2019 - Heroku Data

slide-81
SLIDE 81

81 

Heroku Data Control Plane

  • Kafka does a lot...but it doesn’t do everything!

Data Council SF 2019 - Heroku Data

slide-82
SLIDE 82

82 

Automate everything*

Data Council SF 2019 - Heroku Data

slide-83
SLIDE 83

83 

Automation

  • Cluster resizing
  • Node replacement
  • Storage expansion
  • Version upgrades
  • Restarts*

Data Council SF 2019 - Heroku Data

slide-84
SLIDE 84

84 

Automation

  • Cluster resizing
  • Node replacement
  • Storage expansion
  • Version upgrades
  • Restarts*
  • ...

Data Council SF 2019 - Heroku Data

slide-85
SLIDE 85

85 

Limitations

  • No AdminAPI

Data Council SF 2019 - Heroku Data

slide-86
SLIDE 86

86 

Limitations

  • No AdminAPI
  • Explicit Topic creation

Data Council SF 2019 - Heroku Data

slide-87
SLIDE 87

87 

Limitations

  • No AdminAPI
  • Explicit Topic creation
  • Explicit Consumer Group creation

Data Council SF 2019 - Heroku Data

slide-88
SLIDE 88

88 

Limitations

  • No AdminAPI
  • Explicit Topic creation
  • Explicit Consumer Group creation
  • No Zookeeper access

Data Council SF 2019 - Heroku Data

slide-89
SLIDE 89

89 

Thank you!

Data Council SF 2019 - Heroku Data

slide-90
SLIDE 90

90 

HEROKU.COM Ali Hamidi @ahamidi

Data Council SF 2019 - Heroku Data