APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK - PowerPoint PPT Presentation

� 1 APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK RAMASAMY SENIOR DIRECTOR OF ENGINEERING SPLUNK @KARTHIKZ

� 2 Connected World

� 3 Ubiquity of Real-Time Data Streams & Events

� 4 EVENT/STREAM DATA PROCESSING ✦ Events are analyzed and processed as they arrive ✦ Decisions are timely, contextual and based on fresh data ✦ Decision latency is eliminated ✦ Data in motion Ingest/ Act Analyze Buffer

EVENT/STREAM PROCESSING PATTERNS MONITORING MICROSERVICES WORKFLOWS ANALYTICS MODEL INFERENCE

� 6 STREAM PROCESSING PATTERN Data Ingestion Data Processing Messaging Compute Data Storage Data Storage Results Storage Serving

� 7 APACHE PULSAR Flexible Messaging + Queueing System backed by a durable log storage

Key Concepts

Core concepts: Tenants, namespaces, topics Visits Analytics Conversions Marketing Responses Campaigns Conversions Apache Pulsar Cluster Sales Data Transactions transformation Interactions Log events Data Integration Security Signatures Microservices Accesses Tenants Namespaces Topics � 9

Topics Consumers Topic Producers Consumers Producers Consumers Time � 10

Topic partitions Topic - P0 Consumers Producers Topic - P1 Consumers Producers Topic - P2 Consumers Time � 11

Segments P0 Segment 1 Segment 2 Segment 3 P1 Segment 1 Segment 2 Segment 3 Segment 4 P2 Segment 1 Segment 2 Segment 3 Time � 12

Architecture

� 14 APACHE PULSAR Producer Consumer SERVING Brokers can be added independently Traffic can be shifted quickly across brokers Broker Broker Broker STORAGE Bookies can be added independently Bookie Bookie Bookie New bookies will ramp up traffic quickly

� 15 APACHE PULSAR - BROKER ✦ Broker is the only point of interaction for clients (producers and consumers) ✦ Brokers acquire ownership of group of topics and “serve” them ✦ Broker has no durable state ✦ Provides service discovery mechanism for client to connect to right broker

� 16 APACHE PULSAR - BROKER

� 17 APACHE PULSAR - CONSISTENCY Bookie Bookie Producer Broker Bookie

� 18 APACHE PULSAR - DURABILITY (NO DATA LOSS) fsync Bookie Journal fsync Journal Bookie Producer Broker fsync Journal Bookie

� 19 APACHE PULSAR - ISOLATION

� 20 APACHE PULSAR - SEGMENT STORAGE … 63 62 61 60 … 43 42 41 40 … 23 22 21 20 … 4 3 2 Segment 1 Segment 2 Segment 4 Segment 3 Segment 1 Segment 3 Segment 1 Segment 2 Segment 3 Segment 2 Segment 4 Segment 4

� 21 APACHE PULSAR - RESILIENCY … 63 62 61 60 … 43 42 41 40 … 23 22 21 20 … 4 3 2 1 Segment 1 Segment 2 Segment 4 Segment 3 Segment 1 Segment 3 Segment 1 Segment 2 Segment 3 Segment 2 Segment 4 Segment 4

� 22 APACHE PULSAR - SEAMLESS CLUSTER EXPANSION … 63 62 61 60 … 43 42 41 40 … 23 22 21 20 … 4 3 2 1 Segment X Segment 1 Segment 2 Segment 4 Segment Y Segment 1 Segment 3 Segment 1 Segment 3 Segment Z Segment 4 Segment 2 Segment 3 Segment 2 Segment 4

� 23 APACHE PULSAR - TIERED STORAGE … 63 62 61 60 … 43 42 41 40 … 23 22 21 20 … 4 3 2 1 Segment 4 Segment 3 Segment 3 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Low Cost Storage

Multi-tiered storage and serving Partition Processing Tailing reads: served from Broker Broker Broker (brokers) in-memory cache .   .   .   .   .   .   .   .   . . . . Catch-up reads: served Warm from persistent storage Storage layer Cold Historical reads: served Storage from cold storage � 24

� 25 PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? Logical Partition Partition View Segment 1 Segment 2 Segment 3 Segment n Broker Broker Broker Broker Broker Broker Processing Processing Partition (brokers) Partition Partition & Storage (primary) (copy) (copy) Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 .   .   .   .   .   .   .   .   . . . . Storage Segment n Segment n Segment n Segment n Legacy Architectures Apache Pulsar # Storage co-resident with processing # Partition-centric # Storage decoupled from processing # Cumbersome to scale--data # Partitions stored as segments redistribution, performance impact # Flexible, easy scalability

� 26 DEPLOYMENT IN K8S LB LB1 LB2 LB3 S S1 S2 S3 Broker Broker Broker Broker1 Broker2 Broker3 Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 .   .   .   .   .   .   .   .   . . . . Segment n Segment n Segment n Segment n

� 27 PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? ✦ In Kafka, partitions are assigned to brokers “permanently” ✦ A single partition is stored entirely in a single node ✦ Retention is limited by a single node storage capacity ✦ Failure recovery and capacity expansion require expensive “rebalancing” ✦ Rebalancing has a big impact over the system, affecting regular traffic

� 28 UNIFIED MESSAGING MODEL - STREAMING Consumer 2 X Producer 1 Pulsar topic/ Subscription Consumer 1 partition A Exclusive M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 Producer 2

� 29 UNIFIED MESSAGING MODEL - STREAMING Consumer 2 In case of failure in Producer 1 consumer 1 Pulsar topic/ Subscription Consumer 1 partition B Failover M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 Producer 2

� 30 UNIFIED MESSAGING MODEL - QUEUING Traffic is equally distributed across consumers Consumer 3 M0 Producer 1 M3 Pulsar topic/ Subscription M1 Consumer 2 partition C Shared M4 M4 M3 M2 M1 M0 Producer 2 M2 Consumer 1

� 31 DISASTER RECOVERY Data Center A Data Center B Producer Producer Topic (T1) Topic (T1) (P2) (P1) Simple configuration to add/remove regions Subscription Subscription Consumer Consumer (S1) (S1) Producer Asynchronous (default) Topic (T1) Integrated in the (P3) and synchronous broker message flow replication Data Center C

Asynchronous replication example Datacenter 1 Datacenter 2 Two independent clusters, • Producers Producers ZooKeeper ZooKeeper primary and standby (standby) (active) Configured tenants and • namespaces replicate to standby Pulsar Cluster Pulsar Cluster (primary) (standby) Data published to primary is • asynchronously replicated to standby Pulsar Consumers Consumers replication (standby) (active) Producers and consumers • restarted in second datacenter upon primary failure � 32

Synchronous replication example Datacenter 1 Datacenter 2 Each topic owned by one • broker at a time, i.e. in one ZooKeeper Producers Producers datacenter ZooKeeper cluster spread • Pulsar Cluster across multiple locations Broker commits writes to • bookies in both datacenters Consumers Consumers In event of datacenter failure, • broker in surviving datacenter assumes ownership of topic � 33

Replicated subscriptions Datacenter 1 Datacenter 2 Producers Subscriptions Subscriptions Pulsar Pulsar Consumers Consumers Cluster 1 Cluster 2 Pulsar Marker Marker Marker Replication � 34

� 35 MULTITENANCY - CLOUD NATIVE Topic-1 Account History Topic-2 ETL User Clustering Topic-1 Customer Authentication Microservice 5 TB Data Fraud Topic-1 Product Serving Detection Risk Classification Safety Apache Pulsar Cluster 7 TB Campaigns Marketing ✦ Authentication Topic-1 Budgeted Spend ✦ Authorization 10 TB Topic-2 ✦ Software isolation Demographic Classification ETL ๏ Storage quotas, flow control, back pressure, rate limiting ✦ Hardware isolation Topic-1 Location Resolution ๏ Constrain some tenants on a subset of brokers/bookies

� 36 PULSAR CLIENTS Python Java Go Apache Pulsar Cluster C++ C

� 37 PULSAR PRODUCER PulsarClient client = PulsarClient.create( “http://broker.usw.example.com:8080”); Producer producer = client.createProducer( “persistent://my-property/us-west/my-namespace/my-topic”); // handles retries in case of failure producer.send("my-message".getBytes()); // Async version: producer.sendAsync("my-message".getBytes()).thenRun(() -> { // Message was persisted });

� 38 PULSAR CONSUMER PulsarClient client = PulsarClient.create( "http://broker.usw.example.com:8080"); Consumer consumer = client.subscribe( "persistent://my-property/us-west/my-namespace/my-topic", "my-subscription-name"); while (true) { // Wait for a message Message msg = consumer.receive(); System.out.println("Received message: " + msg.getData()); // Acknowledge the message so that it can be deleted by broker consumer.acknowledge(msg); }

� 39 SCHEMA REGISTRY ✦ Provides type safety to applications built on top of Pulsar ✦ Two approaches ✦ Client side - type safety enforcement up to the application ✦ Server side - system enforces type safety and ensures that producers and consumers remain synced ✦ Schema registry enables clients to upload data schemas on a topic basis. ✦ Schemas dictate which data types are recognized as valid for that topic

APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK - PowerPoint PPT Presentation

1 APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK RAMASAMY SENIOR DIRECTOR OF ENGINEERING SPLUNK @KARTHIKZ 2 Connected World 3 Ubiquity of Real-Time Data Streams & Events 4 EVENT/STREAM DATA PROCESSING

Queuing Networks - Outline of queuing networks - Mean Value Analisys (MVA) for open and closed

Matteo Merli What is Apache Pulsar? Distributed pub/sub messaging Backed by a scalable log

Performance Evaluation of Queuing Systems Introduction to Queuing Systems System Performance

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

sphere wind Pulsar e + ,e - , (ions?) wind nebula electro-magnetic fields 1000 km 0.1 pc

How do people queue? A study of different queuing models TGF 2015 Delft, 28th October 2015

Constraining Queuing Delay in a Constraining Queuing Delay in a Router based on Superposition of

STRATEGIC STRATEGIC MESSAGING MESSAGING BUILDING A BETTER CORE PITCH TORYTELLING FOR STARTUPS:

Secure Messaging Lecture 23 Messaging Alice Bob Secure Messaging Corruption model

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Lowering boundaries between data analysis ecosystems Jim Pivarski Princeton University DIANA

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019,

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov,

API Design is Hard By Dave Halter @davidhalter on Github @jedidjah_ch on Twitter Me Creator

Enabling Data-Driven API Design with Community Usage Data: A Need-Finding Study Tianyi Zhang 1 ,

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Getting popular Figure 1 : Condor downloads by platform Figure 2 : Known # of Condor hosts

APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK - PowerPoint PPT Presentation

1 APACHE PULSAR - THE NEXT GENERATION MESSAGING AND QUEUING KARTHIK RAMASAMY SENIOR DIRECTOR OF ENGINEERING SPLUNK @KARTHIKZ 2 Connected World 3 Ubiquity of Real-Time Data Streams & Events 4 EVENT/STREAM DATA PROCESSING

Queuing Networks - Outline of queuing networks - Mean Value Analisys (MVA) for open and closed

Matteo Merli What is Apache Pulsar? Distributed pub/sub messaging Backed by a scalable log

Performance Evaluation of Queuing Systems Introduction to Queuing Systems System Performance

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

sphere wind Pulsar e + ,e - , (ions?) wind nebula electro-magnetic fields 1000 km 0.1 pc

How do people queue? A study of different queuing models TGF 2015 Delft, 28th October 2015

Constraining Queuing Delay in a Constraining Queuing Delay in a Router based on Superposition of

STRATEGIC STRATEGIC MESSAGING MESSAGING BUILDING A BETTER CORE PITCH TORYTELLING FOR STARTUPS:

Secure Messaging Lecture 23 Messaging Alice Bob Secure Messaging Corruption model

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Multi-tenant Machine Learning Apache Aurora &amp; Apache Mesos Stephan Erb

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Lowering boundaries between data analysis ecosystems Jim Pivarski Princeton University DIANA

Fletcher: A Framework to Effjciently Integrate FPGA Accelerators with Apache Arrow @ FPL2019,

Dataflow/Apache Beam A Unified Model for Batch and Streaming Data Processing Eugene Kirpichov,

API Design is Hard By Dave Halter @davidhalter on Github @jedidjah_ch on Twitter Me Creator

Enabling Data-Driven API Design with Community Usage Data: A Need-Finding Study Tianyi Zhang 1 ,

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008 Google App Engine

Getting popular Figure 1 : Condor downloads by platform Figure 2 : Known # of Condor hosts

Multi-tenant Machine Learning Apache Aurora & Apache Mesos Stephan Erb