Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO - - PowerPoint PPT Presentation

index free log analytics with kafka
SMART_READER_LITE
LIVE PREVIEW

Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO - - PowerPoint PPT Presentation

Index-Free Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Log Analytics Wish List Record everything - TBs of data per day Interactive/ad-hoc search on historic data -


slide-1
SLIDE 1

Index-Free Log Analytics 
 with Kafka

Kresten Krab Thorup, Humio CTO

Log Everything, Answer Anything, In Real-Time.

slide-2
SLIDE 2

Log Analytics Wish List

  • Record everything - TB’s of data per day
  • Interactive/ad-hoc search on historic data - 100’s of TB
  • Generate metrics and alerts from the logs in real-time
  • Can be installed on-premises (privacy / security)
  • Affordable - TCO (hardware, license, operations)
slide-3
SLIDE 3

Humio

Data Driven SecOps

30k PC’s BRO network 6 AD’s 2k servers CEP Log Store Alerts/dashboards Incident Response ~1M/sec 20TB/day

slide-4
SLIDE 4

Put Logs in an Index?

Low Volume High Volume

DATA INDEX

slide-5
SLIDE 5

Index-Free

Low Volume High Volume

DATA

slide-6
SLIDE 6

Index-Free

DATA DATA DATA DATA

Low Volume High Volume

slide-7
SLIDE 7

Low Volume

Index-Free

High Volume

DATA DATA DATA DATA

slide-8
SLIDE 8

Low Volume

Index-Free

High Volume

DATA DATA DATA DATA

TIME “INDEX”

WANT

slide-9
SLIDE 9

Stream Query

Index-Free

High Volume

DATA DATA DATA DATA Stream Query Stream Query ALERTS & DASHBOARD Ad-Hoc Queries

slide-10
SLIDE 10

State Machine Event Store Query /error/i | count() State Machine count: 473 count: 243,565

slide-11
SLIDE 11

Log Store Design

  • Build minimal index and compress data

Store order of magnitude more events

  • Fast “grep” for filtering events

Filtering and time/metadata selection 
 reduces the problem space

slide-12
SLIDE 12

Event Store

10GB (start-time, end-time, metadata) 10GB (start-time, end-time, metadata) 10GB (start-time, end-time, metadata) 10GB (start-time, end-time, metadata) . . .

slide-13
SLIDE 13

Event Store

1GB (start-time, end-time, metadata) 1GB (start-time, end-time, metadata) 1GB (start-time, end-time, metadata) 1GB (start-time, end-time, metadata) . . . compress 1 month x 30GB/day ingest 90GB data, <1MB index 1 month x 1TB/day ingest 4TB data, <1MB index 40MB 40MB 40MB . . . 40MB Bloom Filters +4% overhead

slide-14
SLIDE 14

Query

1GB 1GB 1GB 1GB 1GB 1GB 1GB 1GB 1GB 1GB time #dc1, #web #dc1, #app #dc2, #web datasource

slide-15
SLIDE 15

Query

1GB 1GB 1GB 1GB 1GB 1GB 1GB 1GB 1GB 1GB time datasource 10GB #dc1, #web #dc1, #app #dc2, #web

slide-16
SLIDE 16

Real-time Processing Brute-Force Search

  • “Materialized views”


for dashboards/alerts.

  • Processed when data


is in-memory anyway.

  • Fast response times


for “known” queries.

  • Shift CPU load to 


query time

  • Data compression
  • Filtering, not Indexing
  • Requires “Full stack” 

  • wnership to perform


#IndexFreeLogging

+

slide-17
SLIDE 17

Humio Ingest Data Flow

Agent API/ Ingest Digest Storage

  • Send data
  • HTTP/TCP API
  • Authenticate
  • Field Extraction
  • Live queries
  • Write segment files
  • Replication

alerts / dashboards

slide-18
SLIDE 18

Use Kafka for the ‘hard parts’

  • Coordination
  • Commit-log / ingest buffer
  • No KSQL
slide-19
SLIDE 19

Kafka 101

  • Kafka is a reliable distributed log/queue system
  • A Kafka queue consists of a number of partitions
  • Messages within a partition are sequenced
  • Partitions are replicated for durability
  • Use ‘partition consumers’ to parallelise work
slide-20
SLIDE 20

topic partition #1 partition #2 partition #3 consumer consumer producer partition=hash(key)

Kafka 101

slide-21
SLIDE 21

Coordination ‘global data’

  • Zookeeper-like system in-process
  • All cluster node keep entire K/V set in memory
  • Make decisions locally/fast without crossing a

network boundary.

  • Allows in-memory indexes of meta data.
slide-22
SLIDE 22

Coordination ‘global data’

  • Coordinated via single-partition Kafka queue
  • Ops-based CRDT-style event sourcing
  • Bootstrap from snapshot from any node
  • Kafka config: low latency
slide-23
SLIDE 23

Durability

  • Don’t loose people’s data.
  • Control and manage data life expectancy
  • Store, Replicate, Archive, Multi-tier Data storage
slide-24
SLIDE 24

Durability

Agent Ingest Digest Storage

  • Send data
  • Authenticate
  • Field Extraction
  • Streaming queries
  • Write segment files
  • Replication
  • Queries on ‘old data’

Kafka

slide-25
SLIDE 25

Durability

Agent API/ Ingest Kafka HTTP 200 response => Kafka ACK’ed the store

slide-26
SLIDE 26

Durability

Kafka Digest WIP
 (buffer) Segment File records last consumed 
 sequence number from disk QE Retention must be long enough to deal with crash

slide-27
SLIDE 27

Durability

Kafka Digest WIP
 (buffer) Segment QE Ingest Kafka ingest latency p99 p50

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

topic partition #1 partition #2 partition #3 consumer consumer producer partition=hash(key)

Hash? ?

slide-31
SLIDE 31

Consumers falling behind…

  • Reasons:
  • Data volume
  • Processing time for real-time processing
  • Measure ingest latency
  • Increase parallelism when running 10s behind
  • Log scale (1, 2, 4, …) randomness added to key.
slide-32
SLIDE 32

topic partition #1 partition #2 partition #3

Data Sources

…100.000 … 100.000

multiplexing

slide-33
SLIDE 33

Data Model

Repository Data Source

  • Storage limits
  • User admin
  • Time series identified by


set of key-value ‘tags’

*

Event

  • Timestamp +


Map[String,String]

*

#type=accesslog,#host=ops01 Hash ( )

slide-34
SLIDE 34

High variability tags ‘auto grouping’

  • Tags (hash key) may be chosen with large value domain
  • User name
  • IP-address
  • This causes many datasources => growth in metadata,

resource issues.

slide-35
SLIDE 35

High variability tags ‘auto grouping’

  • Tags (hash key) may be chosen with large value domain
  • User name
  • IP-address
  • Humio sees this and hashes tag value into a smaller

value domain before the Kafka partition hash.

slide-36
SLIDE 36

High variability tags ‘auto grouping’

  • For example, before Kafka ingest hash(“kresten”)


#user=kresten => #user=13

  • Store the actual value ‘kresten’ in the event
  • At query time, a search is then rewritten to search the

data source #user=13, and re-filter based on values.

slide-37
SLIDE 37

Multiplexing in Kafka

  • Ideally, we would just have 100.000 dynamic topics that

perform well and scales infinitely.

  • In practice, you have to know your data, and control the
  • sharding. Default Kafka configs work for many

workloads, but for maximum utilisation you have to do go beyond defaults.

  • Humio automates this problem for log data w/ tags.
slide-38
SLIDE 38

Using Kafka in an on-prem Product

  • Leverage the stability and fault tolerance of Kafka
  • Large customers often have Kafka knowledge
  • We provide kafka/zookeeper docker images
  • Only real issue is Zookeper dependency
  • Often runs out of disk space in small setups
slide-39
SLIDE 39

Other Issues

  • Observed GC pauses in the JVM
  • Kafka and HTTP libraries compress data
  • JNI/GC interactions with byte[] can block global GC.
  • We replaced both with custom compression
  • JLibGzip (gzip in pure Java)
  • Zstd and LZ4/JNI using DirectByteBuffer
slide-40
SLIDE 40

Resetting Kafka/Zookeeper

  • Kafka provides a ‘cluster id’ we can use as epoch
  • All Kafka sequence numbers (offsets) are reset
  • Recognise this situation, no replay beyond such a reset.
slide-41
SLIDE 41

What about KSQL?

  • Kafka now has KSQL which is in many ways similar to

the engine we built

  • Humio moves computation to the data,
  • KSQL moves the data to the computation
  • We provide interactive end-user friendly experience
slide-42
SLIDE 42

Final thoughts

  • With #IndexFreeLogging you can eat your cake and

have it too: fast, useful, low footprint logging.

  • Many difficult problems go away by deferring them to

Kafka.

slide-43
SLIDE 43

Thanks for your time.

Kresten Krab Thorup Humio CTO

slide-44
SLIDE 44

Filter 1GB data

slide-45
SLIDE 45

Filter 1GB data

slide-46
SLIDE 46

Filter 1GB data

slide-47
SLIDE 47

Filter 1GB data

slide-48
SLIDE 48

Filter 1GB data