Better TV & Broadband with Kafka & Spark Phill Radley - - PowerPoint PPT Presentation

better tv broadband with kafka spark
SMART_READER_LITE
LIVE PREVIEW

Better TV & Broadband with Kafka & Spark Phill Radley - - PowerPoint PPT Presentation

Better TV & Broadband with Kafka & Spark Phill Radley Chief Data Architect British Telecommunications plc In the beginning ( 2012 ) Hadoop HaaS Hadoop - Admin as a Service Admin Group Early adoption Spark will replace


slide-1
SLIDE 1

Better TV & Broadband with Kafka & Spark

Phill Radley Chief Data Architect British Telecommunications plc

slide-2
SLIDE 2

In the beginning ( 2012 )

slide-3
SLIDE 3

HaaS Hadoop - Admin

Hadoop as a Service Admin Group

slide-4
SLIDE 4

Early adoption

slide-5
SLIDE 5

Doug Cutting – Sep 2015

“Spark will replace map/reduce as the standard execution for Hadoop”

slide-6
SLIDE 6

HaaS 2.0

Denser Nodes

doubled #cores trebled RAM

Same node count 

slide-7
SLIDE 7

Cluder migration

slide-8
SLIDE 8

TV Set Top Box Broadband Home Hub

slide-9
SLIDE 9

TV & BB Data Pipeline Overview

Gateway Firewall

ESB

Impala

CRM

enrichment data

rich

YARN Cluster

Enrich Aggregate every

Spark

Producer consumer

HDFS

flume

HIVE Tables

Atomic metrics big XML payload Kafka Broker

raw

HAAS

Kafka Producer

slide-10
SLIDE 10

Data Ingest Kafka - Raw topic

slide-11
SLIDE 11

Data Serving – Impala Concurrency

slide-12
SLIDE 12

Schema Design … on read … DEVOPS approach

  • Flat (De-Normalised) Tables, table per query
  • Queried with SELECT * FROM …. WHERE …
  • Table Dimensions ( rows & columns )
  • Table File formats optimised for table query pattern ( up to 10 x difference )
  • 1. AVRO for tables being queried row oriented queries
  • 2. Parquet – default time series
  • 3. Parquet with snappy compression for deep time queries
slide-13
SLIDE 13

Impala Tuning…

  • There’s lots of options, the default will not be good enough
  • ( it’s not as mature as an Oracle DB ;-)
  • Isolate operational tenant loads with their own Dedicated Impala Resource Pool
  • “Dedicated SQL Queue” added to platform service portfolio
  • Chargeable platform feature ( as its dedicated resource )
  • Tuning Impala Daemons
  • Query Executor & Scanner Threads for MAX concurrency, shortest que
  • HDFS Caching
  • Currently in test, expecting a 2-5x speed up, more importantly eliminates

unnecessary physical I/O ( these are hot tables keep them in memory )

slide-14
SLIDE 14

Conclusions after months in production….

  • Spark 1.6 very stable
  • Impala requires a lot of tuning & table design to get working
  • High demand to use the data for other customer experience work
  • This solution runs on a multi-tenant cluster running hundreds of batch loads, and

dozens of ad-hoc self-service analytics and data science users

  • i.e. the isolation using cgroups seems to work ( mostly )
  • Next Steps
  • Another similar data pipeline from internal nework
  • Multi-tenant Kafka ( Topic as a Service ) to service more clients
  • Second Data centre Site with dual ingest for high availability
slide-15
SLIDE 15

Thank you 