Building Data applications with Go from Bloom filters to Data - - PowerPoint PPT Presentation

building data applications with go
SMART_READER_LITE
LIVE PREVIEW

Building Data applications with Go from Bloom filters to Data - - PowerPoint PPT Presentation

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data Scientist sergii.khomenko@stylight.com, @lc0d3r FOSDEM - January 31, 2016 01 Sergii Khomenko Data scientist at one of the biggest fashion


slide-1
SLIDE 1

Building Data applications with Go

01

from Bloom filters to Data pipelines

Sergii Khomenko, Data Scientist sergii.khomenko@stylight.com, @lc0d3r FOSDEM - January 31, 2016

slide-2
SLIDE 2

Sergii Khomenko

2

Data scientist at one of the biggest fashion communities, Stylight. Data analysis and visualisation hobbyist, working on problems not

  • nly in working time but in free time for fun and personal data

visualisations. First time faced Golang in ~ 2010. Fell in love with language channels and core concepts. Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet Camp London, Berlin Buzzwords 2015 , Tableau Conference on Tour, Budapest BI Forum 2015, Crunchsconf 2015 and others

slide-3
SLIDE 3

3

Munich, Germany Founded on Apr 5, 2014 Gophers: 323

slide-4
SLIDE 4

4

slide-5
SLIDE 5

5

https://www.pinterest.com/pin/38351034303708696/

slide-6
SLIDE 6

Profitable Leads Stylight provides its partners with high- quality leads enabling partner shops to leverage Stylight as a ROI positive traffic channel. Inspiration Stylight offers shoppable inspiration that makes it easy to know what to buy and how to style it. Branding & Reach Stylight offers a unique

  • pportunity for brands to reach

an audience that is actively looking for style online. Shopping Stylight helps users search and shop fashion and lifestyle products smarter across hundreds of shops. 6

Stylight – Make Style Happen

Core Target Group Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.

slide-7
SLIDE 7

Stylight – acting on a global scale

slide-8
SLIDE 8

Experienced & Ambitious Team

Innovative cross- functional organisation with flat hierarchy builds a 
 unique team spirit.

  • +200 employees
  • 40 PhDs/Engineers
  • 28 years average age
  • 63% female
  • 23 nationalities
  • 0 suits

8

slide-9
SLIDE 9

Agenda

9

P r o b a b i l i s t i c d a t a s t r u c t u r e s B l o o m f i l t e r s , C o u n t M i n o r H y p e r l o g l o g O p e n S o u r c e s t a c k A m a z o n A W S G o o g l e C l o u d S e r v e r l e s s a r c h i t e c t u r e

slide-10
SLIDE 10

The Nature of Data

10

slide-11
SLIDE 11

Sources of data:

11

  • Web tracking
  • Metrics tracking
  • Behaviour tracking
  • Business intelligence ETL
  • Internal Services
  • ML tagging service
slide-12
SLIDE 12

Access patterns

12

  • Real-time
  • Nearly real-time
  • Daily batches
slide-13
SLIDE 13

Probabilistic data structures

13

slide-14
SLIDE 14

14

D a t a s t r u c t u r e s t h a t u s e d i f f e r e n t p r o b a b i l i s t i c a p p r o a c h e s t o c o m p a c t l y s t o r e d a t a

slide-15
SLIDE 15

15

slide-16
SLIDE 16

16

slide-17
SLIDE 17

Bloom filter

17

Approximate Membership

slide-18
SLIDE 18

18

A B l o o m f i l t e r i s a s p a c e - e f f i c i e n t p r o b a b i l i s t i c d a t a s t r u c t u r e , c o n c e i v e d b y B u r t o n H o w a r d B l o o m i n 1 9 7 0 , t h a t i s u s e d t o t e s t w h e t h e r a n e l e m e n t i s a m e m b e r o f a s e t . F a l s e p o s i t i v e m a t c h e s a r e p o s s i b l e , b u t f a l s e n e g a t i v e s a r e n o t

slide-19
SLIDE 19

19

A B l o o m f i l t e r i s a s p a c e - e f f i c i e n t p r o b a b i l i s t i c d a t a s t r u c t u r e , c o n c e i v e d b y B u r t o n H o w a r d B l o o m i n 1 9 7 0 , t h a t i s u s e d t o t e s t w h e t h e r a n e l e m e n t i s a m e m b e r o f a s e t . F a l s e p o s i t i v e m a t c h e s a r e p o s s i b l e , b u t f a l s e n e g a t i v e s a r e n o t

slide-20
SLIDE 20

20

  • b i t a r r a y o f m b i t s .
  • k d i f f e r e n t h a s h f u n c t i o n s w i t h

a u n i f o r m r a n d o m d i s t r i b u t i o n

slide-21
SLIDE 21

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

https://www.jasondavies.com/bloomfilter/

slide-24
SLIDE 24

Size estimation

24

memory usage hash functions n - estimated number of elements p - false positive probability m - required bit array length Example: n=1,000,000 FPR 10% ~= 4800000 Bit ~= 600 kByte FPR 0.1% ~= 14400000 Bit ~= 1.8 MByte

slide-25
SLIDE 25

Use-cases

25

  • Caches
  • Databases
  • HBase
  • Cassandra
  • Networking

https://github.com/willf/bloom https://github.com/reddragon/bloomfilter.go https://github.com/seiflotfy/dlCBF https://github.com/patrickmn/go-bloom https://github.com/armon/bloomd https://github.com/geetarista/go-bloomd

slide-26
SLIDE 26

Extensions

26

  • Cardinality estimate (increment counter when add a new)
  • Scalable Bloom filters (add another hash function on top)
  • Counting Bloom filters
  • increment every time we see it
slide-27
SLIDE 27

Count-Min

27

Frequency estimator

slide-28
SLIDE 28

28

  • m a t r i x o f w c o l u m n s a n d d r o w s
  • h a s h f u n c t i o n a s s o c i a t e d w i t h

e v e r y r o w

slide-29
SLIDE 29

29

slide-30
SLIDE 30

HyperLogLog

30

Cardinality estimator

slide-31
SLIDE 31

31

H y p e r L o g L o g i s a n a l g o r i t h m f o r t h e c o u n t - d i s t i n c t p r o b l e m , a p p r o x i m a t i n g t h e n u m b e r o f d i s t i n c t e l e m e n t s i n a m u l t i s e t .

slide-32
SLIDE 32

32

T h e H y p e r L o g L o g a l g o r i t h m i s a b l e t o e s t i m a t e c a r d i n a l i t i e s

  • f > 1 0 ^ 9 w i t h a t y p i c a l e r r o r

r a t e o f 2 % , u s i n g 1 . 5 k B o f m e m o r y [ 3 ]

slide-33
SLIDE 33

33

hash(x) -> stream of bits {1,0,0,1,0,1..}

  • hash generates uniformly distributed values
  • every bit is independent

Hash function

slide-34
SLIDE 34

34

p(first bit - 0) = 1/2 p(second bit - 0) = 1/2 5 consecutive zeros - (1/2)^5 N consecutive zeros - (1/2)^N

Bit probability

slide-35
SLIDE 35

35

p(first bit - 0) = 1/2 p(second bit - 0) = 1/2 5 consecutive zeros - (1/2)^5 N consecutive zeros - (1/2)^N N = 32, Odds = 1/4294967296 -> Expected 4294967296 samples

Guessing bits

slide-36
SLIDE 36

36

N = 32 = {1,0,0,0,0} = 6bit With 6bits we can count 2^64 Where the name is coming from Log(Log(64)) = 6

Storing bits

slide-37
SLIDE 37

37

  • Create m registers
  • Partition the bit stream
  • first log(m) - register index
  • rest used for actual values

Multiple registers

slide-38
SLIDE 38

38

HyperLogLog - add

slide-39
SLIDE 39

39

  • Given m registers
  • Estimate aggregated value
  • Min? Max? Avg? Median?
  • Geometric/Harmonic mean!
  • Estimate A*m*H

HyperLogLog - size

slide-40
SLIDE 40

40

http://content.research.neustar.biz/blog/hll.html

slide-41
SLIDE 41

Use-cases

41

  • Databases
  • Redis
  • PostgreSQL
  • Redshift
  • Impala
  • Hive
  • Spark

https://github.com/clarkduvall/hyperloglog https://github.com/armon/hlld

slide-42
SLIDE 42

42

I n c o m p u t i n g , a p i p e l i n e i s a s e t o f d a t a p r o c e s s i n g e l e m e n t s c o n n e c t e d i n s e r i e s , w h e r e t h e

  • u t p u t o f o n e e l e m e n t i s t h e

i n p u t o f t h e n e x t o n e .

slide-43
SLIDE 43

Open Source Stack

43

slide-44
SLIDE 44

44

http://lambda-architecture.net/

slide-45
SLIDE 45

45

A p a c h e K a f k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g r e t h o u g h t a s a d i s t r i b u t e d c o m m i t l o g .

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

Libraries

  • Sarama is an MIT-licensed Go client library

for Apache Kafka version 0.8 (and later) https://github.com/Shopify/sarama Go Kafka Client https://github.com/elodina/go_kafka_client

slide-49
SLIDE 49

49

producer, err := NewAsyncProducer([]string{"localhost:9092"}, nil) if err != nil { panic(err) } defer func() { if err := producer.Close(); err != nil { log.Fatalln(err) } }()

slide-50
SLIDE 50

50

var enqueued, errors int ProducerLoop: for { select { case producer.Input() <- &ProducerMessage{Topic: "my_topic", Key: nil, Value: StringEncoder("testing 123")}: enqueued++ case err := <-producer.Errors(): log.Println("Failed to produce message", err) errors++ case <-signals: break ProducerLoop } } log.Printf("Enqueued: %d; errors: %d\n", enqueued, errors)

slide-51
SLIDE 51

51

http://www.ipponusa.com/wp-content/uploads/2014/10/spark-architecture.jpg

dmrgo is a Go library for writing map/reduce jobs. https://github.com/dgryski/dmrgo

slide-52
SLIDE 52

Results

52

  • Scalable
  • Flexible
  • High costs of maintenance
  • Not so easy to setup
slide-53
SLIDE 53

53

A p r o g r a m m i n g l a n g u a g e i s l o w l e v e l w h e n i t s p r o g r a m s r e q u i r e a t t e n t i o n t o t h e i r r e l e v a n t .

Alan Jay Perlis / Epigrams on Programming

slide-54
SLIDE 54

Amazon AWS

54

slide-55
SLIDE 55

Kinesis Streams

slide-56
SLIDE 56

56

slide-57
SLIDE 57

57

slide-58
SLIDE 58

58

Libraries

  • AWS SDK for Go

https://github.com/aws/aws-sdk-go

slide-59
SLIDE 59

59

slide-60
SLIDE 60

60

slide-61
SLIDE 61

Kinesis Firehose

slide-62
SLIDE 62

Kinesis Analytics

slide-63
SLIDE 63

63

custom unification pipeline Product Processing Business Intelligence

ML/Tagging Product events variety of event types and structures

slide-64
SLIDE 64

Google Cloud

64

slide-65
SLIDE 65

65

slide-66
SLIDE 66

66

slide-67
SLIDE 67

67

Libraries

  • Google APIs Client Library for Go

https://github.com/GoogleCloudPlatform/gcloud-golang

slide-68
SLIDE 68

68

slide-69
SLIDE 69

69

slide-70
SLIDE 70
slide-71
SLIDE 71

71

slide-72
SLIDE 72

Serverless architecture

72

slide-73
SLIDE 73

73

slide-74
SLIDE 74

74

slide-75
SLIDE 75

75

slide-76
SLIDE 76

76

slide-77
SLIDE 77

77

slide-78
SLIDE 78

78

slide-79
SLIDE 79

79

slide-80
SLIDE 80

80

Possibilities

  • all Lambdas in one place with version control
  • integration tests with real events
  • proper CI/CD setup
slide-81
SLIDE 81

81

slide-82
SLIDE 82

www.stylight.com

sergii.khomenko@stylight.com @lc0d3r

slide-83
SLIDE 83

Related links

83

  • 1. Burton H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors.

1970

  • 2. Interactive visualisation: Bloom Filters
  • 3. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
  • 4. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art

Cardinality Estimation Algorithm

  • 5. HyperLogLog — Cornerstone of a Big Data Infrastructure
  • 6. Armon Dadgar on Bloom Filters and HyperLogLog
slide-84
SLIDE 84

Related links

84

  • 7. https://github.com/willf/bloom
  • 8. Google’s Cloud Pub/Sub Real-Time Messaging Service Is Now In

Public Beta

  • 9. Streaming Data Processing with Amazon Kinesis and AWS

Lambda

  • 10. Google Cloud Dataflow Two Worlds Become a Much Better One
  • 11. https://github.com/apex/apex