Eivind Arvesen, Javazone 2019 E I V I N D A R V E S E N Developer, - - PowerPoint PPT Presentation

eivind arvesen javazone 2019 e i v i n d a r v e s e n
SMART_READER_LITE
LIVE PREVIEW

Eivind Arvesen, Javazone 2019 E I V I N D A R V E S E N Developer, - - PowerPoint PPT Presentation

Eivind Arvesen, Javazone 2019 E I V I N D A R V E S E N Developer, architect M.Sc. in Computer Science Consultant at Bouvet since 2017 Into application security, privacy, machine learning, web development / :


slide-1
SLIDE 1

Eivind Arvesen, Javazone 2019

slide-2
SLIDE 2

E I V I N D A R V E S E N

  • Developer, architect
  • M.Sc. in Computer Science
  • Consultant at Bouvet since 2017
  • Into application security, privacy,


machine learning, web development

/ : @EivindArvesen

slide-3
SLIDE 3

E I V I N D A R V E S E N

  • Developer, architect
  • M.Sc. in Computer Science
  • Consultant at Bouvet since 2017
  • Into application security, privacy,


machine learning, web development

/ : @EivindArvesen

slide-4
SLIDE 4

E I V I N D A R V E S E N

  • Developer, architect
  • M.Sc. in Computer Science
  • Consultant at Bouvet since 2017
  • Into application security, privacy,


machine learning, web development

@EivindArvesen

/ : @EivindArvesen

slide-5
SLIDE 5

P E R S O N A L D ATA I N A P P E N D - O N LY S T O R A G E

P O S S I B LY N O T A G R E AT I D E A

/ : @EivindArvesen

slide-6
SLIDE 6

* C O N T E X T *

/ : @EivindArvesen

slide-7
SLIDE 7

Elasticsearch: an open source, near realtime distributed search engine with a REST-API.

/ : @EivindArvesen

slide-8
SLIDE 8

Elasticsearch should not be used as a primary data store!

/ : @EivindArvesen

slide-9
SLIDE 9

Elasticsearch is great at search

/ : @EivindArvesen

slide-10
SLIDE 10

… but it is not a database

/ : @EivindArvesen

slide-11
SLIDE 11

P H I L A D E L P H I A

/ : @EivindArvesen

slide-12
SLIDE 12

E L A S T I C S E A R C H I N D E P T H

Source: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up

/ : @EivindArvesen

slide-13
SLIDE 13

Source: https://github.com/exo-archives/exo-es-search

E L A S T I C S E A R C H I N D E P T H

/ : @EivindArvesen

slide-14
SLIDE 14

E L A S T I C S E A R C H I N D E P T H

Source: https://www.elastic.co/blog/every-shard-deserves-a-home

/ : @EivindArvesen

slide-15
SLIDE 15

E L A S T I C S E A R C H I N D E P T H

Source: https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/

/ : @EivindArvesen

slide-16
SLIDE 16

E L A S T I C S E A R C H I N D E P T H

Merging can be performed manually, but this should only be done on old indices that are no longer in active use. It merges everything into one segment – no further automatic optimization.

/ : @EivindArvesen

slide-17
SLIDE 17

E L A S T I C S E A R C H I N D E P T H

/ : @EivindArvesen

slide-18
SLIDE 18

C O N T E X T

« … B U T I T I S N ’ T A C T U A L LY D E L E T E D U N T I L A S E G M E N T M E R G E O C C U R S »

/ : @EivindArvesen

slide-19
SLIDE 19

/ : @EivindArvesen

slide-20
SLIDE 20

G D P R

/ : @EivindArvesen

slide-21
SLIDE 21

G D P R

A R T. 1 7 - R I G H T T O E R A S U R E ( « R I G H T T O B E F O R G O T T E N » )

The data subject shall have the right to obtain from the controller the erasure

  • f personal data concerning him or her without undue delay and the

controller shall have the obligation to erase personal data without undue delay…

/ : @EivindArvesen

slide-22
SLIDE 22

G D P R

« U N D U E D E L AY »

Considered «about a month» (EU), or 
 «thirty days» (Information Commissioner’s Office, UK)

/ : @EivindArvesen

slide-23
SLIDE 23

W H AT I S « E R A S U R E » ?

What does it mean to «erase»?

/ : @EivindArvesen

slide-24
SLIDE 24

A F F E C T E D ?

Depending upon

  • Cluster architecture
  • Difference in data between shards on different nodes
  • Configuration (e.g. refresh-interval)
  • Merge settings*
  • Whether a new search (via side effects) leads to a «flush», which in turn leads to a merge
  • ne can at any given point in time be in possession of data that should be deleted.

/ : @EivindArvesen

slide-25
SLIDE 25

A F F E C T E D ?

…and when a segment reaches the maximum size (5GB by default), it can

  • nly* be merged when it accumulates 50% deletions!

* Lucene < 7.5

/ : @EivindArvesen

slide-26
SLIDE 26

P R O B L E M

  • No obvious solution
  • Uncertain whether it is a problem in practice until an EU court takes a

position

/ : @EivindArvesen

slide-27
SLIDE 27

W E D O N ’ T K N O W W H AT D ATA W E H AV E

/ : @EivindArvesen

slide-28
SLIDE 28

N O W W H AT ?

/ : @EivindArvesen

slide-29
SLIDE 29

/ : @EivindArvesen

Elasticsearch should not be used as a primary data store!

slide-30
SLIDE 30

… but many do it anyway!

/ : @EivindArvesen

slide-31
SLIDE 31

C O M M U N I C AT I O N S

  • Blog
  • Elastic
  • Lucene

/ : @EivindArvesen

slide-32
SLIDE 32

C O M M U N I C AT I O N S

Lucene 7.5 would be released in about a week (Thanks, Jan Høydal!)

/ : @EivindArvesen

slide-33
SLIDE 33

C O M M U N I C AT I O N S

Current ES version: 7.3 (July 2019) ES version < 6.5 does not have Lucene 7.5, and cannot be configured to the extent we need

/ : @EivindArvesen

slide-34
SLIDE 34

S O L U T I O N !

/ : @EivindArvesen

slide-35
SLIDE 35

S O L U T I O N

U P G R A D E + B L U E / G R E E N D E P L O Y M E N T

E S > = 6 . 5 ( L U C E N E > = 7 . 5 )

/ : @EivindArvesen

slide-36
SLIDE 36

S O L U T I O N

/ : @EivindArvesen

U P G R A D E + B L U E / G R E E N D E P L O Y M E N T

A N O T H E R S O U R C E O F G R O U N D T R U T H

slide-37
SLIDE 37

S O L U T I O N

Source: https://www.elastic.co/blog/signal-media-optimizing-for-more-elasticsearch-power-with-less-elasticsearch-cluster

/ : @EivindArvesen

slide-38
SLIDE 38

PAT C H I N G + W E E K LY J O B

Cron expungeDelete

A LT E R N AT I V E S O L U T I O N

/ : @EivindArvesen

slide-39
SLIDE 39

O N LY E L A S T I C S E A R C H ?

  • Probably also affects SOLR (and other comparable solutions)
  • Kafka?

/ : @EivindArvesen

slide-40
SLIDE 40

C O N C L U S I O N S

  • DON’T use Elasticsearch as primary data store!
  • If this strikes you as a particularly relevant risk:
  • Get legal advice
  • Upgrade your Elasticsearch version
  • Read up on configs
  • Read up on how to reindex in place (periodically)
  • Establish a cleaning job
  • … or encrypt (hard) and «throw away» the key

/ : @EivindArvesen

slide-41
SLIDE 41

T L D R

  • L U C E N E < 7 . 5 W O N ' T M E R G E

S E G M E N T S L A R G E R T H A N 5 G B ( D E FA U LT ) U N L E S S T H E Y A C C U M U L AT E 5 0 % D E L E T I O N S .

  • Y O U S H O U L D R E I N D E X F R O M

P R I M A RY D ATA S T O R E P E R I O D I C A L LY

/ : @EivindArvesen

slide-42
SLIDE 42

I N S U M M A RY

/ : @EivindArvesen

slide-43
SLIDE 43

E L A S T I C S E A R C H

Y O U K N O W, F O R S E A R C H

/ : @EivindArvesen

slide-44
SLIDE 44

C AT C H M E O U T S I D E

  • @EivindArvesen
  • htttps://github.com/eivindarvesen
  • https://eivindarvesen.com

Illustrations: Unsplash

/ : @EivindArvesen

slide-45
SLIDE 45

T H A N K Y O U !