Eivind Arvesen, Javazone 2019
E I V I N D A R V E S E N • Developer, architect • M.Sc. in Computer Science • Consultant at Bouvet since 2017 • Into application security, privacy, machine learning, web development / : @EivindArvesen
E I V I N D A R V E S E N • Developer, architect • M.Sc. in Computer Science • Consultant at Bouvet since 2017 • Into application security, privacy, machine learning, web development / : @EivindArvesen
@EivindArvesen E I V I N D A R V E S E N • Developer, architect • M.Sc. in Computer Science • Consultant at Bouvet since 2017 • Into application security, privacy, machine learning, web development / : @EivindArvesen
P E R S O N A L D ATA I N A P P E N D - O N LY S T O R A G E P O S S I B LY N O T A G R E AT I D E A / : @EivindArvesen
* C O N T E X T * / : @EivindArvesen
Elasticsearch: an open source, near realtime distributed search engine with a REST-API. / : @EivindArvesen
Elasticsearch should not be used as a primary data store ! / : @EivindArvesen
Elasticsearch is great at search / : @EivindArvesen
… but it is not a database / : @EivindArvesen
P H I L A D E L P H I A / : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H Source: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up / : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H Source: https://github.com/exo-archives/exo-es-search / : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H Source: https://www.elastic.co/blog/every-shard-deserves-a-home / : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H Source: https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ / : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H Merging can be performed manually, but this should only be done on old indices that are no longer in active use. It merges everything into one segment – no further automatic optimization. / : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H / : @EivindArvesen
C O N T E X T « … B U T I T I S N ’ T A C T U A L LY D E L E T E D U N T I L A S E G M E N T M E R G E O C C U R S » / : @EivindArvesen
/ : @EivindArvesen
G D P R / : @EivindArvesen
G D P R A R T. 1 7 - R I G H T T O E R A S U R E ( « R I G H T T O B E F O R G O T T E N » ) The data subject shall have the right to obtain from the controller the erasure of personal data concerning him or her without undue delay and the controller shall have the obligation to erase personal data without undue delay… / : @EivindArvesen
G D P R « U N D U E D E L AY » Considered «about a month» (EU), or «thirty days» (Information Commissioner’s Office, UK) / : @EivindArvesen
W H AT I S « E R A S U R E » ? What does it mean to «erase»? / : @EivindArvesen
A F F E C T E D ? Depending upon • Cluster architecture • Difference in data between shards on different nodes • Configuration (e.g. refresh-interval) • Merge settings* • Whether a new search (via side effects) leads to a «flush», which in turn leads to a merge one can at any given point in time be in possession of data that should be deleted. / : @EivindArvesen
A F F E C T E D ? …and when a segment reaches the maximum size (5GB by default), it can only* be merged when it accumulates 50% deletions! * Lucene < 7.5 / : @EivindArvesen
P R O B L E M • No obvious solution • Uncertain whether it is a problem in practice until an EU court takes a position / : @EivindArvesen
W E D O N ’ T K N O W W H AT D ATA W E H AV E / : @EivindArvesen
N O W W H AT ? / : @EivindArvesen
Elasticsearch should not be used as a primary data store ! / : @EivindArvesen
… but many do it anyway! / : @EivindArvesen
C O M M U N I C AT I O N S • Blog • Elastic • Lucene / : @EivindArvesen
C O M M U N I C AT I O N S Lucene 7.5 would be released in about a week (Thanks, Jan Høydal!) / : @EivindArvesen
C O M M U N I C AT I O N S Current ES version: 7.3 (July 2019) ES version < 6.5 does not have Lucene 7.5, and cannot be configured to the extent we need / : @EivindArvesen
S O L U T I O N ! / : @EivindArvesen
S O L U T I O N U P G R A D E + B L U E / G R E E N D E P L O Y M E N T E S > = 6 . 5 ( L U C E N E > = 7 . 5 ) / : @EivindArvesen
S O L U T I O N U P G R A D E + B L U E / G R E E N D E P L O Y M E N T A N O T H E R S O U R C E O F G R O U N D T R U T H / : @EivindArvesen
S O L U T I O N Source: https://www.elastic.co/blog/signal-media-optimizing-for-more-elasticsearch-power-with-less-elasticsearch-cluster / : @EivindArvesen
A LT E R N AT I V E S O L U T I O N PAT C H I N G + W E E K LY J O B Cron expungeDelete / : @EivindArvesen
O N LY E L A S T I C S E A R C H ? • Probably also affects SOLR (and other comparable solutions) • Kafka? / : @EivindArvesen
C O N C L U S I O N S • DON’T use Elasticsearch as primary data store! • If this strikes you as a particularly relevant risk: • Get legal advice • Upgrade your Elasticsearch version • Read up on configs • Read up on how to reindex in place (periodically) • Establish a cleaning job • … or encrypt (hard) and «throw away» the key / : @EivindArvesen
T L D R • L U C E N E < 7 . 5 W O N ' T M E R G E S E G M E N T S L A R G E R T H A N 5 G B ( D E FA U LT ) U N L E S S T H E Y A C C U M U L AT E 5 0 % D E L E T I O N S . • Y O U S H O U L D R E I N D E X F R O M P R I M A RY D ATA S T O R E P E R I O D I C A L LY / : @EivindArvesen
I N S U M M A RY / : @EivindArvesen
E L A S T I C S E A R C H Y O U K N O W, F O R S E A R C H / : @EivindArvesen
C AT C H M E O U T S I D E • @EivindArvesen • htttps://github.com/eivindarvesen • https://eivindarvesen.com Illustrations: Unsplash / : @EivindArvesen
T H A N K Y O U !
Recommend
More recommend