Eivind Arvesen, Javazone 2019 E I V I N D A R V E S E N Developer, - - PowerPoint PPT Presentation
Eivind Arvesen, Javazone 2019 E I V I N D A R V E S E N Developer, - - PowerPoint PPT Presentation
Eivind Arvesen, Javazone 2019 E I V I N D A R V E S E N Developer, architect M.Sc. in Computer Science Consultant at Bouvet since 2017 Into application security, privacy, machine learning, web development / :
E I V I N D A R V E S E N
- Developer, architect
- M.Sc. in Computer Science
- Consultant at Bouvet since 2017
- Into application security, privacy,
machine learning, web development
/ : @EivindArvesen
E I V I N D A R V E S E N
- Developer, architect
- M.Sc. in Computer Science
- Consultant at Bouvet since 2017
- Into application security, privacy,
machine learning, web development
/ : @EivindArvesen
E I V I N D A R V E S E N
- Developer, architect
- M.Sc. in Computer Science
- Consultant at Bouvet since 2017
- Into application security, privacy,
machine learning, web development
@EivindArvesen
/ : @EivindArvesen
P E R S O N A L D ATA I N A P P E N D - O N LY S T O R A G E
P O S S I B LY N O T A G R E AT I D E A
/ : @EivindArvesen
* C O N T E X T *
/ : @EivindArvesen
Elasticsearch: an open source, near realtime distributed search engine with a REST-API.
/ : @EivindArvesen
Elasticsearch should not be used as a primary data store!
/ : @EivindArvesen
Elasticsearch is great at search
/ : @EivindArvesen
… but it is not a database
/ : @EivindArvesen
P H I L A D E L P H I A
/ : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H
Source: https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
/ : @EivindArvesen
Source: https://github.com/exo-archives/exo-es-search
E L A S T I C S E A R C H I N D E P T H
/ : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H
Source: https://www.elastic.co/blog/every-shard-deserves-a-home
/ : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H
Source: https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
/ : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H
Merging can be performed manually, but this should only be done on old indices that are no longer in active use. It merges everything into one segment – no further automatic optimization.
/ : @EivindArvesen
E L A S T I C S E A R C H I N D E P T H
/ : @EivindArvesen
C O N T E X T
« … B U T I T I S N ’ T A C T U A L LY D E L E T E D U N T I L A S E G M E N T M E R G E O C C U R S »
/ : @EivindArvesen
/ : @EivindArvesen
G D P R
/ : @EivindArvesen
G D P R
A R T. 1 7 - R I G H T T O E R A S U R E ( « R I G H T T O B E F O R G O T T E N » )
The data subject shall have the right to obtain from the controller the erasure
- f personal data concerning him or her without undue delay and the
controller shall have the obligation to erase personal data without undue delay…
/ : @EivindArvesen
G D P R
« U N D U E D E L AY »
Considered «about a month» (EU), or «thirty days» (Information Commissioner’s Office, UK)
/ : @EivindArvesen
W H AT I S « E R A S U R E » ?
What does it mean to «erase»?
/ : @EivindArvesen
A F F E C T E D ?
Depending upon
- Cluster architecture
- Difference in data between shards on different nodes
- Configuration (e.g. refresh-interval)
- Merge settings*
- Whether a new search (via side effects) leads to a «flush», which in turn leads to a merge
- ne can at any given point in time be in possession of data that should be deleted.
/ : @EivindArvesen
A F F E C T E D ?
…and when a segment reaches the maximum size (5GB by default), it can
- nly* be merged when it accumulates 50% deletions!
* Lucene < 7.5
/ : @EivindArvesen
P R O B L E M
- No obvious solution
- Uncertain whether it is a problem in practice until an EU court takes a
position
/ : @EivindArvesen
W E D O N ’ T K N O W W H AT D ATA W E H AV E
/ : @EivindArvesen
N O W W H AT ?
/ : @EivindArvesen
/ : @EivindArvesen
Elasticsearch should not be used as a primary data store!
… but many do it anyway!
/ : @EivindArvesen
C O M M U N I C AT I O N S
- Blog
- Elastic
- Lucene
/ : @EivindArvesen
C O M M U N I C AT I O N S
Lucene 7.5 would be released in about a week (Thanks, Jan Høydal!)
/ : @EivindArvesen
C O M M U N I C AT I O N S
Current ES version: 7.3 (July 2019) ES version < 6.5 does not have Lucene 7.5, and cannot be configured to the extent we need
/ : @EivindArvesen
S O L U T I O N !
/ : @EivindArvesen
S O L U T I O N
U P G R A D E + B L U E / G R E E N D E P L O Y M E N T
E S > = 6 . 5 ( L U C E N E > = 7 . 5 )
/ : @EivindArvesen
S O L U T I O N
/ : @EivindArvesen
U P G R A D E + B L U E / G R E E N D E P L O Y M E N T
A N O T H E R S O U R C E O F G R O U N D T R U T H
S O L U T I O N
Source: https://www.elastic.co/blog/signal-media-optimizing-for-more-elasticsearch-power-with-less-elasticsearch-cluster
/ : @EivindArvesen
PAT C H I N G + W E E K LY J O B
Cron expungeDelete
A LT E R N AT I V E S O L U T I O N
/ : @EivindArvesen
O N LY E L A S T I C S E A R C H ?
- Probably also affects SOLR (and other comparable solutions)
- Kafka?
/ : @EivindArvesen
C O N C L U S I O N S
- DON’T use Elasticsearch as primary data store!
- If this strikes you as a particularly relevant risk:
- Get legal advice
- Upgrade your Elasticsearch version
- Read up on configs
- Read up on how to reindex in place (periodically)
- Establish a cleaning job
- … or encrypt (hard) and «throw away» the key
/ : @EivindArvesen
T L D R
- L U C E N E < 7 . 5 W O N ' T M E R G E
S E G M E N T S L A R G E R T H A N 5 G B ( D E FA U LT ) U N L E S S T H E Y A C C U M U L AT E 5 0 % D E L E T I O N S .
- Y O U S H O U L D R E I N D E X F R O M
P R I M A RY D ATA S T O R E P E R I O D I C A L LY
/ : @EivindArvesen
I N S U M M A RY
/ : @EivindArvesen
E L A S T I C S E A R C H
Y O U K N O W, F O R S E A R C H
/ : @EivindArvesen
C AT C H M E O U T S I D E
- @EivindArvesen
- htttps://github.com/eivindarvesen
- https://eivindarvesen.com
Illustrations: Unsplash
/ : @EivindArvesen