Data Management in Distributed Systems Simon Schffner Advisor: - - PowerPoint PPT Presentation
Data Management in Distributed Systems Simon Schffner Advisor: - - PowerPoint PPT Presentation
Data Management in Distributed Systems Simon Schffner Advisor: Stefan Liebald Technische Universitt Mnchen Fakultt fr Informatik Lehrstuhl fr Netzarchitekturen und Netzdienste Garching, 13. Juli 2018 Distributed Systems Simon
Simon Schäffner
Distributed Systems
2
“A collection of autonomous computing elements that appear to its users as a single coherent system” [1]
Simon Schäffner
Distributed Systems
2
“A collection of autonomous computing elements that appear to its users as a single coherent system” [1]
Simon Schäffner
Distributed Systems
2
Geographical Dispersion?
Simon Schäffner
Input / Output for most tasks More complex than reading from / writing to HDD How is data organised? What data is stored on which nodes?
Data Management in Distributed Systems
3
Simon Schäffner
Scalability
Attributes of Data Management Strategies
4
Simon Schäffner
Scalability Performance
Attributes of Data Management Strategies
5
Simon Schäffner
Scalability Performance Consistency
Attributes of Data Management Strategies
6
Simon Schäffner
Scalability Performance Consistency Redundancy
Attributes of Data Management Strategies
7
Simon Schäffner
Scalability Performance Consistency Redundancy Overhead
Attributes of Data Management Strategies
8
Simon Schäffner
Scalability Performance Consistency Redundancy Overhead Attack Resistance
Attributes of Data Management Strategies
9
Simon Schäffner
Comparison of Data Management Strategies
10
Simon Schäffner
Comparison of Data Management Strategies
11
Simon Schäffner
Started out with Napster and gnutella Legally controversial usage Interesting for completely legal use, as well
Peer-To-Peer Filesharing
12
Simon Schäffner
BitTorrent
13
Web Server
Web Server .torrent
Simon Schäffner
BitTorrent
14
.torrent
Simon Schäffner
BitTorrent
15
Web Server
Simon Schäffner
BitTorrent
16
Web Server
{ announce: http://bttracker.debian.org:6969/ announce, comment: "Debian CD from cdimage.debian.org”, creation date: 1520682848, httpseeds: [ https://cdimage.debian.org/cdimage/ release/9.4.0//srv/cdbuilder.debian.org/ dst/deb-cd/weekly-builds/amd64/iso-cd/ debian-9.4.0-amd64-netinst.iso ], info: { length: 305135616, name: debian-9.4.0-amd64-netinst.iso, piece length: 262144, … } }
.torrent
Simon Schäffner
BitTorrent
17
Tracker
.torrent
http://bttracker.debian.org:6969/announce
.torrent
Origin
.torrent
Simon Schäffner
BitTorrent
17
Tracker
.torrent
http://bttracker.debian.org:6969/announce
.torrent
Origin
.torrent
Simon Schäffner
BitTorrent
18
Tracker
.torrent
http://bttracker.debian.org:6969/announce
.torrent
Origin
.torrent
Simon Schäffner
BitTorrent
19
Tracker
.torrent
http://bttracker.debian.org:6969/announce
.torrent
Origin
.torrent
Simon Schäffner
BitTorrent
19
Tracker
.torrent
http://bttracker.debian.org:6969/announce
.torrent
Origin
Simon Schäffner
BitTorrent: Scalability
20
Source: [4]
Scalability
- Scales well with large #nodes
21
BitTorrent: Attributes (1)
Scalability
- Scales well with large #nodes
Performance
- Good upload utilisation
22
BitTorrent: Attributes (1)
Scalability
- Scales well with large #nodes
Performance
- Good upload utilisation
Consistency
- Content does not change
- Checksums in metadata file
23
BitTorrent: Attributes (1)
Scalability
- Scales well with large #nodes
Performance
- Good upload utilisation
Consistency
- Content does not change
- Checksums in metadata file
Redundancy
- Max. redundancy possible
24
BitTorrent: Attributes (1)
Scalability
- Scales well with large #nodes
Performance
- Good upload utilisation
Consistency
- Content does not change
- Checksums in metadata file
Redundancy
- Max. redundancy possible
Overhead
- Metadata file
- 1/1000 of traffic to tracker [5]
- Metainformation sent to other nodes
25
BitTorrent: Attributes (1)
Attack Resistance
- Download of corrupt file very unlikely
- Poisoning attacks possible
− Uploading large amount of fake files / malware − Flooding all peers with download requests
26
BitTorrent: Attributes (2)
Simon Schäffner
BitTorrent
27
Data Update Scalability Performance Consistency Redundancy Overhead Attack Resistance no data update ++ Upload
- util. indep.
Of #nodes + upload util. very good no data updates; corrupt data identified + high high, but aligns with goal
- poisoning
attacks possible
Peer-to-peer distributed hash table (DHT) identifiers: 160-bit (both key and node IDs) Keys stored on “close” nodes
Simon Schäffner
Kademlia
28
d(x, y) = x ⊕ y
Simon Schäffner
Kademlia
29
Data Update Scalability Performance Consistency Redundancy Overhead Attack Resistance active republication + storage scales lin. With #nodes; Lookup time scales with O(log_2 n) ++ parallel redundant requests, efficient caching +/0 deleted information
- max. 24h in
system + <key, value> pairs stored in >=k nodes
- ptimized
process republishin g <key, value> pairs every hour + protection agains node flooding
Simon Schäffner
First idea: Proxies for caching static content Internet now very dynamic, hit rates for proxies low (25-40%) [9] Content Delivery Networks (CDNs) more elaborate
Content Delivery Networks
30
Simon Schäffner
Content Delivery Networks
31
Source: [10]
Simon Schäffner
>240.000 servers >130 countries Within >1.700 networks [10] Handle flashcrowds by allocating more servers to the sites that need them at the moment Server nearby for low latency and small packet-loss
Akamai
32
Simon Schäffner
Akamai: Tiered Distribution
33
Source: [11]
Akamai: Attributes (1)
Scalability
- Large amount of data available at high speed within network
34
Akamai: Attributes (1)
Scalability
- Large amount of data available at high speed within network
Performance
- Overlay network (speed improvements up to 30-50%)
- Sometimes Border Gateway Protocol is not optimal
- Increased reliability by offering alternate routes
- Reduced packet loss by sending packet through parallel routes
- Forward error correction techniques
- Transport protocol optimisations over TCP
- Application level optimisations (content compression, application logic on edge servers)
35
Akamai: Attributes (1)
Scalability
- Large amount of data available at high speed within network
Performance
- Overlay network (speed improvements up to 30-50%)
- Sometimes Border Gateway Protocol is not optimal
- Increased reliability by offering alternate routes
- Reduced packet loss by sending packet through parallel routes
- Forward error correction techniques
- Transport protocol optimisations over TCP
- Application level optimisations (content compression, application logic on edge servers)
Consistency
- Standard techniques for caching (TTLs, versioned URLs)
36
Akamai: Attributes (2)
Redundancy
- Dependent on customer’s needs
- Tiered distribution provides balance between redundancy and fast availability
37
Akamai: Attributes (2)
Redundancy
- Dependent on customer’s needs
- Tiered distribution provides balance between redundancy and fast availability
Overhead
- Proprietary overlay network
- Claim to have improvements over TCP (reduced setup & teardown time per connection)
38
Akamai: Attributes (2)
Redundancy
- Dependent on customer’s needs
- Tiered distribution provides balance between redundancy and fast availability
Overhead
- Proprietary overlay network
- Claim to have improvements over TCP (reduced setup & teardown time per connection)
Attack Resistance
- No attackers within closed network
- Engineered for high failure rate of network and equipment
39
Highly distributed nature fundamental to high performance Entire communication within overlay network optimised, two small hops on either end should not matter
Simon Schäffner
Akamai
40
Data Update Scalability Performance Consistency Redundancy Overhead Attack Resistance
- dep. on
customer ++ >240.000 nodes ++
- ptimised
transfer speed in network based on TTL/version URLs, but
- dep. on
customer ++ tiered distribution allows for any degree improvem ents over TCP, dep.
- n
customer ++
- nly attacks
from
- utside
possible, high recovery rate
Simon Schäffner
Sharding: partition data over many servers Redundancy Online Transaction Processing (OTLP): each query may be handled by different node Online Analytical Processing (OLAP): network can work together on single query
Distributed Databases
41
Simon Schäffner
Document storage NoSQL database
- Non relational
- not only SQL (SQL-like query languages supported)
- Usually worse consistency, better
Documents: number of fields and attachments Documents are versioned Use case: multiple offline clients, synchronise upon reconnection
CouchDB
42
CouchDB: Attributes (1)
Scalability
- Data lookup restricted to keys
- Data can be partitioned, but nodes can still be queried individually
43
CouchDB: Attributes (1)
Scalability
- Data lookup restricted to keys
- Data can be partitioned, but nodes can still be queried individually
Performance
- Storage on Nodes: B-Tree
− Lookup by key − Lookup by key range
- Multiversion Concurrency Control (MVCC)
− Queries see state of database at beginning of query for their while lifetime − Queries can run fully in parallel (as long as they do not write to the same dataset)
44
O(logN) O(logN + K)
CouchDB: Attributes (2)
Consistency
- Every node is ACID compliant
− Availability − Consistency − Isolation − Durability
- Edit conflict triggered when two clients try to edit the same document
- Edit conflict also triggered when offline client reconnects and a document was edited on node
and network − Each node deterministically decides which version wins − Losing versions still stored and replicated − Every node sees conflict, can resolve it
45
CouchDB: Attributes (2)
Consistency
- Every node is ACID compliant
− Availability − Consistency − Isolation − Durability
- Edit conflict triggered when two clients try to edit the same document
- Edit conflict also triggered when offline client reconnects and a document was edited on node
and network − Each node deterministically decides which version wins − Losing versions still stored and replicated − Every node sees conflict, can resolve it Redundancy
- Fully configurable: able to run on single node, full replication supported, partitioning supported,
…
46
CouchDB: Attributes (3)
Overhead
- Depends on replication
47
CouchDB: Attributes (3)
Overhead
- Depends on replication
Attack Resistance
- Simple authentication model by default
- Allows for custom JavaScript function for update validation
48
Simon Schäffner
CouchDB
49
Data Update Scalability Performance Consistency Redundancy Overhead Attack Resistance Active ++ data lookup by key only; data can be partitioned
- ver many
nodes ++ key lookup locally O(log N); MVCC for parallel reading queries ++ eventual consistency; graceful conflict handling ++
- dep. on use-
case +
- dep. on
redundanc y needed + simple auth. Model by default, can be customised
Simon Schäffner
Summary: Performance
50
Simon Schäffner
Summary: Data Update
51
Simon Schäffner
Summary: Data Allocation
52
Simon Schäffner
Summary: Conflict Handling
53
Simon Schäffner
Conclusion
54
trend: active replication
- More network overhead
- Decreases time of system being inconsistent
Currently all system popular (except for Zatara) Amount of distributed systems will probably only ever increase
Simon Schäffner
[1] M. van Steen and A. S. Tanenbaum. A brief introduction to distributed systems. Computing, 98(10):967–1009, Oct 2016. [2] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips. The bittorrent p2p file-sharing system: Measurements and analysis. In M. Castro and R. van Renesse, editors, Peer-to-Peer Systems IV, pages 205–216, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. [3] Z. Zhang, Y. Li, Y. Chen, P. Cao, B. Deng, and X. Li. Understand the unfairness of bittorrent. 12 2010. [4] A. Bharambe, C. Herley, and V. N. Padmanabhan. Analyzing and improving bittorrent
- performance. 01 2006.
[5] B. Cohen. Incentives build robustness in bittorrent. http://www.bittorrent.org/bittorrentecon.pdf, May 2003. last accessed: 27.05.2018 14:51.
Bibliography
55
Simon Schäffner
[6] P. Maymounkov and D. Mazi`eres. Kademlia: A peer-to-peer information system based on the xor metric. In P. Druschel, F. Kaashoek, and A. Rowstron, editors, Peer-to-Peer Systems, pages 53–65, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. [7] A. Loewenstern and A. Norberg. Dht protocol. http://www.bittorrent.org/beps/bep_0005.html, Jan 2008. last accessed: 02.06.2018 15:11. [8] vbuterin and J. Ray. Kademlia peer selection. https://github.com/ethereum/wiki/wiki/ Kademlia- Peer-Selection, Oct 2015. last accessed: 02.06.2018 15:13 (revision ea47c31). [9] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. Globally distributed content delivery. IEEE Internet Computing, 6(5):50–58, Sep 2002. [10] Facts & figures. https: //www.akamai.com/us/en/about/facts-figures.jsp. last accessed: 27.05.2018 20:47.
Bibliography
56
Simon Schäffner
[11] E. Nygren, R. K. Sitaraman, and J. Sun. The akamai network: A platform for high- performance internet applications. SIGOPS Oper. Syst. Rev., 44(3):2–19, Aug. 2010. [12] B. Carstoiu and D. Carstoiu. High performance eventually consistent distributed database
- zatara. In INC2010: 6th International Conference on Networked Computing, pages 1–6, May
2010.
Bibliography
57
Any Questions?
Simon Schäffner Garching, 13. Juli 2018
Thank you for your attention
Simon Schäffner Garching, 13. Juli 2018
Simon Schäffner
Kademlia
60
d(x, y) = x ⊕ y
Source: [6]
Simon Schäffner 61
d(x, y) = x ⊕ y
Source: [6]
Kademlia: Finding Another Node
Simon Schäffner
Kademlia: Finding Another Node
62
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
62
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
62
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
63
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
64
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
65
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
66
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
67
Source: [6]
Simon Schäffner
Kademlia: Finding Another Node
68
Source: [6]
Kademlia: Attributes
Scalability
- Storage scales linearly with #nodes
- #nodes to contact for lookup of a value scales with
Performance
- Parallel, redundant requests
- Caching reduces likelihood of hotspots
Consistency
- Original publisher has to republish every 24h
- Each of the k nodes the <key,value> pair is stored on has to republish every 1h
- Caching: TTL inversely proportional to
Redundancy
- Every <key,value> pair is stored on k nodes
69
O(log2(n))
d(x, y) = x ⊕ y
Kademlia: Attributes
Overhead
- Optimised republishing (do not also republish for 1h when republish is received)
- Caching limited to short timespans
Attack Resistance
- Protected against node flooding
70
Simon Schäffner
Kademlia: Caching
71
Source: [6]
Node with value STORE
Kademlia
Now used in
- BitTorrent [7]
- Ethereum [8]
72
Simon Schäffner
Kademlia
73
Source: [7]
Simon Schäffner
Kademlia
74
Source: [8]
Simon Schäffner
First general use-case NoSQL database Eventually consistent Built to satisfy needs of modern cloud applications
Zatara
75
Simon Schäffner
Key types
- Cache only: stored in memory on single node, not replicated
- Persistent: eventually consistent, stored on disk, replicated
Each key mapped to single node by hashing algorithm Client connects directly to that node to store the key
Zatara: Key Management
76
Simon Schäffner
Zatara: Node Management
77
Source: [12]
Zatara: Attributes
Scalability
- Regional, asynchronous replication
Performance
- Keys cached in memory
- Higher TTL for more popular keys
- Evaluation on 196 Amazon EC2 instances [12]
78
- perations/sec
32500 65000 97500 130000 196 nodes, 196 groups 196 nodes, 98 groups SETKEY GETKEY PUSHRKEY PUSHLKEY
Zatara: Attributes
Consistency
- Eventual consistency: guaranteed consistency after consistency window has closed
Redundancy
- Key stored on >=2 nodes
Overhead
- Asynchronous replication reduces overhead (group sizes of 2-4 recommended)
Attack Resistance
- Simple authentication mechanism (only trusted clients should have access)
79
Simon Schäffner
Zatara
80
Data Update Scalability Performance Consistency Redundancy Overhead Attack Resistance Active ++ asynchr. Data repl.; large #groups possible; O(1) key lookup ++ in memory caching (least recently used) + eventual consistency + keys are guaranteed to be stored
- n >=2 nodes
regional replication Simple authenticati
- n