Systems & Research Project
- Prof. Manos Athanassoulis
http://manos.athanassoulis.net/classes/CS591
Systems & Research Project Prof. Manos Athanassoulis - - PowerPoint PPT Presentation
CS 591: Data Systems Architectures Systems & Research Project Prof. Manos Athanassoulis http://manos.athanassoulis.net/classes/CS591 data systems >$200B by 2020, growing at 11.7% every year [The Forbes, 2016] complex analytics simple
http://manos.athanassoulis.net/classes/CS591
2
[The Forbes, 2016]
complex analytics simple queries access data store, maintain, update
3
*algorithms and data structures for organizing and accessing data
complex analytics simple queries access data store, maintain, update
[The Forbes, 2016]
how to access data? how to store data? how to update data?
yes! {<primary_key>,<rest_of_the_row>} example: { student_id, { name, login, yob, gpa } } what is the caveat?
how to index these attributes?
index: { name, { student_id } } index: { yob, { student_id1, student_id2, … } }
yes! {<primary_key>,<rest_of_the_row>} example: { student_id, { name, login, yob, gop } } what is the caveat?
how to index these attributes?
index: { name, { student_id } } index: { yob, { student_id1, student_id2, … } }
basic interface put(k,v) {v} = get(k) {v1, v2, …} = get(k) {v1, v2, …} = get_range(kmin, kmax) c = count(kmin, kmax) {v1, v2, …} = full_scan()
deletes: delete(k) updates: update(k,v) get set: {v1, v2, …} = get_set(k1, k2, …) is it different than put?
if we have only put operations if we mostly have get operations
sort
append
(point or range)
sort data amortize sorting cost
(or updates)
simply append avoid resorting after every update
memory updates buf buffer storage
level
1 level
sort & flush runs
memory updates buf buffer storage
1 level
sort & flush runs
memory updates buf buffer storage
sort-merge
1 level
memory buf buffer storage exponentially increasing sizes
𝑃 𝑚𝑝 𝑂 levels
1 level
memory storage buf buffer fence pointers lookup X
X
1 level
memory storage buf buffer fence pointers lookup X
X
1 level
memory storage fence pointers lookup X
X
Bloom filters buf buffer
true negative false positive true positive
1 level
memory storage fence pointers lookup X
X
Bloom filters buf buffer
true negative false positive true positive
1 level
memory storage fence pointers lookup X
X
Bloom filters buf buffer
true negative false positive true positive
1 level
sort & flush runs
memory updates buf buffer storage
sort-merge
T runs per level
merge & flush
T runs per level
T runs per level merge
T runs per level merge
flush
T runs per level T times bigger
1 level
memory storage fence pointers lookup X
X
Bloom filters buf buffer
true negative false positive true positive
merge policy size ratio tuning knobs
31
𝑷 𝑼 ∙ 𝒎𝒑𝒉𝑼 𝑶 ∙ 𝒇−
Τ 𝑵 𝑶
lookup cost:
runs per level levels false positive rate
𝑷 𝒎𝒑𝒉𝑼 𝑶 ∙ 𝒇−
Τ 𝑵 𝑶
levels false positive rate
32
𝑷 𝒎𝒑𝒉𝑼 𝑶
update cost:
levels
𝑷 𝑼 ∙ 𝒎𝒑𝒉𝑼 𝑶
levels merges per level
𝑃 𝑈 ∙ 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
lookup cost:
𝑃 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
33
𝑃 𝑚𝑝𝑈 𝑂
update cost:
𝑃 𝑈 ∙ 𝑚𝑝𝑈 𝑂
𝑃 𝑈 ∙ 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
lookup cost:
𝑃 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
34
𝑃 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂 = 𝑃 𝑚𝑝𝑈 𝑂 ∙ 𝑓− Τ 𝑁 𝑂
lookup cost:
𝑃 𝑚𝑝𝑈 𝑂 = 𝑃 𝑚𝑝𝑈 𝑂
update cost:
35
𝑃 𝑚𝑝𝑈 𝑂
update cost:
𝑃 𝑈 ∙ 𝑚𝑝𝑈 𝑂
𝑃 𝑈 ∙ 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
lookup cost:
𝑃 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
36
𝑃 𝑚𝑝𝑂 𝑂 = 𝑷 𝟐
update cost:
𝑃 𝑂 ∙ 𝑚𝑝𝑂 𝑂 = 𝑷 𝑶
𝑃 𝑂 runs per level 1 run per level
𝑃 𝑈 ∙ 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
lookup cost:
𝑃 𝑚𝑝𝑈 𝑂 ∙ 𝑓−
Τ 𝑁 𝑂
37
T= T=2
how to do range scans? how to delete? how to delete quickly? what if data items come ordered?
fence pointers Bloom filters buf buffer
how to allocate memory between buffer/Bloom filters/fence pointers? what if data items come almost ordered? study these questions and navigate LSM design space using Facebook’s RocksDB
Research question on sortedness
if we know the workload … LSM-Trees: memory (Buffer/BF/FP) – what about caching? Back to column-stores: do we need to sort? partition the data? add empty slots in the column for future inserts?
find Tuning, s.t. min cost(Workload, Data, Tuning) given Workload and Data what if workload information is a bit wrong? robust optimization (come and find me)
𝑞𝑝𝑡𝑢𝑗𝑢𝑗𝑝𝑜 𝑤𝑏𝑚 = 𝐷𝐸𝐺(𝑤𝑏𝑚) ∙ 𝑏𝑠𝑠𝑏𝑧_𝑡𝑗𝑨𝑓
can you learn the CDF? what is the best way to do so? how to update that?
systems project form groups of 2 (speak to me in OH if you want to work on your own) research project form groups of 3-4 pick one of the subjects & read background material define the behavior you will study and address sketch approach and success metric (if LSM-related get familiar with RocksDB)
systems project form groups of 2 (speak to me in OH if you want to work on your own) research project form groups of 3-4 pick one of the subjects & read background material define the behavior you will study and address sketch approach and success metric (if LSM-related get familiar with RocksDB)
nd)
th
http://manos.athanassoulis.net/classes/CS591