Caching for Data Intensive Scientific Repositories Ani Thakar, Dan - - PowerPoint PPT Presentation

▶

Sep 20, 2023 302 likes •453 views

Caching for Data Intensive Scientific Repositories Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary Scientific repositories can have a large network footprint Network Telescope Data Repository Pan-STARRS is expected

SLIDE 1

Caching for Data Intensive Scientific Repositories

Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary

SLIDE 2

Scientific repositories can have a large “network footprint”

Data Repository

Telescope

Network

Pan-STARRS is expected to service over 10 TB of query results each day. LSST will be a 150 times the size of Pan-STARRS.

SLIDE 3

Well-designed proxy caches can help reduce the network footprint

Data Repository

Telescope

Network

In simulations on SDSS (static data), traffic reduced to one-fifth.

SLIDE 4

Well-designed proxy caches are hard to design

Data Repository

Telescope

Network

Three challenges —

How do we adaptively choose the best objects to cache?
How do we process queries on transient objects?
How do we move large data objects?

SLIDE 5

Cache objects have varying sizes and varying load costs

A E B F

Network

C D G H I

Objects can be relations, columns, horizontal partitions, vertical partitions, etc.

SLIDE 6

Caching decisions are not limited to loading and evicting objects

A E B F

Network

C D G H I SELECT MAX(A.x) FROM A, B WHERE A.x = B.y

Q1

SELECT G.z FROM A, G WHERE A.x G.z

Q2

INSERT INTO A VALUES (2.3, 30, ...) VALUES (4.5, 25, ...)

U1

UPDATE G SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

U2

Three types of data communication —

Query shipping
Object loading
Update shipping

SLIDE 7

Query shipping is for answering queries without using the cache contents

Q1 Result

A E B F

Network

C D G H I

SELECT MAX(A.x) FROM A, B WHERE A.x = B.y

Q1

SELECT G.z FROM A, G WHERE A.x G.z

Q2

INSERT INTO A VALUES (2.3, 30, ...) VALUES (4.5, 25, ...)

U1

UPDATE G SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

U2

4.5 2.3, 4.5, 7.9, 2.1, ....

Q2 Result

SLIDE 8

Loading is for moving frequently accessed objects

A E B F G

Network

C D G H I A B SELECT MAX(A.x) FROM A, B WHERE A.x = B.y

Q1

SELECT G.z FROM A, G WHERE A.x ≤ G.z

Q2

INSERT INTO A VALUES (2.3, 30, ...) VALUES (4.5, 25, ...)

U1

UPDATE G SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

U2 This may require evicting other objects from the cache.

SLIDE 9

Update shipping is for keeping objects up-to-date

A E B F G

Network

C D G H I A B SELECT MAX(A.x) FROM A, B WHERE A.x = B.y

Q1

SELECT G.z FROM A, G WHERE A.x ≤ G.z

Q2

INSERT INTO A VALUES (2.3, 30, ...) VALUES (4.5, 25, ...)

U1

UPDATE G SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

U2

SLIDE 10

The objective is to keep the heavily queried objects in cache, and the heavily updated objects out of it — adaptively

Update Hotspots

Network

A B

Q1 Q2 U1 U2

Query Hotspots

Query Shipping Loading, Update Shipping

The interdependencies between objects makes this even harder.

SLIDE 11

Algorithm Benefit learns from the past window, but is hard to tune

50 100 150 200 250 50k 100k 150k 200k 250k Cumulative Network Traffic Cost (GB) Query and Update Events NoCache Benefit VCover SOptimal

It greedily loads objects by the benefit of keeping them in cache.

SLIDE 12

Algorithm VCover is conservative but performs close to the offline (static) optimal

50 100 150 200 250 50k 100k 150k 200k 250k Cumulative Network Traffic Cost (GB) Query and Update Events NoCache Benefit VCover SOptimal

Characteristics —

It is based on online algorithms for caching.
It incorporates a rent-versus-buy approach.
It captures query-update interactions in a bi-partite graph

(the minimum weighted vertex cover of which is the optimal solution).

SLIDE 13

Several open questions remain in creating an effective database cache

Update Hotspots Network

A B

Q1 Q2 U1 U2 Query Hotspots Query Shipping Loading, Update Shipping

Can we reduce the size of VCover data structures?
Are there better caching algorithms?
What is the best granularity for a data object?
Should we be caching query results rather than data
bjects?
How do we re-write queries for transient data objects?
Can we use indices on transient data objects?

SLIDE 14

Caching for Data Intensive Scientific Repositories

Ani Thakar, Dan Wang, Tanu Malik, Philip Little, Amitabh Chaudhary

Scientific repositories can have a large “network footprint”

Data Repository

Telescope

Network

Pan-STARRS is expected to service over 10 TB of query results each day. LSST will be a 150 times the size of Pan-STARRS.

Well-designed proxy caches can help reduce the network footprint

Data Repository

Telescope

Network

In simulations on SDSS (static data), traffic reduced to one-fifth.

Well-designed proxy caches are hard to design

Data Repository

Network

Three challenges —

Cache objects have varying sizes and varying load costs

A E B F

Network

C D G H I

Objects can be relations, columns, horizontal partitions, vertical partitions, etc.

Caching decisions are not limited to loading and evicting objects

Q1

Q2

U1

U2

Three types of data communication —

Query shipping is for answering queries without using the cache contents

Q1 Result

A E B F

Network

C D G H I

Q1

Q2

INSERT INTO A VALUES (2.3, 30, ...) VALUES (4.5, 25, ...)

U1

UPDATE G SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

U2

4.5

2.3, 4.5, 7.9, 2.1, ....

Q2 Result

Loading is for moving frequently accessed objects

A E B F G

Network

C D G H I A B SELECT MAX(A.x) FROM A, B WHERE A.x = B.y

Q1

SELECT G.z FROM A, G WHERE A.x ≤ G.z

Q2

INSERT INTO A VALUES (2.3, 30, ...) VALUES (4.5, 25, ...)

U1

UPDATE G SET G.z = G.z+3.34 WHERE G.z ≤ 9.6

U2 This may require evicting other objects from the cache.

Update shipping is for keeping objects up-to-date

A E B F G

Network

C D G H I A B SELECT MAX(A.x) FROM A, B WHERE A.x = B.y

Q1

SELECT G.z FROM A, G WHERE A.x ≤ G.z

Q2

U1

U2

The objective is to keep the heavily queried objects in cache, and the heavily updated objects out of it — adaptively

Update Hotspots

Network

A B

Q1 Q2 U1 U2

Query Hotspots

Query Shipping Loading, Update Shipping

The interdependencies between objects makes this even harder.

Algorithm Benefit learns from the past window, but is hard to tune

It greedily loads objects by the benefit of keeping them in cache.

Algorithm VCover is conservative but performs close to the offline (static) optimal

Characteristics —

(the minimum weighted vertex cover of which is the optimal solution).

Several open questions remain in creating an effective database cache

Update Hotspots Network

Q1 Q2 U1 U2 Query Hotspots Query Shipping Loading, Update Shipping

In summary, clever algorithms can help build effective caching solutions for data intensive repositories, but much remains to be done

Update Hotspots

Network