REQUEST AGGREGATION, CACHING, AND FORWARDING STRATEGIES FOR - PowerPoint PPT Presentation

REQUEST AGGREGATION, CACHING, AND FORWARDING STRATEGIES FOR IMPROVING LARGE CLIMATE DATA DISTRIBUTION WITH NDN: A CASE STUDY Susmit Shannigrahi, Chengyu Fan, Christos Papadopoulos Colorado State University ICN 2017, Berlin

Large Scientifjc Data Has Transformed Modern Sciences  Accurate climate models  Boson discovery  Universe in high resolution  Human genome mapping Human Gnome Project CMIP5: 3.5 PB LHC run 2: LSST 1 Genome = 100GB CMIP6: Several Exabytes Exabyte/Month 25TB/night 7 Billion People 2

Scientifjc Data is Also Becoming Unmanageable Large Data FedEx is faster than the network /Output.g8.162/csu.hydro.PS.nc /Output.g8.162/csu.hydro.Q.nc Ad hoc names Flat namespace No Provenance No Metadata No Reusability Contemporary tools and protocols require rethinking 3

Host Dependent Data Discovery And retrieval Data Data download catalog Output Atmospheric request models, Name Data CSU Run simulatjon for 1-8 weeks and generate 10- 50TB data htup://<csu>/data_name Instrument an htup://<texas>/data_name experiment with a statjstjcal model and Location Based Naming fjxed parameters Provenance No built-in Provenance catalog Texas Move useful data ofg supercomputer, throw replication away rest catalog Browse subset before requestjng large dataset (4-5 TB) No reusability No transparent failover

Can NDN help?  Yes!  We have previously shown that NDN can help with  Scientifjc data naming  Name-based discovery and built-in provenance  Retrieval, transparent failover, and subsetting  In this study, we show how NDN can optimize data access and data transfers 5

Scientifjc data distribution options  Option 1: Domain-specifjc custom built software (ESGF, Xrootd)  No common framework, no reusability  Option 2: Commercial CDNs  Very expensive for large data  Hard to rely on for very long-term data storage  Lack of compatibility with existing technologies and among providers 6

Presentation Outline  Investigate patterns in a real climate data access log  Create a realistic network topology from the log  Replay the requests in real-time using NDNSim  Quantify improvements with request aggregation and caching  Propose and evaluate a NDN-based nearest-replica retrieval strategy  Easy to provide CDN like funtionality  Summary and future work 7

Non-goals  Investigate NDN’s performance for generic Internet traffjc (web, voice, video)  Quantify NDN’s performance in a resource constrained environment  This study assumes no congestion, high cache capacity  Claim this study generalizes to all scientifjc workfmows  However, a separate study of a HEP access log shows similar access patterns 8

CMIP5 and ESGF CMIP5 is a modeling framework that is used to simulate the Earth's atmosphere or oceans ESGF is a distributed system that hosts and distributes CMIP5 data C 9

3-years of CMIP5 data access  We looked at one ESGF server log collected at LLNL  Approximately three years of requests (2013-2016)  18.5 million total requests  1.5 million unique fjles requested  Total request size = 1,844 TB  Many duplicates and failed requests 10

Unique Users (Usernames) 5692 Unique Clients (IP addresses) 9266 User and Client Statistics Unique ASNs 911 Client IP addresses 11

Data Statistics Number of total requests 18.5 million Number of partial or completed 5.7 million downloads Number of fjles 1.8 million 95% percentile fjlesize ~1.3GB  Two out of three requests are duplicates  Individual fjles are small but cumulatively add up to a large size 12

Request Distribution Some fjles are very popular  Candidates for aggregation and caching  Can be served from nearer replica 13

Partial Transfers  Three distinct categories of clients according to partial transfers  Waste bandwidth and server resources  Requests are often temporally close; aggregation and caching should help 14

Partial Transfers and duplicate requests  All three categories contribute to duplicate requests  Successful as well as partial transfers are repeated 15

Simulation Setup  Remove all zero-byte transfers from the log  NDNSim uses too much memory and takes too long if we use the whole log  Reduce number of events  Randomly pick 7 weeks from the log  Choose clients responsible for ~95% traffjc  Generate topology using reverse traceroutes from server to clients  I mport them into NDNSim, replay requests in real-time 16

Simulation Setup  Randomly sampled seven weeks  No loss in generality – other weeks show similar traffjc volume and number of duplicate requests 17

Interest Aggregation  Some weeks saw large reduction in Interests reaching the server  Some weeks did not see any reduction - fewer requests and no duplicates  Interest aggregation can be useful during traffjc surge 18

Caching - How much to cache?  Small caches are useful – even 1GB cache provided signifjcant benefjt  Linear increase in cache size does not proportionally decrease traffjc  95% inter-arrival time for duplicate requests = 400 seconds  Caching everything on a 10G link for 400 secs = 500GB 19

Caching - where to cache?  Requests are highly localized  Request paths do not overlap too much  Caching at the edge provides signifjcant benefjts  In some cases, network-wide caches provide better benefjts 20

Cost of caching everywhere  Cost of network-wide caching is consistently very high  7-8 times for more than caching at the edge  Caching at the edge provides reasonable benefjts for our workfmow 21

A simple CDN-like strategy  CDN-like nearest replica retrieval  Hypothetical scenario with fully replicated datasets and six real ESGF server locations  Our strategy measures the path delay and sends requests to nearest replica  96% original requests go to the nearer replicas, original server only receives 0.03% requests 22

Client latency is also reduced  Nearer replica strategy also reduces client-side latency  RTT for client-3 reduced from 200ms to 25ms 23

Conclusions  While climate data is large, individual fjles are small  Requests are highly localized and can benefjt from aggregation and caching  Interest aggregation is useful in some cases  Small caches at the edge can signifjcantly improve data distribution  Data need not to be cached for long, useful caching life for this data is ~400 secs  A simple latency based strategy can provide CDN like functionality  Reduces network and server resource consumption  Reduces client-side latency 24

Future work  Extend the study to include logs from other ESGF node  Analyze raw HTTP logs for possible insights into client behavior  Simulating the full log in real-time  Code and Data: https://github.com/susmit85/icn17-simulation-scenario/ 25

Thank You!

REQUEST AGGREGATION, CACHING, AND FORWARDING STRATEGIES FOR - PowerPoint PPT Presentation

REQUEST AGGREGATION, CACHING, AND FORWARDING STRATEGIES FOR IMPROVING LARGE CLIMATE DATA DISTRIBUTION WITH NDN: A CASE STUDY Susmit Shannigrahi, Chengyu Fan, Christos Papadopoulos Colorado State University ICN 2017, Berlin Large Scientifjc

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Product Transport & Shipping Options 1 DHL Logistics Cambodia | 2014 DHL Global Forwarding

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

SOLUTIONS General Cargo Project Forwarding Industrial Services Dextra Industry and

Address Resolution ARP, RARP, Proxy ARP (C) Herbert Haas 2005/03/11 Agenda IP Forwarding

Strategy for harnessing small hydro potential in Norway Erik Juliussen Norwegian Water Resources

Teachers Experiences Integrating Data Sense-making and Computational Thinking into Science

Market Power in a Hydro-Dominated Wholesale Electricity Market Shaun McRae and Frank Wolak April

Development of Models for Hydro Power Plants with Shared Penstock for Grid Compliance Study

Hydrodynamic fluctuations Pavel Kovtun University of Victoria GGI, May 3, 2011 Pavel Kovtun

QUARK GLUON PLASMA DROPLETS WITH THREE DIFFERENT GEOMETRIES T. Csrg 1,2 and M. Csand 3 for

Hydrodynamization and attractors in rapidly expanding fluids Mauricio Martinez Guerrero North

Nonlinear chance-constrained problems with applications to hydro scheduling Enrico Malaguti DEI

REQUEST AGGREGATION, CACHING, AND FORWARDING STRATEGIES FOR - PowerPoint PPT Presentation

REQUEST AGGREGATION, CACHING, AND FORWARDING STRATEGIES FOR IMPROVING LARGE CLIMATE DATA DISTRIBUTION WITH NDN: A CASE STUDY Susmit Shannigrahi, Chengyu Fan, Christos Papadopoulos Colorado State University ICN 2017, Berlin Large Scientifjc

Agenda Caching Caching Gitlab Demo Caching Demos Mirroring Caching Limitations Manual

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Cooperative Web Caching Cooperative Web Caching Cooperative Caching Cooperative Caching

Web Caching based on: Web Caching , Geoff Huston Web Caching and Zipf-like Distributions:

Web Caching Web Caching and wireless networks Next generation Wireless Networks Helsinki

Scaling Your Cache &amp; Caching at Scale Alex Miller @puredanger Mission Why does caching

Product Transport &amp; Shipping Options 1 DHL Logistics Cambodia | 2014 DHL Global Forwarding

1 Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and

1 Harvest Harvest- -Style ICP Hierarchies Style ICP Hierarchies Issues for Cache Hierarchies

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&amp;D Engineer Thomson

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

SOLUTIONS General Cargo Project Forwarding Industrial Services Dextra Industry and

Address Resolution ARP, RARP, Proxy ARP (C) Herbert Haas 2005/03/11 Agenda IP Forwarding

Strategy for harnessing small hydro potential in Norway Erik Juliussen Norwegian Water Resources

Teachers Experiences Integrating Data Sense-making and Computational Thinking into Science

Market Power in a Hydro-Dominated Wholesale Electricity Market Shaun McRae and Frank Wolak April

Development of Models for Hydro Power Plants with Shared Penstock for Grid Compliance Study

Hydrodynamic fluctuations Pavel Kovtun University of Victoria GGI, May 3, 2011 Pavel Kovtun

QUARK GLUON PLASMA DROPLETS WITH THREE DIFFERENT GEOMETRIES T. Csrg 1,2 and M. Csand 3 for

Hydrodynamization and attractors in rapidly expanding fluids Mauricio Martinez Guerrero North

Nonlinear chance-constrained problems with applications to hydro scheduling Enrico Malaguti DEI

Scaling Your Cache & Caching at Scale Alex Miller @puredanger Mission Why does caching

Product Transport & Shipping Options 1 DHL Logistics Cambodia | 2014 DHL Global Forwarding

Temporal Temporal Radiance Caching Radiance Caching Pascal Gautron R&D Engineer Thomson