1 Web Traffic Characterization Zipf Web Traffic Characterization - PDF document

Caching for a Better Web Caching for a Better Web Performance is a major concern in the Web Proxy caching is the most widely used method to improve Web performance Web Caching and Content Delivery Web Caching and Content Delivery • Duplicate requests to the same document served from cache • Hits reduce latency, bandwidth demand, server load • Misses increase latency (extra hops) Hits Internet Misses Misses Clients Proxy Cache Servers [Source: Geoff Voelker] Proxy Caching Issues for Web Caching Proxy Caching Issues for Web Caching How should we build caching systems for the Web? • Binding clients to proxies, handling failover Manual configuration, router-based “transparent caching”, WPAD • Seminal paper [Chankhunthod96] (Web Proxy Automatic Discovery) • Proxy caches [Duska97] • Proxy may confuse/obscure interactions between • Akamai DNS interposition [Karger99] server and client. • Cooperative caching [Tewari99, Fan98, Wolman99] • Consistency management • Popularity distributions [Breslau99] At first approximation the Web is a wide-area read-only file service...but it is much more than that. • Proxy filtering and transcoding [Fox et al] caching responses vs. caching documents • Consistency [Tewari,Cao et al] deltas [Mogul+Bala/Douglis/Misha/others@research.att.com] • Replica placement for CDNs [et al] • Prefetching, scale, request routing, scale, performance Web caching vs. content distribution (CDNs, e.g., Akamai) [Voelker] End- End -to to- -End Content Delivery End Content Delivery Proxy Cache Effectiveness Proxy Cache Effectiveness How to measure Web cache effectiveness (goals)? request stream • Hit ratio CDN servers • Savings in bandwidth or server load • Reduction in perceived user latency hosting Internet network What factors determine/limit effectiveness? request • Capacity? surrogate distributor caches • User population? proxies server array + storage • Proxy placement in the network? • Updates and invalidations? upstream downstream 1

Web Traffic Characterization Zipf Web Traffic Characterization Zipf [Breslau/Cao99] and others observed that Web accesses can be Research question: how do goals and traffic behavior shape modeled using Zipf-like probability distributions . strategies for deploying and managing proxy caches? • Rank objects by popularity: lower rank i ==> more popular. • Replacement policy: what objects to retain in cache? • The probability that any given reference is to the i th most Large vs. small, relative importance of popularity and stability popular object is p i • Deployment: where to place the cache? Not to be confused with p c , the percentage of cacheable objects. Close to server or client? Zipf says: “ p i is proportional to 1/i α , for some α with 0 < α < 1 ”. • How many users per cache? • Higher α gives more skew: popular objects are way popular. • Prefetching? • Lower α gives a more heavy-tailed distribution. Since the Web is in active deployment on a large-scale, Web traffic characterization is an empirical science. • In the Web, α ranges from 0.6 to 0.8 [Breslau/Cao99]. • Science of mass behavior: observe and test hypotheses. • With α =0.8, 0.3% of the objects get 40% of requests. Zipf- -like Reference Distributions like Reference Distributions Importance of Traffic Models Zipf Importance of Traffic Models Probability of access to the object with popularity rank i : Analytical models like this help us to predict cache hit ratios (object hit ratio or byte hit ratio). [Zipf 49, Duska et al. 97, Breslau et al. 98] • E.g., get object hit ratio as a function of size by integrating under p i ! 1/ i α segments of the Zipf curve …assuming perfect LFU replacement head • Must consider update rate such that: Do object update rates correlate with popularity? p i alpha-0.7 Σ p i = 1 • Must consider object size How does size correlate with popularity? tail • Must consider proxy cache population What is the probability of object sharing? • Enables construction of synthetic load generators Popularity rank SURGE [Barford and Crovella 99] (This is equivalent to a power-law or Pareto distribution.) heavy tail The “Trickle- The “Trickle -Down Effect” Down Effect” A Look at the Miss Stream A Look at the Miss Stream to servers synthetic trace cache Zipf-like SURGE-generated clients low locality: α = 0.6 flood trickle 1035 log-log plot 816 What is the effect on “downstream” traffic? head: flattened midrange: tapers What is the significance of this effect? tail: intact How does it impact design choices for components “behind” the caches? 2

Effect on Server Trace ( ibm ibm.com) .com) What’s Happening? (LRU) Effect on Server Trace ( What’s Happening? (LRU) Suppose the cache fills up in R references. (That’s a property of the trace and the cache size.) Then a cache miss on object with rank i occurs only if i is referenced…. 1998 ibm.com high locality probability p i fit Zipf α = 0.76 … and i has not been referenced in the last R requests. skewed: 77 % / 1% probability (1 - p i ) R Stack distance P(a miss is to object i) is q i = p i (1 - p i ) R Object Hit Ratio by Popularity (1) Object Hit Ratio by Popularity (1) Miss Stream Probability by Popularity Miss Stream Probability by Popularity Moderately popular objects now dominate. q i : R = 10 4 , α α α =0.7 α synthetic α = 0.6 IBM 1998 (32 MB) Object Hit Ratio by Popularity (2) Object Hit Ratio by Popularity (2) Limitations/Features of This Study Limitations/Features of This Study static (cacheable) objects ignore misses caused by updates • invalidation/expiration LRU replacement IBM vary cache effectiveness by capacity 1998 • cache intercepts all client traffic ignore effect on downstream traffic volume 3

Proxy Deployment and Use Interception Switches Proxy Deployment and Use Interception Switches Where to put it? The client doesn’t know. How to direct user Web traffic through the proxy? The server doesn’t know. Request redirection • Much more to come on this topic… Must the server consent? • Protected content • Client identity “Transparent” caching and the end-to-end principle • Must the client consent? Neither side told HTTP to disable it. Is it legal? Good thing? Bad thing? ISP cache array Shouldn’t This Be Illegal? Cache Effectiveness Shouldn’t This Be Illegal? Cache Effectiveness Previous work has shown that hit rate increases with population size [Duska et al. 97, Breslau et al. 98] However, single proxy caches have practical limits end end • Load, network topology, organizational constraints middle One technique to scale the client population is to have proxy caches cooperate RFC 1122: The Internet Architecture (IPv4) specifies that each packet has a unique destination “host” address. Problems middle boxes may be subversive IPsec and SSL dynamic routing 4

1 Web Traffic Characterization Zipf Web Traffic Characterization - PDF document

Caching for a Better Web Caching for a Better Web Performance is a major concern in the Web Proxy caching is the most widely used method to improve Web performance Web Caching and Content Delivery Web Caching and Content Delivery

Data Streams Many large sources of data are generated as streams of updates: IP Network

DOT-K: Distributed Online Top-K Elements Algorithm with Extreme Value Statistics Nick Carey,

Peer-to-Peer Networks 15 Self-Organization Christian Schindelhauer Technical Faculty

Stochastic Simulation The modelling process Bo Friis Nielsen Institute of Mathematical Modelling

Example: Bayes rule A drug test proposed by a company tests positive 99% of the time on drug

Why is Internet traffic self-similar? Allen B. Downey Wellesley College No Micro$oft products

Statistical Inference for Heavy and Super-Heavy-tailed distributions M. Isabel Fraga Alves DEIO,

Clairvoyant Site Allocation of Jobs with Highly Variable Service Demands in a Computational Grid

Uniqueness of characterization of distributions by regressions of generalized order statistics

Population-Based Incremental Learning for Multiobjective Optimisation Sujin Bureerat and Krit

Vehicle routing problems with alternative paths Dominique Feillet University of Avignon ( moving

The art and science of problem solving negotiation Pacey C. Foster Organization Studies Dept

Pushing the Boundaries in Regression Testing Shin Yoo & Mark Harman / King s College London

N-gram Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

Data Science Summer School Part II: Network Science Lecture 2/2 G. Caldarelli,

CS533 No experiment is ever a complete failure. It can always serve as a negative Modeling and

MODULE 6: ECONOMIC DEVELOPMENT IDIS Online for CDBG Entitlement Communities 1 Eligible Economic

Panel Data Analysis Part III Modern Moment Estimation Arellano and Honor (2000) James J.

Neil T. N. Ferguson Responding to Crises Conference 26 September 2016 UNU Wider - Helsinki

OSPF Optimized Multipath (OSPF-OMP) Curtis Villamiza r < curtis@ans.net > URLs

CAD&CG Bundle Adjustment

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine

Lecture 6 Discrete Time Series 9/21/2018 1 Discrete Time Series Stationary Processes A

Partial Galaxy Clustering : An Estimator Incorporating Probabilistic Distance Measurements Humna

1 Web Traffic Characterization Zipf Web Traffic Characterization - PDF document

Caching for a Better Web Caching for a Better Web Performance is a major concern in the Web Proxy caching is the most widely used method to improve Web performance Web Caching and Content Delivery Web Caching and Content Delivery

Data Streams Many large sources of data are generated as streams of updates: IP Network

DOT-K: Distributed Online Top-K Elements Algorithm with Extreme Value Statistics Nick Carey,

Peer-to-Peer Networks 15 Self-Organization Christian Schindelhauer Technical Faculty

Stochastic Simulation The modelling process Bo Friis Nielsen Institute of Mathematical Modelling

Example: Bayes rule A drug test proposed by a company tests positive 99% of the time on drug

Why is Internet traffic self-similar? Allen B. Downey Wellesley College No Micro$oft products

Statistical Inference for Heavy and Super-Heavy-tailed distributions M. Isabel Fraga Alves DEIO,

Clairvoyant Site Allocation of Jobs with Highly Variable Service Demands in a Computational Grid

Uniqueness of characterization of distributions by regressions of generalized order statistics

Population-Based Incremental Learning for Multiobjective Optimisation Sujin Bureerat and Krit

Vehicle routing problems with alternative paths Dominique Feillet University of Avignon ( moving

The art and science of problem solving negotiation Pacey C. Foster Organization Studies Dept

Pushing the Boundaries in Regression Testing Shin Yoo &amp; Mark Harman / King s College London

N-gram Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu T

Data Science Summer School Part II: Network Science Lecture 2/2 G. Caldarelli,

CS533 No experiment is ever a complete failure. It can always serve as a negative Modeling and

MODULE 6: ECONOMIC DEVELOPMENT IDIS Online for CDBG Entitlement Communities 1 Eligible Economic

Panel Data Analysis Part III Modern Moment Estimation Arellano and Honor (2000) James J.

Neil T. N. Ferguson Responding to Crises Conference 26 September 2016 UNU Wider - Helsinki

OSPF Optimized Multipath (OSPF-OMP) Curtis Villamiza r &lt; curtis@ans.net &gt; URLs

CAD&amp;CG Bundle Adjustment

Sailing Through Data: Discoveries and Mirages Emmanuel Cand` es, Stanford University 2018 Machine

Lecture 6 Discrete Time Series 9/21/2018 1 Discrete Time Series Stationary Processes A

Partial Galaxy Clustering : An Estimator Incorporating Probabilistic Distance Measurements Humna

Pushing the Boundaries in Regression Testing Shin Yoo & Mark Harman / King s College London

OSPF Optimized Multipath (OSPF-OMP) Curtis Villamiza r < curtis@ans.net > URLs

CAD&CG Bundle Adjustment