Web Cache Coherence

Adam Dingle (Charles University)

Tomas Partl (Czech Technical University)

The Caching Tradeoff

Benefit

lower access latency
lower bandwidth usage

Cost

risk of returning stale information

How Distributed File Systems Maintain Coherence

validation check ,older (Sun NFS)
callback newer, (Andrew, Coda, AFS)

Callback not scalable to the Web

too many caches want promises(most navigators have a cache)
the Web is unreliable

Validation Checks

can take very long on the web
used by today's web browsers

Staleness

Def:Given a cached copy of a page, let its LAST-GOOD TIME equal the time at which its content was last identical to the master copy of the page. If the cached and master copies are still identical, let the LAST-GOOD TIME equal the current time.

Def:Given a cached copy of a page, let its STALENESS equal the time elapsed since its last-good time.

Http headers used to maintain coherence

Last-Modified:

identifies a version of a page

If-Modified-Since:

(also called conditional GET)
conditional GET only returns the page if it has been modified since the date given
definitory behavior is strict: always pass to home server
practice is liberal: cache can satisfy the request

Date:

says when the page was last known to be fresh
unused for coherence by either W3C or Netscape

Pragma:no-cache

makes the request fall through to the original site (RELOAD button)
very expensive

Expires:

let's the page authority specify how long the page is valid
rarely used

Existing Mechanisms

Naive (Netscape Navigator)

check always

user waits for conditional GET to reach the home site
popular sites are swamped -> useless in a hierarchy

never check

in a large cache, pages are never naturally refreshed
users have to RELOAD -> site swamping -> useless in a hierarchy

expiration-based coherence (W3C httpd, Netscape Proxy)

commonly used on today's Web
cache assigns an expiration date to a page upon its retrieval
before expiration, pages treated as valid
after expiration, pages checked via conditional get
expiration derived from: Expires header, default expiration, LM-factor (Last-Modified)

Problems of Expiration-Based Coherence

1) users must wait for expiration checks to occur

the user gets no information while the check is being performed
the check may go all the way to the original site and take tens of seconds
even a "stale" page could contain requested information
even when the check returns not modified, the page can actually be stale

2) When not satisfied with staleness, users can only RELOAD

may take very long
goes to the original site even though a cache along the way may store requested information

3) Expiration mechanism provides no hard guarantee on staleness

Expires header is rare
Both default-expiry and LM-factor can accumulate staleness

4) Users can't specify their staleness tolerance

cache admin sets both default expiration and LM-factor
users may often have to press RELOAD or wait unnecessarily long

New Coherence Mechanisms

3 independent extensions to existing coherence mechanisms:

return a stream of documents in response to each request
provide a real staleness guarantee (basically a bug fix for httpd)
allow the user to specify a maximum staleness for each request using a Maximum-Staleness: header

Returning a stream of documents for each request

The user's view

Upon requesting a page, a user should receive some version of that page as quickly as possible
As fresher versions arrive, the page should be updated in the user's Web browser (perhaps with some dialog box asking whether to really display a newer version)
The user should always have a clear indication of the staleness of pages being viewed
A user who is looking at an expired page should be given a very clear warning to this effect

Implementation

Netscape's server push mechanism is already adequate for displaying successive versions of a document
Every HTTP GET or conditional GET may return a stream of document versions, each with its own Date: and Last-Modified: headers
A document body in the stream may be null, indicating that the previous document in the stream was still valid at a date later than previously realized
A null document may arrive before a non-null document only if the GET request was conditional

Browser changes needed to work well with version streams

Users need a way to see the last-good date that comes in the Date: header
Browsers need to respond to null documents by updating the user-visible last-good date
Netscape currently clears the browser window as soon as the first bytes of a new document version are read

Proxies can be programmed to work around these limitations for the time being.

More speculative cache coherence techniques

1. Pre-fetching

periodically updates cached pages so that they will be less stale when requested by the user
more research is required to determine exactly when cached pages should be updated

2. Replication

Replication - the distribution of popular pages to caches even before those pages have been requested through those caches

Berners-Lee's argument why caching alone is inadequate and replication is necessary:

"the nesting of proxy servers increases response time"

but proposed extensions to HTTP minimize this inefficiency
even Netscape 2.0 has support for keeping alive HTTP connections

"unnested proxies will be very numerous and put a high load on orginal servers"

but as Web proxy hierarchies become more numerous, users' incentive to use caches will be greater and greater
today, the bottleneck for most page transmission is the Internet itself, not server load

"cache hit rates tend to be under 50%"

yes, but the pages Berners-Lee wants to replicate (e.g. Olympics results) are exactly those which are found in Web caches
this is not really so bad - in a hierarchy, many caches cooperate to provide a high user hit rate, even though each cache itself may have only a low rate
we feel that when sufficiently large (continental) caching systems are built, hit rates will be at least 80-90%

A proposed mechanism for replication

each Web cache preemptively sends popular pages to caches which sit below it
a cache determines a page's popularity by counting the number of (non-conditional) GET requests for the page
essentially, this measures the breadth of a page's popularity
each lower-level cache can specify some threshhold for GET requests for pages it will preemptively receive

Conclusion

We hope that Web caches will replace all archives and mirrors on the net.

Theoretical work to be done:

construct mathematical models of caching systems
how to set expiration times?
how cheap must net bandwidth be to make pre-fetching/replication worthwhile?

Experimental work to be done:

construct extremely large cache hierarchies, generate log data about page accesses
make log data publicly available for research