Web Cache Coherence
Adam Dingle (Charles University)
Tomas Partl (Czech Technical University)
The Caching Tradeoff
Benefit
- lower access latency
- lower bandwidth usage
Cost
- risk of returning stale information
How Distributed File Systems Maintain Coherence
- validation check ,older (Sun NFS)
- callback newer, (Andrew, Coda, AFS)
Callback not scalable to the Web
- too many caches want promises(most navigators have a cache)
- the Web is unreliable
Validation Checks
- can take very long on the web
- used by today's web browsers
Staleness
Def:Given a cached copy of a page, let its LAST-GOOD TIME
equal the time at which its content was last identical to the master copy
of the page. If the cached and master copies are still identical, let the
LAST-GOOD TIME equal the current time.
Def:Given a cached copy of a page, let its STALENESS equal
the time elapsed since its last-good time.
Http headers used to maintain coherence
Last-Modified:
- identifies a version of a page
If-Modified-Since:
- (also called conditional GET)
- conditional GET only returns the page if it has been modified since
the date given
- definitory behavior is strict: always pass to home server
- practice is liberal: cache can satisfy the request
Date:
- says when the page was last known to be fresh
- unused for coherence by either W3C or Netscape
Pragma:no-cache
- makes the request fall through to the original site (RELOAD button)
- very expensive
Expires:
- let's the page authority specify how long the page is valid
- rarely used
Existing Mechanisms
Naive (Netscape Navigator)
check always
- user waits for conditional GET to reach the home site
- popular sites are swamped -> useless in a hierarchy
never check
- in a large cache, pages are never naturally refreshed
- users have to RELOAD -> site swamping -> useless in a hierarchy
expiration-based coherence (W3C httpd, Netscape Proxy)
- commonly used on today's Web
- cache assigns an expiration date to a page upon its retrieval
- before expiration, pages treated as valid
- after expiration, pages checked via conditional get
- expiration derived from: Expires header, default expiration, LM-factor
(Last-Modified)
Problems of Expiration-Based Coherence
1) users must wait for expiration checks to occur
- the user gets no information while the check is being performed
- the check may go all the way to the original site and take tens of
seconds
- even a "stale" page could contain requested information
- even when the check returns not modified, the page can actually be
stale
2) When not satisfied with staleness, users can only RELOAD
- may take very long
- goes to the original site even though a cache along the way may store
requested information
3) Expiration mechanism provides no hard guarantee on staleness
- Expires header is rare
- Both default-expiry and LM-factor can accumulate staleness
4) Users can't specify their staleness tolerance
- cache admin sets both default expiration and LM-factor
- users may often have to press RELOAD or wait unnecessarily long
New Coherence Mechanisms
3 independent extensions to existing coherence mechanisms:
- return a stream of documents in response to each request
- provide a real staleness guarantee (basically a bug fix for httpd)
- allow the user to specify a maximum staleness for each request using
a Maximum-Staleness: header
Returning a stream of documents for each request
The user's view
- Upon requesting a page, a user should receive some version of that
page as quickly as possible
- As fresher versions arrive, the page should be updated in the user's
Web browser (perhaps with some dialog box asking whether to really display
a newer version)
- The user should always have a clear indication of the staleness of
pages being viewed
- A user who is looking at an expired page should be given a very clear
warning to this effect
Implementation
- Netscape's server push mechanism is already adequate for displaying
successive versions of a document
- Every HTTP GET or conditional GET may return a stream of document
versions, each with its own Date: and Last-Modified: headers
- A document body in the stream may be null, indicating that the previous
document in the stream was still valid at a date later than previously
realized
- A null document may arrive before a non-null document only if the GET
request was conditional
Browser changes needed to work well with version streams
- Users need a way to see the last-good date that comes in the Date: header
- Browsers need to respond to null documents by updating the user-visible
last-good date
- Netscape currently clears the browser window as soon as the first bytes
of a new document version are read
Proxies can be programmed to work around these limitations for the time
being.
More speculative cache coherence techniques
1. Pre-fetching
- periodically updates cached pages so that they will be less stale when
requested by the user
- more research is required to determine exactly when cached pages should
be updated
2. Replication
Replication - the distribution of popular pages to caches even before
those pages have been requested through those caches
Berners-Lee's argument why caching alone is inadequate and replication
is necessary:
- "the nesting of proxy servers increases response time"
- but proposed extensions to HTTP minimize this inefficiency
- even Netscape 2.0 has support for keeping alive HTTP connections
- "unnested proxies will be very numerous and put a high load on
orginal servers"
- but as Web proxy hierarchies become more numerous, users' incentive
to use caches will be greater and greater
- today, the bottleneck for most page transmission is the Internet itself,
not server load
- "cache hit rates tend to be under 50%"
- yes, but the pages Berners-Lee wants to replicate (e.g. Olympics results)
are exactly those which are found in Web caches
- this is not really so bad - in a hierarchy, many caches cooperate to
provide a high user hit rate, even though each cache itself may have only
a low rate
- we feel that when sufficiently large (continental) caching systems
are built, hit rates will be at least 80-90%
A proposed mechanism for replication
- each Web cache preemptively sends popular pages to caches which sit
below it
- a cache determines a page's popularity by counting the number of (non-conditional)
GET requests for the page
- essentially, this measures the breadth of a page's popularity
- each lower-level cache can specify some threshhold for GET requests
for pages it will preemptively receive
Conclusion
We hope that Web caches will replace all archives and mirrors on the
net.
Theoretical work to be done:
- construct mathematical models of caching systems
- how to set expiration times?
- how cheap must net bandwidth be to make pre-fetching/replication worthwhile?
Experimental work to be done:
- construct extremely large cache hierarchies, generate log data about
page accesses
- make log data publicly available for research