Main Memory Caching of Web Documents

by Evangelos P. Markatos, ICS-FORTH, Crete, Greece

For more information, see the Web Page at http://www.ics.forth.gr/proj/arch-vlsi/www.html

FO.R.T.H.

Motivation

Some servers are really popular
Up to a million requests per day - one every 50 ms
Future requests will be even more frequent
Web Servers vs. File Servers
Servers should rapidly handle requests
- low processing time per request
- few disk accesses

Main Memory Caching

Servers should cache popular documents in main memory
Much like File Systems - but
- Web Documents are read only
- Web documents are read once by each client
- Clients do not reuse data
- Web Clients read entire documents

Main Memory Caching Issues

Which documents to cache?
Which Documents to Replace?
Working Set: How large caches are needed?
Do the answers depend on:
- the Web server?
- the time/date?
- types of documents provided?
- the cache size?

Methodology

Trace Driven Simulation
Gathered server access traces from
- NCSA: home of Mosaic (USA)
- University of Rochester (USA)
- University of Bergen (Norway)
- University of Crete (Greece)
- ICS-FORTH (Greece)
More than 1 million accesses in total
Several weeks of accesses

Popular Documents are very popular

Few Documents are responsive for lots of references
One document - for 30% of the NCSA references
Two hundred files result in 50% to 85% of the references

Cache Size

Small Caches (1M) results in 35-70% hit rates
Because Popular Documents are small!
E.g.: The little "red/yellow" ball/bullet is one of the most popular documents

Caching only small Documents I

Cache only documents that are smaller than a THRESHOLD:

Caching only small Documents II

The best value of the THRESHOLD depends on the
- size of the cache
- size of documents provided by a server

What is the optimal THRESHOLD?

NCSA is best for 32K threshold
Parallab is best for 4K threshold

Adaptive Threshold Tuning

Start with an initial Threshold (e.g. 16 Kbytes)
Periodically increase/decrease the threshold
If performance gets better continue increasing (decreasing) it
Otherwise start decreasing (increasing) it
Provide barriers to eradicate noise

Performance of Adaptive Tuning I

Caching all documents is not a good idea
ADAPTIVE is always close to the BEST

Performance of Adaptive Tuning II

ADAPTIVE is always close to the BEST
ADAPTIVE is even better than the BEST static (Rochester)
Because THRESHOLD also depends on time/date
e.g. after we put some large popular documents on a server the THRESHOLD should increase

CONCLUSIONS

A small cache (1-2MB) results in significant hit rates
Only small documents should be cached
The exact size of THRESHOLD should change with time/date/server
Adaptive THRESHOLD tuning has shown to provide stable results for several servers
Main Memory Caching using adaptive THRESHOLD is robust