Main Memory Caching of Web Documents
by Evangelos P. Markatos, ICS-FORTH, Crete, Greece
For more information, see
the Web Page at
http://www.ics.forth.gr/proj/arch-vlsi/www.html
FO.R.T.H.
Motivation
- Some servers are really popular
- Up to a million requests per day - one every 50 ms
- Future requests will be even more frequent
- Web Servers vs. File Servers
- Servers should rapidly handle requests
- low processing time per request
- few disk accesses
Main Memory Caching
- Servers should cache popular documents in main memory
- Much like File Systems - but
- Web Documents are read only
- Web documents are read once by each client
- Clients do not reuse data
- Web Clients read entire documents
Main Memory Caching Issues
- Which documents to cache?
- Which Documents to Replace?
- Working Set: How large caches are needed?
- Do the answers depend on:
- the Web server?
- the time/date?
- types of documents provided?
- the cache size?
Methodology
- Trace Driven Simulation
- Gathered server access traces from
- NCSA: home of Mosaic (USA)
- University of Rochester (USA)
- University of Bergen (Norway)
- University of Crete (Greece)
- ICS-FORTH (Greece)
- More than 1 million accesses in total
- Several weeks of accesses
Popular Documents are very popular
- Few Documents are responsive for lots of references
- One document - for 30% of the NCSA references
- Two hundred files result in 50% to 85% of the references
Cache Size
- Small Caches (1M) results in 35-70% hit rates
- Because Popular Documents are small!
- E.g.: The little "red/yellow" ball/bullet is one of the most popular documents
Caching only small Documents I
- Cache only documents that are smaller than a THRESHOLD:
Caching only small Documents II
- The best value of the THRESHOLD depends on the
- size of the cache
- size of documents provided by a server
What is the optimal THRESHOLD?
- NCSA is best for 32K threshold
- Parallab is best for 4K threshold
Adaptive Threshold Tuning
- Start with an initial Threshold (e.g. 16 Kbytes)
- Periodically increase/decrease the threshold
- If performance gets better continue increasing (decreasing) it
- Otherwise start decreasing (increasing) it
- Provide barriers to eradicate noise
Performance of Adaptive Tuning I
- Caching all documents is not a good idea
- ADAPTIVE is always close to the BEST
Performance of Adaptive Tuning II
- ADAPTIVE is always close to the BEST
- ADAPTIVE is even better than the BEST static (Rochester)
- Because THRESHOLD also depends on time/date
- e.g. after we put some large popular documents on a server the
THRESHOLD should increase
CONCLUSIONS
- A small cache (1-2MB) results in significant hit rates
- Only small documents should be cached
- The exact size of THRESHOLD should change with time/date/server
- Adaptive THRESHOLD tuning has shown to provide stable
results for several servers
- Main Memory Caching using adaptive THRESHOLD is robust