The influence of geographical and cultural issues on the cache proxy server workload

Vírgilio F. Almeida, Márcio G. Cesário, Rodrigo C. Fonseca,
Wagner Meira Jr., and Cristina D. Murta*

Computer Science Department, Federal University of Minas Gerais,
Belo Horizonte, MG, Brazil

virgilio@dcc.ufmg.br, magc@dcc.ufmg.br,
rfonseca@dcc.ufmg.br, meira@dcc.ufmg.br, and
cristina@dcc.ufmg.br

Abstract
A key characteristic of the Internet is its global diffusion, that shows a rapid growth of the number of hosts and international links around the world. The diffusion of the Internet has been accompanied by serious performance problems, such as long response times, server overload and network congestion. Caching has been used as a standard solution to minimize the problem. In this paper, we analyze logs of caching proxy servers and show evidence that geographical, cultural and social issues have a strong influence on the workload of a proxy server. Therefore, the cultural and social context provide relevant information to plan efficient caching proxy architectures.

Keywords
WWW, Caching; Internationalization; Internet; Performance

1. Introduction

A key characteristic of the Internet is its global diffusion, that shows a rapid growth of hosts and international links around the world [10]. The fastest growing nations between 1996 and 1997 were Japan, Malaysia, Singapore, Korea, and Brazil, which confirms the widespread penetration of the information society. However, reference [8] also shows that the exponential growth of the Internet has been accompanied by serious performance problems. To minimize these problems, caching proxy servers have been used as a solution to reduce server overload and network congestion, as they attempt to bring the data as close to the client as possible.

The way users access the Internet depends heavily upon the telecommunication infrastructure and social context of each country. Thus, to understand the performance behavior of the WWW, one must consider geographical, cultural and social issues. Based on the analysis of different proxy server logs, this paper shows evidence of the influence of these issues on the workload of caching proxy servers. Our approach is to examine the meaning of statistics drawn from logs of a busy Brazilian proxy server in light of geographical and cultural issues.

Characteristics of Web cache workloads have been studied in many references [5,2]. None of them make mention on geographical, social or cultural influence. Some studies help to understand the relevance of geographical aspects related to caching [6,9]. To our knowledge, there is no reference that investigates the relationship between cache workload and cultural and social issues.

2. The Internet infrastructure in Brazil

  The history of the Internet in Brazil dates back to 1989, with the implementation of the National Research Network's backbone (RNP), which provides Internet access throughout the country. Points Of Presence (POPs) were created in most states of the country, to provide universities and institutions with a link to the Internet. Like other countries, Brazil has watched exponential growth of the Internet in its territory. As of January 1997, Brazil stands as the 19th country in number of hosts in the world, and the 3rd of the Americas, after US and Canada. According to [11], the number of .com hosts in the Brazilian national domain (.com) has grown 1947% from January 1996 to July 1997.

Analyzing logs

Cache proxy servers throughout the world exhibit different access patterns. Using data available at [4], we compiled statistics for cache proxy servers in several countries. We observed that in five countries, USA, Brazil, Japan, Italy, and Taiwan, the majority of accesses is to the their national domain. However, in other countries, like Belgium and the Netherlands, most accesses are directed to the .com domain. The hit ratio for objects in the national domain is always higher than the one for objects from other domains.

We analyze the influence of cultural characteristics on the proxy server workload, by studying 4,235,311 requests to the POP-MG's proxy server. It has a total bandwidth of 7 Mbps and an average total traffic rate measured close to 5 Mbps. POP-MG is the main gateway to the Internet for twenty five universities and a hundred business organizations including Internet Service Providers. The requests correspond to a ten-day operation interval and come from both commercial (40%) and educational (60%) organizations. The amount of transferred data is more than 25 gigabytes.

   
Table 1. Statistics for accesses to the first level proxy of POP-MG
  All .br .com
Requests 4,235,311 (100% Req) 2,146,625 (50.2% Req) 1,402,558 (33.2% Req)
Objects 1,079,044 (100% Obj) 319,937 (29.7% Obj) 518,256 (48.0% Obj)
Accesses/object 3.92 6.70 2.70
1-access objects 709,759 (65.8% Obj) 180,385 (16.7% Obj) 349,737 (32.4% Obj)
Non-first accesses 74.52% 85.10% 63.05%
Hit ratio 47% 58% 36%

Table 1 displays workload statistics. Column labeled ``All'' indicates all requests handled by the proxy during the period, and the other two columns represent the two most accessed domains (comprising more than 80% of the accesses): .br and .com.

Results available in [1] show that the majority of accesses in some countries (e.g., Brazil, USA, Japan, and Italy) are for their national domains. By analyzing the results of the Brazilian Internet User Survey [7] and considering the telecommunication infrastructure, we have the following explanations for the high percentage of accesses to the .br domain: (1) only 58% of the users speak English, and are able to access English language sites; (2) most of the Brazilian users are interested in news (80%), scientific information (67%), music (67%), and adult entertainment (61%), which are topics heavily related to regional culture; and (3) accesses to Brazilian sites are usually faster, since they do not demand traversing busy international links.

The second observation regards the average number of accesses per object. The hit ratio is much higher for Brazilian objects (6.7 and 58%, respectively) than for objects from the .com domain (2.7 and 36%, respectively). This phenomenon is explained not only by the amount of accesses to .br sites, but also by the fact that the number of unique Brazilian objects (319,937) is significantly smaller than the number of cached objects from the .com domain (518,256).

By examining the POP-MG and NLANR logs, we found a significant difference regarding the popularity of http based chat sites in Brazil. Accesses to sites with chat applications correspond to 4.9% of the total accesses recorded at POP-MG log. In the US, requests that stem from chatter sites represent 1.2% of the accesses for NLANR's. It is worthnoting that Web chatter sites are among the most popular sites in Brazil. This characteristic is important for caching projects, because chat pages are dynamic and cannot be cached.

Our last observation regards the telecommunication infrastructure. In Brazil, telecommunication services are more expensive than in US. Thus, most of Internet users tend to navigate through the WWW in periods of time when the telephone rates are lower. As a consequence, we observed heavy peak loads in the low rate periods. Using the logs, we calculated the hourly arrival rates for the proxy server of NLANR and POP-MG. We noticed a high variability in the load, due to the different tariff schemes in the two countries (i.e., Brazil and US). The traffic patterns seen at POP-MG follow the phone rate variations. During the least expensive period, the peak arrival rate is 116% higher than the average rate. In the NLANR servers the peak to average ratio falls to 46%. Thus, this type of information is useful to plan the capacity of proxy servers, that should be able to handle the peak load.

4. Conclusions

In this paper, we have analyzed the logs of a busy cache proxy server in light of geographical and cultural issues, such as language, social interaction, cost of bandwidth, among others. We noted a correlation between national characteristics (taking Brazil as our example) and the quantitative behavior of a cache proxy server, represented by the percentage of accesses to the national domain, the hit ratio for each domain and accented peaks in traffic. As noted by  [3], Brazilians naturally like to chat, and this fact is reflected in a high percentage of accesses to chat sites, as compared to an American server. Language and interest in regional information, according to a WWW Brazilian user survey [7], as well as limited bandwidth of international links are used to explain the high percentage of accesses from Brazilian users to pages in Brazilian sites and the high hit ratios observed in the cache of POP-MG. The tariff scheme adopted by the local phone company — a strong geographical factor — is found to have a significant influence on the traffic patterns of POP-MG's cache server. The above conclusions are being used to define the architecture of the POP-MG cache proxy hierarchy, (e.g. domain based caching), as well as to size cache capacity to handle load peaks.

References

1
V. Almeida, M. Cesário, R. Fonseca, W. Meira Jr., and C. Murta,
The influence of geographical and cultural issues on the cache proxy server workload,
http://www.dcc.ufmg.br/anades/submissions/habits/

2
A. Bestavros, C.R. Cunha and M.E. Crovella,
Characteristics of WWW client-based traces,
Technical Report TR-95-010, Boston University Computer Science Department, 1995.

3
M. Eakin,
Brazil: The Once And Future Country.
St. Martin's Press, New York, NY, 1997.

4
National Laboratory for Applied Network Research,
Cache statistics pages,
http://ircache.nlanr.net/Cache/cache-stats-links.html

5
M. Abrams, G. Abdulla, E.A. Fox and S. Williams,
WWW proxy traffic characterization with application to caching,
Technical Report 97-04, Virginia Tech, Computer Science Department, 1997.

6
J. Gwertzman and M. Seltzer,
The case for geographical push-caching,
in: Proc. of the 5th Annual Workshop on Hot Operating Systems, May 1995, pp. 51–55.

7
IBOPE,
2a., Pesquisa Cadê?/IBOPE,
http://www.ibope.com.br/cade97/welcome.htm

8
C. Kehoe and J. Pitkow,
Surveying the territory: Gvu's five www user surveys,
The World Wide Web Journal, 1(3), 1996.

9
M. Nabeshima,
The Japan cache project: an experiment on domain cache,
in: Proc. 6th International World Wide Web Conference, 1997.

10
L. Press,
Tracking the global diffusion of the Internet,
Communications of the ACM, 40(11): 11–17, November 1997.

11
Brazilian Science and Technology Ministry,
Hosts por Domínio,
http://www.gt-er.cg.org.br/estatisticas/hosts/tab-host.html


Footnotes

...Murta*