Dr John Graham-Cumming
Optimal Networks, Inc.
Palo Alto, CA
jgc@optimal.com
With the growth of the Internet many corporations have provided free access to the World Wide Web from employees' personal computers. Control of the traffic flowing into and out of the corporate network has been limited to security considerations, but a number of corporations have become wary of the amount of time employees spend surfing the Internet, the cost of the connection to the Internet and possible legal exposure from the material brought into the company across the World Wide Web.This paper examines the results of a year of studying the in-and out-bound Internet traffic for a number of corporations. That analysis has led to a number of surprising findings about the nature of Web traffic and leads to practical recommendations for network managers hoping to contain Internet connection costs and limit exposure to legal difficulties.
Using the software we have gathered a large amount of data about the nature of World Wide Web traffic from a number of large corporations. Whilst it is always hard to define any single corporation or network as typical we have drawn together the results of a year of watching the web to provide a picture of average Internet use.
Initial reactions to connecting a company to the Internet tend to revolve around issues of security; exactly how much the Internet connection will be used and for what purpose are generally left as an experiment. Internet connectivity is, for many companies, a new frontier and employees are the explorers.
Some corporations have become concerned about the use to which the Internet connection is put, and spurred by legal considerations, have asked for either detailed information of the sites being surfed, or all out blocking of certain undesirable areas of the Internet. The term "undesirable" has proved subjective and, with the explosive growth of web sites, maintaining and controlling lists of banned sites has become a significant burden for network managers.
Some managers chose to first understand the traffic on the Internet connection using tools already in their possession (typically information from the Internet router, logs from a proxy server, if installed, or output from a protocol analyzer). Those tools fall short of answering, in a simple and speedy manner, questions about which web sites are being visited, and, more importantly, how much network bandwidth is being consumed by those visits.
Newer tools help to ease the burden of understanding the nature of web traffic and in doing so have lead to some surprising results concerning the use to which corporate Internet connections are put. In addition many of these "problems" are easily resolved through configuration and policy management and we lay out some of these results and solutions in this paper.
In addition previous studies have required user participation (see, for example, [1]), our study was performed without user interaction, in doing so we lost user demographic information. The interested reader should read [3] for more information surrounding measurement of web site usage and Internet traffic patterns.
The data was sanitized by removing all user information and replacing each user's IP address or DNS name with a unique integer number. Data was collated in Microsoft Access and exported to Microsoft Excel for analysis. All the analysis here is from the Excel spreadsheet and graphs generated were then converted to GIF files. The HTML document was prepared using Microsoft Notepad.
The data represents around 4000 users, accessing around 95000 different URLs (the Internet Monitor version used only collects that top level of the URL, the domain name, and hence these 95000 URLs represent 95000 different DNS names and not individual pages; for example, all of Yahoo's search result pages are represented as search.yahoo.com). The total time of data collection spans around 4 months with data from April 1996 and September through November 1996. Apart from the Atlanta InterOp, the majority of data collection was performed on the West Coast of the United States. In all just under 100GB (gigabytes) of traffic was collected.
In addition to removing all user names from the databases we also removed all Intranet servers (our intent was to look largely at Internet traffic), and all FTP and News traffic. The majority of users were using a version of the Netscape Navigator, and a large percentage (further details are discussed below) were using the PointCast Network. The networks were connected to the Internet by a variety of means: ISDN, T1 and faster speeds, hence, in order to compensate for possible skew caused by different media speeds we have examined only total bytes transferred and not made any inquiries into bandwidth utilization.
We do not claim that our sample is statistically accurate, no effort was expended to gather demographic information about the users monitored, and in most cases the users were unaware of the monitoring. We have noted that in all circumstances in which monitoring is advertised user behavior changes drastically.
We do believe, however, that our data is genuine, no corporation has influenced the data collected in this, informal, study and the data is presented without bias towards a particular web site or view. We hope to encourage further study of Internet use, patterns of user behavior and bandwidth utilization to create a more dynamic environment for the inhabitants of cyberspace and to promote the optimization of networks and web sites.
Below we make recommendations about possible steps that can be taken to minimize the impact of the Internet on corporate networks; we have made no study of home use of the Internet and our results are currently only applicable to corporate network installations.
The first data to look at is the collection of the top 10 web sites that we have seen. Figure 1 shows a graph of the top 10 web sites listed by URL and showing the percentage of total bytes collected that each web site generated.
Figure 1. Ten most bandwidth-consuming Web sites.
It is immediately interesting to note that in our study the PointCast Network and Netscape are biggest consumers of the network. PointCast accounts for over 17% of the network traffic and Netscape for over 12%. After PointCast and Netscape there is a significant fall in the traffic to any one web site.
To determine the distribution of interest in web sites we examine the web sites ordered by the number of users who access the sites. Figure 2 shows the top 10 web sites ordered by the percentage of users who access that site.
Figure 2. Ten most popular Web sites.
Figure 2 gives more of an indication of the activity of the user group: 70% regularly access the Netscape home page, 20% use Yahoo, whilst around 12% are using the PointCast Network. Search sites Lycos and Excite appear here also, as does the Microsoft home page. Two interesting conclusions can be drawn from these two charts: Netscape's homepage is extremely popular accounting for 12% of network traffic generated by 70% of users, and yet the PointCast Network in use by only 12% of users is generating 17% of all traffic.
Two phenomena are at work: the small number of users connected to PointCast are generating a large amount of traffic because of the continuous and uncontrolled nature of that applications Internet use, and the large number of users contacting Netscape would generate more traffic: however, the majority of hits on the Netscape homepage do not result in the entire homepage being downloaded, and are generated automatically by the browser.
In addition to the rather obvious inclusion of search sites in the top ten, Microsoft's influence on the Internet is clearly seen by their homepage's inclusion in both charts. The popularity of news sites (USA Today and The San Jose Mercury News) is also noted and whilst ESPN appears in Figure 1 it is also the 11th (not shown) most popular web site by our analysis. In addition both charts indicate a strong interest in finance applications with the stock quote service at Yahoo and the Quicken Financial Network's stock quote service both appearing.
Ignoring the influence of the Netscape and PointCast hits: search sites, news, sports and finance prove the most popular destinations for the web surfers in our survey.
An analysis of the sites producing 50% of all the traffic we monitored shows that the top 35 web sites are responsible for the majority of traffic monitored. Those web sites, in order of traffic generated are shown in Figure 3. It is worth noting, as Figure 1 shows, that Netscape and PointCast network create a large skew in our figures, and that in order to understand which web sites users are actively seeking out the first two entries in Figure 3 should be ignored.
Rank | URL | % of total bytes | Site Type |
---|---|---|---|
1 | pointcast.net | 17.81% | Pointcast |
2 | home.netscape.com | 13.13% | Computers/High-Technology |
3 | www.yahoo.com | 2.28% | Search |
4 | www.adobe.com | 1.65% | Computers/High-Technology |
5 | espnet.sportszone.com | 1.13% | Sport |
6 | www.cnn.com | 0.85% | News |
7 | quote.yahoo.com | 0.82% | Finance |
8 | www.microsoft.com | 0.78% | Computers/High-Technology |
9 | www.usatoday.com | 0.76% | News |
10 | quotes.galt.com | 0.74% | Finance |
11 | www.excite.com | 0.70% | Search |
12 | www.dbc.com | 0.60% | Finance |
13 | www.lombard.com | 0.54% | Finance |
14 | www.martijua.com | 0.52% | Adult |
15 | www.geocities.com | 0.52% | Multiple |
16 | www.sjmercury.com | 0.52% | News |
18 | www.lycos.com | 0.49% | Search |
19 | members.aol.com | 0.45% | Multiple |
20 | ms.www.conxion.com | 0.43% | Computers/High-Technology |
21 | www.nfl.com | 0.38% | Sport |
22 | pathfinder.com | 0.38% | Search |
23 | www.grayfire.com | 0.35% | Finance |
24 | www.unitedmedia.com | 0.34% | Entertainment |
25 | quicktime.apple.com | 0.33% | Computers/High-Technology |
26 | ad.doubleclick.net | 0.33% | Advertising |
27 | www.msnbc.com | 0.33% | News |
28 | cnnfn.com | 0.32% | Finance |
29 | altavista.digital.com | 0.31% | Search |
30 | fast.quote.com | 0.30% | Finance |
31 | www.kpix.com | 0.28% | News |
32 | www.otk.com | 0.27% | Adult |
33 | www.playboy.com | 0.27% | Adult |
34 | www.bekkoame.or.jp | 0.26% | Multiple |
35 | www.sfgate.com | 0.26% | News |
36 | cnn.com | 0.25% | News |
Figure 3. Web sites representing 50% of traffic.
To look a little further into the interests of users we have categorized the web sites into the following 7 groups: Search, Finance, Computer and High-Technology, News, Sport, Entertainment, and Adult. Examining the sites in Figure 3, those representing 50% of all monitored Internet traffic gives the following breakdown of areas of interest. Figure 4 shows the categories of interest to Internet users monitored.
Figure 4. Internet use by category (based on total bytes).
Finally we look at the activities of particular users. Figure 5 shows the distribution of bytes transferred during our monitoring against the number of users. We have collated the information taking the total number of bytes transferred to one significant figure and plotted the number of users falling into each group.
Figure 5. Distribution of user activity (based on total bytes).
15 users account for 10% of the data (our top user accounted for 1.5% of all traffic collected). Users in the top 10% accessed thousands of URLs, users centered around the second peak (which is caused in part by the Netscape and PointCast traffic) seemed to have a regular collection of pages that they visited (for example, the CNN homepage showed daily interest), as did users centered around the first major peak. We have not analyzed the activity of users at the low end, their activity is unlikely to cause major concern for a network manager controlling bandwidth.
We define a number of different phenomena seen from the example above, give their explanation and offer possible remedies where appropriate.
This indicates that a network manager can save significant amounts of network bandwidth (in the example above 12% of the Internet traffic is consumed by these phantom requests to the Netscape homepage). Such requests are currently counted as actual requests by the user of the Web browser, as if the user had actively selected the site; in most cases the user is disinterested by the Web site being loaded, as evidenced by the high rate of abort on the default homepages.
Network managers should consider either disabling the homepage feature in the browser or using it to point to useful Intranet information. In organizations where the Intranet page is the default homepage further monitoring will be required to ensure that the automatic requests to that page do not create a problem on the Intranet.
We do not recommend that corporations ban such applications outright but these applications (and in this category we include applications providing streaming audio and video) are significant users of bandwidths; as such they should be monitored to determine whether they have a serious impact on network performance. In addition the use of a caching proxy server may help alleviate some bandwidth problems.
We also recommend considering software such as Packeteer's PacketShaper that can be used to restrict the bandwidth used by specific Internet applications.
Such Web sites present two dangers: information presented by the Web site might be inappropriate for internal or external consumption and the use of freeware or shareware Web servers can cause security problems: a badly configured Web server could provide access to proprietary information.
There are a number of caveats: we are unclear how a generic proxy might handle the PointCast network and a significant reduction in bandwidth use can be made by simply reconfiguring the web browser; in addition the presence of advertising in Figure 3 (at rank 26) indicates that caching may be ineffective due to the constantly changing nature of advertising banners.
Although 50% of the traffic was generated by only 35 sites the other 50% was generated by a mixed bag of many thousands of sites indicating that the varied interests of Internet users make effective caching difficult to achieve. Overall we suggest sensible configuration of Internet browsers and then careful consideration of caching technologies for specific web sites or applications; these changes will have a far greater impact than a simple least recently used caching algorithm.
Whilst we encourage monitoring of networks to help tune network and browser performance we believe it is essential that corporations inform their users that monitoring technology may be used. We suspect that the widespread use of Internet monitors would shift web browsing habits from the office to home. It is also essential that users of the Internet realize that their activities are not anonymous, even without the introduction of special Internet monitors standard protocol analyzers can be used to read all unencrypted messages and requests on the Internet; that this fact is not more widely understood is a matter for concern.
We remain unconvinced about the need for Internet blocking technology, such software designed to prevent access to certain web sites, or to allow access to only a few sites seems almost impossible to administer in the ever-changing World Wide Web and we have noted much greater success in simply monitoring Internet links for corporate policy violations than in attempting to create a rigid framework for Internet use. Users tend to be self-policing and bandwidth use can quickly be brought under control.
We have not presented here information about changing patterns of use over time. Although our software does save information about Internet use by time of day we are limited in space and have chosen to present an "average" view. We hope to be able to present further results at a later date.
Finally, we have not looked at the sequence of steps a user of the Internet undertakes when surfing: we have not followed their "click-stream" to understand particular patterns. The Optimal Internet Monitor does keep information about the sequence of web sites visited and it may prove instructive to follow those streams to determine whether, for example, users prefer Yahoo when finding news web sites and Lycos when looking for movie reviews. We believe there is a great deal of interesting research in this area.
We took a year's worth of data and made no analysis of trends in web site use and Internet traffic, a broader study has been performed by Georgia Tech and results presented in [1] and [2].
We designed the Internet Monitor to help corporations manage and control Internet use and bandwidth requirements: it has yielded interesting results which we hope stimulate further discussion and research. This paper is a small beginning.