Hits and Miss-es: A Year Watching the Web


Dr John Graham-Cumming
Optimal Networks, Inc.
Palo Alto, CA
jgc@optimal.com

Abstract

With the growth of the Internet many corporations have provided free access to the World Wide Web from employees' personal computers. Control of the traffic flowing into and out of the corporate network has been limited to security considerations, but a number of corporations have become wary of the amount of time employees spend surfing the Internet, the cost of the connection to the Internet and possible legal exposure from the material brought into the company across the World Wide Web.

This paper examines the results of a year of studying the in-and out-bound Internet traffic for a number of corporations. That analysis has led to a number of surprising findings about the nature of Web traffic and leads to practical recommendations for network managers hoping to contain Internet connection costs and limit exposure to legal difficulties.

1. Introduction

In April 1996 we launched an Internet monitoring program called the Optimal Internet Monitor and this paper draws on our experience with our customers and the software.

Using the software we have gathered a large amount of data about the nature of World Wide Web traffic from a number of large corporations. Whilst it is always hard to define any single corporation or network as typical we have drawn together the results of a year of watching the web to provide a picture of average Internet use.

Initial reactions to connecting a company to the Internet tend to revolve around issues of security; exactly how much the Internet connection will be used and for what purpose are generally left as an experiment. Internet connectivity is, for many companies, a new frontier and employees are the explorers.

Some corporations have become concerned about the use to which the Internet connection is put, and spurred by legal considerations, have asked for either detailed information of the sites being surfed, or all out blocking of certain undesirable areas of the Internet. The term "undesirable" has proved subjective and, with the explosive growth of web sites, maintaining and controlling lists of banned sites has become a significant burden for network managers.

Some managers chose to first understand the traffic on the Internet connection using tools already in their possession (typically information from the Internet router, logs from a proxy server, if installed, or output from a protocol analyzer). Those tools fall short of answering, in a simple and speedy manner, questions about which web sites are being visited, and, more importantly, how much network bandwidth is being consumed by those visits.

Newer tools help to ease the burden of understanding the nature of web traffic and in doing so have lead to some surprising results concerning the use to which corporate Internet connections are put. In addition many of these "problems" are easily resolved through configuration and policy management and we lay out some of these results and solutions in this paper.

In addition previous studies have required user participation (see, for example, [1]), our study was performed without user interaction, in doing so we lost user demographic information. The interested reader should read [3] for more information surrounding measurement of web site usage and Internet traffic patterns.

2. Results

The results we present are drawn from a collection of databases of Internet traffic information gathered using our software. The databases include those created at Optimal by monitoring the daily activity on the Optimal Internet connection, databases provided by customers, and two databases collected at the 1996 InterOp (Las Vegas and Atlanta) trade shows.

The data was sanitized by removing all user information and replacing each user's IP address or DNS name with a unique integer number. Data was collated in Microsoft Access and exported to Microsoft Excel for analysis. All the analysis here is from the Excel spreadsheet and graphs generated were then converted to GIF files. The HTML document was prepared using Microsoft Notepad.

The data represents around 4000 users, accessing around 95000 different URLs (the Internet Monitor version used only collects that top level of the URL, the domain name, and hence these 95000 URLs represent 95000 different DNS names and not individual pages; for example, all of Yahoo's search result pages are represented as search.yahoo.com). The total time of data collection spans around 4 months with data from April 1996 and September through November 1996. Apart from the Atlanta InterOp, the majority of data collection was performed on the West Coast of the United States. In all just under 100GB (gigabytes) of traffic was collected.

In addition to removing all user names from the databases we also removed all Intranet servers (our intent was to look largely at Internet traffic), and all FTP and News traffic. The majority of users were using a version of the Netscape Navigator, and a large percentage (further details are discussed below) were using the PointCast Network. The networks were connected to the Internet by a variety of means: ISDN, T1 and faster speeds, hence, in order to compensate for possible skew caused by different media speeds we have examined only total bytes transferred and not made any inquiries into bandwidth utilization.

We do not claim that our sample is statistically accurate, no effort was expended to gather demographic information about the users monitored, and in most cases the users were unaware of the monitoring. We have noted that in all circumstances in which monitoring is advertised user behavior changes drastically.

We do believe, however, that our data is genuine, no corporation has influenced the data collected in this, informal, study and the data is presented without bias towards a particular web site or view. We hope to encourage further study of Internet use, patterns of user behavior and bandwidth utilization to create a more dynamic environment for the inhabitants of cyberspace and to promote the optimization of networks and web sites.

Below we make recommendations about possible steps that can be taken to minimize the impact of the Internet on corporate networks; we have made no study of home use of the Internet and our results are currently only applicable to corporate network installations.

The first data to look at is the collection of the top 10 web sites that we have seen. Figure 1 shows a graph of the top 10 web sites listed by URL and showing the percentage of total bytes collected that each web site generated.

Figure 1. Ten most bandwidth-consuming Web sites.

It is immediately interesting to note that in our study the PointCast Network and Netscape are biggest consumers of the network. PointCast accounts for over 17% of the network traffic and Netscape for over 12%. After PointCast and Netscape there is a significant fall in the traffic to any one web site.

To determine the distribution of interest in web sites we examine the web sites ordered by the number of users who access the sites. Figure 2 shows the top 10 web sites ordered by the percentage of users who access that site.

Figure 2. Ten most popular Web sites.

Figure 2 gives more of an indication of the activity of the user group: 70% regularly access the Netscape home page, 20% use Yahoo, whilst around 12% are using the PointCast Network. Search sites Lycos and Excite appear here also, as does the Microsoft home page. Two interesting conclusions can be drawn from these two charts: Netscape's homepage is extremely popular accounting for 12% of network traffic generated by 70% of users, and yet the PointCast Network in use by only 12% of users is generating 17% of all traffic.

Two phenomena are at work: the small number of users connected to PointCast are generating a large amount of traffic because of the continuous and uncontrolled nature of that applications Internet use, and the large number of users contacting Netscape would generate more traffic: however, the majority of hits on the Netscape homepage do not result in the entire homepage being downloaded, and are generated automatically by the browser.

In addition to the rather obvious inclusion of search sites in the top ten, Microsoft's influence on the Internet is clearly seen by their homepage's inclusion in both charts. The popularity of news sites (USA Today and The San Jose Mercury News) is also noted and whilst ESPN appears in Figure 1 it is also the 11th (not shown) most popular web site by our analysis. In addition both charts indicate a strong interest in finance applications with the stock quote service at Yahoo and the Quicken Financial Network's stock quote service both appearing.

Ignoring the influence of the Netscape and PointCast hits: search sites, news, sports and finance prove the most popular destinations for the web surfers in our survey.

An analysis of the sites producing 50% of all the traffic we monitored shows that the top 35 web sites are responsible for the majority of traffic monitored. Those web sites, in order of traffic generated are shown in Figure 3. It is worth noting, as Figure 1 shows, that Netscape and PointCast network create a large skew in our figures, and that in order to understand which web sites users are actively seeking out the first two entries in Figure 3 should be ignored.

RankURL% of total bytesSite Type
1pointcast.net17.81%Pointcast
2home.netscape.com13.13%Computers/High-Technology
3www.yahoo.com2.28%Search
4www.adobe.com1.65%Computers/High-Technology
5espnet.sportszone.com1.13%Sport
6www.cnn.com0.85%News
7quote.yahoo.com0.82%Finance
8www.microsoft.com0.78%Computers/High-Technology
9www.usatoday.com0.76%News
10quotes.galt.com0.74%Finance
11www.excite.com0.70%Search
12www.dbc.com0.60%Finance
13www.lombard.com0.54%Finance
14www.martijua.com0.52%Adult
15www.geocities.com0.52%Multiple
16www.sjmercury.com0.52%News
18www.lycos.com0.49%Search
19members.aol.com0.45%Multiple
20ms.www.conxion.com0.43%Computers/High-Technology
21www.nfl.com0.38%Sport
22pathfinder.com0.38%Search
23www.grayfire.com0.35%Finance
24www.unitedmedia.com0.34%Entertainment
25quicktime.apple.com0.33%Computers/High-Technology
26ad.doubleclick.net0.33%Advertising
27www.msnbc.com0.33%News
28cnnfn.com0.32%Finance
29altavista.digital.com0.31%Search
30fast.quote.com0.30%Finance
31www.kpix.com0.28%News
32www.otk.com0.27%Adult
33www.playboy.com0.27%Adult
34www.bekkoame.or.jp0.26%Multiple
35www.sfgate.com0.26%News
36cnn.com0.25%News

Figure 3. Web sites representing 50% of traffic.

To look a little further into the interests of users we have categorized the web sites into the following 7 groups: Search, Finance, Computer and High-Technology, News, Sport, Entertainment, and Adult. Examining the sites in Figure 3, those representing 50% of all monitored Internet traffic gives the following breakdown of areas of interest. Figure 4 shows the categories of interest to Internet users monitored.

Figure 4. Internet use by category (based on total bytes).

Finally we look at the activities of particular users. Figure 5 shows the distribution of bytes transferred during our monitoring against the number of users. We have collated the information taking the total number of bytes transferred to one significant figure and plotted the number of users falling into each group.

Figure 5. Distribution of user activity (based on total bytes).

15 users account for 10% of the data (our top user accounted for 1.5% of all traffic collected). Users in the top 10% accessed thousands of URLs, users centered around the second peak (which is caused in part by the Netscape and PointCast traffic) seemed to have a regular collection of pages that they visited (for example, the CNN homepage showed daily interest), as did users centered around the first major peak. We have not analyzed the activity of users at the low end, their activity is unlikely to cause major concern for a network manager controlling bandwidth.

3. Remedies

From the results shown above we can draw a number of conclusions about the nature of Internet traffic and provide some recommendations for network managers hoping to create a accessible environment that is cost effective.

We define a number of different phenomena seen from the example above, give their explanation and offer possible remedies where appropriate.

3.1 Homepage Skew

The large number of hits, and large numbers of users accessing default homepages creates a huge number of communications to Web servers that are largely aborted. In the example above the majority of hits on the Netscape homepage were aborted before completion, indicating that the Netscape homepage was being visited automatically by the browser on start-up and the user was quickly jumping to a different Web site before viewing the Netscape.

This indicates that a network manager can save significant amounts of network bandwidth (in the example above 12% of the Internet traffic is consumed by these phantom requests to the Netscape homepage). Such requests are currently counted as actual requests by the user of the Web browser, as if the user had actively selected the site; in most cases the user is disinterested by the Web site being loaded, as evidenced by the high rate of abort on the default homepages.

Network managers should consider either disabling the homepage feature in the browser or using it to point to useful Intranet information. In organizations where the Intranet page is the default homepage further monitoring will be required to ensure that the automatic requests to that page do not create a problem on the Intranet.

3.2Drip Feed Dangers

Applications, in the case of the example above the PointCast Network, that provide a constant flow of data to individual desktop machines consume a great deal of network bandwidth over time. In our analysis the PointCast Network exceeded all other Web sites in terms of total bytes transferred, even exceeding the phantom homepage requests described above.

We do not recommend that corporations ban such applications outright but these applications (and in this category we include applications providing streaming audio and video) are significant users of bandwidths; as such they should be monitored to determine whether they have a serious impact on network performance. In addition the use of a caching proxy server may help alleviate some bandwidth problems.

We also recommend considering software such as Packeteer's PacketShaper that can be used to restrict the bandwidth used by specific Internet applications.

3.3 Personal Homepages

Over the course of the year we discovered a number of organizations where users internal to the corporation were running Web sites on their PCs. These Web sites were not sanctioned by the company and in at least one case were accessible from the Internet.

Such Web sites present two dangers: information presented by the Web site might be inappropriate for internal or external consumption and the use of freeware or shareware Web servers can cause security problems: a badly configured Web server could provide access to proprietary information.

3.4Entertainment

Suprisingly, Entertainment web sites hardly figure in our analysis of Internet traffic. Whilst there is some concern about the influence of adult oriented web sites, as evidenced by our perjorative title, they account for around 2% of total web traffic. Although highly graphical the small number of users contacting these sites from corporate locations lowers their impact on bandwidth used. However we have made no analysis of streaming audio and video and these may significantly alter our results showing entertainment to have a greater impact on corporate Internet connections than our current analysis.

3.5 Caching In

By far the most popular web sites in the above example (excluding the automated requests to Netscape and PointCast) were sports, news, finance and search sites. Since such a small number of web sites are being contacted by a large number of users the potential for caching proxy servers to reduce the amount of repeat communication across a corporate Internet connection seems high.

There are a number of caveats: we are unclear how a generic proxy might handle the PointCast network and a significant reduction in bandwidth use can be made by simply reconfiguring the web browser; in addition the presence of advertising in Figure 3 (at rank 26) indicates that caching may be ineffective due to the constantly changing nature of advertising banners.

Although 50% of the traffic was generated by only 35 sites the other 50% was generated by a mixed bag of many thousands of sites indicating that the varied interests of Internet users make effective caching difficult to achieve. Overall we suggest sensible configuration of Internet browsers and then careful consideration of caching technologies for specific web sites or applications; these changes will have a far greater impact than a simple least recently used caching algorithm.

3.6 Look don't touch

In all instances in which monitoring has been installed and users have been informed user behavior changes drastically. In at least one instance all web browsing was stopped at one company once the presence of monitoring software became common knowledge.

Whilst we encourage monitoring of networks to help tune network and browser performance we believe it is essential that corporations inform their users that monitoring technology may be used. We suspect that the widespread use of Internet monitors would shift web browsing habits from the office to home. It is also essential that users of the Internet realize that their activities are not anonymous, even without the introduction of special Internet monitors standard protocol analyzers can be used to read all unencrypted messages and requests on the Internet; that this fact is not more widely understood is a matter for concern.

We remain unconvinced about the need for Internet blocking technology, such software designed to prevent access to certain web sites, or to allow access to only a few sites seems almost impossible to administer in the ever-changing World Wide Web and we have noted much greater success in simply monitoring Internet links for corporate policy violations than in attempting to create a rigid framework for Internet use. Users tend to be self-policing and bandwidth use can quickly be brought under control.

4. Conclusion and Future Work

Our analysis was scientific, but our data collection methodology was not: we encourage a much broader study of Internet use using monitoring technology to understand the use of the existing Internet structure. In addition that study should track other new and existing protocols, such as FTP, NNTP and RealAudio, to determine their impact.

We have not presented here information about changing patterns of use over time. Although our software does save information about Internet use by time of day we are limited in space and have chosen to present an "average" view. We hope to be able to present further results at a later date.

Finally, we have not looked at the sequence of steps a user of the Internet undertakes when surfing: we have not followed their "click-stream" to understand particular patterns. The Optimal Internet Monitor does keep information about the sequence of web sites visited and it may prove instructive to follow those streams to determine whether, for example, users prefer Yahoo when finding news web sites and Lycos when looking for movie reviews. We believe there is a great deal of interesting research in this area.

We took a year's worth of data and made no analysis of trends in web site use and Internet traffic, a broader study has been performed by Georgia Tech and results presented in [1] and [2].

We designed the Internet Monitor to help corporations manage and control Internet use and bandwidth requirements: it has yielded interesting results which we hope stimulate further discussion and research. This paper is a small beginning.

References

[1] Kehoe, Colleen M. and Pitkow, James E. Surveying the Territory: GVU's Five WWW User Surveys The World Wide Web Journal, Vol. 1, no. 3, 1996

[2] Kehoe, Colleen M. and Pitkow, James E. Emerging Trends in the WWW User Population Communications of the ACM, Vol. 39, no. 6, 1996

[3] Bhatia, M Web Audience Measurement: Issues, Challenges and Solutions IPQC conference on Performance Measurement for Web Sites, San Francisco, 1996

URL References

Adobe Systems, Inc.http://www.adobe.com
Altavistahttp://altavista.digital.com
AOL Membershttp://members.aol.com
Apple Quicktimehttp://quicktime.apple.com
Bekkoamehttp://www.bekkoame.or.jp
CNNhttp://www.cnn.com
CNNfnhttp://cnnfn.com
DBChttp://www.dbc.com
Doubleclickhttp://www.doubleclick.net
ESPNhttp://espnet.sportszone.com
Excitehttp://www.excite.com
The Gatehttp://www.sfgate.com
Geocitieshttp://www.geocities.com
Grayfirehttp://www.grayfire.com
KPIXhttp://www.kpix.com
Lombardhttp://www.lombard.com
Lycoshttp://www.lycos.com
Martijuahttp://www.martijua.com
Microsofthttp://www.microsoft.com
MSNBChttp://www.msnbc.com
Netscape Communication Corporationhttp://www.netscape.com
NFLhttp://www.nfl.com
Optimal Networks, Inc. http://www.optimal.com
OTKhttp://www.otk.com
Packeteer, Inc. http://www.packeteer.com
Pathfinderhttp://pathfinder.com
Playboyhttp://www.playboy.com
PointCast, Inc. http://www.pointcast.com
Quicken Financial Networkhttp://www.galt.com
Quote.Comhttp://www.quote.com
San Jose Mercury Newshttp://www.sjmercury.com
United Meduahttp://www.unitedmedia.com
USA Todayhttp://www.usatoday.com
Yahoo, Inc. http://www.yahoo.com
Yahoo Quoteshttp://quote.yahoo.com




Return to Top of Page
Return to Technical Papers Index