Measuring the Web with Lycos

Michael L. Mauldin, Carnegie Mellon University, Pittsburgh, PA, USA

fuzzy@cmu.edu or http://fuzine.mt.cs.cmu.edu/mlm/

Keywords:: Web Size, Information Discovery and Retrieval

What is the Web?

The first question to answer in measuring the size of the web is to determine what counts as being ``on'' the web. We define the web as any document in either

FTP space,
Gopher space, or
HTTP space.

By design, Lycos does not index ephemeral or changing data or infinite virtual spaces. Therefore, the following are not considered part the web:

WAIS databases
USENET news
TELNET services
Email

Further, we do not consider the output of CGI scripts as part of our count. To codify this constraint, we ignore URLs containing either a question mark (?) or an equals-sign (=), as these characters are of primary usage for CGI scripts.

The figure shows our taxonomy of CyberSpace, showing that Lycos' view of the Web is a strict subset of the Internet, but is larger than simply the space provided by HTTP servers.

Sampling the space

Lycos samples the web continuously, and the search results are merged with the catalog weekly. To estimate the size of the web, we take a week's worth of new searches, assume they are an independent random sample of the web as a whole, and multiply the old document size by the ratio of the size of the new sample set to the size of the intersection of the two sets.

As of February 1st, 1995, the ``old'' document set contained 1,647,617 known URLs. The new search consisted of 182,672 URLs, of which 101,260 were also in the old set. That gives a ratio of 1.804, multiplied by the old size gives 2.825 million URLs.

Other Measures

How many servers?

Between Nov 21, 1994 and Jan 31, 1995, Lycos successfully downloaded at least one file from 15,858 unique HTTP servers.

How big is the average document?

During that same time period, the average text file size downloaded was 6,340 characters.

So how big is it?

Multiplying gives an estimate of 17.91 billion bytes (16.7 gigabytes) for the size of the web.

Sources of Error

The biggest problem with this number is that the search is almost certainly not a truly random sample. Lycos uses a biased weighting scheme to download ``popular'' documents first, so the new search will tend to overlap the old more than a truly random sample.

Since the size of the intersection is therefore inflated, and since it's in the denominator, the estimate of 2.825 million is a lower bound.

Acknowledgements

Lycos is generously supported by funds from Carnegie Mellon University. Some of the hardware is re-used from the Tipster Data Extraction Project funded by ARPA. Dr. Mauldin is also funded by a research grant from the Corporation for National Research Initiatives as part of ARPA's Computer Science Technical Report project.

Lycos is a registered trademark of Carnegie Mellon University.

Last updated 15-Feb-95 by fuzzy@cmu.edu