Measuring the Web with Lycos
Michael L. Mauldin, Carnegie Mellon University, Pittsburgh, PA, USA
fuzzy@cmu.edu or http://fuzine.mt.cs.cmu.edu/mlm/
- Keywords:
- Web Size, Information Discovery and Retrieval
What is the Web?
The first question to answer in measuring the size of the web is
to determine what counts as being ``on'' the web. We define the
web as any document in either
- FTP space,
- Gopher space, or
- HTTP space.
By design, Lycos does not index ephemeral or changing data or
infinite virtual spaces. Therefore, the following are not considered part
the web:
- WAIS databases
- USENET news
- TELNET services
- Email
Further, we do not consider the output of CGI scripts as part of
our count. To codify this constraint, we ignore URLs containing
either a question mark (?) or an equals-sign (=), as these characters
are of primary usage for CGI scripts.
The figure shows our taxonomy of CyberSpace, showing that Lycos'
view of the Web is a strict subset of the Internet, but is larger
than simply the space provided by HTTP servers.
Sampling the space
Lycos samples the web continuously, and the search results are merged
with the catalog weekly. To estimate the size of the web, we take
a week's worth of new searches, assume they are an independent
random sample of the web as a whole, and multiply the old document size
by the ratio of the size of the new sample set to the size of the
intersection of the two sets.
As of February 1st, 1995, the ``old'' document set contained
1,647,617 known URLs. The new search consisted of 182,672 URLs,
of which 101,260 were also in the old set. That gives a ratio
of 1.804, multiplied by the old size gives 2.825 million URLs.
Other Measures
How many servers?
Between Nov 21, 1994 and Jan 31, 1995, Lycos successfully downloaded at
least one file from 15,858 unique HTTP servers.
How big is the average document?
During that same time period, the average text file size
downloaded was 6,340 characters.
So how big is it?
Multiplying gives an estimate of 17.91 billion bytes (16.7 gigabytes)
for the size of the web.
Sources of Error
The biggest problem with this number is that the search is almost
certainly not a truly random sample. Lycos uses a biased weighting
scheme to download ``popular'' documents first, so the new search
will tend to overlap the old more than a truly random sample.
Since the size of the intersection is therefore inflated, and
since it's in the denominator, the estimate of 2.825 million is a lower
bound.
Acknowledgements
Lycos is generously supported by funds from Carnegie Mellon University.
Some of the hardware is re-used from the Tipster Data Extraction
Project funded by ARPA. Dr. Mauldin is also funded by a research grant
from the Corporation for National Research Initiatives as part of
ARPA's Computer Science Technical Report project.
Lycos is a registered trademark of Carnegie Mellon University.
Last updated 15-Feb-95 by fuzzy@cmu.edu