Estimating Web properties by using search engines and random crawlers

Nobuko Kishi, Akiko Kondo, Takahide Ogawa
Tsuda College, Japan
Takahiro Ohmori, Seiji Sasazuka, Masahiro Mizutani
Tokyo University of Information Science,Japan
{ kishi, m99kondo, ogawa@tsuda.ac.jp, ohmori, sasazuka, mizutani@rsch.tuis.ac.jp }

Introduction

The rapid growth of the Web has made it impossible to learn various properties of the entire Web directly. Thus we need to use statistical methods to estimate the properties of Web, such as the size of Web (the number of Web pages) ,the amount of Web (the number of bytes of Web pages) and the number of links. S. Lawrence et al.[1,2] proposed two different methods for estimating the number of Web pages. The first method used search engines as a method of random sampling. They gave an estimate of 320 million pages as a lower bound on the size of the indexable Web in December 1997[1]. The second method used random sampling of IP addresses and gave an estimate of 800 million pages as the size of the Web in February 1999[2].

The purpose of this study is to see if these two methods can be applied to a subset of Web. As the diversity of Web users grows, we need new approaches to measure the properties of a subset of Web written by various languages and in countries. We have applied these two methods to Japanese Web pages in JP domain. The first method gave an estimate of 88 million pages as a lower bound on the size of Japanese indexable Web, while the second method gave an estimate of 17 million pages. These results suggest that the two methods do not measure the same set of Web pages, and that the number of Japanese indexable Web pages is much larger than the number currently known.

Estimates by Search Engine Coverage

Lawrence's Experiment using Search Engine Coverage

Lawrence et al.[1] analyzed the search results obtained from six major search engines by using the queries used at NEC research laboratory. They have retrieved the documents in the search result, and checked the existence of query terms. They have computed the size of overlap between the results of two largest search engines and then obtained estimates of 320 million with 95% confidence interval of 34 million.

Our Experiment with Search Engine Coverage

There are several difficulties to use their approach exactly to estimate the number of Web pages which contain Japanese. One of the difficulties is caused by a fact that we need Japanese terms as queries to obtain Web pages which contain Japanese text. Another is a fact that the search engines based in U.S. do not to cover as many Japanese Web pages as the engines based in Japan. We used Japanese query terms selected from a keyword index from newspaper articles between 1997 and 1998, published by Mainichi Shinbun. We used the following four major search engines in Japan: Goo (http://www.goo.ne.jp), Lycos Japan (http://www.lycos.co.jp), Excite Japan (http://www.excite.co.jp), Infoseek Japan (http://www.infoseek.co.jp). We have found 597 query terms satisfying the same conditions used by Lawrence et al. We then have retrieved the documents in the search result, and checked the existence of query terms from December 27 through 29, 1999. By computing the size of overlap between the results from two largest search engines, Goo and Lycos, we obtained an estimate of 88 million pages with 95% confidence level interval of 1 million. This estimate, 88 million pages, is much larger than other estimates currently known in Japan. In fact, Ministry on Post and Telecommunications, Japan reported an estimate of 29 million in 1999. We found a single largest search engine covers only 40% of our estimated size of Japanese Web. Figure 1 shows the relationship between our estimate and other statistics.

Estimates by Random IP Address Sampling

Lawrence's Experiment For Random IP Sampling

There are currently 256 4, about 4 billion possible Internet Protocol(IPv4) addresses. By obtaining a random sample of IP addresses and testing them for a Web server at a standard port, we can estimate the number of Web servers at a standard port. Furthermore we can estimate the number of Web pages, provided the distribution of the number of Web pages among servers is known. Lawrence et al. chose random 3.6 million IP addresses and tested them for a web server at a standard port. They found a web server for one in every 269 addresses and estimated the total number of Web servers as 16 millions. After excluding web servers with non-indexble pages, they have produced 2.8 millions as an estimate of the total number of public Web servers. Then they have observed the number of indexable web pages of 2,500 web servers chosen from the above sample of Web servers. They found the average number of Web pages to be 289 and produced an estimate of 800 million pages.

Our Experiment For Random IP Sampling

Among 256 4 IPv4 addresses, about 2.8 million IP addresses are currently managed by JPNIC [3]. We chose random 28 thousand IP addresses from these addresses. We tested them for a web server at a standard port, and got 335 responses, in which there are 175 successful responses. Among these 175 web servers, we found 85 servers hold indexable Web pages. Thus we obtain 85,000 as an estimate of the number of public Web servers in JPNIC address space. We then retrieved indexable Web pages from these 85 servers, found the average number of Web pages to be about 200. As a result, we obtain an estimate of 17 million Web pages in JPNIC's IP address space.

Although this estimated number of Web pages, 17 million pages, seems too small, we believe the estimated number of Web servers in JPNIC's IP address space, 85,000, is reasonable for the following reasons. The Netcraft Web Server Survey reports 70,851 Web servers in JP domain in May 1999 [5]. The WWW-in-JP Server Survey by Hitachi Seibu Software, Ltd. reports 78,015 Web servers with the name in the form of www.*.*.jp in November 1999 [6].

Discussions

In the experiments by Lawrence et al., the value estimated by the first method, 320 million, is smaller than the value by the second, 800 million. This result seemingly indicates that the first method might only give an estimate of a subset of Web pages that the second method can estimate. After all, the first method is based on the sampling of Web pages which can be retrieved by English query terms, while the second is based on the sampling of Web pages irrelevant of its contents. However, our first methods applied to JP domain gave the estimate of 88 million, that is is much larger than 17 million estimated by the second, showing the two methods does not measure the same set of Web space.

One of the reasons for the difference between the results by Lawrence and by us might be found in an explanation that the Web pages in JP domain are less linked to each other than Web pages in the entire Web observed by Lawrence et al. More precisely, the pages in JP domain has a tendency that they are isolated from the root document of a Web server. This situation could be understood by an example. Assume a user, X, has placed their Web pages at a provider Y. The user X's home page, http://www.Y.ne.jp/~X/, is usually not linked from the root document http://www.Y.ne.jp/ neither directly nor indirectly. If the user X registers their home page at some directory services, these pages will be eventually crawled by search engines' crawlers and will become searchable.

Conclusions

We have applied Lawrence's two methods to estimate the number of Japanese Web pages in JP domain. The first method gave an estimate of 88 million pages, as a lower bound on the size of Japanese indexable Web, showing that the number of Japanese Web pages is much larger than existing statistics and Japanese search engines cover only a part of Japanese Web. The second method gave an estimate of about 17 million pages. The second method gave an estimate of about 17 million pages, suggesting that Web pages in JP domain are less connected compared to Web pages in overall Web, although we need more study to explain the difference between these two methods.

References

  1. Lawrence,S. and Giles, C.L. : Searching the World Wide Web, Science 280, pp. 98-100(1998)
  2. Lawrence,S. and Giles, C.L. : Accessibility of information of the Web, Nature 400, pp. 107-109(1999)
  3. Japan Network Information Center: IP Addresses, October 31, 1999 http://www.nic.ad.jp/jp/regist/dns/doc/jp-addr-block.html
  4. Ministry of Post and Telecommunications, Japan: Tsuushin Hakusho 1999, White Paper on Telecommunications, http://www.mpt.go.jp/policyreports/japanese/papers/99wp/99wp-0-index.html
  5. NetCraft: The NetCraft Server Survey, May 1999-Japan http://www.netcraft.co.uk/survey/Reports/9905/bydomain/jp/
  6. Hitachi Seibu Software Ltd.: The WWW-in-JP Server Survey http://www.hitachi-ns.co.jp/pub/w3survey/latest/
Figure 1: Estimated Size of Japanese Web Pages