Internet Indexing
Darren Hardy
Netscape Communications Corporation
5th WWW Conference, Paris, May 1996
Harvesting the Web
Harvest
: Distributed Gatherer and Broker Architecture
Gatherer:
Retrieves resources via URLs (HTTP, FTP, Gopher, etc.)
Translates between file formats (HTML, TeX, ASCII, etc.)
Generates indexing data for Broker
Broker:
Retrieves indexing data from Gatherer
Provides search interface for end-users (or other Brokers)
Growing adoption (e.g.,
comp.infosystems.harvest
,
Netscape Catalog Server
, etc.)
Gatherering Efficiency in Harvest
SOIF
: Summary Object Interchange Format
Preserves structure in indexing data
Supports arbitrary binary or textual data
Streaming format to describe many resources
Gatherer targets specific information -- doesn't roam wildly
Gatherer and Broker communicate efficently:
Incremental session-oriented transfers of indexing data
Reduces network and CPU bottlenecks on servers
Real Distributed Gatherering on the Internet?
Coordinating gatherering effort is tough right now:
Coverage
is a competitive aspect
No agreements on
indexing data
interchanging
Few
publish
results or indexing data after gatherering
Few adopted mechanisms for targeting gatherering (e.g., /robots.txt)
More and more
autonomous
Robots are being deployed...
And the Web isn't getting any smaller!