Internet Indexing

Darren Hardy
Netscape Communications Corporation
5th WWW Conference, Paris, May 1996

Harvesting the Web

Harvest: Distributed Gatherer and Broker Architecture
Gatherer:
- Retrieves resources via URLs (HTTP, FTP, Gopher, etc.)
- Translates between file formats (HTML, TeX, ASCII, etc.)
- Generates indexing data for Broker
Broker:
- Retrieves indexing data from Gatherer
- Provides search interface for end-users (or other Brokers)
Growing adoption (e.g., comp.infosystems.harvest, Netscape Catalog Server, etc.)

Gatherering Efficiency in Harvest

SOIF: Summary Object Interchange Format
- Preserves structure in indexing data
- Supports arbitrary binary or textual data
- Streaming format to describe many resources
Gatherer targets specific information -- doesn't roam wildly
Gatherer and Broker communicate efficently:
- Incremental session-oriented transfers of indexing data
- Reduces network and CPU bottlenecks on servers

Real Distributed Gatherering on the Internet?

Coordinating gatherering effort is tough right now:
- Coverage is a competitive aspect
- No agreements on indexing data interchanging
- Few publish results or indexing data after gatherering
Few adopted mechanisms for targeting gatherering (e.g., /robots.txt)
More and more autonomous Robots are being deployed...
And the Web isn't getting any smaller!