The GeoWeb Project

Using WAIS and the World Wide Web to aid location of Distributed Data Sets

Brandon Plewe, State University of New York at Buffalo

The acquisition of data is almost always the most expensive and time consuming part of GIS analysis. While Internet tools such as FTP promise to make spatial data easier to retrieve, it is still often very difficult to find. The National Spatial Data Infrastructure initiative of the U.S. federal government, and its Spatial Data Clearinghouse component, give a structure and impetus to the creation of powerful, easy-to-use tools for locating and retrieving geographic data through the use of Metadata. However, the actual structure and administration of the Clearinghouse have not yet been determined, and several models are possible.
The GeoWeb project provides an example of one approach to the Clearinghouse metadata-base, as well as prototype map- and query-based interfaces to aid the user in locating real geographic data. This pilot is successful in providing a way to find and retrieving data, and shows that the Clearinghouse can probably meet its desired goals.

Introduction
Aiding the Data Input Process
- The Data Sharing Concept
- Federal Data Sharing Initiatives
Clearinghouse Components
- Location of Clearinghouse Pieces
Building the GeoWeb Pilot
Results and Conclusions
Notes
Bibliography
About the Author

Introduction

Ask anyone involved with Geographic Information Systems (GIS) what the least favorite part of their work is, and you will very likely get the same answer: Data Input. This first phase of any GIS analysis, involving finding, creating, acquiring, entering and formatting spatial data to prepare it for analysis is almost always the most expensive and time consuming part of the project. This project looks at what is being done, and what can be done, to help users more easily locate and acquire spatial data using the Internet and its tools.

Aiding the Data Input Process

Most GIS analyses currently rely on either government- or self-produced data. The U.S. Government produces huge amounts of information, including geographic and spatial data of many forms, ranging from linework for basemaps produced by USGS, to thematic data from the Census that have spatial significance. In most cases, these datasets have been too large in total to be distributed easily. For instance, the complete set of USGS GeoData fills a warehouse of magnetic tapes, which must be located, copied, and delivered on demand. The incurred time and cost of this process make one wonder how "public" the data really is.

Many commercial software and data vendors make public data more convenient by repackaging it into different forms and media, sometimes adding new information to it in the process. However, these commercial sources can be very expensive. The data they produce is geared heavily toward commercial GIS applications (mostly involving street address matching), and rarely have much value for environmental or rural subjects.

If a researcher cannot find an outside source (public or private) for data, it will need to be collect first hand. This may involve field studies, photogrammetry or remote sensing, surveying, scanning, or digitizing; whatever the process, it is almost always time-consuming and expensive.

The Data Sharing Concept

One of the keys to aiding data acquisition efforts is data sharing, in which those who have already acquired data for their own projects make it readily available to others who need it. Over the 30-year history of GIS research, gigabytes of geographic data have been collected for projects that are long since completed. While this valuable information is gathering dust on tape, other people are out reproducing it from scratch for their own projects. If someone is doing research on soils in some remote mountain area, and another group has already done a soil survey of that area, the researcher would probably be very eager use the data (even for a fee), which they would probably be willing to sell.

However, this concept cannot become a reality by itself. There are four necessary steps to a successful data sharing effort:

Data producers need to be willing to share their data with others.
The data that producers wish to share needs to be in an open form that allows many other people to successfully read and use it.
There need to be tools that allow these producers to make data easily accessible to potential users.
There need to be tools whereby data users can learn what data is available out there that may be useful.

Federal Data Sharing Initiatives

The U.S. federal government has been evangelizing the data sharing concept for many years, and helping it to become a reality. This has happened, not because of some form of altruism, but because of mandate. Almost all information produced by the U.S. Government, including that of a geographic nature, is in the public domain. The people of the United States own the data, and have a right to it without cost (government producers are allowed to charge only for production and distribution costs, not content). The Freedom of Information Act (U.S. Congress 1966) requires the producers of information in the federal government to do anything they can to make that information more freely accessible to the public.

Currently, the major governmental voice in the data sharing effort is the Federal Geographic Data Committee, set up in 1990 "to promote the coordinated development, use, sharing, and dissemination of surveying, mapping, and related spatial data." (OMB 1990) It is composed of representatives of the 5 Cabinet Departments (Interior, Commerce, Agriculture, State, Transportation) that are responsible for producing various kinds of spatial data.

As part of their mission, they are overseeing the creation of a National Spatial Data Infrastructure (NSDI), part of the National Information Infrastructure (NII) initiative, which sets out the government's role in developing the communication networks of the future. The NSDI was officially mandated by an executive order signed on April 10, 1994 (EOP 1994). According to the official definition adopted by the FGDC,

"The National Spatial Data Infrastructure is the means to assemble geographic information that describes the arrangement and attributes of features and phenomena on the Earth. The infrastructure includes the materials, technology, and people necessary to acquire, process, store, and distribute such information to meet a wide variety of needs. (MSC 1993 p.2)

It is not concerned with the physical wiring of the computer networks involved, but how the four criteria list above can be met to allow the free exchange of spatial data between producers and consumers. A major part of the NSDI is the creation of a national Spatial Data Clearinghouse where data producers can contribute data they would like to share, and potential data users can come to find and obtain the data they need. If built well, it will produce an efficient, powerful means of data sharing, which will save time and money for almost anyone involved with GIS. The following section looks at some of the issues that must be dealt with in designing and implementing the Spatial Data Clearinghouse.

Clearinghouse Components

To build the Spatial Data Clearinghouse, several types of information need to come together. If they are put together well, the system will be easy to use and maintain. If not, it could be a nightmare to keep up-to-date, and impossible to use. The pieces can be separated into four distinct "levels:"

Geographic Data. These are the spatial datasets that are used in a GIS. They could be map data from government agencies (i.e. USGS DLG's), statistical data that has spatial significance (i.e. Census summary files), or complete GIS projects (i.e. Arc/Info export files). With the Internet, this data can be stored and easily retrived, usually via anonymous FTP servers. Thus, each dataset can be identified by a Uniform Resource Locator such as: ftp://data.census.gov/tiger/washington/snohomish/tgr28019.f1
Metadata. This is information that describes the dataset, so potential users can decide if the data will be useful to them before they spend the time or money to download it and look at it. It should at least include information about the spatial coverage, subject matter, format, and location (using URL's) of each dataset. To be most useful, the format should be based on recognized standards, such as the FGDC Content Standard for Digital Geographic Metadata (FGDC 1994b), which outlines the pieces of information that must be included in a useful metadata record.
Metadata Index. This level allows users to search through the metadata for a dataset meeting desired criteria. For spatial data, there must be provisions for at least two different types of criteria: keywords in the text or in certain fields, and spatial queries searching for a given region. For example, I may ask for "Transportation data produced after 1980 [keywords] covering the coordinate 38°20 N 115°45 W[spatial]." For maximum flexibility, the index must be based on a standard, open search protocol to allow for multiple interfaces, and access from anywhere on the Internet. In this project, WAIS is the protocol of choice, since it is Internet-based, well-integrated with the protocols used to retrieve the data itself, and has simple spatial query capabilities.
Search Interface. This is a front end for the Index level, which allows users to enter queries in an intuitive way, passing the criteria to the index(es). It could be anything from a query form in which users directly enter criteria, or a menu hierarchy that allows users to make choices for certain criteria (i.e. themes), to a map interface that allows users to point to the area in which they wish to search. This level must be based on an open interface system that has adequate interface-definition capabilities (including graphics), and is widely accessible over the Internet. Currently, the only feasible solution is the World Wide Web (WWW).

Figure 1 shows how these four levels relate in a particular situation. The user sets certain criteria for the desired geographic data using the WWW Interface Level (top). The interface software formats this into a valid WAIS query and sends it to the Index Level. The index performs the query and returns matching Metadata to the user (there could be several entries). After looking through these records, the user could choose the most appropriate one(s), and download the respective Geographic Data immediately. A process that may have taken months, if the data could be found at all, is reduced to a few minutes.

Location of Clearinghouse Pieces

The success of any system which hopes to be comprehensive (i.e. cataloging all available spatial data in the nation) must be designed in such a way as to ensure scalability, maintainability, usability, and flexibility. These factors depend heavily upon the physical location of each part of the Clearinghouse on the Internet.

For example, a national government could decide to set up a monolithic Geography Server which contained all the geographic data which exists in the country. While this would be conceptually easy to set up, it is not at all scalable (it is currently impossible to put enough disk space on a single computer, and even if possible, it would be extremely slow); maintaining the system would require a large staff (including several who spend all their time trying to find new available resources); and the monolithic design would not be very flexible (one agency would have ultimate control over the system).

At the other end of the scale, it is also conceptually easy to require that everyone maintain the interface, index, metadata, and dataset storage for their own spatial data. This approach would produce a very fractured, anarchic set of dissimilar products, and locating data would still be next to impossible (which of the 500 indexes do I look in?). It would also be far from comprehensive, since not every data producer has the resources to do their own distribution (usually requiring a dedicated data service computer and a fast Internet connection).

Table 1 represents the locational choices for each of the four levels. Centralized location means that it is housed at a single national/global site; Major-only location means that it is housed at a few (<50) large, professionally-staffed sites; Distributed location means that it could appear anywhere on the Net.

     TABLE 1. Locational Choices for Spatial Data Levels
     LEVEL               Centralized  Major-Only   Distributed
     -----------------------------------------------------------
     Spatial Data                     *            ***
     Metadata            *            ***          **
     Metadata Index      **           ***          *
     Query Interface     ***          **           *

The number of asterisks represents the relative desirability of each choice (no bullets means the choice is not possible). The following is an explanation of the shown preferences:

Spatial Data The location of the data itself is not really up for debate; it will be distributed all over the Net, despite any other efforts. There will certainly be a few major sites, such as government agencies and public-access archives (where users without their own distribution resources can place datasets) with thousands of files each, but there will also be hundreds of small businesses and academic departments who each contribute two or three files. This massively-distributed nature of the Internet is what makes this project so important.
Metadata and Index The second and third levels must be considered together, because currently, the WAIS software will only index information that exists on the same computer as the search server. The trade-off here is that a more centralized approach aids the search process ("one-stop shopping"), but the distributed approach ensures greater comprehensiveness and completeness (since each person indexes their own data), as well as demanding fewer resources from each server. A compromise may be to have a few (10-50) large indexes, with an open-maintenance approach that allows data providers to add and edit their own metadata entries interactively (possible through the use of HTML forms).
Interface The location of the Interface level is important because it is the level at which users operate. A single query interface must be able to simultaneously search any and all indexes existing in the Clearinghouse transparently, appearing to the user as though he or she is searching one massive database(1). However, there should be at least a few different interfaces (each comprehensive) to allow for different metaphors for querying (i.e. map- based spatial queries, placename lookup, keyword search). Since WAIS is an open Internet standard, the interface query tools don't have to be on the same computers as the indexes; an interface based on one machine could search several databases existing all over the world.

In the author's opinion, the best solution seems to be: distributed data, a few major metadata- index servers, and three or four user interfaces, each of which is comprehensive. However, there is currently some debate over the preferred location of the levels to achieve maximum comprehensiveness. So far, the experimental clearinghouse projects such as this one have focused on the implementation of each of these levels using datasets of limited size on single servers, purposefully avoiding this dilemma of creating a comprehensive national-scale clearinghouse.

Building the GeoWeb Pilot

For this project, a working prototype of the clearinghouse was created, building the top three layers on an existing spatial data archive on the Internet.

Data used

The test data archive need to be small (<200 datasets) and exist on a single computer, so that changes in the format and layout of the metadata could be made quickly, the scalability problem would not have to be dealt with. However, real datasets that exemplify what will eventually be in the NSDI are desired, with enough to be useful to users even in this pilot stage.

A good archive that meets thes criteria is the popular anonymous FTP server at spectrum.xerox.com. This site acts as a "swap meet" for geodata, where some users can contribute public data they have acquired and others can download it. It is not very large (<200 individual datasets), but contains real data that should be a good test for the metadata, the index, and the prototype interfaces. This site contains data of the following types:

USGS Digital Elevation Models (from the 7.5 minute and 1 degree series)
USGS Digital Line Graph Files (from the 1:24000, 1:100000, and 1:2million series)
USGS Land Use/Land Cover Files
U.S. Census Tiger files (1990 Pre- and Post-census series)
CIA World Databank files

Of these, the DEM and 1:24000 DLG directories have a good number of datasets, while the others did not have enough to be worth working with for now. These two groups together list 150 datasets, including: 5 1° DEM's, 83 7.5' DEM's, and 62 1:24000 DLG's.

Creation of Metadata

Creating a complete metadata record for each one of these 150 datasets would be very tedious. Fortunately, each directory on spectrum has a comprehensive index of the datasets there. Although the information about each file is not complete and is sometimes inconsistent, it is enough to build the most necessary part of the metadata, enough for it to be useful. Each directory, which contains a certain type of geodata, has its own index file, tailored to the needs of that dataset. These index files have slightly different formats but are fairly straightforward.

Although the FGDC Content Standard does not specify an exact format for the data, the wording of the standard was followed fairly closely to keep the task simple. The Standard is basically a hierarchy of information structures, where major organizational elements can contain several more detailed elements, with bottom-level elements having attributes associated with them. In places, the hierarchy can go 8-9 levels deep, and in full implementation, a metadata record could include over 300 lines. For this project, the structure was modified somewhat by treating some elements as attributes, condensing several levels; it is still compliant with the Standard, since the same information is present, and the Standard only defines content, not structure.

To turn the Spectrum index files into CSDGM-compliant metadata, I used a set of scripts written in the Perl programming language. Each parses a line of the index file, using the information it finds to build a metadata record. Several of the lines were in a nonstandard format, which required me to edit some of the resulting metadata files by hand. In actual Clearinghouse implementation, this would not be a problem, because the metadata record would be created by the data supplier when the data was submitted.

Building the Index

The indexing is done using the freeWAIS software, which is being developed by CNIDR, based on the WAIS protocol created by Brewster Kahle (formerly at Thinking Machines). This protocol provides for rapid searching of large and/or numerous text files, such as the metadata. The actual software used was freeWAIS-sf, modified at the University of Dortmund to include field-based searches and better boolean queries, both of which are vital for the simple spatial queries that are done with the metadata.

The "index" produced by the WAIS indexer consists of a dictionary of words used, and an inverse file, which lists the locations of each word in the dictionary. The fields are also parsed into separate directory and inverse files to allow field serching.

To handle queries, a WAIS server is executed, which runs constantly on the host computer. As it receives a query from a client on the Internet (or on the same computer), it parses it into separate fields, performs each search, and uses any boolean operators to narrow the results down, returning a list of matching records (each expressed as a range of bytes in a file). As the user selects one of the entries, the client requests the particular record, which is extracted from the appropriate file and return.

The WAIS client, which contains the user interface and sends entered queries to the server, could be a program designed specifically for WAIS (such as the ones distributed with the server and indexer), a general Internet browser such as Mosaic, or as in our case, a script running behind a custom interface.

Spatial Searching

An important part of this database is the ability to do spatial queries. While the potential user of geographic data may be searching for certain keywords in many fields, such as theme, producer, or data quality, the most important single piece information for determining the appropriateness of a dataset is its spatial coverage the area on the ground that it represents. The best way to characterize this coverage is as an arbitrarily complex polygon, expressed as a list of coordinate pairs(2).

To execute a query on this database, the client would pass a g-ring (the desired coverage) to the server, which would execute a polygon overlay algorithm with each coverage g-ring in the database to see if they overlap, returning any that do as matches. While this is the most accurate way of doing spatial queries, it is currently not implemented in WAIS(3), and when it is, will probably be fairly slow compared to field and full-text searches.

Another approach, raised by Doug Nebert of USGS (Nebert 1994), is to do "fake" spatial queries using fields and a Minimum Bounding Rectangle (MBR) query. To accomplish this, the spatial coverage of each dataset is stored as four numbers: the easternmost and westernmost longitudes in the dataset, and the northernmost and southernmost latitudes; this forms a rectangle which just encloses the data. While this is a very crude approximation of the dataset's coverage (4), it is very simple to work with.

These values are stored as separate numeric-type fields, specified in the the CSDGM as EBNDGCOORD, WBNDGCOORD, NBNDGCOORD, and SBNDGCOORD, respectively (ASTM 1994). When the user submits a rectangle representing the desired coverage (say ELON,WLON,NLAT,SLAT), the client merely converts this request into the following boolean field WAIS query to submit to determine overlapping datasets:

ELON>WBNDGCOORD and WLON<EBNDGCOORD and NLON>SBNDGCOORD and SLON>NBNDGCOORD

This query is processed by the freeWAIS-sf server as a normal field query, just as quickly as any other. While spurious results may be returned, this approach will work for the pilot application.

To more intelligently handle the particular data being used, an intermediate WWW-WAIS interface, called SFgate, is used. This perl package was written by the same people at Dortmund University who did the work on freeWAIS-sf, with a few modifications by the author. It takes a query as it would be entered by an HTML form (i.e. http://...?ebndgcoord=-75&wbndgcoord;=75&...), forms it into a proper boolean WAIS expression, submits the query to the WAIS server, and formats the results. The author's modification allows for the customization of the display of th metadata, so the plain text record can be turned into an HTML document, with appropriate headings and hypertext links (i.e. pointing to the actual location of the dataset).

Gazetteer Interface

The nice thing about the WAIS engine (and the SFgate intermediate interface) is its client/server Internet-based approach, which means that many client interfaces on various host computers can be implemented, which all submit queries to a single database. In this experimental project, two interfaces were developed which should allow geographic researchers to intuitively perform spatial queries on the GeoWeb database.

The first interface is in the form of a gazetteer, where users specify a location by name, and request data that covers that locale. When the user enters a placename (i.e. "columbus, oh"), the server script looks it up in a database of populated places(5). This database returns basic information about any matching places, including the location in latitude and longitude (i.e. "40 23 45 N 80 43 20 W"). The script takes the returned text and formats it into an HTML list of places, adding a link to the end of each one which, when selected, spatially searches the database using the place's coordinates.

Map Interface

The second interface implemented was a map-based approach, where users can use a map of the United States in a WWW browser to specify the desired area. This map can be zoomed in and out, and panned in any direction, until the user finds the region needed. This is done using a link to the Xerox MapViewer which generates simple GIF-format maps based on user-supplied criteria.

The mapbrowse script receives basic criteria from a query (i.e. "http://...?lat=40&lon;=-90&width;=5") and generates an html page including the appropriate MapViewer image, and graphical "buttons" for panning (i.e. the left button re-requests the same mapbrowse script, but with lon=lon-width/2 to pan half a screen to the West) and zooming (i.e. "zoom in" re-requests the script with width=width*2). A small form allows users to enter the three pertinent criteria directly, and there is a link to the above gazetteer interface to center the map on an actual place. Using a combination of the interactive graphics, direct entry, and keyword lookup approaches, the user should be able to easily find the desired region.

Once the proper region is displayed on the screen, the optimal way of entering a query for the metadatabase would be for the user to draw the desired area on the map, as a rectangle or polygon. Since this kind of complex graphical input is not yet implemented into HTML and the World-Wide Web(6), a workaround was implemented. Four points are displayed on the map at all times, forming a rectangle which covers about 1/3 of the map area (to give some context to the enclosed area). A hyperlink in the HTML text below the map executes a spatial query to the GeoWeb database, using the displayed rectangle as the query region. Again, this is crude, but it works as a proof of concept for the map-oriented interface.

Results and Conclusions

There was no rigorous testing of the effectiveness of the interface. However, several researchers involved with similar projects were asked to look at the two interfaces and the metadata and respond on its effectiveness. Except for a few cosmetic suggestions, the response was very favorable; everyone felt the gazetteer and map were intuitive ways of looking for spatial data. Suggestions for future enhancements included:

Fields to allow users to specify non-spatial keywords for the query
HTML Forms to allow data producers to add their own metadata entries
A more tightly-structured metadata format (still compliant with CSDGM) to allow for more intelligent processing and display of the metadata records.
Ability to search other similar databases simultaneously

All of these suggestions are in harmony with the structure of the Clearinghouse described above. However, the conceptual framework for the Clearinghouse, the physical format of the metadata, and the design of the interface tools as presented here were all based on one person's ideas. Much work needs to be done to build consensus on actual implementation issues such as these, so that it can be built well and on time. Although many of the technical details need to be worked out, this study shows that each of the pieces of the Clearinghouse can be built using accepted standards into a very useful product for finding and accessing spatial data.

Notes

The technical feasibility of having a single interface search 50 WAIS databases simultaneously is questionable, and needs to be tested in the near future. If it is not possible, a single monolithic metadata/index server would be the next best solution, and this is not desirable.
According to the Spatial Data Transfer Standard dictionary, a closed, ordered string of coordinate pairs is called a "G-Ring." (Fegeas et al 1992)
The spatial-query code is currently being worked into freeWAIS-sf for release in the fall of 1994. Also, a full spatial handling component is currently being developed for inclusion into the Z39.50v3 protocol, the successor to WAIS (Nebert 1994)
For example, the MBR of the state of California also includes most of Nevada and a considerable area of ocean.
The Geographic Names Server, maintained by Tom Libert at the University of Michigan, can be found at telnet://martini.eecs.umich.edu:3000.
It is currently being worked into HTML Level 3 as a "scribble" input type, although the details of implementation have not yet been worked out.

Bibliography

89th U.S. Congress (1966).: Freedom of Information Act. United States Code Title V Section 552 (most recently amended 1986); gopher://eryx.syr.edu/00/Citizen%27s%20Guide/Appendix%204-Information%20Act
American Society for Testing Materials (1994): Content Standard for Digital Geospatial Metadata, Section D18.01.05 Draft Specification May 23 1994; ftp://waisqvarsa.er.usgs.gov/wais/docs/ASTMmeta83194.ps
Executive Office of the President (1994).: "Coordinating Geographic Data Acquisition and Access: The National Spatial Data Infrastructure." Executive Order 12906. Washington, D.C.: Government Printing Office. Signed April 11, 1994.; ftp://fgdc.er.usgs.gov/gdc/html/execord.html
Federal Geographic Data Committee (1994a).: The 1994 Plan for the National Spatial Data Infrastructure. Reston, VA: FGDC. March 1994.; ftp://fgdc.er.usgs.gov/gdc/general/documents/nsdi.plan.1994.ps
Federal Geographic Data Committee (1994b).: Content Standard for Digital Geospatial Metadata. Reston, VA: FGDC. final draft June 8, 1994.; ftp://fgdc.er.usgs.gov/gdc/metadata/meta.6894.ps
Fegeas, Robin G., Cascio, Janette L., Lazar, Robert A. (1992): "An Overview of FIPS 173, The Spatial Data Transfer Standard," Cartography and Geographic Information Systems, Vol.19 No. 5 (Dec 1992); ftp://sdts.er.usgs.gov/pub/sdts/articles/ps/overview.ps
Mapping Sciences Committee, John D. Bossler, chair (1993): "Toward a Coordinated Spatial Data Infrastructure for the Nation". Washington, DC: National Academy of Sciences Press.
Nebert, Douglas V. (1994).: Personal electronic communication with the author.
Office of Management and Budget, Executive Office of the President (1990).: Coordination of Federal Surveying, Mapping, and Related Spatial Data Activities. Circular A-16 Revised. Washington, D.C.: Government Printing Office.; ftp://fgdc.er.usgs.gov/gdc/general/documents/a-16.txt

About the Author

Brandon Plewe is originally from St. George Utah, but currently resides in Buffalo, New York, where he is the Assistant Coordinator for Campus-Wide Information Services at the State University of New York. He holds a B.S. in Math and Cartography from Brigham Young University, is completing his M.A. in Geography at SUNY/Buffalo, and is working on a PhD in the same subject.

Along with the UB Wings CWIS which he co-manages, he has been a frequent contributer to the popularity and usability of the World Wide Web, developing, among other things, the first Best of the Web award competition, and the popular Virtual Tourist service. He continues to search for ways to integrate mapping and geography into the Internet. He has been married to his wife Jamie for 2½ years, and has one son, Spencer.

He can be contacted at plewe@acsu.buffalo.edu.

The GeoWeb Project

Using WAIS and the World Wide Web to aid location of Distributed Data Sets

Brandon Plewe, State University of New York at Buffalo

Contents

Spatial Searching