The GeoWeb Project
Using WAIS and the World Wide Web to aid location of Distributed Data Sets
Brandon Plewe,
State University of New York at Buffalo
The acquisition of data is almost always the most expensive and time consuming part of
GIS analysis. While Internet tools such as FTP promise to make spatial data easier to retrieve,
it is still often very difficult to find. The National Spatial Data Infrastructure initiative of the
U.S. federal government, and its Spatial Data Clearinghouse component, give a structure and
impetus to the creation of powerful, easy-to-use tools for locating and retrieving geographic
data through the use of Metadata. However, the actual structure and administration of the
Clearinghouse have not yet been determined, and several models are possible.
The GeoWeb project provides an example of one approach to the Clearinghouse
metadata-base, as well as prototype map- and query-based interfaces to aid the user in locating
real geographic data. This pilot is successful in providing a way to find and retrieving data, and
shows that the Clearinghouse can probably meet its desired goals.
Contents
Ask anyone involved with Geographic Information Systems (GIS) what the least favorite
part of their work is, and you will very likely get the same answer: Data Input. This first phase of
any GIS analysis, involving finding, creating, acquiring, entering and formatting spatial data to
prepare it for analysis is almost always the most expensive and time consuming part of the project.
This project looks at what is being done, and what can be done, to help users more easily locate
and acquire spatial data using the Internet and its tools.
Most GIS analyses currently rely on either government- or self-produced data. The U.S.
Government produces huge amounts of information, including geographic and spatial data of
many forms, ranging from linework for basemaps produced by USGS, to thematic data from the
Census that have spatial significance. In most cases, these datasets have been too large in total to
be distributed easily. For instance, the complete set of USGS GeoData fills a warehouse of
magnetic tapes, which must be located, copied, and delivered on demand. The incurred time and
cost of this process make one wonder how "public" the data really is.
Many commercial software and data vendors make public data more convenient by
repackaging it into different forms and media, sometimes adding new information to it in the
process. However, these commercial sources can be very expensive. The data they produce is
geared heavily toward commercial GIS applications (mostly involving street address matching),
and rarely have much value for environmental or rural subjects.
If a researcher cannot find an outside source (public or private) for data, it will need to be
collect first hand. This may involve field studies, photogrammetry or remote sensing, surveying,
scanning, or digitizing; whatever the process, it is almost always time-consuming and expensive.
One of the keys to aiding data acquisition efforts is data sharing, in which those who have
already acquired data for their own projects make it readily available to others who need it. Over
the 30-year history of GIS research, gigabytes of geographic data have been collected for projects
that are long since completed. While this valuable information is gathering dust on tape, other
people are out reproducing it from scratch for their own projects. If someone is doing research
on soils in some remote mountain area, and another group has already done a soil survey of that
area, the researcher would probably be very eager use the data (even for a fee), which they would
probably be willing to sell.
However, this concept cannot become a reality by itself. There are four necessary steps to
a successful data sharing effort:
- Data producers need to be willing to share their data with others.
- The data that producers wish to share needs to be in an open form that allows many
other people to successfully read and use it.
- There need to be tools that allow these producers to make data easily accessible to
potential users.
- There need to be tools whereby data users can learn what data is available out there that
may be useful.
The U.S. federal government has been evangelizing the data sharing concept for many
years, and helping it to become a reality. This has happened, not because of some form of
altruism, but because of mandate. Almost all information produced by the U.S. Government,
including that of a geographic nature, is in the public domain. The people of the United States
own the data, and have a right to it without cost (government producers are allowed to charge
only for production and distribution costs, not content). The Freedom of Information Act
(U.S. Congress 1966) requires the producers of information in the
federal government to do anything
they can to make that information more freely accessible to the public.
Currently, the major governmental voice in the data sharing effort is the Federal
Geographic Data Committee, set up in 1990 "to promote the coordinated development, use,
sharing, and dissemination of surveying, mapping, and related spatial data."
(OMB 1990) It is
composed of representatives of the 5 Cabinet Departments (Interior, Commerce, Agriculture,
State, Transportation) that are responsible for producing various kinds of spatial data.
As part of their mission, they are overseeing the creation of a National Spatial Data
Infrastructure (NSDI), part of the National Information Infrastructure (NII) initiative, which sets
out the government's role in developing the communication networks of the future. The NSDI
was officially mandated by an executive order signed on April 10, 1994 (EOP 1994). According
to the official definition adopted by the FGDC,
"The National Spatial Data Infrastructure is the means to assemble geographic
information that describes the arrangement and attributes of features and
phenomena on the Earth. The infrastructure includes the materials, technology, and
people necessary to acquire, process, store, and distribute such information to
meet a wide variety of needs. (MSC 1993 p.2)
It is not concerned with the physical wiring of the computer networks involved, but how
the four criteria list above can be met to allow the free exchange of spatial data between
producers and consumers. A major part of the NSDI is the creation of a national Spatial Data
Clearinghouse where data producers can contribute data they would like to share, and potential
data users can come to find and obtain the data they need. If built well, it will produce an
efficient, powerful means of data sharing, which will save time and money for almost anyone
involved with GIS. The following section looks at some of the issues that must be dealt with in
designing and implementing the Spatial Data Clearinghouse.
To build the Spatial Data Clearinghouse, several types of information need to come together. If
they are put together well, the system will be easy to use and maintain. If not, it could be a
nightmare to keep up-to-date, and impossible to use. The pieces can be separated into four
distinct "levels:"
- Geographic Data.
These are the spatial datasets that are used in a GIS. They could be map data from
government agencies (i.e. USGS DLG's), statistical data that has spatial significance (i.e.
Census summary files), or complete GIS projects (i.e. Arc/Info export files). With the
Internet, this data can be stored and easily retrived, usually via anonymous FTP servers.
Thus, each dataset can be identified by a Uniform Resource Locator such as:
ftp://data.census.gov/tiger/washington/snohomish/tgr28019.f1
- Metadata.
This is information that describes the dataset, so potential users can decide if the data will
be useful to them before they spend the time or money to download it and look at it. It
should at least include information about the spatial coverage, subject matter, format, and
location (using URL's) of each dataset. To be most useful, the format should be based on
recognized standards, such as the FGDC Content Standard for Digital Geographic
Metadata (FGDC 1994b), which outlines the pieces of information
that must be included in a useful metadata record.
- Metadata Index.
This level allows users to search through the metadata for a dataset meeting desired
criteria. For spatial data, there must be provisions for at least two different types of
criteria: keywords in the text or in certain fields, and spatial queries searching for a given
region. For example, I may ask for "Transportation data produced after 1980 [keywords]
covering the coordinate 38°20 N 115°45 W[spatial]."
For maximum flexibility, the index must be based on a standard, open search protocol to
allow for multiple interfaces, and access from anywhere on the Internet. In this project,
WAIS is the protocol of choice, since it is Internet-based, well-integrated with the
protocols used to retrieve the data itself, and has simple spatial query capabilities.
- Search Interface.
This is a front end for the Index level, which allows users to enter queries in an intuitive
way, passing the criteria to the index(es). It could be anything from a query form in which
users directly enter criteria, or a menu hierarchy that allows users to make choices for
certain criteria (i.e. themes), to a map interface that allows users to point to the area
in which they wish to search. This level must be based on an open interface system that has
adequate interface-definition capabilities (including graphics), and is widely accessible
over the Internet. Currently, the only feasible solution is the World Wide Web (WWW).
Figure 1 shows how these four levels relate in a particular situation. The user sets certain criteria for
the desired geographic data using the WWW Interface Level (top). The interface software
formats this into a valid WAIS query and sends it to the Index Level. The index performs the query and
returns matching Metadata to the user (there could be several entries). After looking through these
records, the user could choose the most appropriate one(s), and download the respective Geographic
Data immediately. A process that may have taken months, if the data could be found at all, is reduced
to a few minutes.
The success of any system which hopes to be comprehensive (i.e. cataloging all available
spatial data in the nation) must be designed in such a way as to ensure scalability, maintainability,
usability, and flexibility. These factors depend heavily upon the physical location of each part of
the Clearinghouse on the Internet.
For example, a national government could decide to set up a monolithic Geography Server
which contained all the geographic data which exists in the country. While this would be
conceptually easy to set up, it is not at all scalable (it is currently impossible to put enough disk
space on a single computer, and even if possible, it would be extremely slow); maintaining the
system would require a large staff (including several who spend all their time trying to find new
available resources); and the monolithic design would not be very flexible (one agency would have
ultimate control over the system).
At the other end of the scale, it is also conceptually easy to require that everyone maintain
the interface, index, metadata, and dataset storage for their own spatial data. This approach
would produce a very fractured, anarchic set of dissimilar products, and locating data would still
be next to impossible (which of the 500 indexes do I look in?). It would also be far from
comprehensive, since not every data producer has the resources to do their own distribution
(usually requiring a dedicated data service computer and a fast Internet connection).
Table 1 represents the locational choices for each of the four levels. Centralized
location means that it is housed at a single national/global site; Major-only location means
that it is housed at a few (<50) large, professionally-staffed sites; Distributed location
means that it could appear anywhere on the Net.
TABLE 1. Locational Choices for Spatial Data Levels
LEVEL Centralized Major-Only Distributed
-----------------------------------------------------------
Spatial Data * ***
Metadata * *** **
Metadata Index ** *** *
Query Interface *** ** *
The number of asterisks represents the relative desirability of each choice (no bullets means the
choice is not possible). The following is an explanation of the shown preferences:
- Spatial Data
The location of the data itself is not really up for debate; it will be distributed all over the
Net, despite any other efforts. There will certainly be a few major sites, such as
government agencies and public-access archives (where users without their own
distribution resources can place datasets) with thousands of files each, but there will also
be hundreds of small businesses and academic departments who each contribute two or
three files. This massively-distributed nature of the Internet is what makes this project so
important.
- Metadata and Index
The second and third levels must be considered together, because currently, the WAIS
software will only index information that exists on the same computer as the search server.
The trade-off here is that a more centralized approach aids the search process ("one-stop
shopping"), but the distributed approach ensures greater comprehensiveness and
completeness (since each person indexes their own data), as well as demanding fewer
resources from each server. A compromise may be to have a few (10-50) large indexes,
with an open-maintenance approach that allows data providers to add and edit their own
metadata entries interactively (possible through the use of HTML forms).
- Interface
The location of the Interface level is important because it is the level at which users
operate. A single query interface must be able to simultaneously search any and all
indexes existing in the Clearinghouse transparently, appearing to the user as though he or
she is searching one massive database(1). However, there should be at least a few different
interfaces (each comprehensive) to allow for different metaphors for querying (i.e. map-
based spatial queries, placename lookup, keyword search). Since WAIS is an open
Internet standard, the interface query tools don't have to be on the same computers as the
indexes; an interface based on one machine could search several databases existing all over
the world.
In the author's opinion, the best solution seems to be: distributed data, a few major metadata-
index servers, and three or four user interfaces, each of which is comprehensive. However, there
is currently some debate over the preferred location of the levels to achieve maximum
comprehensiveness. So far, the experimental clearinghouse projects such as this one have focused
on the implementation of each of these levels using datasets of limited size on single servers,
purposefully avoiding this dilemma of creating a comprehensive national-scale clearinghouse.
For this project, a working prototype of the clearinghouse was created, building the top
three layers on an existing spatial data archive on the Internet.
The test data archive need to be small (<200 datasets) and exist on a single computer, so
that changes in the format and layout of the metadata could be made quickly, the scalability
problem would not have to be dealt with. However, real datasets that exemplify what will
eventually be in the NSDI are desired, with enough to be useful to users even in this pilot stage.
A good archive that meets thes criteria is the popular anonymous FTP server at
spectrum.xerox.com. This site acts as a "swap meet" for geodata, where some users can
contribute public data they have acquired and others can download it. It is not very large (<200
individual datasets), but contains real data that should be a good test for the metadata, the index,
and the prototype interfaces. This site contains data of the following types:
- USGS Digital Elevation Models (from the 7.5 minute and 1 degree series)
- USGS Digital Line Graph Files (from the 1:24000, 1:100000, and 1:2million series)
- USGS Land Use/Land Cover Files
- U.S. Census Tiger files (1990 Pre- and Post-census series)
- CIA World Databank files
Of these, the DEM and 1:24000 DLG directories have a good number of datasets, while
the others did not have enough to be worth working with for now. These two groups together
list 150 datasets, including: 5 1° DEM's, 83 7.5' DEM's, and 62 1:24000 DLG's.
Creating a complete metadata record for each one of these 150 datasets would be very
tedious. Fortunately, each directory on spectrum has a comprehensive index of the datasets there.
Although the information about each file is not complete and is sometimes inconsistent, it is
enough to build the most necessary part of the metadata, enough for it to be useful. Each
directory, which contains a certain type of geodata, has its own index file, tailored to the needs of
that dataset. These index files have slightly different formats but are fairly straightforward.
Although the FGDC Content Standard does not specify an exact format for the data, the
wording of the standard was followed fairly closely to keep the task simple. The Standard is
basically a hierarchy of information structures, where major organizational elements can contain
several more detailed elements, with bottom-level elements having attributes associated with
them. In places, the hierarchy can go 8-9 levels deep, and in full implementation, a metadata
record could include over 300 lines. For this project, the structure was modified somewhat by
treating some elements as attributes, condensing several levels; it is still compliant with the
Standard, since the same information is present, and the Standard only defines content, not
structure.
To turn the Spectrum index files into CSDGM-compliant metadata, I used a set of scripts
written in the Perl programming language. Each parses a line of the index file, using the
information it finds to build a metadata record. Several of the lines were in a nonstandard format,
which required me to edit some of the resulting metadata files by hand. In actual Clearinghouse
implementation, this would not be a problem, because the metadata record would be created by
the data supplier when the data was submitted.
The indexing is done using the freeWAIS software, which is being developed by CNIDR,
based on the WAIS protocol created by Brewster Kahle (formerly at Thinking Machines). This
protocol provides for rapid searching of large and/or numerous text files, such as the metadata.
The actual software used was freeWAIS-sf, modified at the University of Dortmund to include
field-based searches and better boolean queries, both of which are vital for the simple spatial
queries that are done with the metadata.
The "index" produced by the WAIS indexer consists of a dictionary of words used, and an
inverse file, which lists the locations of each word in the dictionary. The fields are also parsed
into separate directory and inverse files to allow field serching.
To handle queries, a WAIS server is executed, which runs constantly on the host
computer. As it receives a query from a client on the Internet (or on the same computer), it
parses it into separate fields, performs each search, and uses any boolean operators to narrow the
results down, returning a list of matching records (each expressed as a range of bytes in a file).
As the user selects one of the entries, the client requests the particular record, which is extracted
from the appropriate file and return.
The WAIS client, which contains the user interface and sends entered queries to the server,
could be a program designed specifically for WAIS (such as the ones distributed with the server
and indexer), a general Internet browser such as Mosaic, or as in our case, a script running behind
a custom interface.
Spatial Searching
An important part of this database is the ability to do spatial queries. While the potential
user of geographic data may be searching for certain keywords in many fields, such as theme,
producer, or data quality, the most important single piece information for determining the
appropriateness of a dataset is its spatial coverage the area on the ground that it represents. The
best way to characterize this coverage is as an arbitrarily complex polygon, expressed as a list of
coordinate pairs(2).
To execute a query on this database, the client would pass a g-ring (the desired coverage)
to the server, which would execute a polygon overlay algorithm with each coverage g-ring in the
database to see if they overlap, returning any that do as matches. While this is the most accurate
way of doing spatial queries, it is currently not implemented in WAIS(3), and when it is, will
probably be fairly slow compared to field and full-text searches.
Another approach, raised by Doug Nebert of USGS (Nebert 1994),
is to do "fake" spatial
queries using fields and a Minimum Bounding Rectangle (MBR) query. To accomplish this, the
spatial coverage of each dataset is stored as four numbers: the easternmost and westernmost
longitudes in the dataset, and the northernmost and southernmost latitudes; this forms a rectangle
which just encloses the data. While this is a very crude approximation of the dataset's coverage
(4), it is very simple to work with.
These values are stored as separate numeric-type fields, specified in the the CSDGM as
EBNDGCOORD
, WBNDGCOORD
, NBNDGCOORD
, and SBNDGCOORD
,
respectively (ASTM 1994). When the user submits a rectangle representing the
desired coverage (say ELON,WLON,NLAT,SLAT
), the
client merely converts this request into the following boolean field WAIS query to submit to
determine overlapping datasets:
ELON>WBNDGCOORD and WLON<EBNDGCOORD and NLON>SBNDGCOORD and SLON>NBNDGCOORD
This query is processed by the freeWAIS-sf server as a normal field query, just as quickly
as any other. While spurious results may be returned, this approach will work for the pilot
application.
To more intelligently handle the particular data being used, an intermediate WWW-WAIS
interface, called SFgate, is used. This perl package was written by the same people at Dortmund
University who did the work on freeWAIS-sf, with a few modifications by the author. It takes a
query as it would be entered by an HTML form (i.e.
http://...?ebndgcoord=-75&wbndgcoord;=75&...
), forms it into a proper boolean WAIS
expression, submits the query to
the WAIS server, and formats the results. The author's modification allows for the customization
of the display of th metadata, so the plain text record can be turned into an HTML document,
with appropriate headings and hypertext links (i.e. pointing to the actual location of the dataset).
The nice thing about the WAIS engine (and the SFgate intermediate interface) is its
client/server Internet-based approach, which means that many client interfaces on various host
computers can be implemented, which all submit queries to a single database. In this experimental
project, two interfaces were developed which should allow geographic researchers to intuitively
perform spatial queries on the GeoWeb database.
The first interface is in the form of a gazetteer, where users specify a location by name,
and request data that covers that locale. When the user enters a placename (i.e. "columbus,
oh
"), the server script looks it up in a database of populated places(5). This database returns basic
information about any matching places, including the location in latitude and longitude (i.e.
"40 23 45 N 80 43 20 W
"). The script takes the returned text and formats it into an HTML list
of places, adding a link to the end of each one which, when selected, spatially searches the
database using the place's coordinates.
The second interface implemented was a map-based approach, where users can use a map
of the United States in a WWW browser to specify the desired area. This map can be zoomed in
and out, and panned in any direction, until the user finds the region needed. This is done using a
link to the Xerox MapViewer which generates simple
GIF-format maps based on user-supplied criteria.
The mapbrowse script receives basic criteria from a query (i.e.
"http://...?lat=40&lon;=-90&width;=5"
) and generates an html page including the appropriate
MapViewer image, and graphical "buttons" for panning (i.e. the left button re-requests the same
mapbrowse script, but with lon=lon-width/2
to pan half a screen to the West) and zooming (i.e.
"zoom in" re-requests the script with width=width*2
). A small form allows users to enter the
three pertinent criteria directly, and there is a link to the above gazetteer interface to center the
map on an actual place. Using a combination of the interactive graphics, direct entry, and
keyword lookup approaches, the user should be able to easily find the desired region.
Once the proper region is displayed on the screen, the optimal way of entering a query for
the metadatabase would be for the user to draw the desired area on the map, as a rectangle or
polygon. Since this kind of complex graphical input is not yet implemented into HTML and the
World-Wide Web(6), a workaround was implemented. Four points are displayed on the map at all
times, forming a rectangle which covers about 1/3 of the map area (to give some context to the
enclosed area). A hyperlink in the HTML text below the map executes a spatial query to the
GeoWeb database, using the displayed rectangle as the query region. Again, this is crude, but it
works as a proof of concept for the map-oriented interface.
There was no rigorous testing of the effectiveness of the interface. However, several
researchers involved with similar projects were asked to look at the two interfaces and the
metadata and respond on its effectiveness. Except for a few cosmetic suggestions, the response
was very favorable; everyone felt the gazetteer and map were intuitive ways of looking for spatial
data. Suggestions for future enhancements included:
- Fields to allow users to specify non-spatial keywords for the query
- HTML Forms to allow data producers to add their own metadata entries
- A more tightly-structured metadata format (still compliant with CSDGM) to allow for
more intelligent processing and display of the metadata records.
- Ability to search other similar databases simultaneously
All of these suggestions are in harmony with the structure of the Clearinghouse described above.
However, the conceptual framework for the Clearinghouse, the physical format of the metadata,
and the design of the interface tools as presented here were all based on one person's ideas. Much
work needs to be done to build consensus on actual implementation issues such as these, so that it
can be built well and on time. Although many of the technical details need to be worked out, this
study shows that each of the pieces of the Clearinghouse can be built using accepted standards
into a very useful product for finding and accessing spatial data.
The technical feasibility of having a single interface search 50
WAIS databases simultaneously is questionable, and needs to be tested in the near future. If
it is not possible, a single monolithic metadata/index server would be the next best solution,
and this is not desirable.
According to the Spatial Data Transfer Standard dictionary, a closed,
ordered string of coordinate pairs is called a "G-Ring." (Fegeas et al 1992)
The spatial-query code is currently being worked into freeWAIS-sf for release
in the fall of 1994. Also, a full spatial handling component is currently being developed
for inclusion into the Z39.50v3 protocol, the successor to WAIS (Nebert 1994)
For example, the MBR of the state of California also includes most of
Nevada and a considerable area of ocean.
The Geographic Names Server, maintained by Tom Libert at the University of
Michigan, can be found at telnet://martini.eecs.umich.edu:3000.
It is currently being worked into HTML Level 3 as a "scribble" input type,
although the details of implementation have not yet been worked out.
- 89th U.S. Congress (1966).
- Freedom of Information Act. United States Code Title V Section 552
(most recently amended 1986)
-
gopher://eryx.syr.edu/00/Citizen%27s%20Guide/Appendix%204-Information%20Act
- American Society for Testing Materials (1994)
- Content Standard for Digital Geospatial
Metadata, Section D18.01.05 Draft Specification May 23 1994
-
ftp://waisqvarsa.er.usgs.gov/wais/docs/ASTMmeta83194.ps
- Executive Office of the President (1994).
- "Coordinating Geographic Data Acquisition and
Access: The National Spatial Data Infrastructure." Executive Order 12906. Washington,
D.C.: Government Printing Office. Signed April 11, 1994.
-
ftp://fgdc.er.usgs.gov/gdc/html/execord.html
- Federal Geographic Data Committee (1994a).
- The 1994 Plan for the National Spatial Data
Infrastructure. Reston, VA: FGDC. March 1994.
-
ftp://fgdc.er.usgs.gov/gdc/general/documents/nsdi.plan.1994.ps
- Federal Geographic Data Committee (1994b).
- Content Standard for Digital Geospatial
Metadata. Reston, VA: FGDC. final draft June 8, 1994.
-
ftp://fgdc.er.usgs.gov/gdc/metadata/meta.6894.ps
- Fegeas, Robin G., Cascio, Janette L., Lazar, Robert A. (1992)
- "An Overview of FIPS 173, The
Spatial Data Transfer Standard," Cartography and Geographic Information Systems,
Vol.19 No. 5 (Dec 1992)
-
ftp://sdts.er.usgs.gov/pub/sdts/articles/ps/overview.ps
- Mapping Sciences Committee, John D. Bossler, chair (1993)
- "Toward a Coordinated Spatial
Data Infrastructure for the Nation". Washington, DC: National Academy of Sciences
Press.
- Nebert, Douglas V. (1994).
- Personal electronic communication with the author.
- Office of Management and Budget, Executive Office of the President (1990).
- Coordination of
Federal Surveying, Mapping, and Related Spatial Data Activities. Circular A-16 Revised.
Washington, D.C.: Government Printing Office.
-
ftp://fgdc.er.usgs.gov/gdc/general/documents/a-16.txt
Brandon Plewe is originally from St. George Utah,
but currently resides in Buffalo, New
York, where he is the Assistant Coordinator for Campus-Wide Information Services at the State
University of New York. He holds a B.S. in Math and Cartography from Brigham Young
University, is completing his M.A. in Geography at SUNY/Buffalo, and is working on a PhD in
the same subject.
Along with the UB Wings CWIS which he co-manages, he has
been a frequent contributer
to the popularity and usability of the World Wide Web, developing, among other things, the first
Best of the Web award competition, and the popular
Virtual Tourist service. He continues to
search for ways to integrate mapping and geography into the Internet. He has been married to his
wife Jamie for 2½ years, and has one son, Spencer.
He can be contacted at plewe@acsu.buffalo.edu.