Jonathan A. Epstein, Management Systems Designers, Inc.
Jonathan A. Kans, National Center for Biotechology Information
Gregory D. Schuler, National Center for Biotechnology Information
While Entrez was already available as both a CD-ROM and a true standalone client/server application ("NCSA Mosaic).
The availability of alternate implementations of Entrez provides us with a rare opportunity to investigate the strengths and weaknesses of the WWW. In particular, we can compare the ergonomics, performance and bandwidth requirements of WWW Entrez to those of the true client/server implementation. In addition, the relative ease of integrating the two network services into other molecular biology software is discussed. Finally, we present usage statistics for WWW Entrez, the Entrez CD-ROM, and Network Entrez; the degree to which these services are actually used may provide the ultimate measure of effectiveness.
The CD-ROM subscription, containing both data and software, provides an integrated view of the public DNA and protein databases, as well as a related subset of the MEDLINE medical literature database.
The growth of information which molecular biologists need to access to support their experimental work is growing at an explosive rate. The scientific literature adds some 6,000 peer-reviewed articles per month. Databases which contain the codes for gene sequences double in size every 20 months. Biologists need a retrieval tool which offers easy, accurate and complete access to this type of information and this was the motivation for developing Entrez. Entrez is a search tool for integrated access to the biological literature and sequence data [3].
In addition to conventional indexed lookup techniques and the ability to link among the three databases, Entrez also provides the ability to link within a database by finding "related" sequences or literature abstracts. These neighboring relationships within the nucleotide database are pre-computed by comparing the entire nucleotide database against itself using the BLAST[1] sequence similarity algorithm. Proteins are also neighbored using BLAST, and the literature abstract database is neighbored against itself using a statistical text retrieval algorithm [6].
Because of these advantages, Network Entrez was written, and released in mid-1993. Network Entrez is a true client/server software package where the client and server transmit and receive data according to a well-defined Abstract Syntax Notation 1 (ASN.1[5]) protocol, and only small amounts of data are sent across the network to facilitate smooth software operation. The Network Entrez client obtains a connection with a suitable Network Entrez server by first connecting to a central Dispatcher at a well-known address. The use of the client< - > Dispatcher protocol, also an ASN.1 protocol, results in the Dispatcher instructing a suitable server to connect to the client. The desired client< - > server connection is thereby established.
The communications protocol provides a desirable layer of abstraction between the client and server. In practice, there are two different Network Entrez servers which use different types of underlying databases, but understand the same ASN.1 protocol.
To reduce NCBI's software support requirements and to provide local support to users, a Network Entrez administrator at each site takes responsibility for local users. As of this writing (September 1994), there are over 800 registered Network Entrez sites in 32 countries.
The WWW Entrez server was constructed using a combination of Bourne shell scripts and a single C program. This C program, entrcmd, is a search engine which can perform powerful retrievals using a simple UNIX-command-line query language. For example, entrcmd can look up the entries associated with a Boolean expression, and then perform an arbitrary number of rounds of inter-database linking and intra-database neighboring using the MEDLINE, protein, and nucleotide databases. Also, given a starting term (e.g., "Jones JD") it can fetch a number of consecutive terms from a given field (e.g., 50 author names). In its role as the WWW server search engine, entrcmd only runs on a UNIX host; however, it is layered on top of the NCBI toolbox, which is portable across a wide variety of platforms. The source code for entrcmd appears within the NCBI toolbox, and can run on all the NCBI-supported platforms.
In Entrez, each toggle of a popup (e.g., database or field) and each entry of a datum results in an action by the Entrez application program. Furthermore, continuous scrolling of huge alphabetical lists is possible, since moderate scrolling can result in the fetching of only a small number of terms. Experience shows that this scrolling ("selection mode") is a valuable feature of Entrez, since users don't always know the exact query term or author name.
Once a Boolean query has been composed by the user, the user can fetch and view the resulting document summaries. Again, these document summaries appear as a scrolled list, and only a few entries need to be retrieved at a time. The user may then view an entire entry by double-clicking on a document summary of interest.
Links to other databases or neighboring within the current database may be performed by selecting the target database and marking one or more articles to be used as the basis for linking/neighboring. For example, by marking several MEDLINE articles a user can retrieve the related sequence data from the DNA database.
In addition to these constraints, clients with FORMS capability were rare in the fall of 1993, when the WWW Entrez server was written. Even at the time of this writing (September 1994), reliable FORMS-capable WWW browsers are not ubiquitous. Therefore, it is desirable to have servers which can take advantage of FORMS, but can still be useful to users who lack FORMS-capable browsers. WWW Entrez includes both FORMS-based and non-FORMS-based interfaces.
Given the constraints of non-FORMS HTML, the simplest approach was to decompose the Entrez query interface into many screens, each corresponding to a traditional hierarchical menu system. Thus, for example, to lookup a MEDLINE article by author name, one first selects "MEDLINE" from a top-level menu, then selects "Author Name" from a menu, and finally makes a query using the browser's search window.
Given the power of FORMS, it is possible to express a complex Boolean query on a single form. The sample MEDLINE form which appears below has considerable expressive power, allowing the Boolean composition (union, intersection, and set subtraction) of several terms, where each term may use a separate indexed field.
Because of this unified approach, some users have stated that they prefer this query interface to that provided by Entrez/Network Entrez.
As an option on the single FORM, a user may choose to "browse" from 50 of the available terms rather than typing an exact term. For example, if a user doesn't remember how to spell "adenoviridae", typing "adeno" and using the "browse" option is helpful. This is also especially useful when searching by author name. Note that this is a partial emulation of Entrez's "selection mode", described earlier.
Hypertext is also a natural way to represent the hierarchical organism taxonomy which was added to the Entrez CD-ROM in May 1994, and has been found to be particularly useful to molecular phylogenists and other systematists. [4] This feature was subsequently added to WWW Entrez.
The visual clarity of hypertext permits the coherent presentation of rich text, compensating for other disadvantages of WWW and HTML. For example, the number of neighboring MEDLINE, protein, and nucleotide articles is presented to the user in WWW Entrez, smoothing his/her access to the available data. If these counts were computed and displayed in Entrez/Network Entrez, the screen would be cluttered, and, because of the extra time required to perform the computations, Entrez's otherwise quick performance would be compromised.
The power of the entrcmd engine allows each Entrez query to be stateless, even though a user may perform several rounds of neighboring and linking between and within the MEDLINE, protein, and nucleotide databases. This statelessness requires the recomputation of some results which could be stored, but the simplicity and power which statelessness provides is considerable. Each WWW Entrez URL is an encoding describing an originating Boolean expression or set of unique identifiers, along with the "rounds" of neighboring and linking which have been performed.
Comparing usage of the three types of Entrez is complex because there is no way to know how many users are actually using the Entrez CD-ROM and its derivative hard-disk copies. Furthermore, there is no exact analogy between Network Entrez sessions and WWW Entrez URLs. However, a more detailed study suggests that an average Network Entrez session corresponds to roughly eight WWW Entrez URLs.
In recent months, the plateau in Entrez CD-ROM subscriptions has been triggered by the wide availability of the two Internet-based versions. In some ways, this is not a bad thing, especially because the Entrez CD-ROM subscription expanded to two CDs in October 1993, and must expand to three CDs in October 1994, due to the dramatic growth in molecular sequence data.
WWW Entrez and Network Entrez complement, rather than compete with one another. WWW Entrez is useful for those who prefer a single software tool and who can accept the slower performance. Network Entrez is critical for high performance, smoother ergonomics, and custom-written applications which need pre-parsed data. Finally, for some users either Network Entrez or WWW Entrez fails to function for some technical reason. In these cases, the NCBI staff has usually been able to refer the user to the alternate Internet-based solution.
The WWW Entrez server is able to profit from the interconnection capabilities of the World Wide Web. First of all, Entrez's own data is highly interconnected. Secondly, for some protein data where external information is available on the Web, WWW Entrez points to the external Expasy, , and "Molecules R Us" servers. Finally, several servers point to the WWW Entrez server to obtain data, notably Expasy and the Baylor College of Medicine's sequence annotation server. The latter is especially interesting because it provides a way to annotate sequence database entries without modifying the original entry, and while pointing to the canonical Entrez sequence entry.
In the future, the Entrez WWW server will include an interface to a powerful structural molecule visualization tool (RasMol), and will contain daily-updated molecular sequence data databases.
Jonathan Kans is a Research Associate at the National Center for Biotechnology Information, a division of the National Library of Medicine at NIH. He received his A.B. in biological sciences, S.M. in immunology, and Ph.D. in genetics from the University of Chicago, and was a postdoctoral fellow at the University of California at Berkeley. His past research has centered around recombination and gene rearrangements, and, as with most of the NCBI staff, he has a strong interest in applying computers to problems of biological importance. He is the developer of the VIBRANT portable user interface, is one of the principal authors of the Entrez data retrieval application, and is the developer of the Sequin direct submission program.
Gregory Schuler received his B.S. in Biochemistry & Microbiology from the University of Maryland and his Ph.D. in Molecular Biology from Princeton University. He has performed several years of laboratory research on the role of oncogene and growth factor expression in the processes of cell differentiation and transformation. For the past several years at the NCBI, he has worked on various aspects of biological sequence analysis, such as querying and searching sequence databases and performing multiple alignments on groups of related sequences.