The EDGAR Project: A Case Study in Disseminating Financial Data on the Internet

This project supported by NSF Grant No. 9319331, Internet Access to Large Government Data Archives, and a grant from RR Donnelley and Sons.

Mark Ginsburg, Doctoral Student, NYU Stern School, Information Systems, mark@edgar.stern.nyu.edu.
Ajit Kambil, Assistant Professor, NYU Stern School, Information Systems, akambil@stern.nyu.edu.
Alan B. Eisner, Doctoral Student, NYU Stern School, Management, aeisner@stern.nyu.edu.

Abstract

In this case study of a Government application of the Internet, we describe the project evolution, current directions, and research emphasis of the EDGAR (The Electronic Data Gathering, Archiving and Retrieval) project. EDGAR is a large, heterogeneous financial data archive that has been available to Internet users since January 1994, composed of all forms filed electronically to the Securities and Exchange Commission (SEC) by domestic publicly traded corporations.
We present empirical analysis of access patterns to support our research in two key areas, Information Retrieval and Problem Categorization, and mention possible directions for future research.

Introduction

The EDGAR database is housed at the Internet Multicasting Service (IMS), located in Washington, D.C. (with the invaluable technical and network expertise of Carl Malamud, Brad Burdick, and the rest of the IMS staff). Since January 1994, the IMS has provided dissemination of the corporate filings on the Internet via anonymous FTP, and since March 1994, both the IMS and New York University (NYU) have provided WWW access to the filings. The IMS also permits Gopher and e-mail service at town.hall.org. The NYU team primarily develops front-end Web applications to customize filings access; however as we shall see there is substantial interest in composite interdocument, intracompany profiles and interdocument, intercompany industry analyses. The project ends on December 31, 1995, and continuation of the dissemination at this point is a source of intense industry speculation; the IMS has indicated it will definitely cease its involvement on this date.

The Edgar Archive

As of this writing, 3,196 publicly traded companies file electronically to the SEC. The Mead Data Corporation, which has exclusive dissemination rights in a contract which runs through 1997, has contracted with the NSF Project team to provide tapes of the filing data on a one-day delayed basis; the IMS then mounts the data on fast disk. Two files are written for each submission: the filing proper, and the header tags and their contents. In the section on Current Work we will discuss one use for the header tags. Most major corporations already file electronically, and the phase-in schedules are publicly avaiable in the Federal Registry. It should be noted that corporations may control many filing entities - consider a brokerage firm which has a controlling interest in many mutual funds. Each filing entity is accorded its own CIK (filing) code and the IMS data archive is broken into subdirectories according to these codes.

A Word on Size and Structure

The IMS data store is now nearing 6 GB, and a final storage requirement is expected to be 20 GB or greater. Thus, the EDGAR database is orders of magnitude greater than experimental hypermedia databases such as the work of Salton et al. with the Funk and Wagnall encyclopedia {Salton}; the structure of the elements is quite different too. Whereas an encyclopedia's universe of articles has a relatively low size variance, the EDGAR store can vary from small encoded filings that might be 800 bytes, to a large corporation's annual financial statement (10-K), that typically ranges from 300 KB to 900 KB.

Furthermore, although it is difficult to group encyclopedia articles into iron-clad families prior to a user query, the EDGAR filing types are more conducive to such prior grouping. It is relatively safe, for example, to label all forms with financial ratios (10-K's, 10-Q's, etc.) into a ``Financial'' family when attempting to guide the user towards task completion. The semantic content of the various sections of a 10-K, for example, are well defined by various acts and regulations and is available in hardcopy {Bowne}. We shall return to this in the section on user requirements.

This combination of heterogeneity and a modicum of contextual uniformity guaranteed by legal and compliance issues poses interesting challenges, particularly since the web offers a variety of intriguing tools for simplifying the user's task: Wide Area Information Search (WAIS), intradocument tagging, intelligent agents, context-sensitive help, etc.

Filing Form Types

Investors, librarians, private investors, public advocacy groups, and many others are interested in various SEC filings such as the 10-K (annual report), 8-K (Change in Material Status), 10-Q (Quarterly Report), DEF 14A (Proxy), 485APOS and 485BPOS (mutual fund prospectus), and so on. There is a wealth of well over a hundred SEC form types, and they provide invaluable depth of information that ``fills in the gaps'' behind such mundane events as a newswire report of a company's earnings announcement. For example, the ``Management Discussion and Analysis'' section in the 10-K is scrutinized by investors as a key source of management rationalization of prior performance and future trends. A dictionary of form types and descriptions is online .

What Isn't in the EDGAR Archive

Not present in the archive are SEC Forms 3, 4, and 5 (officer, or ``inside'' purchases and sales) which are closely watched in the SEC reading rooms - these are expected to be submitted via EDGAR sometime in 1996. Also missing are photographic exhibits, which one can find in commercial products such as Disclosure's CD-ROM. There are no current plans for the SEC to upgrade its EDGAR software to accommodate electronic filings of non-ASCII files.

Thus, for the forseeable future, we are left with ASCII text and tables. The good news, of course, is that a text-intensive data archive conserves bandwidth. The bad news is that the current incarnation of HTML does not support columnar tables; we are looking forward to this enhancement.

Project Goals

The general goal of our EDGAR development work is as follows:

To enable wide dissemination and support all levels of user access to the corporate electronic filings submitted to the Securities and Exchange Commission (SEC).

From the academic perspective, other major goals are:

To identify and understand the requirements for broad public access,
To identify and implement applications which operate on the large document database and synthesize reports based on information across multiple filings, and,
To understand patterns of access to the EDGAR database with an eye to generalizing knowledge thus acquired. Indeed, the EDGAR project is a flagship government database dissemination project and many other ambitious projects are also being launched\footnote{visit http://www.town.hall.org/ to explore the U.S. Patent Database and other interesting data sources.}.

We shall review our progress to date in these areas, discuss our current work, and indicate some of the most important future projects we have planned.

Empirical Analysis of EDGAR User Access Patterns

Access Methods

Access methods that are supported include, but are not limited to, the following: e-mail, gopher, ftp, and WWW browsers (e.g. Mosaic, Cello, Lynx, etc.).

Figure 1 shows summary statistics at all levels of access in 1994. The Web servers at the IMS and NYU sites became production services in March 1994; only FTP was available in January and February.

The NYU server, which provides custom form lookup tools, forms help, and utilities such as company to ticker symbol lookup, has been quite active with 482 files transmitted daily, on average. The IMS server has, on a daily basis, transmitted 1,455 files via FTP (195 MB) and 178 files via the Web (377K). E-mail and gopher statistics are not yet available. Naturally, as more users gain Web access, the forms support and other search tools (e.g., WAIS) will cause a steady migration away from ftp, e-mail, and gopher.

Figure 2 and Figure 3 show the total transfer by client domain from the NYU and IMS web servers. Of interest is the substantial proportion of foreign usage (10.12%, NYU; 12.47%, IMS). Domestic commercial and education usage is fairly balanced for NYU (37.5% commercial; 39.35% educational) but commercial interests are the major user for IMS (38.40% versus 32.43%, educational).

Figure 4 and Figure 5 show, similarly, the total number of information requests by client domain from the NYU and IMS web servers.

The effect of publicity generated via conventional media (newspaper and magazine articles), announcements on USENET newsgroups such as misc.invest and misc.invest.funds, and increased awareness of the Web in general has caused a steady increase in EDGAR Web access at both the NYU and IMS sites as you can see in the above four figures.

Strategies for Understanding User Requirements

There is no simple way to predict what a user will need a particular filing for in any given session. We have learned that in aggregate (via usage questionnaires which are online at both the IMS and NYU sites) that the forms most often needed are the 10-K's, the Proxies (DEF 14A's), the Acquisition Filings (Schedules 13d and 13g), and Mutual Fund Prospectuses.

However, the problems faced by many users are definitely of the ``ill-structured'' variety {Simon}, {Newell}. Consider a hypothetical example: a user would like to know why XYZ Corp. laid off 5,000 employees last quarter. It is completely unobvious which form types might contain clues; even after perusing an online or paper dictionary. One might as well start with 8-K's (Material changes to financial condition) but there are no ready-made hyperlinks from an 8-K to the larger 10-K's or Proxies. This problem will be addressed in the following section on inter-document linking.

A further distinction can be made between the expert user (e.g. a corporate law librarian) and the novice (e.g. an inexperienced private investor). Chi et al. suggest that the frame of reference is critical in physics problem solving and speeds the expert along {Chi}; similarly in the EDGAR domain experts have a built-in frame of reference that links keywords to form types. They further know what information is likely to be unlocatable in the EDGAR data archive, whereas a novice might spend many fruitless hours searching. A good example is the officer purchases and sales; critical to investment decisions, reported in investor newsletters and journals (e.g. {\em Barron's}), but not present yet on EDGAR.

Another problem stems from many users' internet providers. Often, a filing is quite large (several hundred KB for 10-K's and Proxies) and the provider does not allow e-mail of that size; insisting they be chopped up (and then, according to Murphy's Law, they never arrive in proper order!). Similarly, a provider might charge by the KB transferred and thus EDGAR use might become quite pricey for some users. We discuss these bread and butter concerns in the next section.

Economic Considerations of the EDGAR Service

Considerable technical work has been published charting backbone congestion; backbone upgrade, and the inevitable return of congestion {Claffy92}, {Claffy94}. EDGAR is a text-based archive, as we have noted, devoid of audio, video, or photographic images. However, many of the interesting filings are quite large such as the 10-K and Proxy filings.

Keeping in mind our goal of low-cost information dissemination, we would like to provide enough information to satisfy the user's needs during an EDGAR session without necessarily providing entire documents. At present, many providers charge by the KB transferred and there are strong arguments for a generalized ``user pays'' policy to be applied to the Internet at large {MacKie93}, {MacKie94}.

As MacKie and Varian say, the Internet community at large faces the classic economic ``problem of the commons'' where users, given unlimited Internet access, pay no penalty for high-bandwidth usage. {MacKie94}.

Therefore, how can the EDGAR service position itself as a 'good Net citizen', conserving bandwidth, while not limiting functionality? There are several approaches either in development or under consideration:

Automatic intradocument table of contents generation. We have developed shell scripts to parse the more popular filings and prefix them with a hyperlink table of contents designed for easy perusal. For example, the key financial ratios are indexed at the very top of the 10-Ks.
User choice at FTP request time. At the moment the user received an answer to his or her query from a Web form, subdocuments are presented as an alternative to downloading the entire filing. For example, the user may opt to download only the Executive Compensation section of the Proxy. Of course, we do not want to use unnecessary disk space at the IMS site and thus we are testing realtime extractions of subdocuments and leaning away from batch jobs which would partition the filings ahead of time.
'Intelligent Agent' help. If the user requests this service, the agent will present a picklist of typical user queries and the subsections of one or more filings that would be useful in each case. For this work, it is critical that we sit with EDGAR users and do a complete task analysis for several major user groups.

Customizing the Front-End

Using Mosaic Common Gateway Interface (CGI) tools, it is a simple matter to provide an attractive front-end for point and click forms retrieval. For example, we provide a Mutual Fund search where we provide a pre-written list of publicly traded funds and flag those not yet on EDGAR. We also have a Prospectus search which corresponds to the "485" form series.

We also provide a Schedule 13D application, where a user enters a company (e.g. Gabelli) and the result is all companies in which Gabelli has acquired a 5% or greater ownership position. We utilized an internal database to reverse-engineer the CIK codes of the target companies in order to provide their names to the end-user (in the case of Gabelli, they have acquired positions in Tredegar Industries, Santa Anita Operations, United Television, etc). This application has proven popular with mutual fund aficionados based on comments we have received from misc.invest.fund newsgroup readers. Also popular is Current Events Analysis where users can view recent filings.

These applications are fairly straightforward: the user inputs a few variables, then we extract records from a master index file and provide a clickable answer. The last step, which takes place when the user clicks on the highlighted filing, provides the native FTP service to town.hall.org in Washington, D.C. Since the Web is a stateless connection {Berners-Lee}, the NYU server is freed in the final phase of filing retrieval.

The Web is conducive to the cycle of resource discovery, mutual server linking, and consequently increased traffic at each linked site. EDGAR sits at the intersection of law, finance, and economics and thus enjoys much visibility in the academic server community. For example, EDGAR provides reciprocal links to the Carnegie Mellon financial server, FINWeb, a financial economics WWW Server, the Indiana University School of Law, and the University of Michigan Economics Server as well as a host of commercial enterprises.

Current Work

Indexing schemes are a major focus of current work. The IMS site is experimenting with the commercial WAIS engine and plans to fully index all of the SGML header tags of each filing. This will provide the capability to perform boolean queries on data items such as the company address, CIK code, filing as-of date, etc.

However, since WAIS indices typically have approximately a 1:1 space requirement correspondence with the text they are indexing, the search is on to help the user in a more efficient manner. The NYU team, for example, is building a Standard Industrial Code (SIC) database to permit intrasector analyses (e.g. extracting the key financial ratios for every aerospace firm). There is also work to identify the controlling interests behind each mutual fund in order to correlate parent firm performance with that of the fund.

We are also tabulating the heavy responses we have received from the online usage questionnaire. Many of the problems users face, as has been noted, stem from their access providers; others encounter difficulties from idiosyncratic web clients. Whenever possible, we attempt to modify the server's behavior suitably in such events (for example, a client passes an unxpected token in a form response).

Future Directions

Here are some interesting issues that are being worked on in both the private and public sector.

Officer Migration Patterns. As archiving of the EDGAR data store continues, one can start to ask interesting questions that would rely on ``historical'' EDGAR data (recall that we have no data before January 1, 1994). Of particular interest to Management scholars is the 'flow' of officers from one publicly traded company to another. Such a database might be used to correlate resignations and hirings with firm performance.
Artificial Intelligence Applications. There are some very hard questions that users would like an EDGAR front-end to answer. For example, how do we answer the hypothetical question 'How many boards of directors does John Q. Public serve on?' Since we do not have social security numbers or other unique identifiers in the body of forms, a programmatic approach needs to be quite clever in attempting to match names.
Modification of EDGAR User Software. Customization of user software for EDGAR use is clearly desirable. For example, it could do 'smart parsing' following a local cache of the document (as opposed to having the server do it, which might cause serious delays). Typical tasks could be optimized locally, especially when the user needs an unusual combination of subdocuments for one or more companies. Customization can take place at the level of the Web browser or even at the level of FTP transfer from the IMS - there exist automated routines to do 'unmanned' FTP transfers.

Concluding Comments

As the EDGAR Data Archive grows in size and complexity, so do the challenges inherent in serving a diverse community of Internet users. We must monitor emerging standards, protocols, platforms, and clients and continue to work with the various user communities to adapt the service to their need.

References

Berners-Lee, T. and Cailliau, R. and Luotonen, A. and Nielson,H. and A. Secret, The World-Wide Web, Communications of the ACM, 1994, 37, 76-82, August.

Bowne and Co., Appeal Securities Act Handbook, Bowne and Co., 1993.

Chi, M. T. and Feltovich, P. J. and R. Glaser, Categorization and representation of physics problems by experts and novices, Cognitive Science, 1979, 5, 121-132.

Claffy, K. C. and Polyzos, G.C. and H.-W. Braun, Traffic characteristics of the T1 NFSNET backbone, UCSD, 1992, CS92-252.

Claffy, K. C. and Polyzos, G.C. and H.-W. Braun, Tracking long-term growth of the NSFNET, Communications of the ACM, 1994, 23, 35-45.

MacKie-Mason, J. and H. Varian, Some economics of the Internet, University of Michigan, 1993, Ann Arbor, Michigan,

MacKie-Mason, J. and H. Varian, Editor=B. Kahin and J. Keller, Pricing the Internet, Prentice-Hall, 1994.

Newell, A. and H. Simon, Human Problem Solving, Prentice-Hall, 1972.

RR Donnelley and Sons, The EDGAR Handbook, Chicago, IL, 1994.

Salton, G. and Allan, J. and C. Buckley", Automatic structuring and retrieval of large text files, Communications of the ACM, 1994, 37, 97-108, February.

Securities and Exchange Commission, A User's Guide to the Facilities of the Public Reference Room, Washington, D.C., SEC Commission Office of Filings, Information, and Consumer Services, 1991.

Securities and Exchange Commission, The SEC Edgar Filer Manual Version 3.5, Washington, D.C., SEC, 1994.

Simon, H. and Newell, A. and J.C. Shaw, Editor=H.E. Gruber, G. Terrell, and M. Wertheimer, The processes of creative thinking, 63-119, Lieber-Atherton, Inc., 1962.

About the Authors

Mark Ginsburg

Mark Ginsburg is a doctoral student in the Information Systems Department, Stern School of Business, New York University. He has a B.A. from Princeton University, a M.A. from Columbia University, and was a Stern Scholar in the Statistics and Operations Research Department en route to earning a M.B.A. at NYU. He is responsible for the daily operation of NYU's EDGAR web server and is interested in the following Internet issues: evolution of standards, collaborative software, and the economics of interoperability (or lack thereof).

Ajit Kambil

Professor Kambil is an Assistant Professor of Information Systems at the Stern School of Business, New York University. He earned his undergraduate and PhD degrees at MIT. His research is centered on three inter-related areas: Information technology and the transformation of business strategy, organizations and networks; Aligning Information Technology and Business Strategies, and Communications Networks - design, use and policies.

At NYU Prof. Kambil teaches courses that introduce MBAs to the management of information systems and undergraduate students to telecommunications systems.

Alan B. Eisner

Alan B. Eisner is a doctoral student in the Management Department, Stern School of Business, New York University. He earned a B.S. in Operations Research and Industrial Engineering in 1989, and a M.Eng. in Engineering Management in 1992, both from Cornell University. His primary research interests are technology strategies and organizational learning.