The EDGAR Project: A Case Study in Disseminating
Financial Data on the Internet
This project supported by NSF Grant No. 9319331, Internet Access to Large
Government Data Archives, and a grant from RR Donnelley and Sons.
Mark Ginsburg, Doctoral Student, NYU Stern School,
Information Systems, mark@edgar.stern.nyu.edu.
Ajit Kambil, Assistant Professor, NYU Stern School, Information Systems,
akambil@stern.nyu.edu.
Alan B. Eisner, Doctoral Student, NYU Stern School, Management, aeisner@stern.nyu.edu.
Abstract
In this case study of a Government application of the Internet, we describe
the project evolution, current directions, and research
emphasis of the EDGAR (The Electronic Data Gathering, Archiving and Retrieval) project.
EDGAR is a large, heterogeneous financial
data archive that has been available to Internet users
since January 1994, composed of all forms filed electronically to the
Securities and Exchange Commission (SEC) by domestic publicly traded
corporations.
We present empirical analysis of access patterns to support our research in
two key areas, Information Retrieval and Problem Categorization, and mention
possible directions for future research.
Introduction
The EDGAR database is housed at the Internet Multicasting Service (IMS), located
in Washington, D.C. (with the invaluable technical and network
expertise of Carl Malamud, Brad Burdick, and the rest of the IMS staff).
Since January 1994, the IMS has provided dissemination of the
corporate filings on the Internet via anonymous FTP, and since March 1994,
both the
IMS and
New York University (NYU)
have provided WWW access to the
filings.
The IMS also permits
Gopher and e-mail service at town.hall.org.
The NYU team primarily develops front-end
Web applications to customize filings access; however as we shall see there is
substantial interest in composite interdocument, intracompany profiles and
interdocument, intercompany industry analyses. The project ends on December 31, 1995,
and continuation of the dissemination at this point is a source of intense industry
speculation; the IMS has indicated it will definitely cease its involvement on this date.
The Edgar Archive
As of this writing, 3,196 publicly traded companies file electronically to the SEC.
The Mead Data Corporation, which has exclusive dissemination rights in a contract
which runs through 1997, has contracted with the NSF Project team to provide
tapes of the filing data on a one-day
delayed basis; the IMS then mounts the data on fast disk. Two files are written for
each submission: the filing proper, and the
header tags and their contents.
In the section on Current Work we will discuss one use for the header tags.
Most major corporations already file electronically, and the phase-in
schedules are publicly avaiable in the Federal Registry. It should be noted that
corporations may control many filing entities - consider a brokerage firm which
has a controlling interest in many mutual funds. Each filing entity is accorded its
own CIK (filing) code and the IMS data archive is broken into subdirectories
according to these codes.
A Word on Size and Structure
The IMS data store is now nearing 6 GB, and a final
storage requirement is expected to be 20 GB or greater. Thus, the EDGAR database is orders of
magnitude greater than experimental hypermedia databases such as the work of
Salton et al. with the Funk and Wagnall encyclopedia {Salton}; the structure of the elements is
quite different too. Whereas an encyclopedia's universe of articles has a relatively low
size variance, the EDGAR store can vary from small encoded filings that might be 800 bytes,
to a large corporation's annual financial statement (10-K), that typically ranges from 300 KB to
900 KB.
Furthermore, although it is difficult to group encyclopedia articles into iron-clad families
prior to a user query, the EDGAR filing types are more conducive to such prior grouping.
It is relatively safe, for example, to label all forms with financial ratios (10-K's, 10-Q's, etc.)
into a ``Financial'' family when attempting to guide the user towards task completion.
The semantic content of the various sections of a 10-K, for example, are well defined
by various acts and regulations and is
available in hardcopy {Bowne}. We shall return to this in the
section on user requirements.
This combination of heterogeneity and a modicum of contextual uniformity guaranteed by
legal and compliance issues poses
interesting challenges, particularly since the web offers a variety of
intriguing tools for simplifying the user's task: Wide Area Information Search
(WAIS), intradocument tagging, intelligent agents,
context-sensitive help, etc.
Filing Form Types
Investors, librarians, private investors, public advocacy groups, and many others
are interested in various SEC filings such as the 10-K (annual report), 8-K (Change in Material Status),
10-Q (Quarterly Report), DEF 14A (Proxy),
485APOS and 485BPOS (mutual fund prospectus), and so on. There is a wealth
of well over a hundred SEC form types, and they provide
invaluable depth of information that ``fills in the gaps'' behind such mundane
events as a newswire report of a company's earnings announcement. For example,
the ``Management Discussion and Analysis'' section in the 10-K is scrutinized
by investors as a key source of management rationalization of prior performance and
future trends. A dictionary of form types and descriptions is
online .
What Isn't in the EDGAR Archive
Not present in the archive are SEC Forms 3, 4, and 5 (officer, or ``inside'' purchases and sales) which are
closely watched in the SEC reading rooms - these are expected to be submitted via
EDGAR sometime in 1996. Also missing are photographic exhibits, which one can find
in commercial products such as Disclosure's CD-ROM. There are no current plans for the
SEC to upgrade its EDGAR software to accommodate electronic filings of non-ASCII files.
Thus, for the forseeable future, we are left with ASCII text and tables. The good news, of course,
is that a text-intensive data archive conserves bandwidth. The bad news is that the current
incarnation of HTML does not support columnar tables; we are looking forward to this enhancement.
Project Goals
The general goal of our EDGAR development work is as follows:
To enable wide dissemination and support all levels of user access to the
corporate electronic filings submitted to the Securities and Exchange
Commission (SEC).
From the academic perspective, other major goals are:
- To identify and understand the requirements for broad public access,
- To identify and implement applications which operate
on the large document database and synthesize
reports based on information across multiple filings, and,
- To understand patterns of access to the EDGAR database with an eye to
generalizing knowledge thus acquired. Indeed, the EDGAR project is a flagship
government database dissemination project and many other ambitious projects
are also being launched\footnote{visit http://www.town.hall.org/
to explore the U.S. Patent Database and other interesting data sources.}.
We shall review our progress to date in these areas, discuss our current work, and
indicate some of the most important future projects we have planned.
Empirical Analysis of EDGAR User Access Patterns
Access Methods
Access methods that are supported include, but are not limited to,
the following: e-mail, gopher, ftp, and WWW browsers (e.g. Mosaic, Cello, Lynx, etc.).
Figure 1
shows summary statistics at all levels of access in 1994. The Web servers
at the IMS and NYU sites became production services in March 1994;
only FTP was available in January and February.
The NYU server, which provides custom form lookup tools, forms help,
and utilities such as company to ticker symbol lookup, has been
quite active with 482 files transmitted daily, on average. The IMS server has, on a daily
basis, transmitted 1,455 files via FTP (195 MB) and 178 files via the Web (377K).
E-mail and gopher statistics are not yet available. Naturally, as more users gain Web access,
the forms support and other search tools (e.g., WAIS) will cause a steady migration away
from ftp, e-mail, and gopher.
Figure 2 and
Figure 3
show the total transfer by client domain from the NYU and IMS web servers. Of interest is
the substantial proportion of foreign usage (10.12%, NYU; 12.47%, IMS). Domestic commercial
and education usage is fairly balanced for NYU (37.5% commercial; 39.35% educational) but commercial
interests are the major user for IMS (38.40% versus 32.43%, educational).
Figure 4 and
Figure 5
show, similarly, the total number of information requests by client domain
from the NYU and IMS web servers.
The effect of publicity generated via conventional media (newspaper and magazine articles), announcements on
USENET newsgroups such as misc.invest and misc.invest.funds, and increased awareness of the Web in general
has caused a steady increase in EDGAR Web access at both the NYU and IMS sites as you can see
in the above four figures.
Strategies for Understanding User Requirements
There is no simple way to predict what a user will need a particular filing for in any
given session. We have learned
that in aggregate (via usage questionnaires which are online at both the IMS and NYU sites)
that the forms most often needed are the 10-K's, the Proxies (DEF 14A's),
the Acquisition Filings (Schedules 13d and
13g), and Mutual Fund Prospectuses.
However, the problems faced by many users are definitely of the ``ill-structured'' variety {Simon},
{Newell}.
Consider a hypothetical example: a user would like to know why
XYZ Corp. laid off 5,000 employees last quarter.
It is completely unobvious which form types might contain clues;
even after perusing an online or paper dictionary.
One might as well start with 8-K's (Material changes to financial condition)
but there are no ready-made hyperlinks
from an 8-K to the larger 10-K's or Proxies. This problem will be addressed
in the following section on inter-document
linking.
A further distinction can be made between the expert user (e.g. a corporate law librarian) and the novice (e.g.
an inexperienced private investor). Chi et al. suggest that the frame of reference is critical in physics problem
solving and speeds the expert along {Chi}; similarly in the EDGAR domain experts
have a built-in frame of reference that links keywords to form types.
They further know what information is likely to be unlocatable in the EDGAR data archive, whereas
a novice might spend many fruitless hours searching. A good example is the officer purchases
and sales; critical to investment decisions, reported in investor newsletters and journals (e.g. {\em Barron's}),
but not present yet on EDGAR.
Another problem stems from many users' internet providers. Often, a filing is quite large (several hundred KB for
10-K's and Proxies) and the provider does not allow e-mail of that size; insisting they be chopped up (and then,
according to Murphy's Law, they never arrive in proper order!).
Similarly, a provider might charge by the
KB transferred and thus EDGAR use might become quite pricey for some users. We discuss these bread and butter
concerns in the next section.
Economic Considerations of the EDGAR Service
Considerable technical work has been published charting backbone congestion; backbone upgrade, and
the inevitable return of congestion {Claffy92}, {Claffy94}. EDGAR is a text-based archive, as
we have noted, devoid of audio, video, or photographic images. However, many of the interesting filings
are quite large such as the 10-K and Proxy filings.
Keeping in mind our goal of low-cost information dissemination, we would like to provide enough information to
satisfy the user's needs during an EDGAR session without necessarily providing entire documents. At present,
many providers charge by the KB transferred and there are strong arguments for a generalized ``user pays''
policy to be applied to the Internet at large {MacKie93}, {MacKie94}.
As MacKie and Varian say, the Internet community at large faces the classic economic
``problem of the commons'' where users, given unlimited Internet access, pay no penalty for
high-bandwidth usage. {MacKie94}.
Therefore, how can the EDGAR service position itself as a 'good Net citizen', conserving bandwidth,
while not limiting functionality? There are several approaches either in development or under
consideration:
- Automatic intradocument table of contents generation. We have developed shell scripts to parse
the more popular filings and prefix them with a hyperlink table of contents designed for easy perusal.
For example, the key financial ratios are indexed at the very top of the 10-Ks.
- User choice at FTP request time. At the moment the user received an answer to his or her query
from a Web form, subdocuments are presented as an alternative to downloading the entire filing. For
example,
the user may opt to download only the Executive Compensation section of the Proxy. Of course,
we do not want to use unnecessary disk space at the IMS site and thus we are testing realtime
extractions of subdocuments and leaning away from batch jobs which would partition the filings ahead
of time.
- 'Intelligent Agent' help. If the user requests this service, the agent will present a picklist of
typical user queries and the subsections of one or more filings that would be useful in each case. For
this work, it is critical that we sit with EDGAR users and do a complete task analysis for several major
user groups.
Customizing the Front-End
Using Mosaic Common Gateway Interface (CGI) tools, it is a simple matter to provide an attractive front-end
for point and click forms retrieval. For example,
we provide a Mutual Fund search where we provide
a pre-written list of publicly traded funds and flag those not yet on EDGAR.
We also have a Prospectus search which corresponds to the "485" form
series.
We also provide a Schedule 13D application, where a user enters a
company (e.g. Gabelli) and the
result is all companies in which Gabelli has acquired a 5% or
greater ownership position. We utilized an internal
database to reverse-engineer the CIK codes of the target companies
in order to provide their names to the end-user (in the case of Gabelli,
they have acquired positions in Tredegar Industries,
Santa Anita Operations, United Television, etc). This application
has proven popular with mutual fund aficionados based on comments we have
received from misc.invest.fund newsgroup readers.
Also popular is
Current Events Analysis
where users can view recent filings.
These applications are fairly straightforward: the user inputs a few variables, then we extract records from a master
index file and provide a clickable answer. The last step, which takes place when the user clicks on the
highlighted filing, provides the native FTP service to town.hall.org in Washington, D.C. Since the Web is a stateless
connection {Berners-Lee}, the NYU server is freed in the final phase of filing retrieval.
The Web is conducive to the cycle of resource discovery, mutual server linking, and consequently increased
traffic at each linked site. EDGAR sits at the intersection of law, finance, and economics and thus enjoys
much visibility in the academic server community.
For example, EDGAR provides reciprocal links to the
Carnegie Mellon financial
server,
FINWeb, a financial economics WWW Server,
the Indiana University School of Law,
and the University of Michigan Economics Server as well as a host of
commercial enterprises.
Current Work
Indexing schemes are a major focus of current work. The IMS site is experimenting with the commercial
WAIS engine and plans to fully index all of the SGML header tags of each filing. This will provide the
capability to perform boolean queries on data items such as the company address, CIK code, filing as-of
date, etc.
However, since WAIS indices typically have approximately a 1:1 space requirement correspondence
with the text they are indexing, the search is on to help the user in a more efficient manner.
The NYU team, for example, is building a Standard Industrial Code
(SIC) database to permit intrasector analyses (e.g. extracting the key
financial ratios for every aerospace firm). There is also work to identify the controlling interests behind
each mutual fund in order to correlate parent firm performance with that of the fund.
We are also tabulating the heavy responses we have received from the online usage questionnaire.
Many of the problems users face, as has been noted, stem from their access providers; others encounter
difficulties from idiosyncratic web clients. Whenever possible, we attempt to modify the server's
behavior suitably in such events (for example, a client passes an unxpected token in a form
response).
Future Directions
Here are some interesting issues that are being worked on in both the private and public sector.
- Officer Migration Patterns. As archiving of the EDGAR data store continues, one can start to ask interesting questions that
would rely on ``historical'' EDGAR data (recall that we have no data before January 1, 1994). Of
particular interest to Management scholars is the 'flow' of officers from one publicly traded company
to another. Such a database might be used to correlate resignations and hirings with firm performance.
- Artificial Intelligence Applications. There are some very hard questions that users would like an EDGAR front-end to answer. For example,
how do we answer the hypothetical question 'How many boards of directors
does John Q. Public serve on?' Since we do not have social security numbers
or other unique identifiers in the body of forms, a programmatic approach needs to be quite clever in
attempting to match names.
- Modification of EDGAR User Software. Customization of user software
for EDGAR use is clearly desirable. For example,
it could do 'smart parsing' following a local cache of the document (as
opposed to having the server do it, which might cause serious delays). Typical tasks could be
optimized locally, especially when the user needs an unusual combination of subdocuments for
one or more companies. Customization can take place at the level of the Web browser or even at the
level of FTP transfer from the IMS - there exist automated routines to do 'unmanned' FTP transfers.
Concluding Comments
As the EDGAR Data Archive grows in size and complexity, so do the challenges
inherent in serving a diverse community of
Internet users. We must monitor emerging standards, protocols, platforms, and clients and
continue to work with the various user communities to adapt the service to their need.
References
Berners-Lee, T. and Cailliau, R. and Luotonen, A. and Nielson,H. and A. Secret,
The World-Wide Web,
Communications of the ACM,
1994,
37,
76-82,
August.
Bowne and Co.,
Appeal Securities Act Handbook,
Bowne and Co.,
1993.
Chi, M. T. and Feltovich, P. J. and R. Glaser,
Categorization and representation of physics problems by experts
and novices,
Cognitive Science,
1979,
5,
121-132.
Claffy, K. C. and Polyzos, G.C. and H.-W. Braun,
Traffic characteristics of the T1 NFSNET backbone,
UCSD,
1992,
CS92-252.
Claffy, K. C. and Polyzos, G.C. and H.-W. Braun,
Tracking long-term growth of the NSFNET,
Communications of the ACM,
1994,
23,
35-45.
MacKie-Mason, J. and H. Varian,
Some economics of the Internet,
University of Michigan,
1993,
Ann Arbor, Michigan,
MacKie-Mason, J. and H. Varian,
Editor=B. Kahin and J. Keller,
Pricing the Internet,
Prentice-Hall,
1994.
Newell, A. and H. Simon,
Human Problem Solving,
Prentice-Hall,
1972.
RR Donnelley and Sons,
The EDGAR Handbook,
Chicago, IL,
1994.
Salton, G. and Allan, J. and C. Buckley",
Automatic
structuring and retrieval of large text files,
Communications of the ACM,
1994,
37,
97-108,
February.
Securities and Exchange Commission,
A User's Guide to the
Facilities of the Public Reference Room,
Washington, D.C.,
SEC Commission Office of Filings, Information,
and Consumer Services,
1991.
Securities and Exchange Commission,
The SEC Edgar Filer Manual Version 3.5,
Washington, D.C.,
SEC,
1994.
Simon, H. and Newell, A. and J.C. Shaw,
Editor=H.E. Gruber, G. Terrell, and M. Wertheimer,
The processes of creative thinking,
63-119,
Lieber-Atherton, Inc.,
1962.
About the Authors
Mark Ginsburg
Mark Ginsburg is a doctoral student in the Information Systems
Department, Stern School of Business, New York University. He has a
B.A. from Princeton University, a M.A. from
Columbia University, and was a Stern Scholar in the Statistics and
Operations Research Department en route to earning a M.B.A. at NYU.
He is responsible for the daily operation of
NYU's EDGAR web server and is interested in the following Internet
issues: evolution of standards, collaborative software, and
the economics of interoperability (or lack thereof).
Ajit Kambil
Professor Kambil is an Assistant Professor of Information Systems at the
Stern School of Business, New York University. He earned his undergraduate and PhD degrees at
MIT. His research is centered on three inter-related areas:
Information technology and the transformation of business strategy,
organizations and networks; Aligning
Information Technology and Business Strategies, and Communications Networks - design, use and policies.
At NYU Prof. Kambil teaches courses that introduce MBAs to the management of information
systems and undergraduate students to telecommunications systems.
Alan B. Eisner
Alan B. Eisner is a doctoral student in the Management
Department, Stern School of Business, New York University. He earned
a B.S. in Operations Research and Industrial Engineering in 1989, and
a M.Eng. in Engineering Management in 1992, both from Cornell
University. His primary research interests are technology strategies
and organizational learning.