The EDGAR Project: A Case Study in Disseminating Financial Data on the
Internet
This project supported by NSF Grant No. 9319331, Internet Access to Large
Government Data Archives, and a grant from RR Donnelley and Sons.
Mark Ginsburg,
Doctoral Student, NYU Stern School, Information Systems,
mark@edgar.stern.nyu.edu.
Ajit Kambil, Assistant
Professor, NYU Stern School, Information Systems, akambil@stern.nyu.edu.
Alan B. Eisner, Doctoral Student, NYU Stern School, Management,
aeisner@stern.nyu.edu.
Abstract
In this case study of a Government application of the Internet,
we describe the project evolution, current directions, and research emphasis of
the EDGAR (The Electronic Data Gathering, Archiving and Retrieval) project.
EDGAR is a large, heterogeneous financial data archive that has been available
to Internet users since January 1994, composed of all forms filed electronically
to the Securities and Exchange Commission (SEC) by domestic publicly traded
corporations.
We present empirical analysis of access patterns to support
our research in two key areas, Information Retrieval and Problem Categorization,
and mention possible directions for future research.
Introduction
The EDGAR database is housed at the Internet Multicasting
Service (IMS), located in Washington, D.C. (with the invaluable technical and
network expertise of Carl Malamud, Brad Burdick, and the rest of the IMS staff).
Since January 1994, the IMS has provided dissemination of the corporate filings
on the Internet via anonymous FTP, and since March 1994, both the IMS and New York University (NYU) have provided
WWW access to the filings. The IMS also permits Gopher and e-mail service at town.hall.org.
The NYU team primarily develops front-end Web applications to customize filings
access; however as we shall see there is substantial interest in composite
interdocument, intracompany profiles and interdocument, intercompany industry
analyses. The project ends on December 31, 1995, and continuation of the
dissemination at this point is a source of intense industry speculation; the IMS
has indicated it will definitely cease its involvement on this date.
The Edgar Archive
As of this writing, 3,196 publicly traded companies
file electronically to the SEC. The Mead Data Corporation, which has exclusive
dissemination rights in a contract which runs through 1997, has contracted with
the NSF Project team to provide tapes of the filing data on a one-day delayed
basis; the IMS then mounts the data on fast disk. Two files are written for each
submission: the filing proper, and the header tags and their
contents. In the section on Current Work we will discuss one use for the
header tags. Most major corporations already file electronically, and the
phase-in schedules are publicly avaiable in the Federal Registry. It should be
noted that corporations may control many filing entities - consider a brokerage
firm which has a controlling interest in many mutual funds. Each filing entity
is accorded its own CIK (filing) code and the IMS data archive is broken into
subdirectories according to these codes.
A Word on Size and Structure
The IMS data store is now nearing 6 GB, and
a final storage requirement is expected to be 20 GB or greater. Thus, the EDGAR
database is orders of magnitude greater than experimental hypermedia databases
such as the work of Salton et al. with the Funk and Wagnall encyclopedia
{Salton}; the structure of the elements is quite different too. Whereas an
encyclopedia's universe of articles has a relatively low size variance, the
EDGAR store can vary from small encoded filings that might be 800 bytes, to a
large corporation's annual financial statement (10-K), that typically ranges
from 300 KB to 900 KB.
Furthermore, although it is difficult to group encyclopedia articles into
iron-clad families prior to a user query, the EDGAR filing types are more
conducive to such prior grouping. It is relatively safe, for example, to label
all forms with financial ratios (10-K's, 10-Q's, etc.) into a ``Financial''
family when attempting to guide the user towards task completion. The semantic
content of the various sections of a 10-K, for example, are well defined by
various acts and regulations and is available in hardcopy {Bowne}. We shall
return to this in the section on user requirements.
This combination of heterogeneity and a modicum of contextual uniformity
guaranteed by legal and compliance issues poses interesting challenges,
particularly since the web offers a variety of intriguing tools for simplifying
the user's task: Wide Area Information Search (WAIS), intradocument tagging,
intelligent agents, context-sensitive help, etc.
Filing Form Types
Investors, librarians, private investors, public
advocacy groups, and many others are interested in various SEC filings such as
the 10-K (annual report), 8-K (Change in Material Status), 10-Q (Quarterly
Report), DEF 14A (Proxy), 485APOS and 485BPOS (mutual fund prospectus), and so
on. There is a wealth of well over a hundred SEC form types, and they provide
invaluable depth of information that ``fills in the gaps'' behind such mundane
events as a newswire report of a company's earnings announcement. For example,
the ``Management Discussion and Analysis'' section in the 10-K is scrutinized by
investors as a key source of management rationalization of prior performance and
future trends. A dictionary of form types and descriptions is online .
What Isn't in the EDGAR Archive
Not present in the archive are SEC Forms
3, 4, and 5 (officer, or ``inside'' purchases and sales) which are closely
watched in the SEC reading rooms - these are expected to be submitted via EDGAR
sometime in 1996. Also missing are photographic exhibits, which one can find in
commercial products such as Disclosure's CD-ROM. There are no current plans for
the SEC to upgrade its EDGAR software to accommodate electronic filings of
non-ASCII files.
Thus, for the forseeable future, we are left with ASCII text and tables. The
good news, of course, is that a text-intensive data archive conserves bandwidth.
The bad news is that the current incarnation of HTML does not support columnar
tables; we are looking forward to this enhancement.
Project Goals
The general goal of our EDGAR development work is as
follows:
To enable wide dissemination and support all levels of user access to the
corporate electronic filings submitted to the Securities and Exchange Commission
(SEC).
From the academic perspective, other major goals are:
- To identify and understand the requirements for broad public access,
- To identify and implement applications which operate on the large document
database and synthesize reports based on information across multiple filings,
and,
- To understand patterns of access to the EDGAR database with an eye to
generalizing knowledge thus acquired. Indeed, the EDGAR project is a flagship
government database dissemination project and many other ambitious projects
are also being launched\footnote{visit http://www.town.hall.org/ to explore
the U.S. Patent Database and other interesting data sources.}.
We
shall review our progress to date in these areas, discuss our current work, and
indicate some of the most important future projects we have planned.
Empirical Analysis of EDGAR User Access Patterns
Access Methods
Access methods that are supported include, but are not
limited to, the following: e-mail, gopher, ftp, and WWW browsers (e.g. Mosaic,
Cello, Lynx, etc.).
Figure 1 shows
summary statistics at all levels of access in 1994. The Web servers at the IMS
and NYU sites became production services in March 1994; only FTP was available
in January and February.
The NYU server, which provides custom form lookup tools, forms help, and
utilities such as company to ticker symbol lookup, has been quite active with
482 files transmitted daily, on average. The IMS server has, on a daily basis,
transmitted 1,455 files via FTP (195 MB) and 178 files via the Web (377K).
E-mail and gopher statistics are not yet available. Naturally, as more users
gain Web access, the forms support and other search tools (e.g., WAIS) will
cause a steady migration away from ftp, e-mail, and gopher.
Figure 2 and Figure 3 show the total
transfer by client domain from the NYU and IMS web servers. Of interest is the
substantial proportion of foreign usage (10.12%, NYU; 12.47%, IMS). Domestic
commercial and education usage is fairly balanced for NYU (37.5% commercial;
39.35% educational) but commercial interests are the major user for IMS (38.40%
versus 32.43%, educational).
Figure 4 and Figure 5 show, similarly,
the total number of information requests by client domain from the NYU and IMS
web servers.
The effect of publicity generated via conventional media (newspaper and
magazine articles), announcements on USENET newsgroups such as misc.invest and
misc.invest.funds, and increased awareness of the Web in general has caused a
steady increase in EDGAR Web access at both the NYU and IMS sites as you can see
in the above four figures.
Strategies for Understanding User Requirements
There is no simple way to
predict what a user will need a particular filing for in any given session. We
have learned that in aggregate (via usage questionnaires which are online at
both the IMS and NYU sites) that the forms most often needed are the 10-K's, the
Proxies (DEF 14A's), the Acquisition Filings (Schedules 13d and 13g), and Mutual
Fund Prospectuses.
However, the problems faced by many users are definitely of the
``ill-structured'' variety {Simon}, {Newell}. Consider a hypothetical example: a
user would like to know why XYZ Corp. laid off 5,000 employees last quarter. It
is completely unobvious which form types might contain clues; even after
perusing an online or paper dictionary. One might as well start with 8-K's
(Material changes to financial condition) but there are no ready-made hyperlinks
from an 8-K to the larger 10-K's or Proxies. This problem will be addressed in
the following section on inter-document linking.
A further distinction can be made between the expert user (e.g. a corporate
law librarian) and the novice (e.g. an inexperienced private investor). Chi et
al. suggest that the frame of reference is critical in physics problem solving
and speeds the expert along {Chi}; similarly in the EDGAR domain experts have a
built-in frame of reference that links keywords to form types. They further know
what information is likely to be unlocatable in the EDGAR data archive, whereas
a novice might spend many fruitless hours searching. A good example is the
officer purchases and sales; critical to investment decisions, reported in
investor newsletters and journals (e.g. {\em Barron's}), but not present yet on
EDGAR.
Another problem stems from many users' internet providers. Often, a filing is
quite large (several hundred KB for 10-K's and Proxies) and the provider does
not allow e-mail of that size; insisting they be chopped up (and then, according
to Murphy's Law, they never arrive in proper order!). Similarly, a provider
might charge by the KB transferred and thus EDGAR use might become quite pricey
for some users. We discuss these bread and butter concerns in the next section.
Economic Considerations of the EDGAR Service
Considerable technical work
has been published charting backbone congestion; backbone upgrade, and the
inevitable return of congestion {Claffy92}, {Claffy94}. EDGAR is a text-based
archive, as we have noted, devoid of audio, video, or photographic images.
However, many of the interesting filings are quite large such as the 10-K and
Proxy filings.
Keeping in mind our goal of low-cost information dissemination, we would like
to provide enough information to satisfy the user's needs during an EDGAR
session without necessarily providing entire documents. At present, many
providers charge by the KB transferred and there are strong arguments for a
generalized ``user pays'' policy to be applied to the Internet at large
{MacKie93}, {MacKie94}.
As MacKie and Varian say, the Internet community at large faces the classic
economic ``problem of the commons'' where users, given unlimited Internet
access, pay no penalty for high-bandwidth usage. {MacKie94}.
Therefore, how can the EDGAR service position itself as a 'good Net citizen',
conserving bandwidth, while not limiting functionality? There are several
approaches either in development or under consideration:
- Automatic intradocument table of contents generation. We have developed
shell scripts to parse the more popular filings and prefix them with a
hyperlink table of contents designed for easy perusal. For example, the key
financial ratios are indexed at the very top of the 10-Ks.
- User choice at FTP request time. At the moment the user received an answer
to his or her query from a Web form, subdocuments are presented as an
alternative to downloading the entire filing. For example, the user may opt to
download only the Executive Compensation
section of the Proxy. Of course, we do not want to use unnecessary disk
space at the IMS site and thus we are testing realtime extractions of
subdocuments and leaning away from batch jobs which would partition the
filings ahead of time.
- 'Intelligent Agent' help. If the user requests this service, the agent
will present a picklist of typical user queries and the subsections of one or
more filings that would be useful in each case. For this work, it is critical
that we sit with EDGAR users and do a complete task analysis for several major
user groups.
Customizing the Front-End
Using Mosaic Common Gateway Interface (CGI)
tools, it is a simple matter to provide an attractive front-end for point and
click forms retrieval. For example, we provide a Mutual Fund search
where we provide a pre-written list of publicly traded funds and flag those
not yet on EDGAR. We also
have a Prospectus search which corresponds to the "485" form series.
We also provide a Schedule
13D application, where a user enters a company (e.g. Gabelli) and the result
is all companies in which Gabelli has acquired a 5% or greater ownership
position. We utilized an internal database to reverse-engineer the CIK codes of
the target companies in order to provide their names to the end-user (in the
case of Gabelli, they have acquired positions in Tredegar Industries, Santa
Anita Operations, United Television, etc). This application has proven popular
with mutual fund aficionados based on comments we have received from
misc.invest.fund newsgroup readers. Also popular is Current Events Analysis where
users can view recent filings.
These applications are fairly straightforward: the user inputs a few
variables, then we extract records from a master index file and provide a
clickable answer. The last step, which takes place when the user clicks on the
highlighted filing, provides the native FTP service to town.hall.org in
Washington, D.C. Since the Web is a stateless connection {Berners-Lee}, the NYU
server is freed in the final phase of filing retrieval.
The Web is conducive to the cycle of resource discovery, mutual server
linking, and consequently increased traffic at each linked site. EDGAR sits at
the intersection of law, finance, and economics and thus enjoys much visibility
in the academic server community. For example, EDGAR provides reciprocal links
to the Carnegie
Mellon financial server, FINWeb, a financial
economics WWW Server, the Indiana
University School of Law, and
the University of Michigan Economics Server as well as a host of commercial
enterprises.
Current Work
Indexing schemes are a major focus of current work. The IMS
site is experimenting with the commercial WAIS engine and plans to fully index
all of the SGML header tags of each filing. This will provide the capability to
perform boolean queries on data items such as the company address, CIK code,
filing as-of date, etc.
However, since WAIS indices typically have approximately a 1:1 space
requirement correspondence with the text they are indexing, the search is on to
help the user in a more efficient manner. The NYU team, for example, is building
a Standard Industrial Code (SIC) database to permit intrasector analyses (e.g.
extracting the key financial ratios for every aerospace firm). There is also
work to identify the controlling interests behind each mutual fund in order to
correlate parent firm performance with that of the fund.
We are also tabulating the heavy responses we have received from the online
usage questionnaire. Many of the problems users face, as has been noted, stem
from their access providers; others encounter difficulties from idiosyncratic
web clients. Whenever possible, we attempt to modify the server's behavior
suitably in such events (for example, a client passes an unxpected token in a
form response).
Future Directions
Here are some interesting issues that are being worked
on in both the private and public sector.
- Officer Migration Patterns. As archiving of the EDGAR data store
continues, one can start to ask interesting questions that would rely on
``historical'' EDGAR data (recall that we have no data before January 1,
1994). Of particular interest to Management scholars is the 'flow' of officers
from one publicly traded company to another. Such a database might be used to
correlate resignations and hirings with firm performance.
- Artificial Intelligence Applications. There are some very hard questions
that users would like an EDGAR front-end to answer. For example, how do we
answer the hypothetical question 'How many boards of directors does John Q.
Public serve on?' Since we do not have social security numbers or other unique
identifiers in the body of forms, a programmatic approach needs to be quite
clever in attempting to match names.
- Modification of EDGAR User Software. Customization of user software for
EDGAR use is clearly desirable. For example, it could do 'smart parsing'
following a local cache of the document (as opposed to having the server do
it, which might cause serious delays). Typical tasks could be optimized
locally, especially when the user needs an unusual combination of subdocuments
for one or more companies. Customization can take place at the level of the
Web browser or even at the level of FTP transfer from the IMS - there exist
automated routines to do 'unmanned' FTP transfers.
Concluding Comments
As the EDGAR Data Archive grows in size and
complexity, so do the challenges inherent in serving a diverse community of
Internet users. We must monitor emerging standards, protocols, platforms, and
clients and continue to work with the various user communities to adapt the
service to their need.
References
Berners-Lee, T. and Cailliau, R. and Luotonen, A. and
Nielson,H. and A. Secret, The World-Wide Web, Communications of the ACM, 1994,
37, 76-82, August.
Bowne and Co., Appeal Securities Act Handbook, Bowne and Co., 1993.
Chi, M. T. and Feltovich, P. J. and R. Glaser, Categorization and
representation of physics problems by experts and novices, Cognitive Science,
1979, 5, 121-132.
Claffy, K. C. and Polyzos, G.C. and H.-W. Braun, Traffic characteristics of
the T1 NFSNET backbone, UCSD, 1992, CS92-252.
Claffy, K. C. and Polyzos, G.C. and H.-W. Braun, Tracking long-term growth of
the NSFNET, Communications of the ACM, 1994, 23, 35-45.
MacKie-Mason, J. and H. Varian, Some economics of the Internet, University of
Michigan, 1993, Ann Arbor, Michigan,
MacKie-Mason, J. and H. Varian, Editor=B. Kahin and J. Keller, Pricing the
Internet, Prentice-Hall, 1994.
Newell, A. and H. Simon, Human Problem Solving, Prentice-Hall, 1972.
RR Donnelley and Sons, The EDGAR Handbook, Chicago, IL, 1994.
Salton, G. and Allan, J. and C. Buckley", Automatic structuring and retrieval
of large text files, Communications of the ACM, 1994, 37, 97-108, February.
Securities and Exchange Commission, A User's Guide to the Facilities of the
Public Reference Room, Washington, D.C., SEC Commission Office of Filings,
Information, and Consumer Services, 1991.
Securities and Exchange Commission, The SEC Edgar Filer Manual Version 3.5,
Washington, D.C., SEC, 1994.
Simon, H. and Newell, A. and J.C. Shaw, Editor=H.E. Gruber, G. Terrell, and
M. Wertheimer, The processes of creative thinking, 63-119, Lieber-Atherton,
Inc., 1962.
About the Authors
Mark Ginsburg
Mark Ginsburg is a doctoral student in the Information
Systems Department, Stern School of Business, New York University. He has a B.A.
from Princeton University, a M.A. from Columbia University, and was a Stern
Scholar in the Statistics and Operations Research Department en route to earning
a M.B.A. at NYU. He is responsible for the daily operation of NYU's EDGAR web
server and is interested in the following Internet issues: evolution of
standards, collaborative software, and the economics of interoperability (or
lack thereof).
Ajit Kambil
Professor Kambil is an Assistant Professor of Information
Systems at the Stern School of Business, New York University. He earned his
undergraduate and PhD degrees at MIT. His research is centered on three
inter-related areas: Information technology and the transformation of business
strategy, organizations and networks; Aligning Information Technology and
Business Strategies, and Communications Networks - design, use and policies.
At NYU Prof. Kambil teaches courses that introduce MBAs to the management of
information systems and undergraduate students to telecommunications systems.
Alan B. Eisner
Alan B. Eisner is a doctoral student in the Management
Department, Stern School of Business, New York University. He earned a B.S. in
Operations Research and Industrial Engineering in 1989, and a M.Eng. in
Engineering Management in 1992, both from Cornell University. His primary
research interests are technology strategies and organizational learning.