VIRTUAL SHAREWARE LIBRARY - A WWW BASED SYSTEM FOR CATALOGUING SOFTWARE ON THE INTERNET

dr. Ziga Turk, University of Ljubljana, FGG, Jamova 2, Ljubljana, Slovenia.
ziga.turk@fagg.uni-lj.si , http://www.fagg.uni-lj.si/~zturk/

Keywords:: software, shareware, archives, ftp, archie, recourse discovery, WWW

Problem statement

Judging by the Internet traffic statistics and the share of the ftp, downloading software and other file oriented information remains one of the dominant usage of the Internet. With the exception of Archie, however, little has been done to simplify and facilitate the search and retrieval of files. Many users still manually browse FTP archives to pick things up. The basic problem with Archie is that one needs to know more or less the exact name of the file which contains the program required, before he/she attempts to locate it on the Net and downloads it. Other techniques, not integrated with Archie, are needed to find the name of the file. There are several and none is specialized in general software so the number of hits in those searches is high yet few point to file names.

With the invention of the World Wide Web a new breed of users is beginning to use the Internet and they expect the WWW to provide user friendly interface to other, historically older but more difficult to use services. What these users need in relation to software available on the Net is a service which can be queried on keywords and descriptions about on-line available software and which returns a list of files an a selection of sites which store the file.

In this report we present the implemented solution which currently replies to some 400.000 requests per month and use this forum to discuss future developments that need broader cooperation of the authors of the software distributed on the Net and of the service providers.

Solution: The Virtual Shareware Library

As we have found out, hard data (descriptions) on some 100000 pieces of software is available on the Internet. This software is archived in the so called software archives. Best known are SimTel, Cica, Hobbes, SunSite etc. Archive is a library which not only stores files but also their descriptions. Typically, archives are mirrored to reduce the load on the original site. We used that data to create the Virtual Shareware Library (VSL) at http://www.fagg.uni-lj.si/SHASE/ - a catalogue which includes large majority of described files available on the Internet. The associated search and delivery engine (SHASE) uses the WWW for a user friendly interface. The structure of the database is in Fig. 1.

Fig. 1: VSL schema,

The VSL collects the data by mirroring selected index files of the archives and translates them to a neutral format. The database is then mirrored to several "front desks", all carrying an exact copy of the database and the same search script. A users visits such a front desk, fills in a form like the one in Fig. 2. As a results a list of files matching the criteria is presented which contain pointers to that file on all mirror sites. In interface to Archie is provided as well. Finally a user picks a site and downloads the file.

The VSL operates since spring 1994. It is quite popular and serves some 400000 requests a day. It supports 20 archives which total around 11 Gigabytes. In addition, VSL also provides interesting insights on archive statistics, age and size histograms etc. At the time of writing (mid Feb. 1994) two European and a US front desk have opened.

Fig. 2: Search form filled-in to search for html editors posted since Jan 1st 1994 in the Cica, Sim-Win and Microsoft archives.

The Future Enhancements of the VSL

Further enhancements of the VSL, most of them due by the end of March, include:

Support for platform based searches; users will be able to search files by platform (DOS, Windows, UNIX etc.) and not by archive.
Support of batch downloads; a user will be able to select multiple files from the list of found ones and (1) batch download the files to the client machine, (2) order files to be delivered to a nearby host and (3) have the files delivered by e-mail to his/hers address. Solution one is planned as a special viewer which takes a list of ftp commands as parameter. An extension to the URL syntax which would enable batch downloads would be most welcome.
The support for un-archives files: the VSL database will be matched against an Archie database. The missing files will be used as a query into Veronica database which will be used to extract some keywords on those files.
Diff based mirror site updates. This will reduce the network load of keeping the mirror front desks up to date to a few kilobytes a day.

The Future of Software Archives

Having the statistical data on software archives available, we have found out that for PC related platforms which until recently, were not networked (DOS, Windows, Mac, Amiga, OS/2), the SHASE databases provides a very representative selection of the software available on line. Nearly everything is catalogued. On the other hand, for platforms where the majority of them has been networked, archives are not as representative; quite a few programs are out there, but not properly archived but stored only on the author's machine. With the growing availability of Internet we fear that the great capital that has accumulated in the archives will start to dissolve into numerous tiny sites each offering the works of the host owner and a large number of WWW based collection pointing to some of them. Search for those files will not be as flexible and fast as for the archived files. We shall therefore extend SHASE with two additional databases:

Managers of small archives will be able to automatically register an archive with SHASE. They would need to provide an index and archive description file in a defined unified format. The solution will be appropriate for archives of 50 or more files.
Authors of individual programs will be allowed to register their programs directly with the VSL providing a pointer to the file and file's WWW page. Coordination with services offering similar capabilities such as Aliweb is needed.

Conclusions

Demonstrating the flexibility of the WWW, archived files can be searched more conformably then ever using the SHASE engine. However, there are more and more files appearing which are not archived, do have a WWW page or appear in a WWW subject index. For those a registry system is needed which should result from a coordinated effort.