dr. Ziga Turk, University of Ljubljana, FGG, Jamova 2, Ljubljana,
Slovenia.
ziga.turk@fagg.uni-lj.si
, http://www.fagg.uni-lj.si/~zturk/
Judging by the Internet traffic statistics and the share of the ftp, downloading software and other file oriented information remains one of the dominant usage of the Internet. With the exception of Archie, however, little has been done to simplify and facilitate the search and retrieval of files. Many users still manually browse FTP archives to pick things up. The basic problem with Archie is that one needs to know more or less the exact name of the file which contains the program required, before he/she attempts to locate it on the Net and downloads it. Other techniques, not integrated with Archie, are needed to find the name of the file. There are several and none is specialized in general software so the number of hits in those searches is high yet few point to file names.
With the invention of the World Wide Web a new breed of users is beginning to use the Internet and they expect the WWW to provide user friendly interface to other, historically older but more difficult to use services. What these users need in relation to software available on the Net is a service which can be queried on keywords and descriptions about on-line available software and which returns a list of files an a selection of sites which store the file.
In this report we present the implemented solution which currently replies to some 400.000 requests per month and use this forum to discuss future developments that need broader cooperation of the authors of the software distributed on the Net and of the service providers.
As we have found out, hard data (descriptions) on some 100000 pieces of software is available on the Internet. This software is archived in the so called software archives. Best known are SimTel, Cica, Hobbes, SunSite etc. Archive is a library which not only stores files but also their descriptions. Typically, archives are mirrored to reduce the load on the original site. We used that data to create the Virtual Shareware Library (VSL) at http://www.fagg.uni-lj.si/SHASE/ - a catalogue which includes large majority of described files available on the Internet. The associated search and delivery engine (SHASE) uses the WWW for a user friendly interface. The structure of the database is in Fig. 1.
Fig. 1: VSL schema,
The VSL collects the data by mirroring selected index files of the archives and translates them to a neutral format. The database is then mirrored to several "front desks", all carrying an exact copy of the database and the same search script. A users visits such a front desk, fills in a form like the one in Fig. 2. As a results a list of files matching the criteria is presented which contain pointers to that file on all mirror sites. In interface to Archie is provided as well. Finally a user picks a site and downloads the file.
The VSL operates since spring 1994. It is quite popular and serves some 400000 requests a day. It supports 20 archives which total around 11 Gigabytes. In addition, VSL also provides interesting insights on archive statistics, age and size histograms etc. At the time of writing (mid Feb. 1994) two European and a US front desk have opened.
Fig. 2: Search form filled-in to search for html editors posted
since Jan 1st 1994 in the Cica, Sim-Win and Microsoft archives.
Further enhancements of the VSL, most of them due by the end of March, include:
Having the statistical data on software archives available, we have found out that for PC related platforms which until recently, were not networked (DOS, Windows, Mac, Amiga, OS/2), the SHASE databases provides a very representative selection of the software available on line. Nearly everything is catalogued. On the other hand, for platforms where the majority of them has been networked, archives are not as representative; quite a few programs are out there, but not properly archived but stored only on the author's machine. With the growing availability of Internet we fear that the great capital that has accumulated in the archives will start to dissolve into numerous tiny sites each offering the works of the host owner and a large number of WWW based collection pointing to some of them. Search for those files will not be as flexible and fast as for the archived files. We shall therefore extend SHASE with two additional databases:
Demonstrating the flexibility of the WWW, archived files can be searched more conformably then ever using the SHASE engine. However, there are more and more files appearing which are not archived, do have a WWW page or appear in a WWW subject index. For those a registry system is needed which should result from a coordinated effort.