WWW5 Fifth International World Wide Web Conference
May 6-10, 1996, Paris, France


Weblint: Quality Assurance for the World-Wide Web

Neil Bowers
Khoral Research, Inc.
neilb@khoral.com

Abstract

More and more people are creating World Wide Web (WWW) pages. This explosive growth of the WWW is made easier through a widening variety of web browsers, each implementing its own interpretation of HTML. WWW search engines which extract information from pages based on their structure have multiplied just as rapidly. All these things mean it is increasingly important that web pages are checked for legal syntax and for additional problems, such as portability across browsers. This paper develops an initial taxonomy of HTML problems and describes weblint, a tool which can be used to identify a number of these problems.


Introduction

The last two years have seen an explosive growth in the WWW. Everyone and their cat has a home page, with additional random pages providing information they imagine others want to read. Developers of WWW pages have no way of knowing (modulo CGI scripts) what browsers will be pointed at their pages, so some way to provide assurance that pages will work for everyone is necessary. Additional problems for WWW pages builders exist as well: Any time humans create things with a specific notation, a tool to catch mistakes is needed. Validation tools can check HTML pages and identify problems for the developer [Sanders 94].

A year and a half ago my employer started to provide information using the web. We are a small company, and couldn't justify a full-time webmaster, so a number of people were creating pages, with differing levels of HTML experience. Most of our pages were, and still are, created by hand, with the attendant potential for human error. Being a fan of lint (which is used to perform static analysis of C code [Darwin 91]), the regular structure and simple syntax (or so I thought at the time) of HTML seemed like an opportunity for an automatic checker. Enter weblint, stage left.

A quick tour of this paper

A definition of validation is developed, along with a simple taxonomy of HTML errors. This is followed with a description of weblint, which covers philosophy, design, and the categories of HTML validation performed, according to the taxonomy presented in the previous section. A summary of other validation tools is given, with comparison to weblint. Finally the conclusion presents a summary of the weblint, the contents of this paper, and future plans related to weblint and HTML validation.


Validation

Before describing tools for validation of HTML, we need a definition and scope for validation. In the strictest sense, a given page is valid if it conforms to a specific Document Type Definition (DTD) for HTML. Since there is no single definition of what constitutes HTML, strict validation must refer to a specific DTD, such as the definition for HTML 2 [HTML 2]. There are a number of services which provide strict validation, the best known being the WebTechs HTML Validation Service [WebTechs 96].

In developing weblint, I was more interested in a tool which would provide some level of assurance to developers of web pages that their pages provide the intended content to the reader. Rather than proving something is formally correct, I wanted a tool that measured does it get the job done?

In this section a taxonomy of HTML problems is developed. The (potential) problems identified by Weblint can be classified according to the categories described below.

Syntax Problems

These are problems related to incorrect use of HTML tags or elements. There are a number of common types of syntax error.

The simplest syntactic problem is illegal elements. This might be the result of using an element which is not part of the DTD you're writing to, a typo (for example, <BLOKCQUOTE>), or the result of including literal text with something which looks like markup, such as <laugh>. The same problems can occur for element attributes.

Another syntactic problem is unclosed container errors, which can have a number of causes:

Other common syntax problems are:

Lexical Problems

These are problems related to use of character sets, and formatting-related problems. These problems are often the results in browsers which use ad-hoc parsing techniques, rather than mechanisms driven from a DTD.

There are some places where whitespace is significant, and will change how your page is rendered by some browsers. For example, leading whitespace in a list element with unevenly formatted text, as shown below:

    <LI>First item
    <LI> Second item, won't be lined up with 1st and 3rd in all browsers
    <LI>Third item
Leading and trailing whitespace in anchor elements can result in hanging underscores, which looks goofy when the only content of the anchor is an image.

HTML Usage Problems

Some aspects of HTML are optional, but can be useful to check for, such as use of the HTML, HEAD, TITLE and BODY elements.

Abuse of header elements is extremely common, the most frequent misdemeanor being use of H5 and H6 to produce small sized text, usually for things like copyright statements. This causes problems if you try and automatically process the HTML for translation to a different notation, or to generate a table of contents. For similar reasons it is a good idea to have heading levels increasing by no more than one, i.e., an <H3> should not follow an <H1>.

If you subscribe to the notion that HTML is intended to describe content and not presentation, then you should use logical rather than physical font elements. For example, using <STRONG> and not <B>.

Structural Integrity Problems

A single web page is not a stand-alone document, but almost certainly part of some larger document, or infostructure. This usually holds true at a number of scales. For example, this section of the paper exists as a single page, which is part of the overall document for this paper, which is itself a component of the weblint pages I maintain. The weblint pages are a subset of all the pages I maintain, which are a subset of the pages on our server. There are links across pages at all of these levels. Furthermore, there are links from these pages to URLs on other servers, and there are external documents referencing my pages.

The most common problem is dead links, where the target of an anchor no longer exists, because the page has been moved, the document has been restructured, the server doesn't exist any more, or the author has moved on. The problem might also be that the URL was mistyped.

Another user unfriendly problem is that of limbo pages, where there are no upward links, for example, to the parent document or site. Pages are often reached by way of a search engine, and it is frustrating to have to guess what the parent URL is. From a web maintenance standpoint, it is also useful to know if any pages are not referenced by any other page.

Portability Problems

There are some aspects of HTML which, while they are legal, cause problems on enough browsers to warrant avoiding them.

SGML allows for attribute values to be quoted using single (') or double (") quotation marks. Very few of the most popular browsers support use of single quotations, so attribute values should always be quoted with double quotations, and the &quot; entity used to represent double quotation marks within attribute values.

Very few browsers handle SGML comments correctly. A valid SGML comment has the following format:

     <!-- body of the comment -->
Although it is perfectly legal to include valid markup within comments, this will confuse a large number of browsers, and should be avoided.

There are a number of features which are inherited from SGML, such as tag minimization, which allows the following notation:

    <TITLE/Weblint home page/
None of the current crop of browsers support this, or other `esoteric' notations [Sosin 96].

Stylistic Problems

There is a wide range of `stylistic problems', most of which depend on your taste [Yahoo 96a]. Examples of this category:


Weblint

Weblint is an ad-hoc HTML validation program which is written in the Perl scripting language [Wall 91]. Perl was designed for processing and generating text, and has powerful regular expression capabilities, which made it the right tool for the job. Perl is also very portable, so weblint can be used under Unix, VMS, Windows, Mac, and other platforms.

The simplest way to use weblint is to provide one or more filenames on the command line. For example, if the foobar.html contained the following:

    <HTML>
    <HEAD><TITLE>Sample HTML Page</TITLE></HEAD>
    <BODY>
       <H1>Sample HTML Page</H2>
       Click <A href="fun.html"><B>here</A></B> for fun!
    </BODY>
    </HTML>
then weblint can be used to check it as follows:
   % weblint foobar.html
   foobar.html(4): unmatched </H2> (no matching <H2> seen).
   foobar.html(5): bad form to use `here' as an anchor!
   foobar.html(5): </A> on line 5 seems to overlap <B>, opened on line 5.
   foobar.html(6): no closing </H1> seen for <H1> on line 4.
In addition to specifying files, you can provide URLs or directory names. In the latter case weblint will recurse in the directory, checking all HTML files found. This makes it easy to check a given set of pages.

Weblint can also be used via one of a number of form-based interfaces. The user can type in a URL, and request that weblint be used to check the referenced page. The first such interface was created by Clay Webster at Unipress [Unipress 95], but there are now several different interfaces available, listed on the Weblint home page [Bowers 96].

Weblint Design Issues

There have been a number of goals, criteria, and points of philosophy I've held while developing weblint:

Configuring Weblint

It is important that a validation tool be configurable, for a number of reasons:
  1. There is no single definition of what constitutes `HTML'.
  2. Not everyone is worried about the portability of their HTML pages.
  3. HTML developers have different definitions of correctness.
  4. Some people do like rigorous, anal retentive, warnings. Myself included.
The operation of weblint can be controlled via command-line switches, or the user can create a .weblintrc file in their home directory. The weblint distribution includes a sample configuration file, which mirrors the built-in defaults, and is thus a good start point for someone wanting to tweak weblint's operation.

All weblint warnings have an associated identifier, and a flag which specifies whether the warning should be generated. For example, the warning for potentially overlapping elements in the example above has an identifier of element-overlap. If you are not interested in seeing this warning, you can run

    % weblint -d element-overlap foobar.html
Or disable the warning from your configuration file, with the line:
    disable element-overlap
Weblint includes the concept of HTML extensions, which define additional elements and attributes. Enabling the Netscape extensions means that weblint won't warn that CENTER is non-standard markup.

A number of configuration variables can also be set in the user's configuration file:

    set message-style   = lint
    set url-get         = lynx -source
    set directory-index = index.html, welcome.html
The first variable specifies that warnings should be generated in the style of traditional lint. The url-get variable provides a command which can be used to pull down pages specified with a URL. It is also used to pull the current todo list from my ftp server if you run weblint -todo. The last variable specifies the name(s) of a valid directory index file, which weblint checks for when recursing in directories.

Weblint Warnings

The version of weblint publically available at the time of writing (1.013) supports 41 different warnings, in all the different problem categories identified above. The next version will support several more, and the todo list has a lot beyond that waiting for implementation.

Syntax Warnings
  • Unknown tags and attributes.
  • Unclosed container elements.
  • Required attributes.
  • Context checks.
  • Illegally nested container elements.
Lexical Warnings
  • Leading and trailing whitespace in certain container elements, such as anchors.
  • Not using entities where appropriate.
HTML Usage Warnings
  • Not using HTML, HEAD, and BODY elements.
  • Use of `here' as anchor text.
  • Not including a DOCTYPE.
  • Not including a LINK defining the page's author.
  • Use of obsolete markup, such as XMP.
  • Headings appearing in unexpected order.
Structural Integrity Warnings
This is the weakest area of weblint, though I haven't been hurrying to rectify that, since MOMspider and other packages do that pretty well.
  • No index file for a directory.
  • Target for local anchor does not exist (buggy).
Portability Warnings
  • Markup inside comments.
  • Use of single quotation marks for attribute values.
  • Not defining ALT text for IMaGes.
  • Use of netscape specific markup.
Stylistic Warnings
  • Tags in upper or lower case.
  • Use of `here' as anchor text.
  • Not specifying WIDTH and HEIGHT on IMaGes.
  • Use of physical, rather than logical, font markup.
  • Empty container elements. Empty paragraphs are often used to explicitly control spacing, for example.

Other Validation Tools

MOMspider is a robot which checks the structural integrity of a specified infostructure, such as the set of web pages you maintain [Fielding 94]. MOMspider fills a gap in the coverage provided by weblint.

Henry Churchyard's htmlchek is very similar to weblint, performing ad-hoc analysis of web pages with control over the warnings generated [Churchyard 95]. Both weblint and htmlchek have warnings not available in the other, so it is often worth running both. Something I am working to rectify :-).

The WebTechs HTML validator [WebTechs 96] provides true validation against a number of selectable DTDs, using the sgmls parser by James Clark [Clark 94]. Although this provides the last word in HTML conformance, a lot of the warning messages are hard to understand, particularly for those new to HTML. The Kinder, Gentler Validator is also based on sgmls, but provides a more user friendly description of any problems found [Oskoboiny 96].

Doctor HTML is a forms-based package which performs a number of checks, including basic document structure, spell checking, and use of FORM, TABLE, and IMG elements [Tongue 95]. Results are generated as tables, and while they are thorough, it is hard to find the real warnings.

There are a number of other tools which perform a subset of the validation activities identified above. There is an index of validation tools on the weblint home page, and one at Yahoo [Yahoo 96b].


Conclusions and Future Work

Weblint is a useful tool for web developers, and provides certain categories of validation which are not provided for by other tools, particularly those which are valid HTML, but problematic for different reasons. Weblint is not fully comprehensive in its coverage, and is best used in combination with additional tools, such as htmlchek and the WebTechs validator. Weblint's coverage improves with every release, which happens approximately monthly.

The todo list for weblint grows faster than items are taken off. The following provides highlights of the things I'll be working on over the next year or so:


Acknowledgements

I would like to thank the Weblint Victims for all their help, in the form of pre-release testing, bug reports and fixes, suggestions and code for new features.

I would also like to thank Leslee Richards for proofreading this paper and making many useful suggestions. Any errors are my own.


References

[Bowers 96]
Neil Bowers, Weblint Home Page.
http://www.khoral.com/staff/neilb/weblint.html

[Churchyard 95]
Henry Churchyard, Htmlchek Home Page.
http://uts.cc.utexas.edu/~churchh/htmlchek.html

[Clark 94]
James Clark, sgmls SGML Parser.

[Darwin 91]
Ian F. Darwin, Checking C Programs with lint. O'Reilly & Associates, Inc. 1991.
http://www.ora.com/gnn/bus/ora/item/lint.html

[Fielding 94]
Roy T. Fielding, Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web. Proceedings of the First International Conference on the World-Wide Web. Geneva, May 1994.
http://www.ics.uci.edu/WebSoft/MOMspider/docs/www94_paper.ps

[HTML 2]
W3 Consortium, HyperText Markup Language (HTML).
http://www.w3.org/pub/WWW/MarkUp/

[Oskoboiny 96]
Gerald Oskoboiny, A Kinder, Gentler Validator.
http://ugweb.cs.ualberta.ca/~gerald/validate/

[Sosin 96]
Semyon Sosin, Dark Side of the HTML.
http://www.best.com/~sem/dark_side/

[Sanders 94]
Tony Sanders, Why Validate Your HTML.
http://www.earth.com/bad-style/why-validate.html

[Tongue 95]
Thomas Tongue and Imagiware, Doctor HTML.
http://imagiware.com/RxHTML.cgi

[Unipress 95]
Unipress, Inc., Weblint Interface.
http://www.unipress.com/weblint/

[Wall 91]
Larry Wall, Programming Perl. O'Reilly & Associates, Inc. 1991.
http://www.ora.com/gnn/bus/ora/item/pperl.html

[WebTechs 96]
The WebTechs HTML Validation Service.
http://www.webtechs.com/html-val-svc/

[Yahoo 96a]
Yahoo Web Archive, HTML Guides and Tutorials
http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/Guides_and_Tutorials/

[Yahoo 96b]
Yahoo Web Archive, HTML Validation/Checkers
http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/Validation_Checkers/

Getting Weblint

You can get the latest version of weblint via the weblint home page: or directly from our ftp server: