Fifth International World Wide Web Conference
May 6-10, 1996, Paris, France

Weblint: Quality Assurance for the World-Wide Web

Neil Bowers
Khoral Research, Inc.
neilb@khoral.com

Abstract

More and more people are creating World Wide Web (WWW) pages. This explosive growth of the WWW is made easier through a widening variety of web browsers, each implementing its own interpretation of HTML. WWW search engines which extract information from pages based on their structure have multiplied just as rapidly. All these things mean it is increasingly important that web pages are checked for legal syntax and for additional problems, such as portability across browsers. This paper develops an initial taxonomy of HTML problems and describes weblint, a tool which can be used to identify a number of these problems.

Introduction

The last two years have seen an explosive growth in the WWW. Everyone and their cat has a home page, with additional random pages providing information they imagine others want to read. Developers of WWW pages have no way of knowing (modulo CGI scripts) what browsers will be pointed at their pages, so some way to provide assurance that pages will work for everyone is necessary. Additional problems for WWW pages builders exist as well:

Human error. Unfortunately humans tend to make mistakes, particularly when performing repetitive tasks, and much of the web page creation process is repetitive. A large number of non-technophiles are also creating web pages, so we cannot assume that everyone will be intimately familiar with the DTD.
Browsers are supposed to be liberal in what they accept, and do their best to render pages, no matter how badly formed they are. But a page that looks great under netscape may be un-viewable under other browsers.
Since you're spending all that time crafting the content of your page, you want people to be able to read your web pages.
You want surfers to want to read about your love life and Mother's fruitcake recipe. Poorly crafted pages send a loud subconscious signal to the surfer, as does sloppy prose. With the short attention span and low tolerance of the average surfer, you cannot afford to fail on that all-important first impression.
Search engines are becoming the first line of attack when surfing, so it is important that your pages are amenable to automatic processing, particularly for extracting titles and section headings.
Not all browsers handle non-conformant HTML the same way; some content will not appear. For example, it is possible to write a table which looks fine under netscape, but nothing at all will appear when viewed using Mosaic 2.6.
More and more people are not writing HTML directly, but using tools or a higher level notation, then generating HTML. We need to ensure that valid HTML is being generated.

Any time humans create things with a specific notation, a tool to catch mistakes is needed. Validation tools can check HTML pages and identify problems for the developer [Sanders 94].

A year and a half ago my employer started to provide information using the web. We are a small company, and couldn't justify a full-time webmaster, so a number of people were creating pages, with differing levels of HTML experience. Most of our pages were, and still are, created by hand, with the attendant potential for human error. Being a fan of lint (which is used to perform static analysis of C code [Darwin 91]), the regular structure and simple syntax (or so I thought at the time) of HTML seemed like an opportunity for an automatic checker. Enter weblint, stage left.

A quick tour of this paper

A definition of validation is developed, along with a simple taxonomy of HTML errors. This is followed with a description of weblint, which covers philosophy, design, and the categories of HTML validation performed, according to the taxonomy presented in the previous section. A summary of other validation tools is given, with comparison to weblint. Finally the conclusion presents a summary of the weblint, the contents of this paper, and future plans related to weblint and HTML validation.

Validation

Before describing tools for validation of HTML, we need a definition and scope for validation. In the strictest sense, a given page is valid if it conforms to a specific Document Type Definition (DTD) for HTML. Since there is no single definition of what constitutes HTML, strict validation must refer to a specific DTD, such as the definition for HTML 2 [HTML 2]. There are a number of services which provide strict validation, the best known being the WebTechs HTML Validation Service [WebTechs 96].

In developing weblint, I was more interested in a tool which would provide some level of assurance to developers of web pages that their pages provide the intended content to the reader. Rather than proving something is formally correct, I wanted a tool that measured does it get the job done?

In this section a taxonomy of HTML problems is developed. The (potential) problems identified by Weblint can be classified according to the categories described below.

Syntax Problems

These are problems related to incorrect use of HTML tags or elements. There are a number of common types of syntax error.

The simplest syntactic problem is illegal elements. This might be the result of using an element which is not part of the DTD you're writing to, a typo (for example, <BLOKCQUOTE>), or the result of including literal text with something which looks like markup, such as <laugh>. The same problems can occur for element attributes.

Another syntactic problem is unclosed container errors, which can have a number of causes:

Forgetting the closing tag (most common for list elements and blockquote), or not realizing that one was needed. I have received quite a few `bug reports', because weblint was warning that <A NAME="..."> did not have a closing </A>.
Using the wrong closing tag. The most common example is mismatched header tags, as in <H2>Level 1 Heading</H1>.
Forgetting the / on the closing tag, for example <TITLE>Weblint Home Page<TITLE>.

Lexical Problems

These are problems related to use of character sets, and formatting-related problems. These problems are often the results in browsers which use ad-hoc parsing techniques, rather than mechanisms driven from a DTD.

There are some places where whitespace is significant, and will change how your page is rendered by some browsers. For example, leading whitespace in a list element with unevenly formatted text, as shown below:

    <LI>First item
    <LI> Second item, won't be lined up with 1st and 3rd in all browsers
    <LI>Third item

Leading and trailing whitespace in anchor elements can result in hanging underscores, which looks goofy when the only content of the anchor is an image.

HTML Usage Problems

Some aspects of HTML are optional, but can be useful to check for, such as use of the HTML, HEAD, TITLE and BODY elements.

Abuse of header elements is extremely common, the most frequent misdemeanor being use of H5 and H6 to produce small sized text, usually for things like copyright statements. This causes problems if you try and automatically process the HTML for translation to a different notation, or to generate a table of contents. For similar reasons it is a good idea to have heading levels increasing by no more than one, i.e., an <H3> should not follow an <H1>.

If you subscribe to the notion that HTML is intended to describe content and not presentation, then you should use logical rather than physical font elements. For example, using <STRONG> and not <B>.

Structural Integrity Problems

A single web page is not a stand-alone document, but almost certainly part of some larger document, or infostructure. This usually holds true at a number of scales. For example, this section of the paper exists as a single page, which is part of the overall document for this paper, which is itself a component of the weblint pages I maintain. The weblint pages are a subset of all the pages I maintain, which are a subset of the pages on our server. There are links across pages at all of these levels. Furthermore, there are links from these pages to URLs on other servers, and there are external documents referencing my pages.

The most common problem is dead links, where the target of an anchor no longer exists, because the page has been moved, the document has been restructured, the server doesn't exist any more, or the author has moved on. The problem might also be that the URL was mistyped.

Another user unfriendly problem is that of limbo pages, where there are no upward links, for example, to the parent document or site. Pages are often reached by way of a search engine, and it is frustrating to have to guess what the parent URL is. From a web maintenance standpoint, it is also useful to know if any pages are not referenced by any other page.

Portability Problems

There are some aspects of HTML which, while they are legal, cause problems on enough browsers to warrant avoiding them.

SGML allows for attribute values to be quoted using single (') or double (") quotation marks. Very few of the most popular browsers support use of single quotations, so attribute values should always be quoted with double quotations, and the " entity used to represent double quotation marks within attribute values.

Very few browsers handle SGML comments correctly. A valid SGML comment has the following format:

     <!-- body of the comment -->

Although it is perfectly legal to include valid markup within comments, this will confuse a large number of browsers, and should be avoided.

There are a number of features which are inherited from SGML, such as tag minimization, which allows the following notation:

    <TITLE/Weblint home page/

None of the current crop of browsers support this, or other `esoteric' notations [Sosin 96].

Stylistic Problems

There is a wide range of `stylistic problems', most of which depend on your taste [Yahoo 96a]. Examples of this category:

Using `here' and other content-free text within anchors (As in `Click here to read about the Wizard Frobozz'). This is particularly bad given that many search engines will use anchor text.
Spelling and grammatical mistakes. Sloppy text is worse than sloppy HTML, and almost as bad as use of the BLINK tag.
Discriminating against users of text-only browsers, such as lynx (are there any other text browsers?):
- Index pages where the only interface is an image map.
- Not providing ALT attribute values for <IMG> elements.

Weblint

Weblint is an ad-hoc HTML validation program which is written in the Perl scripting language [Wall 91]. Perl was designed for processing and generating text, and has powerful regular expression capabilities, which made it the right tool for the job. Perl is also very portable, so weblint can be used under Unix, VMS, Windows, Mac, and other platforms.

The simplest way to use weblint is to provide one or more filenames on the command line. For example, if the foobar.html contained the following:

    <HTML>
    <HEAD><TITLE>Sample HTML Page</TITLE></HEAD>
    <BODY>
       <H1>Sample HTML Page</H2>
       Click <A href="fun.html"><B>here</A></B> for fun!
    </BODY>
    </HTML>

then weblint can be used to check it as follows:

   % weblint foobar.html
   foobar.html(4): unmatched </H2> (no matching <H2> seen).
   foobar.html(5): bad form to use `here' as an anchor!
   foobar.html(5): </A> on line 5 seems to overlap <B>, opened on line 5.
   foobar.html(6): no closing </H1> seen for <H1> on line 4.

In addition to specifying files, you can provide URLs or directory names. In the latter case weblint will recurse in the directory, checking all HTML files found. This makes it easy to check a given set of pages.

Weblint can also be used via one of a number of form-based interfaces. The user can type in a URL, and request that weblint be used to check the referenced page. The first such interface was created by Clay Webster at Unipress [Unipress 95], but there are now several different interfaces available, listed on the Weblint home page [Bowers 96].

Weblint Design Issues

There have been a number of goals, criteria, and points of philosophy I've held while developing weblint:

Weblint should be a useful tool for people creating web pages.
I was not trying to produce a strict HTML validation system; one of those exists already. If someone requests a new check and warning, I will almost always add it to the todo list. If the warning is not likely to be of interest to most people, then it will be disabled by default (for example, checking to see if the first thing in a file is a DOCTYPE element).
Weblint should be easy to obtain, install, and use.
Weblint is intended to speed up the process of creating good web pages. The web is used by non-technophiles, so weblint needs to be easy to use.
Weblint should produce warnings which are easy to understand.
Validators based on DTDs often generate fairly cryptic messages, since they're based around the idea of showing that something is correct. Often one mistake will result in scads of warnings, because the parser becomes confused. Weblint is based around an ad-hoc parser, and includes code to check for known mistakes.
For example, early versions of weblint generated at least five warnings for overlapping elements. I took pains to ensure that this would only result in one warning for the basic case. For example, the following snippet was checked with weblint and the WebTechs validator, which is based on the sgmls parser:
```
    <HTML><HEAD><TITLE>sample HTML</TITLE></HEAD>
    <BODY>
    <B><I>Hello</B></I>
    </BODY>
    </HTML>
```
The WebTechs validator produced the following output
```
    sgmls: SGML error at -, line 7 at ">":
           I end-tag implied by B end-tag; not minimizable
    sgmls: SGML error at -, line 7 at "":
           I end-tag ignored: doesn't end any open element (current is BODY)
```
while Weblint generated the following:
```
    foobar.html(3): </B> on line 3 seems to overlap <I>, opened on line 3.
```
Weblint should be configurable
It should be possible, and easy, to specify what types of warnings weblint should generate. This was important, given that different people care about different aspects of HTML validation (I understand there are people who don't consider BLINK to be a crime against the net). Configuration is considered separately, below.
Weblint should be flexible in how it can be used.
Weblint was originally developed as a command-line tool, but has had a number of features and options added to facilitate development of forms-based interfaces, and other front-ends. It should also be easy to use in batch style, for example, running from a crontab to check pages on a regular basis.

Configuring Weblint

It is important that a validation tool be configurable, for a number of reasons:

There is no single definition of what constitutes `HTML'.
Not everyone is worried about the portability of their HTML pages.
HTML developers have different definitions of correctness.
Some people do like rigorous, anal retentive, warnings. Myself included.

The operation of weblint can be controlled via command-line switches, or the user can create a .weblintrc file in their home directory. The weblint distribution includes a sample configuration file, which mirrors the built-in defaults, and is thus a good start point for someone wanting to tweak weblint's operation.

All weblint warnings have an associated identifier, and a flag which specifies whether the warning should be generated. For example, the warning for potentially overlapping elements in the example above has an identifier of element-overlap. If you are not interested in seeing this warning, you can run

    % weblint -d element-overlap foobar.html

Or disable the warning from your configuration file, with the line:

    disable element-overlap

Weblint includes the concept of HTML extensions, which define additional elements and attributes. Enabling the Netscape extensions means that weblint won't warn that CENTER is non-standard markup.

A number of configuration variables can also be set in the user's configuration file:

    set message-style   = lint
    set url-get         = lynx -source
    set directory-index = index.html, welcome.html

The first variable specifies that warnings should be generated in the style of traditional lint. The url-get variable provides a command which can be used to pull down pages specified with a URL. It is also used to pull the current todo list from my ftp server if you run weblint -todo. The last variable specifies the name(s) of a valid directory index file, which weblint checks for when recursing in directories.

Weblint Warnings

The version of weblint publically available at the time of writing (1.013) supports 41 different warnings, in all the different problem categories identified above. The next version will support several more, and the todo list has a lot beyond that waiting for implementation.

Syntax Warnings

Unknown tags and attributes.
Unclosed container elements.
Required attributes.
Context checks.
Illegally nested container elements.

Lexical Warnings

Leading and trailing whitespace in certain container elements, such as anchors.
Not using entities where appropriate.

HTML Usage Warnings

Not using HTML, HEAD, and BODY elements.
Use of `here' as anchor text.
Not including a DOCTYPE.
Not including a LINK defining the page's author.
Use of obsolete markup, such as XMP.
Headings appearing in unexpected order.

Structural Integrity Warnings
This is the weakest area of weblint, though I haven't been hurrying to rectify that, since MOMspider and other packages do that pretty well.

No index file for a directory.
Target for local anchor does not exist (buggy).

Portability Warnings

Markup inside comments.
Use of single quotation marks for attribute values.
Not defining ALT text for IMaGes.
Use of netscape specific markup.

Stylistic Warnings

Tags in upper or lower case.
Use of `here' as anchor text.
Not specifying WIDTH and HEIGHT on IMaGes.
Use of physical, rather than logical, font markup.
Empty container elements. Empty paragraphs are often used to explicitly control spacing, for example.

Other Validation Tools

MOMspider is a robot which checks the structural integrity of a specified infostructure, such as the set of web pages you maintain [Fielding 94]. MOMspider fills a gap in the coverage provided by weblint.

Henry Churchyard's htmlchek is very similar to weblint, performing ad-hoc analysis of web pages with control over the warnings generated [Churchyard 95]. Both weblint and htmlchek have warnings not available in the other, so it is often worth running both. Something I am working to rectify :-).

The WebTechs HTML validator [WebTechs 96] provides true validation against a number of selectable DTDs, using the sgmls parser by James Clark [Clark 94]. Although this provides the last word in HTML conformance, a lot of the warning messages are hard to understand, particularly for those new to HTML. The Kinder, Gentler Validator is also based on sgmls, but provides a more user friendly description of any problems found [Oskoboiny 96].

Doctor HTML is a forms-based package which performs a number of checks, including basic document structure, spell checking, and use of FORM, TABLE, and IMG elements [Tongue 95]. Results are generated as tables, and while they are thorough, it is hard to find the real warnings.

There are a number of other tools which perform a subset of the validation activities identified above. There is an index of validation tools on the weblint home page, and one at Yahoo [Yahoo 96b].

Conclusions and Future Work

Weblint is a useful tool for web developers, and provides certain categories of validation which are not provided for by other tools, particularly those which are valid HTML, but problematic for different reasons. Weblint is not fully comprehensive in its coverage, and is best used in combination with additional tools, such as htmlchek and the WebTechs validator. Weblint's coverage improves with every release, which happens approximately monthly.

The todo list for weblint grows faster than items are taken off. The following provides highlights of the things I'll be working on over the next year or so:

A more flexible form-based interface. In the future there will be a basic interface, and a full-featured interface, with both being generated from the weblint source. A weblint form kit will be available, since I get a lot of requests from webmasters who want to provide a local web-based version of weblint.
There are still a number of problems which will result in more than one warning from weblint, such as:
```
    <H2>Leather Goddesses of Phobos</H1>
```
which results in two warnings, rather than one.
More comprehensive context checking. Weblint is table-driven, so I would like to create maintenance scripts which generate these tables directly from a DTD.
Improve structural integrity checks, such as checking for existence of URLs referenced, and whether a page has any links off it.
The weblint distribution includes a regression testsuite. It would be nice to have automatic generation of test cases from a DTD.
Check for validity of FORM items, flagging radio elements with more than one item checked, for example.
Add support for additional HTML extensions, such as those defined by Microsoft and Spinner. There are always new Netscape elements to add, partly because Netscape, Inc. never publishes a full list of supported markup, let alone a DTD.
Development of a higher level meta validation tool which uses a selection of the available validation tools to provide a `complete' validation and problem identification service. This is a better approach that trying to provide everything in one tool.
I hope browser technology will advance far enough such that some of the warnings can be removed. Don't hold your breath.

Acknowledgements

I would like to thank the Weblint Victims for all their help, in the form of pre-release testing, bug reports and fixes, suggestions and code for new features.

I would also like to thank Leslee Richards for proofreading this paper and making many useful suggestions. Any errors are my own.

References

[Bowers 96]: Neil Bowers, Weblint Home Page.
http://www.khoral.com/staff/neilb/weblint.html
[Churchyard 95]: Henry Churchyard, Htmlchek Home Page.
http://uts.cc.utexas.edu/~churchh/htmlchek.html
[Clark 94]: James Clark, sgmls SGML Parser.
[Darwin 91]: Ian F. Darwin, Checking C Programs with lint. O'Reilly & Associates, Inc. 1991.
http://www.ora.com/gnn/bus/ora/item/lint.html
[Fielding 94]: Roy T. Fielding, Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web. Proceedings of the First International Conference on the World-Wide Web. Geneva, May 1994.
http://www.ics.uci.edu/WebSoft/MOMspider/docs/www94_paper.ps
[HTML 2]: W3 Consortium, HyperText Markup Language (HTML).
http://www.w3.org/pub/WWW/MarkUp/
[Oskoboiny 96]: Gerald Oskoboiny, A Kinder, Gentler Validator.
http://ugweb.cs.ualberta.ca/~gerald/validate/
[Sosin 96]: Semyon Sosin, Dark Side of the HTML.
http://www.best.com/~sem/dark_side/
[Sanders 94]: Tony Sanders, Why Validate Your HTML.
http://www.earth.com/bad-style/why-validate.html
[Tongue 95]: Thomas Tongue and Imagiware, Doctor HTML.
http://imagiware.com/RxHTML.cgi
[Unipress 95]: Unipress, Inc., Weblint Interface.
http://www.unipress.com/weblint/
[Wall 91]: Larry Wall, Programming Perl. O'Reilly & Associates, Inc. 1991.
http://www.ora.com/gnn/bus/ora/item/pperl.html
[WebTechs 96]: The WebTechs HTML Validation Service.
http://www.webtechs.com/html-val-svc/
[Yahoo 96a]: Yahoo Web Archive, HTML Guides and Tutorials
http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/Guides_and_Tutorials/
[Yahoo 96b]: Yahoo Web Archive, HTML Validation/Checkers
http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/Validation_Checkers/

Getting Weblint

You can get the latest version of weblint via the weblint home page:

http://www.khoral.com/staff/neilb/weblint.html

or directly from our ftp server:

ftp://ftp.khoral.com/pub/weblint/weblint.tar.gz

Fifth International World Wide Web Conference May 6-10, 1996, Paris, France