Neil Bowers
Khoral Research, Inc.
neilb@khoral.com
table
which looks fine under netscape, but nothing at all will appear
when viewed using Mosaic 2.6.
A year and a half ago my employer started to provide information using the web. We are a small company, and couldn't justify a full-time webmaster, so a number of people were creating pages, with differing levels of HTML experience. Most of our pages were, and still are, created by hand, with the attendant potential for human error. Being a fan of lint (which is used to perform static analysis of C code [Darwin 91]), the regular structure and simple syntax (or so I thought at the time) of HTML seemed like an opportunity for an automatic checker. Enter weblint, stage left.
In developing weblint, I was more interested in a tool which would provide some level of assurance to developers of web pages that their pages provide the intended content to the reader. Rather than proving something is formally correct, I wanted a tool that measured does it get the job done?
In this section a taxonomy of HTML problems is developed. The (potential) problems identified by Weblint can be classified according to the categories described below.
The simplest syntactic problem is illegal elements. This might be the result of using an element which is not part of the DTD you're writing to, a typo (for example, <BLOKCQUOTE>), or the result of including literal text with something which looks like markup, such as <laugh>. The same problems can occur for element attributes.
Another syntactic problem is unclosed container errors, which can have a number of causes:
<A NAME="...">
did not have a closing </A>
.
<H2>Level 1 Heading</H1>
.
<TITLE>Weblint Home Page<TITLE>
.
<TITLE>
).
ROWS
and COLS
attributes
for <TEXTAREA>
).
There are some places where whitespace is significant, and will change how your page is rendered by some browsers. For example, leading whitespace in a list element with unevenly formatted text, as shown below:
<LI>First item <LI> Second item, won't be lined up with 1st and 3rd in all browsers <LI>Third itemLeading and trailing whitespace in anchor elements can result in hanging underscores, which looks goofy when the only content of the anchor is an image.
Abuse of header elements is extremely common,
the most frequent misdemeanor being use of H5 and H6 to produce small
sized text, usually for things like copyright statements.
This causes problems if you try and automatically process the HTML
for translation to a different notation,
or to generate a table of contents.
For similar reasons it is a good idea to have heading levels increasing
by no more than one,
i.e., an <H3>
should
not follow an <H1>
.
If you subscribe to the notion that HTML is intended to describe content
and not presentation,
then you should use logical rather than physical font elements.
For example, using <STRONG>
and not <B>
.
The most common problem is dead links, where the target of an anchor no longer exists, because the page has been moved, the document has been restructured, the server doesn't exist any more, or the author has moved on. The problem might also be that the URL was mistyped.
Another user unfriendly problem is that of limbo pages, where there are no upward links, for example, to the parent document or site. Pages are often reached by way of a search engine, and it is frustrating to have to guess what the parent URL is. From a web maintenance standpoint, it is also useful to know if any pages are not referenced by any other page.
SGML allows for attribute values to be quoted using single (')
or double (") quotation marks.
Very few of the most popular browsers support use of single quotations,
so attribute values should always be quoted with double quotations,
and the "
entity used to represent double quotation marks
within attribute values.
Very few browsers handle SGML comments correctly. A valid SGML comment has the following format:
<!-- body of the comment -->Although it is perfectly legal to include valid markup within comments, this will confuse a large number of browsers, and should be avoided.
There are a number of features which are inherited from SGML, such as tag minimization, which allows the following notation:
<TITLE/Weblint home page/None of the current crop of browsers support this, or other `esoteric' notations [Sosin 96].
ALT
attribute values for
<IMG>
elements.
The simplest way to use weblint is to provide one or more filenames
on the command line.
For example, if the foobar.html
contained the following:
<HTML> <HEAD><TITLE>Sample HTML Page</TITLE></HEAD> <BODY> <H1>Sample HTML Page</H2> Click <A href="fun.html"><B>here</A></B> for fun! </BODY> </HTML>then weblint can be used to check it as follows:
% weblint foobar.html foobar.html(4): unmatched </H2> (no matching <H2> seen). foobar.html(5): bad form to use `here' as an anchor! foobar.html(5): </A> on line 5 seems to overlap <B>, opened on line 5. foobar.html(6): no closing </H1> seen for <H1> on line 4.In addition to specifying files, you can provide URLs or directory names. In the latter case weblint will recurse in the directory, checking all HTML files found. This makes it easy to check a given set of pages.
Weblint can also be used via one of a number of form-based interfaces. The user can type in a URL, and request that weblint be used to check the referenced page. The first such interface was created by Clay Webster at Unipress [Unipress 95], but there are now several different interfaces available, listed on the Weblint home page [Bowers 96].
DOCTYPE
element).
For example, early versions of weblint generated at least five warnings for overlapping elements. I took pains to ensure that this would only result in one warning for the basic case. For example, the following snippet was checked with weblint and the WebTechs validator, which is based on the sgmls parser:
<HTML><HEAD><TITLE>sample HTML</TITLE></HEAD> <BODY> <B><I>Hello</B></I> </BODY> </HTML>The WebTechs validator produced the following output
sgmls: SGML error at -, line 7 at ">": I end-tag implied by B end-tag; not minimizable sgmls: SGML error at -, line 7 at "": I end-tag ignored: doesn't end any open element (current is BODY)while Weblint generated the following:
foobar.html(3): </B> on line 3 seems to overlap <I>, opened on line 3.
All weblint warnings have an associated identifier,
and a flag which specifies whether the warning should be generated.
For example,
the warning for potentially overlapping elements in the example
above has an identifier of element-overlap
.
If you are not interested in seeing this warning,
you can run
% weblint -d element-overlap foobar.htmlOr disable the warning from your configuration file, with the line:
disable element-overlapWeblint includes the concept of HTML extensions, which define additional elements and attributes. Enabling the Netscape extensions means that weblint won't warn that
CENTER
is non-standard markup.A number of configuration variables can also be set in the user's configuration file:
set message-style = lint set url-get = lynx -source set directory-index = index.html, welcome.htmlThe first variable specifies that warnings should be generated in the style of traditional lint. The
url-get
variable provides a command which can be used
to pull down pages specified with a URL.
It is also used to pull the current todo list from my ftp server
if you run weblint -todo
.
The last variable specifies the name(s) of a valid directory index file,
which weblint checks for when recursing in directories.
- Syntax Warnings
- Unknown tags and attributes.
- Unclosed container elements.
- Required attributes.
- Context checks.
- Illegally nested container elements.
- Lexical Warnings
- Leading and trailing whitespace in certain container elements, such as anchors.
- Not using entities where appropriate.
- HTML Usage Warnings
- Not using HTML, HEAD, and BODY elements.
- Use of `here' as anchor text.
- Not including a DOCTYPE.
- Not including a LINK defining the page's author.
- Use of obsolete markup, such as XMP.
- Headings appearing in unexpected order.
- Structural Integrity Warnings
- This is the weakest area of weblint, though I haven't been hurrying to rectify that, since MOMspider and other packages do that pretty well.
- No index file for a directory.
- Target for local anchor does not exist (buggy).
- Portability Warnings
- Markup inside comments.
- Use of single quotation marks for attribute values.
- Not defining ALT text for IMaGes.
- Use of netscape specific markup.
- Stylistic Warnings
- Tags in upper or lower case.
- Use of `here' as anchor text.
- Not specifying WIDTH and HEIGHT on IMaGes.
- Use of physical, rather than logical, font markup.
- Empty container elements. Empty paragraphs are often used to explicitly control spacing, for example.
Henry Churchyard's htmlchek is very similar to weblint,
performing ad-hoc analysis of web pages with control over
the warnings generated [Churchyard 95].
Both weblint and htmlchek have warnings not available in the other,
so it is often worth running both.
Something I am working to rectify :-)
.
The WebTechs HTML validator [WebTechs 96] provides true validation against a number of selectable DTDs, using the sgmls parser by James Clark [Clark 94]. Although this provides the last word in HTML conformance, a lot of the warning messages are hard to understand, particularly for those new to HTML. The Kinder, Gentler Validator is also based on sgmls, but provides a more user friendly description of any problems found [Oskoboiny 96].
Doctor HTML is a forms-based package which performs a number of checks, including basic document structure, spell checking, and use of FORM, TABLE, and IMG elements [Tongue 95]. Results are generated as tables, and while they are thorough, it is hard to find the real warnings.
There are a number of other tools which perform a subset of the validation activities identified above. There is an index of validation tools on the weblint home page, and one at Yahoo [Yahoo 96b].
The todo list for weblint grows faster than items are taken off. The following provides highlights of the things I'll be working on over the next year or so:
<H2>Leather Goddesses of Phobos</H1>which results in two warnings, rather than one.
I would also like to thank Leslee Richards for proofreading this paper and making many useful suggestions. Any errors are my own.
http://www.khoral.com/staff/neilb/weblint.html
http://uts.cc.utexas.edu/~churchh/htmlchek.html
http://www.ora.com/gnn/bus/ora/item/lint.html
http://www.ics.uci.edu/WebSoft/MOMspider/docs/www94_paper.ps
http://www.w3.org/pub/WWW/MarkUp/
http://ugweb.cs.ualberta.ca/~gerald/validate/
http://www.best.com/~sem/dark_side/
http://www.earth.com/bad-style/why-validate.html
http://imagiware.com/RxHTML.cgi
http://www.unipress.com/weblint/
http://www.ora.com/gnn/bus/ora/item/pperl.html
http://www.webtechs.com/html-val-svc/
http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/Guides_and_Tutorials/
http://www.yahoo.com/Computers_and_Internet/Software/Data_Formats/HTML/Validation_Checkers/