When editing HTML it is easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely laid out markup? Well now there is, thanks to Dave Raggett of HP Labs. HTML Tidy is a free utility for doing just that. It also works great on the attrociously hard to read markup generated by specialized HTML editors and conversion tools, and can help you identify where you need to pay further attention on making your pages more accessible to people with disabilities.
Tidy is able to fix up a wide range of problems and to bring to your attention things that you need to work on yourself. Each item found is listed with the line number and column so that you can see where the problem lies in your markup. Tidy will not generate a cleaned up version when there are problems that it is not sure of how to handle. These are logged as "errors" rather than "warnings".
Here are just a few examples of how Tidy perfects your HTML for you:
<h1>heading <h2>subheading</h3>
is mapped to
<h1>heading</h1> <h2>subheading</h2>
<p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?
is mapped to
<p>here is a para <b>bold <i>bold italic</i> bold?</b> normal?
<h1><i>italic heading</h1> <p>new paragraph
In Netscape and Internet Explorer this causes everything following the heading to be in the heading font size, not the desired effect at all!
Tidy maps the example to
<h1><i>italic heading</i></h1> <p>new paragraph
<i><h1>heading</h1></i> <p>new paragraph <b>bold text <p>some more bold text
Tidy maps this to
<h1><i>heading</i></h1> <p>new paragraph <b>bold text</b> <p><b>some more bold text</b>
<h1><hr>heading</h1> <h2>sub<hr>heading</h2>
Tidy maps this to
<hr> <h1>heading</h1> <h2>sub</h2> <hr> <h2>heading</h2>
<a href="#refs">References<a>
Tidy maps this to
<a href="#refs">References</a>
<body> <li>1st list item <li>2nd list item
is mapped to
<body> <ul> <li>1st list item</li> <li>2nd list item</li> </ul>
Tidy inserts quote marks around all attribute values. It can also detect when you have forgotten the closing quote mark, although this is something you will have to fix yourself.
Tidy has a comprehensive knowledge of the attributes defined in the HTML 4.0 recommendation from W3C. This often allows you to spot where you have mistyped an attribute or value.
Tidy will even work out which version of HTML you are using and insert the appropriate DOCTYPE element, as per the W3C recommendations.
This is something you then have to fix yourself as Tidy is unsure of where the > should be inserted.
You can choose which style you want Tidy to use when it generates the cleaned up markup: for instance whether you like elements to indent their contents or not.
Tidy uses UTF-8 internally to represent character values. The full set of HTML 4.0 entities are defined. Cleaned up output uses HTML entity names for characters when appropriate. Otherwise characters outside the normal ASCII range are output as numeric character entities. Support for a range of character encodings is under development and offers of help are welcomed.
Tidy offers advice on accessibility problems for people using non-graphical browsers. The most common thing you will see is the suggestion you add a summary attribute to table elements. The idea is to provide a summary of the table's role and structure suitable for use with aural browsers.
If you are to switch to using style sheets you do not want FONT, NOBR and CENTER elements. Tidy will obligingly remove them if you ask.
XML processors compliant with W3C's XML 1.0 recommendation are very picky about which files they will accept. Tidy can help you to fix errors that cause your XML files to be rejected.
<html> <head> </head> <body> <p> para which has enough text to cause a line break, and so test the wrapping mechanism for long lines. </p> <pre>This is <em>genuine preformatted</em> text </pre> <ul> <li> 1st list item </li> <li> 2nd list item </li> </ul> <!-- end comment --> </body> </html>
and this is the default style:
<html> <head> </head> <body> <p>para which has enough text to cause a line break, and so test the wrapping mechanism for long lines.</p> <pre>This is <em>genuine preformatted</em> text </pre> <ul> <li>1st list item </li> <li>2nd list item</li> </ul> <!-- end comment --> </body> </html>
The code is in ANSI C and uses the C standard library for i/o. The parser is thread-safe although the code for pretty printing the parse tree is not (yet). The parser works top down, building a complete parse tree in memory. Document text is held in an expanding character array. The code has so far been tested on Windows'95, Windows NT, Linux, SunOS, Solaris and HP-UX.
You can read more about Tidy and download the source code and binaries for common platforms from: http://www.w3.org/People/Raggett/tidy
Dave Raggett dsr@w3.org is an engineer at Hewlett Packard's UK Laboratories, and works on assignment to the World Wide Web Consortium, where he is the W3C lead for HTML.