Clean up your Web pages with HP's HTML Tidy

Dave Raggett
Hewlett Packard Laboratories,
Filton Road, Stoke Gifford, Bristol BS12 6QZ, U.K.
dsr@w3.org

Keywords: HTML; Validation; Error correction; Pretty-printing

1. Introduction to Tidy

When editing HTML it is easy to make mistakes. Wouldn't it be nice if there was a simple way to fix these mistakes automatically and tidy up sloppy editing into nicely laid out markup? Well now there is, thanks to Dave Raggett of HP Labs. HTML Tidy is a free utility for doing just that. It also works great on the attrociously hard to read markup generated by specialized HTML editors and conversion tools, and can help you identify where you need to pay further attention on making your pages more accessible to people with disabilities.

Tidy is able to fix up a wide range of problems and to bring to your attention things that you need to work on yourself. Each item found is listed with the line number and column so that you can see where the problem lies in your markup. Tidy will not generate a cleaned up version when there are problems that it is not sure of how to handle. These are logged as "errors" rather than "warnings".

1.1. Examples of Tidy at work

Here are just a few examples of how Tidy perfects your HTML for you:

Missing or mismatched end tags are detected and corrected.

   <h1>heading
   <h2>subheading</h3>

is mapped to

   <h1>heading</h1>
   <h2>subheading</h2>

End tags in the wrong order are corrected.

   <p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?

is mapped to

   <p>here is a para <b>bold <i>bold italic</i> bold?</b> normal?

Fixes problems with heading emphasis.
```
   <h1><i>italic heading</h1>
   <p>new paragraph
```
In Netscape and Internet Explorer this causes everything following the heading to be in the heading font size, not the desired effect at all!

Tidy maps the example to
```
   <h1><i>italic heading</i></h1>
   <p>new paragraph
```

Recovers from mixed up tags.

   <i><h1>heading</h1></i>
   <p>new paragraph <b>bold text
   <p>some more bold text

Tidy maps this to

   <h1><i>heading</i></h1>
   <p>new paragraph <b>bold text</b>
   <p><b>some more bold text</b>

Getting the <hr> in the right place.

   <h1><hr>heading</h1>
   <h2>sub<hr>heading</h2>

Tidy maps this to

   <hr>
   <h1>heading</h1>
   <h2>sub</h2>
   <hr>
   <h2>heading</h2>

Adding the missing "/" in end tags.

   <a href="#refs">References<a>

Tidy maps this to

   <a href="#refs">References</a>

Perfecting lists by putting in tags missed out.

   <body>
   <li>1st list item
   <li>2nd list item

is mapped to

   <body>
   <ul>
   <li>1st list item</li>
   <li>2nd list item</li>
   </ul>

Missing quotes around attribute values are added.
Tidy inserts quote marks around all attribute values. It can also detect when you have forgotten the closing quote mark, although this is something you will have to fix yourself.
Unknown/Proprietary attributes are reported.
Tidy has a comprehensive knowledge of the attributes defined in the HTML 4.0 recommendation from W3C. This often allows you to spot where you have mistyped an attribute or value.
Proprietary elements are recognized and reported as such.
Tidy will even work out which version of HTML you are using and insert the appropriate DOCTYPE element, as per the W3C recommendations.
Tags lacking a terminating `>' are spotted.
This is something you then have to fix yourself as Tidy is unsure of where the > should be inserted.

1.2. Layout style

You can choose which style you want Tidy to use when it generates the cleaned up markup: for instance whether you like elements to indent their contents or not.

1.3. Internationalization issues

Tidy uses UTF-8 internally to represent character values. The full set of HTML 4.0 entities are defined. Cleaned up output uses HTML entity names for characters when appropriate. Otherwise characters outside the normal ASCII range are output as numeric character entities. Support for a range of character encodings is under development and offers of help are welcomed.

1.4. Accessibility

Tidy offers advice on accessibility problems for people using non-graphical browsers. The most common thing you will see is the suggestion you add a summary attribute to table elements. The idea is to provide a summary of the table's role and structure suitable for use with aural browsers.

1.5. Getting rid of those FONT tags

If you are to switch to using style sheets you do not want FONT, NOBR and CENTER elements. Tidy will obligingly remove them if you ask.

1.6. Future releases

Future releases may address:

Recursion through subdirectories, so you can fix up your entire Web site at one go!
Full validation of all attribute values.
Full support for parsing XML (currently rather limited).
How to say which XML elements should be printed "inline".
Mapping between HTML presentation attributes/elements and CSS.

1.7. Support for XML

XML processors compliant with W3C's XML 1.0 recommendation are very picky about which files they will accept. Tidy can help you to fix errors that cause your XML files to be rejected.

1.8. Indenting text for a better layout

 <html>
   <head>
   </head>
   <body>
     <p>
       para which has enough text to cause a line break, and so test
       the wrapping mechanism for long lines.
     </p>
 <pre>This is
 <em>genuine
       preformatted</em>
    text
 </pre>
     <ul>
       <li>
         1st list item 
       </li>
       <li>
         2nd list item
       </li>
     </ul>
     <!-- end comment -->
   </body>
 </html>

and this is the default style:

 <html>
 <head>
 </head>
 <body>
 <p>para which has enough text to cause a line break, and so test
 the wrapping mechanism for long lines.</p>
 <pre>This is
 <em>genuine
       preformatted</em>
    text
 </pre>
 <ul>
 <li>1st list item </li>
 <li>2nd list item</li>
 </ul>
 <!-- end comment -->
 </body>
 </html>

1.9. Implementation details

The code is in ANSI C and uses the C standard library for i/o. The parser is thread-safe although the code for pretty printing the parse tree is not (yet). The parser works top down, building a complete parse tree in memory. Document text is held in an expanding character array. The code has so far been tested on Windows'95, Windows NT, Linux, SunOS, Solaris and HP-UX.

You can read more about Tidy and download the source code and binaries for common platforms from: http://www.w3.org/People/Raggett/tidy

Dave Raggett dsr@w3.org is an engineer at Hewlett Packard's UK Laboratories, and works on assignment to the World Wide Web Consortium, where he is the W3C lead for HTML.