7
Preprocessing
§HTML parsing to obtain the document tree
–Java swing parser, W3C HTML 3.2 DTD
§Extract leaf <table> elements
§Pseudo-rendering to obtain accurate row/column counts
–Only considering <tr>, <td>, <th> not sufficient
–Other tags: <rowspan>, <colspan>, <br>
–
•
(1.5)
Before feature extraction each HTML doc has to be pre-processed. There are three main steps here.
First the page is parsed to obtained the document tree which describes the HTML structure of the page. We then extract leaf table elements from the tree. We decided to concentrate on leaf table elements because from our observations almost all genuine tables are represented by leaf table elements. Finally, each table element is analyzed in a pseudo-rendering process to obtain accurate row/column counts. This is necessary because simply counting tr and td tags are not sufficient for this purpose. Many other tags can affect the layout of a table element, including …..The process is similar to what is used in browser to configure the layout of a table element. And gives us a more accurate description of the physical structure of the table.