Notes
Outline
A Machine Learning Based Approach for Table Detection on The Web
WWW2002     May 8, 2002
Motivation
Tables are common in HTML documents
Table understanding important
Knowledge management
Web mining
Summarization and mobile access
<table></table> tags cannot be trusted
Table detection
Is a given <table> element a genuine table?
Definition
Genuine table
A document entity where a two dimensional grid structure is semantically significant in conveying the logical relations among the cells.
Non-genuine table
A <table> element that does not represent a genuine table
Examples
Comparison to Previous Work
Previous Work
Mostly based on heuristics
Limited testing
Tens to hundreds of samples
Domain specific (airlines, news, etc.)
Our approach
Machine learning based, trainable
Large scale testing (over 10,000 samples)
Main Contributions
Novel features capturing both layout and content
Collection/labeling of a large, diverse database
Testing of two popular classification schemes
Preprocessing
HTML parsing to obtain the document tree
Java swing parser, W3C HTML 3.2 DTD
Extract leaf <table> elements
Pseudo-rendering to obtain accurate row/column counts
Only considering <tr>, <td>, <th> not sufficient
Other tags: <rowspan>, <colspan>, <br>
Features: Overview
Layout features
Content type features
Text content feature
Total of 16
Features: Layout
Average # columns and standard deviation
Average # rows and standard deviation
Average overall cell length
Cumulative Length Consistency (CLC)
Layout Features: CLC
Features: Content Type
Content Type:CTC
Features: Text Content
Word Group Feature
Word Group Feature (Cont’d)
Classifiers
Classifiers: Decision Tree
Classifiers: Support Vector Machine
Data Collection and Labeling
Experimental Setup
Results: Feature Groups
Results:Classifiers
Comparison to Rule Based
Examples: I
Examples: II
Examples: III
Future Directions
Incorporate more cases in pseudo-rendering
  (including non table elements?)
Deeper language analysis (for both detection and interpretation)
Interpretation (title, headings, etc.)