Notes

Outline

A Machine Learning Based Approach for Table Detection on The Web


	WWW2002 May 8, 2002

Motivation


	Tables are common in HTML documents
	Table understanding important
		Knowledge management
		Web mining
		Summarization and mobile access
	<table></table> tags cannot be trusted
	Table detection
		Is a given <table> element a genuine table?

Definition


	Genuine table
		A document entity where a two dimensional grid structure is semantically significant in conveying the logical relations among the cells.

	Non-genuine table
		A <table> element that does not represent a genuine table

Examples

Comparison to Previous Work


	Previous Work
		Mostly based on heuristics
		Limited testing
			Tens to hundreds of samples
			Domain specific (airlines, news, etc.)

	Our approach
		Machine learning based, trainable
		Large scale testing (over 10,000 samples)

Main Contributions



	Novel features capturing both layout and content

	Collection/labeling of a large, diverse database

	Testing of two popular classification schemes

Preprocessing


	HTML parsing to obtain the document tree
		Java swing parser, W3C HTML 3.2 DTD
	Extract leaf <table> elements
	Pseudo-rendering to obtain accurate row/column counts
		Only considering <tr>, <td>, <th> not sufficient
		Other tags: <rowspan>, <colspan>, <br>

Features: Overview


	Layout features

	Content type features

	Text content feature

	Total of 16

Features: Layout


	Average # columns and standard deviation

	Average # rows and standard deviation

	Average overall cell length

	Cumulative Length Consistency (CLC)

Layout Features: CLC

Features: Content Type

Content Type:CTC

Features: Text Content

Word Group Feature

Word Group Feature (Cont’d)

Classifiers

Classifiers: Decision Tree

Classifiers: Support Vector Machine

Data Collection and Labeling

Experimental Setup

Results: Feature Groups

Results:Classifiers

Comparison to Rule Based

Examples: I

Examples: II

Examples: III

Future Directions


	Incorporate more cases in pseudo-rendering
	(including non table elements?)

	Deeper language analysis (for both detection and interpretation)

	Interpretation (title, headings, etc.)