19
Data Collection and Labeling
§Google search engine
–Business, news and sports directories
–using key words likely table related
–   (e.g., schedule, value, results, …)
§Manually labeled 1,393 pages
14,609
<table> elements
9,737
1,740
11,477
Non-genuine tables
Genuine tables
Leaf <table> elements
(1)
Our goal here was to construct a large database that includes tables of as many different varieties as possible. At the same time, we also needed to ensure that we get a significant number of genuine tables in the database. For this practical reason, we biased the data collection towards web pages that are more likely to contain genuine tables …..
Even in this somewhat biased collection, genuine tables only account for about 15% of all leaf table elements