14
Word Group Feature
§Set of words:
§Word counts:
§
§Base weights:
§
(1.5)
It’s definition is quite involved and I will try to explain in the most simple terms to give you a flavor of it. A more accurate description can be found in our paper.
First we compose a set of words extracted from all the tables in a training set, after proper preprocessing such as stemming and stop word removal. For each word in the set, we get the following counts …From these counts we derive these two base weights for each word, one for genuine table class and one for non-genuine table class. Each weight is defined as the term frequency in its own class, modified by the ratio between the document frequency in its own class and the document frequency in the opposite class. Those familiar with IR will notice that this is clearly inspired by the standard tf idf measure used in IR.