|
First we compose
a set of words extracted from all the tables in a training set, after proper
preprocessing such as stemming and stop word removal. For each word in the
set, we get the following counts …From these counts we derive these two base
weights for each word, one for genuine table class and one for non-genuine
table class. Each weight is defined as the term frequency in its own class,
modified by the ratio between the document frequency in its own class and the
document frequency in the opposite class. Those familiar with IR will notice
that this is clearly inspired by the standard tf idf measure used in IR.
|