How would you handle automatic text classification in noisy conditions? This is what has been done, to my knowledge, in Automatic web Genre Idintefication (AGI).
By noise here I refer to two different disturbing factors*: 1) the training sample and test sample come from different sources/annotators; 2) the test set contains genre classes that are not present in the training set. These two types of noise reflect the following real-world conditions when working with genre, namely: 1) since genre is a complex notion that has been interpreted in different ways, the identification of same genre class can vary depending on the research agenda or individual preferences; 2) we cannot possibly conceive a genre classifier that has a good performance if we include all existing genres either on the web or in another digital environment (such as a digital library).
The only explicit investigation was carried out by Shepherd et al. (2004). They compared the performance of two classifiers on three subgenres (i.e. 93 PERSONAL HOME PAGE, 94 CORPORATE HOME PAGE and 74 ORGANIZATIONAL HOME PAGES) with and without noise (i.e. 77 non-home pages). Predictably, they results show that there is a deterioration of performance when noise is introduced. In their case, the “noise” was represented by documents that did not fall into the three subgenres to be identified, but belonged to other genres. In their experiment, Shepherd et al. (2004) conflated into one single class all the genre classes not being “home pages”. Decisions about the size and the proportion of this class were not underpinned. I will call this type of noise structured noise because this noise is represented by well-defined genre classes that should always be a negative for a classifier.
A slightly different approach to structured noise is used by Kim and Ross (2010 and Vidulin et al. (2007). Kim and Ross considered the 24 classes of KRYS-01 as noise with respect to the performance of their classifier on the 7-webgenre collection. In their case, noise is represented by 24 well defined genre classes, each of them represented by a relatively small number of documents (at most 90), while the 7 web genres are represented by 190 web pages each. Noisy classes represent around 60%, and the 7 web genres about 40%. The size and the proportion of this structured noise are not underpinned by any hypotheses. But accuracy results on the 7-webgenre collection is very good (see Table 5)
Interesting information on the impact of the proportion of structured noise can be derived also from Vidulin et al. (2007), though their corpus is quite small with respect to the number of classes (see the description of MGC below). Their genre palette is supposed to represent all the genres on the web (but their proportions seems to be arbitrary), and they build 20 individual subclassifiers and perform a binary classification, i.e. one class against the remaining 19. Similar to Kim and Ross (2010), these 19 classes can be considered as a kind of structured noise. In this scenario, Vidulin et al.’s (2007) accuracy results are high (94%) , while their F-measure average on 20 genres is moderate (50%).
One problem with structured noise is that it requires a major annotation effort because all the classes that the supervised classifier should consider as negative examples must be clearly defined and labelled. Additionally, the underlying hypothesis is quite compelling because it presupposes that all documents fall into well-defined genres and this is not always the case with documents on the web. Many web documents might simply not belong to any genre or embody several genres. Santini (2010) explored a simple way to deal with unstructured noise based on the odd-likelihook form Bayes’ theorem. Results were encouraging, but preliminary.
What is your experience with noise and automatic classification?
*The concept of “noise” can be applied to different situations. E.g., in Stubbe et al (2007) “noise” refers to orthographcal errors.