AGI: Structured and Unstructured Noise

How would you handle automatic text classification in noisy conditions? This is what has been done, to my knowledge, in Automatic web Genre Idintefication (AGI).

By noise here I refer to two different disturbing factors*: 1) the training sample and test sample come from different sources/annotators; 2) the test set contains genre classes that are not present in the training set. These two types of noise reflect the following real-world conditions when working with genre, namely: 1) since genre is a complex notion that has been interpreted in different ways, the identification of same genre class can vary depending on the research agenda or individual preferences; 2) we cannot possibly conceive a genre classifier that has a good performance if we include all existing genres either on the web or in another digital environment (such as a digital library).

The only explicit investigation was carried out by Shepherd et al. (2004). They compared the performance of two classifiers on three subgenres (i.e. 93 PERSONAL HOME PAGE, 94 CORPORATE HOME PAGE and 74 ORGANIZATIONAL HOME PAGES) with and without noise (i.e. 77 non-home pages). Predictably, they results show that there is a deterioration of performance when noise is introduced. In their case, the “noise” was represented by documents that did not fall into the three subgenres to be identified, but belonged to other genres. In their experiment, Shepherd et al. (2004) conflated into one single class all the genre classes not being “home pages”. Decisions about the size and the proportion of this class were not underpinned. I will call this type of noise structured noise because this noise is represented by well-defined genre classes that should always be a negative for a classifier.
A slightly different approach to structured noise is used by Kim and Ross (2010 and Vidulin et al. (2007). Kim and Ross considered the 24 classes of KRYS-01 as noise with respect to the performance of their classifier on the 7-webgenre collection. In their case, noise is represented by 24 well defined genre classes, each of them represented by a relatively small number of documents (at most 90), while the 7 web genres are represented by 190 web pages each. Noisy classes represent around 60%, and the 7 web genres about 40%. The size and the proportion of this structured noise are not underpinned by any hypotheses. But accuracy results on the 7-webgenre collection is very good (see Table 5)
Interesting information on the impact of the proportion of structured noise can be derived also from Vidulin et al. (2007), though their corpus is quite small with respect to the number of classes (see the description of MGC below). Their genre palette is supposed to represent all the genres on the web (but their proportions seems to be arbitrary), and they build 20 individual subclassifiers and perform a binary classification, i.e. one class against the remaining 19. Similar to Kim and Ross (2010), these 19 classes can be considered as a kind of structured noise. In this scenario, Vidulin et al.’s (2007) accuracy results are high (94%) , while their F-measure average on 20 genres is moderate (50%).
One problem with structured noise is that it requires a major annotation effort because all the classes that the supervised classifier should consider as negative examples must be clearly defined and labelled. Additionally, the underlying hypothesis is quite compelling because it presupposes that all documents fall into well-defined genres and this is not always the case with documents on the web. Many web documents might simply not belong to any genre or embody several genres. Santini (2010) explored a simple way to deal with unstructured noise based on  the odd-likelihook form Bayes’ theorem. Results were encouraging, but preliminary.

What is your experience with noise and automatic classification?

*The concept of “noise” can be applied to different situations. E.g., in Stubbe et al (2007) “noise” refers to orthographcal errors.

Related posts:

10 comments for “AGI: Structured and Unstructured Noise

  1. 16 March, 2012 at 12:24

    Source: The WebGenre R&D Group

    Yuehong Hu • but i am still not understand unstructured noise…

    Marina Santini • Hi Yuehong, in a fully supervised single-label machine learning approach, you have a corpus of documents where each document has been manually annotated and belongs to a single class. Then you train a classifier by dividing the corpus into a training set and test set. The test set is used to evaluate the classification performance. In this paradigm, the test set cannot contain any class that is NOT included in the training set.

    My claim is that in a large digital environment or in an evolving enviroment like the web this approach is hard to apply because we WILL NEVER BE ABLE TO BUILD A comprehensive TRAINING SET THAT INCLUDES ALL THE CLASSES “OUT THERE” and in the same proportions (this is also another compelling assumption of the fully supervised machine learning approach).

    Imagine the situation where you have a fantastic supervised classifier trained on 50 categories that you want to use to classify a random sample of 10gb+ web documents. This sample will probably contain your 50 categories plus many more. How to you retrain your classifier for the classification of such a sample with a minimum annotation effort? Realistically, i do not think you are going to annotate 10gb+ manually. And even if you manage to do this, and retrain your classifier on such a sample, and then you crawl a fresh random sample 6 month later, probably in the new sample you will find different classes and completely different proportions, so your retrained classifier will not be efficient. In these situations a supervised classifier is facing “unstructured noise”, i.e. unknown classes and unknown proportions.

    Hope this helps. If not let me know. Cheers, Marina

    Yuehong Hu • Dear Marina, thank you so much!
    I think i see .
    and, i have another question about feature extraction.It is a question in my dissertation. I want to extrat core concepts from keywords set which are the keywords in the dissertations of information science. The selected core concepts should represent the feature of the domain.
    I have triied TF, TF-IDF, MI, but the result is far from satisfactory.
    I think the reason is that all the keywords can represent the feature of the domain concretely or abstractly. The above method can not distinguish the degree.
    Will you please give me some advise? Thank you again ^-^


    Marina Santini • Hi Yuehong, if your problem is about ambiguity (i.e. a word can have several senses), try to see whether the solutions investigated in Word Sense Disambiguation research might help you. A recent survey is by Roberto Navigli (2009), “Word sense disambiguation: A survey”,

  2. Ankul
    25 May, 2012 at 08:02

    Can u please suggest me any algorithm or tool(apart from Naive Bayes algo) that does Document classification having unstructured noise.
    That would help me alot.

  3. 25 May, 2012 at 14:52

    Hi Ankul,

    What is your problem exactly? Do you have unclassified documents? What is their proportion?

    • Ankul
      27 May, 2012 at 12:50


      First of all thanks alot for the reply.

      Actually im working on unsupervised naive bayes algorithm for text classification.

      Here lies my problem :

      Suppose i train my algo for 3 categories A , B and C. and if some other new document comes which does not belong to any of these then my algo(naive bayes) will assign one of the above category(A , B ,C) which is the drawback of naive bayes algorithm. Now i want my algo should signify me that this new document is not from the trained(above) categories.

      Though i tried to do that by training the algo for another category D(say) for the document which do not belong to any of A , B , C , then they will belong to D. (I used some threshold value for that) but the accuracy was poor.

      So , please can u suggest me any other tool or any algorithm or any way to handle this case…

  4. Ankul
    28 May, 2012 at 12:24

    im extremely sorry…actually im working on supervised naive bayes algorithm not on unsupervised one.

  5. 29 May, 2012 at 09:41

    Hi Ankul,

    your negative results were predictable.

    If you want to use supervised classification (regardless of the algorithm you use), your D class must represent somehow the documents which do not belong to any of A , B , C, because your algorithm LEARNS from the classes you have set up and assigns classes to the unseen documents according to what it has learned in the training set. This is the basic idea of supervised machine learning. When a document in the test set is not represented in the training set, the algorithm is confused. Additionally, the proportions of the classes in the training set affect also the results. This means that if you have a training class much larger than the others, your algorithm will tend to assign as many documents as possible to that class.

    If you wish to work with noisy classification, you should look at alternative models: you can try out semi-supervised learning, adaptive learning, unsupervised learning, AI models.

    Good luck with your work


  6. Ankul
    31 May, 2012 at 06:47

    Hi Marina,

    Thanks a lot for the reply.
    I need some heads up on following statement of yours (our NB classifier is behaving the other way round):

    “This means that if you have a training class much larger than the others, your algorithm will tend to assign as many documents as possible to that class.”

    We have A, B, C class of positive (that the end user is interested in) samples and then we intentionally created a much larger class D of negative samples(not belonging to any of the positive class).
    We are fine with few “false negatives”(documents of positive classes classified as negative) but as few “false positives” as possible (documents of negative class classified positive) hence a much larger negatively trained class.
    But our NB classifier gives more false positives with increasing negative class size (Don’t know if the algorithm has some bug or it is inline with the principles of NBC?)

    After this we also experimented with multiple negative classes but none of them bigger than any of the positive classes. This way the accuracy did improve but then we hit a brick wall where increasing the positive:negative categories ratio beyond 1:4 didn’t improve the accuracy anymore.
    Is there a science behind it or just a freaky coincidence.

    I also suspect our algorithm itself specially as in the case of 1 very large negative category experiment it was not reducing false positives.

    Once again thanks a lot for your guidance.


  7. 1 June, 2012 at 08:10

    Hi Ankul,

    I am afraid there is not such a magic algorithm that does document classification and takes care of noise, in whtatever situation, with good performance.

    First of all, you must make sure that your features well represent your classes and your documents.

    Second, you must build a training set that correctly represents your classification problem.

    Third, you must choose a classificaton algorithm that maximizes the power of your features and solves at best your classification problem.

    My suggestion is that your try out your current dataset with many different classification algorighms, not only Naive Bayses.
    You can use a standard package, like Weka, that is very easy to use and offers a wide range of algorithms.

    If the performance is still poor regardless of the algorithm you use, there is something that must be adjusted in your feature set and in your training set. So you must revise and reflect on your classification problem.

    Weka is available here:
    Get hold of the manual:
    and read very carefully Ch 5 (Credibility: Evaluating what’s been learned) and try to understand where your problem lies.

    You can also subscribe to the Weka mailing list and see if somebody else came across the same problem before and ask for advice. Do not expect any miracle.

    In my experience, noise is a hard problem to solve, and it is still an open research question. Handling noise with supervised machine learning is challenging and often unrewarding. But you might have a different experience and find a good solution.

    Hope this helps a little.

    Cheers, Marina

    • Ankul
      1 June, 2012 at 09:53

      Thanks a lot Marina for a detailed answer. I’ll definitely give Weka a try, will keep you posted as I make progress on any other frontier as well.

  8. 3 June, 2012 at 08:48

    A very useful blog post on machine learning is here:

    Do not miss it!

Leave a Reply to Ankul Cancel reply

Your email address will not be published. Required fields are marked *