Towards Language–Independent Web Genre Detection (2009)

Poster paper by : Philipp Scholl, Renato Domínguez García, Doreen Böhnstedt, Christoph Rensing, Ralf Steinmetz

The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy or a fully language independent detection of structured web genres.

Full article: <http://www2009.eprints.org/159/1/p1157.pdf>

Copyright is held by the author/owner(s).
WWW 2009, April 20–24, 2009, Madrid, Spain.

7 comments for “Towards Language–Independent Web Genre Detection (2009)

  1. 18 November, 2011 at 12:17

    My observations in the field of statistical opinion mining only confirm what I read in this abstract — language independence. It means: linguistic analysis of the text content is not necessary. BTW, even the choice of classifier (NaiveBayes or SVM) is not significant since the obtained results differ only slightly. In contrast, consideration of the text hierarchy can be important.

    • 19 November, 2011 at 06:56

      I completely agree – classifiers based on lexical-feature density of any kind are likely to fail on genre detection. These sorts of tools do well on detecting differences in Topic but structural differences usually characterize genre more effectively in my experience. I would even go beyond just the syntactic level to document structure in some cases. For example, resumes, annual reports and court transcripts all have a very prescribed document structure that should be a strong clue to recognizing them independent of the syntactic structure of the texts.

  2. marinasantini.ms@gmail.com
    19 November, 2011 at 07:01

    Comment source: LinkedIn group – Natural Language Processing

    Andrés Hohendahl • Just read the article, it’s like a communication or results, no corpus size, no statistical evidence set size, no indication on how much elements are extracted for training the SVM, no information on number of classes, only good results: poor article on my impression.

    I don’t see it a significant leap: the web-genre is a complex information, of higher cognitive order, and there is no reason a system may recognize it, based on only a small set of evidence and extraction, only because many people use HTML< what if it is made in flash, or even worse, made in PHP, .NET or JSP, and gets dynamically constructed, even with JavaScript or its all an image, or a PDF containing raster information, assembling letters! Simply NO thing can be told from all those, nor the rest of the web.

  3. marinasantini.ms@gmail.com
    19 November, 2011 at 07:11

    I am more on Andres’ side: genre is a complex cognitive concept.
    I believe that it can be captured more thouroughly through language rather than markup. My views are briefly described in this position paper < http://coli.lili.uni-bielefeld.de/Texttechnologie/Forschergruppe/PTTR/abstracts/Abstract-Santini.pdf>. However, it is good in my opinion, that as many experiments as possible are carried out with as many features as possible: we need empirical evidence to support one position or the other.
    Marina

  4. 19 November, 2011 at 10:08

    In my opinion, the reason of our discontent is the following: we don’t agree what means are better for classification — semantic or statistical. As far as I understood, Andre and Marina would use semantic means; Leslie and I — statistical means.

    There are many reasons to prefer semantic means. Genres are indeed a very complex cognitive concept and maybe it is impossible to analyze genres without semantics. However, I also know that thorough analysis of semantics can be error-prone. You need large dictionaries and you can’t consider semantics of every word. In my system to opinion mining (opinionated text can be considered as a genre?), I had a dictionary with 6,000 emotional words and I am sure it is only a tiny abstract of real set of emotional words. Do we want to build WordNets for every constellation in genres detection?

    We need some abstraction to get a somehow useful recognition rate that can be used in real software systems. Not some hypothetical, but real. And for this reason, we use statistical means. You are right, they are not exact, but they allow you to get a concrete result. We analyze markup and divide the feature space in segments and hence facilitate automatic recognition.

    That’s why let us define first what means we discuss! The article discusses automatic approach meaning that the authors chose statistical means. AO

  5. 19 November, 2011 at 14:54

    @Alexander @Marina
    My point of view is not whether “I am with Semantics” or not. What I mean is that some semantics, indeed are actually captured by statistics, the hard part is the selection of the abstraction features to do that in a useful way.
    Actually CRF is one of the best and newest interdependent methods doing a sort-of this.
    But when anyone claims ‘statistics’ are better, or ‘parsing+semantics’ is better; sorry: but they are All right, the only difference on ruler-based (parsing + pragmatics / statistics) is the scope, while statistics gets lots of information and does noise pruning and best-feature extraction the statistical evidence must exist, and hence with a low number of good-samples (human annotated) we can not capture the whole picture!

    In my humble opinion (ant hence the direction of my research) is thar both worlds are good, when starting from scratch, we need to use rules, even bad ones, as we get more information of the real world, cross or self-validated then we build a statistic extraction based on the former rules, and may be twisting them in some way. This yields into a “soft” statistical-rule rather than into a dummy statistic or hard rule system. This is my guess!

    Also (for me) there are lots of features under the hood, like deixis and situational dialogue-related features, largely ignored by parsing-defendants and ignored because they may lay deeply buried inside the mostly-obscure statistic mechanisms and their pile of numbers. My guess is to build (I am doing so) a dual system, capable of learning faster “with” initial rules, adn discovering other rules on the fly, based on experience. This is my beat!

    Thank you!

  6. 21 November, 2011 at 14:26

    @Andres
    Thank you for your comment.

    Now I understand what you mean. As far as I understood, you rely on semantic rules in cases where you get reliable results. If results are unreliable you use statistics. Something like that I had in my opinion mining — hybrid approach, and used only mathematical means to combine two kinds of approach.

    However, your hybrid approach raises many questions. Could you explain how you assess this reliability? Do you have a special measure for it? For example, do you ask your users if they like classification result? I am very curious.

Leave a Reply

Your email address will not be published. Required fields are marked *

*