Abstract: Classification of Web Sites at Super-genre Level

Classification of Web Sites at Super-genre Level

by Christoph Lindemann and Lars Littig

In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9


We present an approach for the classification of Web sites at supergenre level. This approach utilizes both structure and content of Web sites in order to distinguish between eight relevant Web genres. We show that this combination of structural and content-based features considerably improves the classification performance compared to approaches solely based on structure or content. We evaluate our approach on a dataset comprising more than 16,000 Web sites with about 20 million crawled and 100 million known pages. The approach achieves an accuracy of 92% for the classification of these Web sites.
1 Introduction
The World Wide Web has developed into a central source of information, a very important marketplace, a highly noticed presentation platform, and a frequented meeting place, to mention only some. Furthermore, the evergrowing number of users and content creators leads to a rapid evolution and emergence of different Web sites. As a consequence, it is more and more difficult to identify the Web sites providing the information and services of interest.
However, whileWeb sites differ in their design and content, manyWeb sites are created for the same purpose so that they are related like the Web sites of two universities or two competing corporations. This observation directly corresponds to our notion of genre since we think that a genre is defined as a category assigned on the basis of external criteria such as purpose. As a consequence, classification into genres has to focus on the purpose the unit of analysis is created for as central point of investigation. While the concept of genre is often associated with a grouping of texts based on external criteria due to early and important work in this field of research, e.g. [2], we deal with the concept of Web genre. This concept transfers the idea of classification by purpose from texts to the Web introducing new opportunities and challenges. Thus, we have to transfer and extend appropriate methods from the field of text classification in order to embrace these opportunities and to face these challenges. The latter comprise in first place the heterogeneity of theWeb and the emergence of new categories like blogs which cannot be found in traditional genres that are usually instantiated on paper. Opportunities arise especially from the exploitation of the link structure and other structural features which are useful for the classification of these new categories. Therefore, we analyze the structure of Web sites in order to examine how it re ects the purpose a Web site is created for. We also show that it is necessary to take contentrelated features that also re ect this purpose into account. Consequently, a Web site can be classified into a Web genre based on structure and content. The concept of genre is in general very useful in the context of classification tasks because it facilitates the identification of categories. Thus, categories can be identified by considering which purposes can be pursued or are dominant in a certain area. The granularity of the unit of analysis and the genre granularity are other important aspects of consideration. Analysis at the level of Web sites is in general attracting more and more interest for a variety of techniques like spam and duplicate detection. One reason for this trend is based on the fact that there is a more frequent change of information and availability on page-level compared to site-level data so that the latter constitutes a solid foundation for research in several domains. Therefore, a complete Web site is the unit of analysis of our work. Considering the genre granularity, Web genres can be accounted for at subgenre, genre and super-genre level. In our opinion, the super-genre level relates to coarse-grained, general, and dominant purposes and consequently broad categories which comprise several fine-grained sub-categories. Since we focus on dominant purposes Web sites can be created for, we account for theWeb genre of aWeb site at a super-genre level, i.e. we assign a Web site to one of eight Web genres, namely Academic, Blog, Community, Corporate, Information, Nonprofit, Personal, and Shop. Another interesting approach for the identification of a compact system of genres is given in. In this study, genres are not identified by considering the purpose a Web site is created for but by analyzing the major aims of text production. Nevertheless, the adapted typology, which includes six categories, exhibits close relations to the Web genres analyzed in this chapter.
[Continue reading excerpts here or download PDF from here]

7 comments for “Abstract: Classification of Web Sites at Super-genre Level

  1. Christophe Clugston
    4 May, 2013 at 10:30

    Some of this research parallels my own. There are some salient differences between their approach and mine: 1) I am looking at the Sub Genre Analysis (traced from the Supra level), 2) I am employing 3 aspects to all analysis levels, 4) I am concerned with only a small group of documents to test all of this. I do agree with the three layers of genre; however, I term them Supra, Meso and Sub Genre (which may be the result of a Hybrid Genre). In the case of extant genres they are followed until they cannot be followed in the offline version and then are traced via the online mating. Using three lenses 1) Purpose/ Function, 2) Structure/ Format and 3) Content is a necessary method to capture a cyber/ digital genre in my opinion. Until my work (in process) no one has attempted to do this.

  2. Marina Santini
    4 May, 2013 at 14:44

    Hi Chris,

    I understand your problem.

    However, I do not know if I am persuaded…

    Read these two posts on an interesting chapter that I reviewed some time ago:



    Cheers, Marina

  3. christophe clugston
    7 May, 2013 at 05:39

    Yes I have read the salient aspects of Devitt’s work. Genre can easily be termed by its contents. A Golf Ad is different form a Soccer Ad how? By the contents–the contents switch the genre. Structure of the ads could be the same or they could be different; however, what cannot be denied is that the subject matter is completely different. If we go to Text Books–what is it that separates them into Sub Genres? It is their contents. An Art textbook is hardly the same as a Zoology textbook. (BTW I still never get notified about responses to any of my posts on here.)

  4. Marina Santini
    7 May, 2013 at 07:33

    Hi Chris, the notification system is handle by WordPress, so… I do not know exactly why you are not notified, if you have checked the box. I will try to understand…

    About content/topic/domains and their influence on genre characterization, well… what I can see so far is that content is influential at subgenre level, not at mid-genre level, if we are talking about written documents.
    Eg: supergenre = academic writing
    genre: academic papers (see swales 1990)
    subgenres: see the analysis betw humanities vs scientific academic papers in one of Biber’s book (don’t have the reference handy right now).
    But if you have different results from your empirical studies, please give us some references so we can check the experiments.

    As far as film genres and literary genres are concerned, that’s a complete different issue…

  5. 7 May, 2013 at 08:23

    Since I am looking for Sub Genre Classification I am very cognitive of content. I am using a taxonomy system like Zoology. One of the things that I see for any Cyber/ digital genre is that it is the offspring of 2 Supra Genres (to use my terms) Besides the spontaneous genres of FAQ, home page, etc. Although their structure does contain parts of offline techniques. A thorough Genre Analysis takes into account the Supra, Meso and Sub levels. It also takes into account, in my work, the linguistic features NP, VP, verb choices, pronoun reference system, new vs old information, coherence and cohesion, etc. All of those aspects are part of text analysis or discourse analysis and clearly give an accurate view of the form and structure. My document analysis is concerned with , to use your term, Hybridism. Although this existed in the offline version before it went to digital. Pure Discourse Analysis has large problems with this, I might mention. I think it extant to note that unlike those working at the Macro Level (to automate Genre Recognition) I am a Micro researcher. I am looking at a specific mammal. The others are busy trying to figure out how to put reptiles, mammals, and fish in their proper container, as it were. I am busy cataloging and describing a new house cat breed, as it were.

  6. 7 May, 2013 at 08:41

    BTW I also can’t retrieve the PDF you have listed above.

  7. Marina Santini
    10 May, 2013 at 11:12

    Hi Chris,

    Springer has changed their website 🙂 I will fix the links later.

    You can buy a copy of the chapter here (if you like): http://link.springer.com/chapter/10.1007/978-90-481-9178-9_10

    Or you can browse what is available on GoogleBooks here: http://books.google.se/books?id=i3Xh1e3uVpUC&lpg=PP1&pg=PA362#v=onepage&q&f=false

    I do not know if the authors themselves have a downloadable version in their own website….

Leave a Reply

Your email address will not be published. Required fields are marked *