Classification of Web Sites at Super-genre Level
by Christoph Lindemann and Lars Littig
In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9
We present an approach for the classification of Web sites at supergenre level. This approach utilizes both structure and content of Web sites in order to distinguish between eight relevant Web genres. We show that this combination of structural and content-based features considerably improves the classification performance compared to approaches solely based on structure or content. We evaluate our approach on a dataset comprising more than 16,000 Web sites with about 20 million crawled and 100 million known pages. The approach achieves an accuracy of 92% for the classification of these Web sites.
The World Wide Web has developed into a central source of information, a very important marketplace, a highly noticed presentation platform, and a frequented meeting place, to mention only some. Furthermore, the evergrowing number of users and content creators leads to a rapid evolution and emergence of different Web sites. As a consequence, it is more and more difficult to identify the Web sites providing the information and services of interest.
However, whileWeb sites differ in their design and content, manyWeb sites are created for the same purpose so that they are related like the Web sites of two universities or two competing corporations. This observation directly corresponds to our notion of genre since we think that a genre is defined as a category assigned on the basis of external criteria such as purpose. As a consequence, classification into genres has to focus on the purpose the unit of analysis is created for as central point of investigation. While the concept of genre is often associated with a grouping of texts based on external criteria due to early and important work in this field of research, e.g. , we deal with the concept of Web genre. This concept transfers the idea of classification by purpose from texts to the Web introducing new opportunities and challenges. Thus, we have to transfer and extend appropriate methods from the field of text classification in order to embrace these opportunities and to face these challenges. The latter comprise in first place the heterogeneity of theWeb and the emergence of new categories like blogs which cannot be found in traditional genres that are usually instantiated on paper. Opportunities arise especially from the exploitation of the link structure and other structural features which are useful for the classification of these new categories. Therefore, we analyze the structure of Web sites in order to examine how it re ects the purpose a Web site is created for. We also show that it is necessary to take contentrelated features that also re ect this purpose into account. Consequently, a Web site can be classified into a Web genre based on structure and content. The concept of genre is in general very useful in the context of classification tasks because it facilitates the identification of categories. Thus, categories can be identified by considering which purposes can be pursued or are dominant in a certain area. The granularity of the unit of analysis and the genre granularity are other important aspects of consideration. Analysis at the level of Web sites is in general attracting more and more interest for a variety of techniques like spam and duplicate detection. One reason for this trend is based on the fact that there is a more frequent change of information and availability on page-level compared to site-level data so that the latter constitutes a solid foundation for research in several domains. Therefore, a complete Web site is the unit of analysis of our work. Considering the genre granularity, Web genres can be accounted for at subgenre, genre and super-genre level. In our opinion, the super-genre level relates to coarse-grained, general, and dominant purposes and consequently broad categories which comprise several fine-grained sub-categories. Since we focus on dominant purposes Web sites can be created for, we account for theWeb genre of aWeb site at a super-genre level, i.e. we assign a Web site to one of eight Web genres, namely Academic, Blog, Community, Corporate, Information, Nonprofit, Personal, and Shop. Another interesting approach for the identification of a compact system of genres is given in. In this study, genres are not identified by considering the purpose a Web site is created for but by analyzing the major aims of text production. Nevertheless, the adapted typology, which includes six categories, exhibits close relations to the Web genres analyzed in this chapter.
[Continue reading excerpts here or download PDF from here]