Excerpt: Cross-Testing a Genre Classification Model for the Web

Cross-Testing a Genre Classification Model for the Web
by Marina Santini

In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9

Abstract
The main aim of the experiments described in this chapter is to explore how to assess the robustness of genre models for the web. For this purpose, a simple genre model is presented and cross-tested with four genre collections. In this difficult experimental setting, the model shows some stability and its results are in line with other current genre-enabled applications. The model provides some insights into open issues in AGI on the web. In particular, it shows that we know very little about the effect of noise on genre classification results. The set of experiments presented here offers a first baseline in noisy environments.

1.  Introduction
The main aim of the experiments described in this chapter is to investigate ways of assessing the robustness and stability of an Automatic Genre Identification (AGI) model for the web. More specifically, a series of comparisons using four genre collections are illustrated and analysed. I call this comparative approach cross-testing. Cross-testing exploits existing genre collections that are publicly available , which have been built for individual needs and shared by their creators, thus allowing constructive comparative experiments. Thanks to these, big steps forward have been made in the last few years, regardless the absence of official genre benchmarks and test collections. Yet, the current state of AGI is one of fragmentation and tentativeness, and automatic genre research still lingers in the starting phase.
The lack of benchmarks and test collections is only one of the reasons behind the AGI cautious progress. Other reasons are well summarized in the five points listed by Sharoff (in this book) and discussed, from different perspectives, by all the other authors contributing to this volume, namely 1) the lack of an established genre list, 2) the unclear relation between traditional and web genres, 3) the need to classify large quantities of web documents quickly, 4) the design of the genre inventory, and 5) the problem of emerging genres.
From the outside, one might wonder why it is so difficult to harness and create consent among textual categories that we habitually employ in our everyday life. Who is not familiar with one or more of the following genres: EDITORIALS, INTERVIEWS, LETTERS TO THE EDITOR, CLASSIFIED, WEATHER REPORTS etc. in newspapers and magazines; or BLOGS, FAQS, HOME PAGES, PERSONAL PROFILES, ACADEMIC PAPERS, SUGGESTIONS, HINTS, DIY GUIDES, HOW-TOS, NARRATIVES, INSTRUCTIONS, ADVERTISING, etc. on the web or other digital environments? Surprisingly, findings show that agreeing on the genre labels to be applied to documents is not so straightforward as one could imagine (Santini, 2008, Rehm et al., 2008).
From a terminological point of view, there is a great variation in the use of genre labels. There are problems with synonyms, with similarity between and across genres, with the level of generality or specificity that genre labels represent and so on.  In these conditions it is very difficult to make decisions about the definition of a genre palette. Experience from Meyer zu Eissen and Stein (2007), Rosso (2008) and Crowston et al. (in this book) shows the problems related to the definition of genre taxonomies. Connected to the terminological elusiveness is the problem of genre evolution. New genres are spawned continuously in web communities (e.g. see Paolillo et al., in this book), and social networks have certainly novel genres in store for us. Facebook WALL or LinkedIn’s PUBLIC PROFILE seem to be good candidates in this respect. Some ideas on how to detect new genres have been mentioned, e.g. by Shepherd et al. (2004) who suggest adaptive learning. However, human acknowledgment of new or evolving genres might be slower than their automatic detection, since genres require social recognition, at least within the community where a new genre is envisaged (e.g. cf. the historical analysis of BLOG creation in Blood, 2000).
From a perceptive point of view, experiments have shown that individuals have a differing perception and recognition of genres (Rosso, 2008; Santini, 2008). The low understanding of how genres are perceived and how their labels are used by humans is a major drawback for genre annotation tasks, since raters tend to disagree when deciding on the genres to assign to documents. In this respect, the experience by Berninger et al. (2008) and Sharoff (in this book) are very instructive. Both experiences show that it is both difficult to instruct people consistently and to get strong agreement. Only the good will of the corpus builders, lots of dedicated time, financial resources and, last but not least, resolute and clear-cut decisions led to the finalization of KRYS-01 and Sharoff’s English-Russian genre collections.
There is also a problem of sheer classification. The genre classes mentioned above cannot be levelled to one dimension. There are hierarchical relations and horizontal relations (cf. Heyd, 2008). For instance, we can deal with supergenres (e.g. ADVERTISING), subgenres (e.g. WEATHER REPORTS), genres at basic level (e.g. EDITORIALS) , etc. In which way these different levels of generality affect an AGI classifier? Some experiments have shown that an AGI classifier performs better when the level of classes is consistent (Santini, 2006a). However, we currently know very little about the relation between the granularity of genre classes and classification performance.
Another compelling issue concerns the ontological nature of genre. It would be intriguing to delve into the special traits that distinguish genre from other textual categories, such as topic, domain, style, registers or sentiment. Although a good attempt to shed some light on these relations has been made by Lee (2001) for the genre annotation of the British National Corpus (BNC), one practical solution is to conflate all textual categories into the catch-all term “text categories”, in line with the Lancaster-Oslo/Bergen Corpus (LOB)  or Brown corpus  built about 50 years ago.
It is undoubtedly true that the term “genre” is loosely applied in everyday life to a number of conceptually heterogeneous classes, as Karlgren (in this book) underpins through the analyses of Yahoo! directory. His claim is also supported by booksellers’ catalogues. For instance, under the tab “Browse Genres” , Amazon UK lists many disparate categories, from genres (i.e. BIOGRAPHIES AND MEMOIRS) to mere descriptive labels (i.e. Subjects within Arts). As a matter of fact, one trend in AGI experiments to date has been the separation of genre from other textual categories and, above all, from topic. Many authors argue that topic and genre are orthogonal to each other (see Stein et al., in this volume). However, others, like Vidulin et al. (2007) experiment with mixed textual categories, from genres like FAQS or ERROR MESSAGES to subjects like “Childrens’”, to functions like “Gateway”, to less transparent labels like “Content delivery” (see Table 18 for the description of these categories). Some correlations between genres and other activities have been explored, e.g. the one between genre and tasks (Freund, 2008) through an ad-hoc built corpus. Conflating genre and topic into the anodyne label of “document types” is common practice in IR (e.g. see Yeung et al., 2007 Xu et al, 2007), although the collections they used are mostly topical. Interestingly, though, corpus linguists studying language variation have shown recently that subject categories are not clearly well defined on linguistic grounds, as shown by the findings by Biber and Kurjian (2006) who have applied multi-dimensional analysis to two Google categories, i.e. Home and Science
In short, as this abbreviated list of issues shows, there is an ongoing heated discussion in AGI research. One practical shortcoming of this debate is the absence of a common and shared genre framework, which hinders the creation of agreed upon genre resources, thus affecting the progress of AGI.
A temporary remedy to this lack is the practice of cross-testing. This practice has been possible because some researchers have shared their own collections within the genre community. This has allowed a number of comparative experiments (some of them are listed in Section 5) that provide insights into AGI problems.
In this chapter, I will leverage on the practice of cross-testing to assess the robustness of a simple genre classification model  that will be described in the next sections. This model is provocatively simple and is used maieutically here to show that genre can be captured with high accuracy (i.e. 86%-96%) with any kind of features (from high-level linguistic attributes to low-level byte n-grams) and any kind of algorithms (from the elementary inferential/rule-based approach described in this chapter to sophisticated statistical/mathematical methods) when genre models are evaluated in a restricted and relatively clean in vitro settings. When one attempts to approximate the population of web genres by introducing lot of noise and many different characterizations of genre classes, it is difficult to understand the significance of the results. In short, the experiments described here show that the diverse definitions of the concepts of genres have a strong bearing on the characterization of genre classes, thus affecting the generability of AGI models as a whole. Ultimately, this chapter is nothing more than a strong encouragement to investigate more extensively the robustness of AGI models for the web in less conventional experimental settings in the future. […]

2.  Approximating Genre Population on the Web
Since the web is in constant flux, it is almost impossible to compile a representative corpus/sample of the web as a whole (the multi-lingual web), or only of a single language, like the English web. There are estimates of the number of indexed web pages (in April 2005 — when my genre model was designed and built — Google could search 8,058,044,651 web pages; cf. Kilgarriff and Grefenstette, 2003 for previous estimates), which is a daily growing number, but we do not know anything about the proportions of the different types of text on the web, as pointed out by Kilgarriff and Grefenstette (2003). Interesting approaches have been proposed to automatically create corpora from the web, but these methods are biased towards the construction of corpora having topic or domain as priority, rather than genre (Ciaramita and Baroni, 2006; Sharoff, 2006). From a statistical point of view, when the composition of a population is unknown, the best solution is to extract a large random sample and draw inferences from that sample. However, deciding the size of this random sample is not a trivial issue. In this chapter I temporarily override this problem by using some available genre collections to cross-test the model performance. Although the total amount of the web pages of the combined genre collections used here is only 6404 (virtually a drop in the web ocean), this amount is the largest ever used in AGI experiments with one exception, namely the CMU genre corpus. This corpus, used in Dewdney (2001), contained 9705 documents divided into seven genres without any noise is no longer available and, as far as I know, it has never been used in other experiments. I conjecture that this final composite corpus of 6404 web pages well represents a noisy environment like the web, where documents come from disparate communities, enact different genre conventions and classification schemes, and not necessarily belonging to a recognized genre.

2.1 Noise
The impact of noise on genre classification results has been little explored in AGI. The only explicit investigation was carried out by Shepherd et al. (2004). They compared the performance of two classifiers on three subgenres (i.e. 93 PERSONAL HOME PAGE, 94 CORPORATE HOME PAGE and 74 ORGANIZATIONAL HOME PAGES) with and without noise (i.e. 77 non-home pages). Predictably, they results show that there is a deterioration of performance when noise is introduced.  [Continue reading excerpts here or download PDF from here]

Leave a Reply

Your email address will not be published. Required fields are marked *

*