In the Garden and in the Jungle: Comparing Genres in the BNC and Internet
by Serge Sharoff
In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9
In this chapter I will present an approach to classifying the Web into genres. The goal is to have a compact system of categories that can be assigned with little ambiguity to almost every webpage. The proposed typology is organised from the functional viewpoint: generalised categories for genre classification correspond to major aims of text production, such as `discussion’ or `instruction’. This chapter compares the genre distributions in English and Russian automatically constructed Internet corpora against their human-collected counterparts (BNC and RNC) in terms of these classes using probabilistic classifiers.
The jungle metaphor is quite common in genre studies. The subtitle of David Lee’s seminal paper on genre classification is `navigating a path through the BNC jungle’. According to Adam Kilgarri , the BNC is a jungle only when compared to smaller Brown-type corpora, while it looks more like an English garden when compared to the Web. Intuitively this claim is plausible: if we consider the whole Web as a corpus, it probably contains a much greater variety of text types and genres than the 4055 texts in the BNC classified into 70 genres. However, we still need to study this jungle. Nowadays it is relatively easy to collect a large corpus from theWeb, either using search engines or web crawlers, so it is easy to surpass the BNC in size. However, we know little about the domains and genres of texts in corpora collected in this way. Even if we collect domain-specific corpora and can be sure that all texts in our corpus are about, e.g., epilepsy, we still do not know the amount of research papers, newspaper articles, webpages advising parents, tutorials for medical staff, etc, in it. Traditional corpora have been annotated manually, which did not create a significant overhead: such corpora have been also compiled manually, so it was possible to annotate each text according to a reasonable number of parameters. Even then there can be problems with manual classification.