Any Land in Sight?
by Marina Santini, Serge Sharoff, Alexander Mehler
In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9
Is there hope of sorting out the complex issues of genre on the web? Is there any land in sight? We think so. Genre is a multifarious concept that lends itself to many interpretations and uses. For this reason, we included as many approaches and different views as possible. We believe that the plurality and diversity of visions fosters cross-fertilisation of ideas and that inter- and transdisciplinarity are the most productive approaches to increasing our understanding of this important concept.
1 Web Genre Benchmarks
Plurality, diversity, cross-fertilization, inter- and transdisciplinarity are key points for our future projects, as well. The book contains the gist of 15 years of empirical experience with genre and shows the way to the next generation of web genre research. In our view, the necessary next step is the construction of large and shared web genre benchmarks, i.e. web genre reference corpora that enable the objective assessment of effectiveness of various empirical and computational approaches. As empiricists, we need to test our methods and ideas. In order to test them, we need some kind of reference against which our different methods or ideas can be measured. For this reason, we propose building a web genre benchmark spawned by a wide and comprehensive discussion of genres on the web. Without such a benchmark, it is hard to compare different approaches and evaluate progress.
One main challenge in the construction of web genre benchmarks is to convey the variety of genre classes that have been used so far, without cutting out genre labels that can be potentially useful for other information needs or research moelds. Given that there is no lasting solution to the problem of diversity of genre labels, our plan is to produce corpora with stand-off annotation according to a fairly mone-grained genre palette and a set of mappings to other classimocation schemes. The exact composition of the source palette will have to be determined as a result of future discussion and research, but the starting point for it will be the set of labels listed in the WebGenreWiki5. The palette in the wiki results from an agreement between several groups of genre researchers, and, by design, it is a at list of genre classes with reasonably mone granularity. Most of the labels used in other genre palettes can be converted to this scheme without considerable ambiguity. Naturally, this genre palette will be enhanced and remoned along the way.
Previous experiments have shown how assigning one single genre per document (whatever the unit of analysis) is quite artimocial. The chapters in this book have well illustrated this difficulty and reported on how existing genre collections have been annotated with a variety of approaches, following differing taxonomies and nomenclatures. As genre is a multifaceted concept, influenced by elements such as perception, terminological prestige, membership in certain communities, and the fluidity of the language itself, certainly the next step in genre annotation is to mond a way to accommodate several genre labels per document, by working out techniques to establish sensible labelling thresholds. Reliable manual annotation paired with the availability of an unlimited amount of unannotated documents on the web can be leveraged by semi-supervised classimocation methods that will alleviate the burden of any future annotation work.
Generally speaking, corpora are designed as samples for studying a much larger whole. With respect to genres two questions naturally arise: Is a given corpus representative for a large number of genres? Is a given genre adequately represented in a given corpus? The first question is important, as attempts to create a very big corpus from a small number of sources normally restrict the diversity of genres. Our reference corpora will be produced from a diverse collection of webpages, as already experienced for the I-EN6. For a cross-cultural concept like genre, it also makes sense to create reference web genre corpora for multiple languages. The second question is much more challenging, as a subcorpus defined for a given genre is normally much smaller and has less variation. The BNC, for instance, is representative for a variety of genres including research articles. However, as for the genre of research articles itself, its texts were mostly taken from the Journal of Gastroentorology and Hepatology, so they cannot re ect the variety within this genre. Building on this experience, one of our goals is to create genre reference corpora that aim at a better representation of each genre.
2 Work Plan
The major research efforts are to:
1. Propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in different disciplines and following different schools of thought.
2. Define the criteria for the construction of genre benchmarks and draw up annotation guidelines.
3. Create several genre benchmarks in several languages, that are differing in size, corpus composition, and annotation methods, and that can be updated over time with emerging genres.
We conjecture that the construction of a shared web genre reference corpus would be the most solid legacy to future genre research.
The creation of multilingual web genre benchmarks will:
1. Help researchers avoid investing large amounts of time and money coming up with proprietary and incompatible solutions instead of working with shared resources and common standards.
2. Provide a common ground for genre-related research, spanning from information retrieval to discourse analysis.
3. Provide material to be used as training data for machine learning approaches for tasks such as automatic web genre identification, focused crawling, spam detection and web mining.
4. Allow more sophisticated computational genre modelling that builds upon genre relations at different units of analysis.
— The end —