Overview: Automatic web Genre Identification (AGI)

Genre is a fundamental component of human communication, but the definition of genre is vague, as genre classes can indicate a text type, a discourse practice, a rhetorical strategy, a cognitive class, or any textual category. In this post I provide a short overview of previous and current approaches to Automatic web Genre Identification (AGI).

In its early stage, (AGI) builds upon the seminal work of Douglas Biber (Biber, 1988). Although Biber did not perform any AGI, he explored the linguistic variation (focusing on the difference between spoken and written) within different genres using statistical approaches based on computable features outputted by Biber’s tagger, such as the number of that-deletions or verbs in the past tense.

Subsequent AGI research is based on a wide range of more easily computable features, such POS tags (Karlgren and Cutting, 1994), function and genre-specific words (Stamatatos et al., 2000; Dewdney et al., 2001), POS n-grams (Santini, 2007; Sharoff, 2007) or character n-grams (Kanaris and Stamatatos, 2009; Mason et al., 2009), as well as visual features (Levering et al., 2008) and web structure, i.e. connections between webpages of the same genre (Waltinger and Mehler, 2009). All the authors cited above, implicitly or explicitly, stress the potential the genre could have for Information Retrieval (IR). The importance of genre for the Internet and web search is indeed self-evident, since web applications were and are mostly based on keywords representing the topic and not the genre of web documents. As the continuing expansion of the Internet makes it increasingly hard to find information relevant to the user’s needs, genre could be a useful selection principle, in this respect. As the concept of web genres is supposed to have a great potential for information systems, in the last six years the major effort has been to find genre-revealing features that are “light” enough to be implemented in web applications. Shallow features are attributes that can be extracted and do not need pre-processing or the use of advanced NLP tools, such as syntactic parsers. Shallow features are claimed to be a pre-requisite when the task is to deal with billions of web pages. Character n-grams (Kanaris and Stamatatos, 2009; Mason et al., 2009) are very useful features to be included in web search applications. They are also supposed to be more appropriate to cope with the multilinguality on the web. Findings show that web genres can be captured with high accuracy (the best results are between 86%-97% on Santinis corpus) with any kind of features (from high-level linguistic attributes to low-level byte n-grams). However, the bulk of previous experiments have been carried out in a restricted and relatively “clean” in vitro settings. While realworld web applications – e.g. Google (a general purpose search engine) or WebCorp (a linguistic search engine) – require approaches (features and algorithms) that can handle billions of web documents at a time, AGI research is still characterized by the following: 

  • small-size genre collections; 
  • small-scale experiments; 
  • many different genre palettes (thus making the comparison of result much more difficult); 
  • no external evaluation, i.e., on a genre benchmark independent from the AGI researchers.

Although results are promising and encouraging, in these conditions, it is difficult to draw any conclusion about the generability and effectiveness of AGI models with larger collections and in real world conditions.

A new area of AGI research needs to be explored, i.e. experimentation with noise. Kennedy and Shepherd (2005) started introducing nonhome pages in their experiments with home page subgenres. Kanaris and Stamatatos (2009) applied their genre classifier trained using one corpus and a specific genre palette to another corpus with a different genre palette that contains some similar genres and some irrelevant genres. More recently, a study (Santini, 2010) showed some trends in the performance of a classifier when diverse genre collections are added up and co-exist with unclassified web pages, i.e. “noise”.

In future, more experiments need to be carried out to investigate the effect of noise on AGI.

Related Posts:


Leave a Reply

Your email address will not be published. Required fields are marked *