AGI: Structured and Unstructured Noise

How would you handle automatic text classification in noisy conditions? This is what has been done, to my knowledge, in Automatic web Genre Idintefication (AGI). By noise here I refer to two different disturbing factors*: 1) the training sample and test sample come from different sources/annotators; 2) the test set contains genre classes that are not present in the training set. These two types of noise reflect the following real-world conditions when working with genre, namely: 1) since genre is a complex notion that has been interpreted in different ways, the identification of same genre class can vary depending on the research agenda or individual preferences; 2) we cannot possibly conceive a genre classifier that has a good performance if we include all existing genres either on the web or in

Overview: Automatic web Genre Identification (AGI)

Genre is a fundamental component of human communication, but the definition of genre is vague, as genre classes can indicate a text type, a discourse practice, a rhetorical strategy, a cognitive class, or any textual category. In this post I provide a short overview of previous and current approaches to Automatic web Genre Identification (AGI). In its early stage, (AGI) builds upon the seminal work of Douglas Biber (Biber, 1988). Although Biber did not perform any AGI, he explored the linguistic variation (focusing on the difference between spoken and written) within different genres using statistical approaches based on computable features outputted by Biber's tagger, such as the number of that-deletions or verbs in the past tense.

Book Outline: Automatic Identification of Genre in Web Pages (2011)

Automatic Identification of Genre in Web Pages: A new perspective [Paperback] Marina Santini (Author) Paperback: 332 pages Publisher: LAP LAMBERT Academic Publishing (December 19, 2011) Language: English ISBN-10: 3847306871 ISBN-13: 978-3847306870 Book Overview This book is divided into five parts: a preliminary part (Part I), three empirical parts (Parts II, III and IV) and an epilogue (Part V).

