Articles Comments

The WebGenre Blog: The power of genre applied to digital information. By Marina Santini » Entries tagged with "automatic genre classification"

Spreading the Word about (Web)Genre Research

Spreading the Word about (Web)Genre Research

What is genre? Why is it useful to master genre conventions? Can we classify document genres automatically? Around the world, lots of researches and scholars belonging to a wide range of disciplines are trying to provide answers to these and to many other questions. Aristotle suggested the first genre classification scheme by dividing literature into Tragedy, Comedy and Lyrics (well, I am oversimplifying…).  Aristotle smoothly classified all the knowledge of his time, so arguably classifying genres … Read entire article »

Filed under: discussions, reading suggestions, references, reflections

Book in Preparation: A Computational Theory of Digital Genre

Book in preparation: A Computational Theory of Digital Genre by Marina Santini The book lists, examines and develops the key concepts necessary to build a novel, intuitive and robust definition of digital genre for computational purposes. The newly proposed definition is the tenet of the computational theory underlying computational models for automatic digital genre classification. The book is divided into six parts, each one discussing exhaustively issues that have been neglected or considered to be too controvertial to find any theoretical or pragmatic agreement among scholars or researchers. The book provides not only theoretical foundations, but also a number of use cases, corpora/datasets, and computational models that readers can re-use for their own experiments to evaluate the validity of the theoretical and practical solutions proposed in this book. Preliminary Table of Contents PART … Read entire article »

Filed under: TOC

Dissemination: Stable Classification of Text Genres (2011)

Stable Classification of Text Genres Philipp Petrenz and Bonnie Webber (University of Edinburgh) Computational Linguistics, June 2011, Vol. 37, No. 2, Pages 385-393   Abstract Every text has at least one topic and at least one genre. Evidence for a text’s topic and genre comes, in part, from its lexical and syntactic features—features used in both Automatic Topic Classification and Automatic Genre Classification (AGC). Because an ideal AGC system should be stable in the face of changes in topic distribution, we assess five previously published AGC methods with respect to both performance on the same topic–genre distribution on which they were trained and stability of that performance across changes in topic–genre distribution. Our experiments lead us to conclude that (1) stability in the face of changing topical distributions should be added to the evaluation critera … Read entire article »

Filed under: dissemination

White Paper: Automatic Genre Identification – Testing with Noise

Automatic Genre Identification – Testing with Noise by Efstathios Stamatatos, Serge Sharoff, Marina Santini – Copyright © 2012, All rights reserved.   Citation:  Stamatatos E., Sharoff S., Santini M. (2012). Automatic Genre Identification – Testing with Noise. [White paper]. Retrieved from http://www.forum.santini.se/2012/03/white-paper-automatic-genre-identification-testing-with-noise/ The genre collections used in the experiments are available here. The reference list is here. In the experiments described below, genre classes coming from three genre collections have been used: Santinis7 (Santini, 2007). KI-04 (Meyer zu Eissen and Stein, 2004), and HGC (Stubbe and Ringlstetter, 2007). These genre collections have been created by different people, in different universities, for different purposes, with different criteria, and different notions of what genre is. Since genre is a complex concept and genre classes can be characterized in different ways, we assume that having a AGI algorithm … Read entire article »

Filed under: collaborative blogging, computational models, featured, signed posts, white papers

AGI: Structured and Unstructured Noise

How would you handle automatic text classification in noisy conditions? This is what has been done, to my knowledge, in Automatic web Genre Idintefication (AGI). By noise here I refer to two different disturbing factors*: 1) the training sample and test sample come from different sources/annotators; 2) the test set contains genre classes that are not present in the training set. These two types of noise reflect the following real-world conditions when working with genre, namely: 1) since genre is a complex notion that has been interpreted in different ways, the identification of same genre class can vary depending on the research agenda or individual preferences; 2) we cannot possibly conceive a genre classifier that has a good performance if we include all existing genres either on the web or in … Read entire article »

Filed under: dialectic, discussions, overviews

Reading Suggestion: Adjectives and adverbs as indicators of affective language for automatic genre detection (2008)

Rittman, Robert and Nina Wacholder. (2008). Adjectives and adverbs as indicators of affective language for automatic genre detection. Proceedings of AISB 2008 Convention, Symposium on Affective Language. Aberdeen, Scotland, April 1-2, 2008. Abstract. We report the results of a systematic study of the feasibility of automatically classifying documents by genre using adjectives and adverbs as indicators of affective language. In addition to the class of adjectives and adverbs, we focus on two specific subsets of adjectives and adverbs: (1) trait adjectives, used by psychologists to assess human personality traits, and (2) speaker-oriented adverbs, studied by linguists as markers of narrator attitude. We report the results of our machine learning experiments using Accuracy Gain, a measure more rigorous than the standard measure of Accuracy. We find that it is possible to classify … Read entire article »

Filed under: reading suggestions, references

Book Outline: Automatic Identification of Genre in Web Pages (2011)

Automatic Identification of Genre in Web Pages: A new perspective [Paperback] Marina Santini (Author) Paperback: 332 pages Publisher: LAP LAMBERT Academic Publishing (December 19, 2011) Language: English ISBN-10: 3847306871 ISBN-13: 978-3847306870 Book Overview This book is divided into five parts: a preliminary part (Part I), three empirical parts (Parts II, III and IV) and an epilogue (Part V). … Read entire article »

Filed under: overviews

Abstract: Formulating Representative Features with Respect to Genre Classification

Formulating Representative Features with Respect to Genre Classification by Yunhyong Kim and Seamus Ross In: Genres on the Web Computational Models and Empirical Studies Alexander Mehler, Serge Sharoff and Marina Santini Text, Speech and Language Technology Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9 Abstract Document genre (e.g. scientific article) is closely bound to the physical and conceptual structure of the document as well as the level of content depth found within the text. Hence, it is useful for comparing documents on the basis of metrics other than topical similarity (e.g. topical depth). Moreover, the structural information (e.g. conceptual  flow) derived from genre classification can be used to locate target information within the text. Despite its usefulness, the success of previous attempts to automate genre classification is somewhat unsatisfactory. These attempts largely depended on the statistical analysis of some normalised … Read entire article »

Filed under: abstracts

Seminar’s Slides — Stockholm University

The powerpoint presentation of today’s seminar at Stockholm University (24 Feb 2011) can be downloaded from here … Read entire article »

Filed under: references