Formulating Representative Features with Respect to Genre Classification
by Yunhyong Kim and Seamus Ross
In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9
Document genre (e.g. scientific article) is closely bound to the physical and conceptual structure of the document as well as the level of content depth found within the text. Hence, it is useful for comparing documents on the basis of metrics other than topical similarity (e.g. topical depth). Moreover, the structural information (e.g. conceptual flow) derived from genre classification can be used to locate target information within the text. Despite its usefulness, the success of previous attempts to automate genre classification is somewhat unsatisfactory. These attempts largely depended on the statistical analysis of some normalised frequency of terms in the document (where, here, term refers to words, phrases, syntactic units, sentences and paragraphs, as well as other patterns derived from structural analysis). These approaches tend to neglect how the patterns change throughout the document. Here, we report the results of automated experiments based on distributive statistics of words in order to present evidence that term distribution may be a highly promising indicator of document genre class.
Document classification is one of the most fundamental steps in enabling the search, selection, and ranking of digital material according to its relevance in answering a predefined search. As such it is a valuable means of knowledge discovery and an essential part of the effective and efficient management of digital documents in a repository, library, or archive. [Continue reading excerpts here or download PDF from here]