Book Outline: Automatic Identification of Genre in Web Pages (2011)

Automatic Identification of Genre in Web Pages: A new perspective [Paperback]

Marina Santini (Author)

  • Paperback: 332 pages
  • Publisher: LAP LAMBERT Academic Publishing (December 19, 2011)
  • Language: English
  • ISBN-10: 3847306871
  • ISBN-13: 978-3847306870

Book Overview

This book is divided into five parts: a preliminary part (Part I), three empirical parts (Parts II, III and IV) and an epilogue (Part V). Part I includes this introduction (Chapter 1) and a literature review (Chapter 2). Part II focuses on automatically-extractable genre-revealing features (Chapters 3, 4, 5 and 6). Part III motivates the necessity of a flexible genre classification scheme (Chapters 7, 8 and 9). Part IV focuses on methods for automatic identification of zero, one, or multiple genres in web pages (Chapters 10 and 11). Finally, Part V (Chapter 12) reports conclusions and outlines future work.

Chapter 1 contains this introduction, where claims and motivation are explained.

Chapter 2 has a wide scope. First, it provides a concise overview of the concept of genre. Second, it reviews the main automatic approaches to text type and genre identification pointing out why these approaches are inadequate when dealing with genres of web pages. Third, it presents some observations about text types. Finally, it briefly reviews the different kinds of features that have been used in automatic genre classification.

Chapter 3 highlights some open issues about the use of a NLP tool, namely a tagger-parser, applied to web pages. These issues are not easy to solve. The goal of this chapter is to point them out for further discussion, and to highlight that the results of the experiments reported in this book are conditioned by these issues.

Chapter 4 presents a first expansion of the range of features useful for automatic genre identification: POS trigrams. POS trigrams are assessed on a number of genre categories found in the BNC. The idea is that if they prove to be discriminating on a stable corpus like the BNC, then they can be applied on a difficult corpus, like a web page collection.

Chapter 5 presents a second expansion of genre-revealing features: the facets. Facets are macro-features, i.e. collections of single features that highlight different aspects, or facets, of communication. The idea behind the use of facets instead of single features is that the different communication aspects highlighted by the facets can then be assembled and composed together to infer text types. Facets can also be used as any other feature in any standard method for automatic genre classification. For example, a subset of facets, the syntactic patterns, fed into an SVM classifier has an accuracy of about 86% vs. an accuracy of about 83% achieved by lexical subordinators.

Chapter 6 summarizes and discuss the properties of features employed in automatic text types and genre identification studies. After having discussed their properties, the chapter ends with a list of the actual types of features that will be used in the experiments presented in the next parts of the book.

Chapter 7 focuses on the detection of emerging genres. The set of experiments presented in this chapter shows to what extent it is possible to detect these forms with a standard algorithm, like cluster analysis. Although some emerging textual patterns can be identified, cluster analysis has some inherent limitations regarding the stability and the interpretation of the clusters. Additionally, the relations of these patterns with existing and acknowledged genres remain unspecified.

Chapter 8 describes a web user study showing that a single genre classification does not correspond to web users’ view of web pages. This study reports that 20 out of 25 web pages are labelled using at least three genres. It also shows that when genre conventions are unclear, web users tend to disagree more on genre assignment.

Chapter 9 suggests a characterization of genres of web pages that includes the two attributes of genre hybridism and individualisation and explains why they are useful.

Chapter 10 describes the advantages and disadvantages of the standard single-label genre classification of web pages. The main advantage is that a high accuracy (about 90%) on a single genre can be reached in a closed and static situation with a predictable population, a felicitous genre palette, mixed feature sets, a balanced corpus, a state-of-the-art classification algorithm, and prototypical web pages. The main disadvantage is that such a classification model shows a reduced generality and exportability, and heavily depends on the above-listed attributes. In brief, it cannot handle the complexity of web pages and the fluidity and fast-paced evolution of the web.

Chapter 11 proposes a new method that implements a zero-to-multi-genre classification scheme. In addition to the standard identification of a single genre, this model can return multiple genre labels or no genre label. It does this using lingistic insights coming from facets and text types. This approach relies on a statistical model already used in artificial intelligence, and its results are competitive: about 86% vs. 90% for a standard machine-learning model, in ideal conditions; and about 86% vs. 76% in more realistic conditions. Although such a model cannot be fully evaluated for zero-genre and multi-genre labelling given the limitations of the current state of genre research, it offers a view on genre variation within a web page.

Chapter 12 summarizes the main findings and outlines future work.

Leave a Reply

Your email address will not be published. Required fields are marked *