Web Genre Analysis: Use Cases, Retrieval Models, and Implementation Issues
by Benno Stein, Sven Meyer zu Eissen and Nedim Lipka
In: Genres on the Web Computational Models and Empirical Studies
Alexander Mehler, Serge Sharoff and Marina Santini
Text, Speech and Language Technology
Volume 42, 2011, DOI: 10.1007/978-90-481-9178-9
People who search the World Wide Web often have a multi-faceted understanding of their information need: they know what they are searching for, and they know of which form or type the desired documents should be. The former aspect relates to the content of a desired document (= topic), the latter to the presentation of its content and the intended target group. Due to the different user groups and the technical means of the World Wide Web several favorite specializations of Web documents emerged: a document may contain many links (e. g. a link collection), scientific text (e. g. a research article), almost no text but pictures (e. g. an advertisement page), or a short answer to a specific question (e. g. a mail in a help forum). These examples suggest that it can be of much help if the retrieval process is capable to address a user’s information need regarding to what is called here genre or Web genre. This chapter contributes to Web genre analysis. It presents relevant use cases, discusses existing and new technology for the construction of Web genre retrieval models, and outlines implementation aspects for a genre-enabled Web search. Special focus is put on the generalization capability of Web genre retrieval models, for which we present new evaluation measures and, for the first time, a quantitative analysis.
The genre of a Web document provides information related to the document’s form, purpose, and intended audience. Documents of the same genre can address different topics and vice versa, and several researchers consider genre and topic as two orthogonal concepts. Though this claim does not hold without exceptions, genre information attracted much interest as positive or negative filter criterion for Web search results. Though the undoubted potential of an automatic genre identification for Web pages, retrieval models for genre could not convince in the Web retrieval practice by now. The reasons for this are threefold. First, as was also observed by Santini , the proposed genre classifier technology is corpus-centered: their application within Web retrieval scenarios shows a significant degradation of the classification performance, rendering the technology largely useless for genre-enabled Web search. Second, the existing genre retrieval models are computationally too expensive to be applied in an ad-hoc manner. Third, there is no genre palette that fits for all users and all purposes. Ideally, a user should be able to adapt a genre classifier to his or her information need, e.g. by labeling documents as being of an interesting genre or not. From the mentioned deficits the first one is the most severe: put in a nutshell, the existing Web genre retrieval models generalize insufficiently. Also the second deficit is crucial since it makes the important use case of a genreenabled Web search unattractive for users who expect a result list from a search engine by the press of a button. We argue that the problems can be overcome, and this chapter will introduce elements of the necessary technological means.