Poster paper by : Philipp Scholl, Renato Domínguez García, Doreen Böhnstedt, Christoph Rensing, Ralf Steinmetz
The term web genre denotes the type of a given web resource, in contrast to the topic of its content. In this research, we focus on recognizing the web genres blog, wiki and forum. We present a set of features that exploit the hierarchical structure of the web page’s HTML mark-up and thus, in contrast to related approaches, do not depend on a linguistic analysis of the page’s content. Our results show that it is possible to achieve a very good accuracy or a fully language independent detection of structured web genres.
Full article: <http://www2009.eprints.org/159/1/p1157.pdf>
Copyright is held by the author/owner(s).
WWW 2009, April 20–24, 2009, Madrid, Spain.