The WebGenre Blog: The power of genre applied to digital information. By Marina Santini » Archive

Summary: Looking for Corpora…

Dear All, In this post I collect all the suggestions I got for the following request: “Looking for Corpora in….” Big thanks to (hope I have not forgotten anybody): Johannes Heinecke, Dominika Rogozinska, Mohamed-Zakaria KURDI, Bartosz Ziólko, Olga Whelan, Margarita Borreguero, Zuloaga, Ayesha Zafar, Will Snellen, Katherine (Katie) Skees Hund, Anna Matyszczyk, Massinissa Ahmim, Marcin Feder, Maria Pia Montoro, Lawrence Niculescu, Jesus Vilares, Ewa Gwiazdecka, Jack Bowers, Taner Sezer, Yvonne Adesam, Kadri Muischnek, Anne Tamm, Ralf Steinberger, Ricardo Campos, Edyta Jurkiewicz-Rohrbacher, Pat, Sara Castagnoli, Adam Przepiorkowski, Hung Le Khanh,  Kristian Kankainen, Norton Roman, Mansur Sayhunov. Suggestions were sent through: Mailing Lists: Corpora List (, BCS IRSG ( LinkedIn Groups: Corpus Linguistics, Computational Linguistics, Natural Language Processing, Applied linguistics, Terminology Services. Hope this list of corpora is useful for everybody working with multi- and cross-linguality. Please … Read entire article »

Filed under: dissemination, references, summaries

Distributional Semantics applied to Flickr® Tags

Upcoming Publications MARIANNA BOLOGNESI, International Center for Intercultural Exchange Distributional Semantics meets Embodied Cognition: Flickr® as a database of semantic features Selected Papers from the 4th UK Cognitive Linguistics Conference (in press) Distributional models such as Latent Semantic Analysis (LSA, Landauer, Dumais 1997) generate semantic spaces based on words’ co-occurrences in linguistic contexts. The semantic representations that emerge from these models are based on solely linguistic information, leaving aside the information that we retrieve from perceptual experiences. The analysis proposed applies the methods of distributional semantics to Flickr®, a corpus of images enhanced with metadata (tags), expressing a wide range of concepts, including perceptual features triggered by the experiences captured in the photographs. A case study on the domain of colors shows how a distributional analysis based on Flickr® can produce semantic … Read entire article »

Filed under: dissemination, reading suggestions, references

Towards a Safer Web (with Language Technology)

Last Updated: 25 June 2013 On 18 June 2013, I attended an interesting conference on cybersecurity. The conference was held in one of the conference rooms at the Police Academy in Rome*. The title of the conference was “Critical Infrastructure Protection – Telecommunications”** and Italian was the working language. The conference was organized by  the I.C.S.A Foundation (Intelligence Culture and Strategic Analysis) ( Those who can understand Italian can read a press release here: As you can imagine, there were many people working for the Police and Defence Departments, but also people coming from industry and academia. I attended this conference because, in my opinion, Language Technology (LT) can help cybersecurity in many ways. We are currently thinking of a LT project, SafeWEB, whose aim is to detect threatening, mischievous and treacherous … Read entire article »

Filed under: discussions, dissemination, reports

Dissemination: A cross-domain analysis of task and genre effects on perceptions of usefulness (2012)

A cross-domain analysis of task and genre effects on perceptions of usefulness by Luanne Freund, University of British Columbia, Vancouver, Canada Information Processing & Management, In Press, Available online 30 October 2012   Abstract Search systems are limited by their inability to distinguish between information that is on topic and information that is useful, i.e. suitable and applicable to the tasks at hand. This paper presents the results of two studies that examine a possible approach to identifying more useful documents through the relationships between searchers’ tasks and the document genres in the collection. A questionnaire and an experimental user study conducted in two domains, provide evidence that perceptions of usefulness are dependent upon information task type, document genre, and the relationship between these two factors. Expertise is also found to have an effect on … Read entire article »

Filed under: dissemination

Dissemination: Cross-Genre and Cross-Domain Detection of Semantic Uncertainty (2012)

Cross-Genre and Cross-Domain Detection of Semantic Uncertainty György Szarvas, Veronika Vincze, Richárd Farkas, György Móra, Iryna Gurevych* Computational Linguistics, June 2012, Vol. 38, No. 2, Pages 335-367   Uncertainty is an important linguistic phenomenon that is relevant in various Natural Language Processing applications, in diverse genres from medical to community generated, newswire or scientific discourse, and domains from science to humanities. The semantic uncertainty of a proposition can be identified in most cases by using a finite dictionary (i.e., lexical cues) and the key steps of uncertainty detection in an application include the steps of locating the (genre- and domain-specific) lexical cues, disambiguating them, and linking them with the units of interest for the particular application (e.g., identified events in information extraction). In this study, we focus on the genre and domain differences of … Read entire article »

Filed under: dissemination

Dissemination: Stable Classification of Text Genres (2011)

Stable Classification of Text Genres Philipp Petrenz and Bonnie Webber (University of Edinburgh) Computational Linguistics, June 2011, Vol. 37, No. 2, Pages 385-393   Abstract Every text has at least one topic and at least one genre. Evidence for a text’s topic and genre comes, in part, from its lexical and syntactic features—features used in both Automatic Topic Classification and Automatic Genre Classification (AGC). Because an ideal AGC system should be stable in the face of changes in topic distribution, we assess five previously published AGC methods with respect to both performance on the same topic–genre distribution on which they were trained and stability of that performance across changes in topic–genre distribution. Our experiments lead us to conclude that (1) stability in the face of changing topical distributions should be added to the evaluation critera … Read entire article »

Filed under: dissemination

Thesis Review: Cross-Language Ontology Learning

Hjelm, Hans (2009) Cross-language Ontology Learning. Incorporating and Exploiting Cross-language Data in the Ontology Learning Process. Academic dissertation for the Degree of Doctor of Philosophy in Computational Linguistics at Stockholm University, 2009. Permalink: Review by Marina Santini   The PhD thesis ”Cross-language Ontology Learning” presents a framework for automating cross-language ontology creation systems and suggests a setting in which cross-language data can be profitably integrated. The high-level task is to computerize the acquisition of semantic knowledge. In Information Science, ontology is “a way of representing knowledge or structuring the terminology within a domain” (p. 14 ). The thesis focuses on the learning of domain ontologies and limits its scope to studying is-a hierarchies. Ontology learning is the automated acquisition of a domain terminology from raw natural language texts. That is, “given a collection of … Read entire article »

Filed under: dissemination, reviews

Towards a Computational Theory of Digital Genre (I): Working Definition of Genres for Computational Purposes

Towards a Computational Theory of Digital Genre (I): Working Definition of Genres for Computational Purposes by Marina Santini – Last Updated: 29 Oct 2012 1. What is a (textual) genre? • A genre is a class of texts with similar communicative, textual and linguistic features. 2. What characterizes a genre? A genre: • Must have a name • Must be recognized within a community • Must be produced or retrieved during a task • Must have conventions • Must raise expectations • Can change over time. It is an cultural artifact (culture here includes society, media, techonology, etc.) 3. What characterizes a digital genre? • The same characteristics listed above. • A digital genre is any kind of genre that has a digital form, such as emails, chats, online academic papers, online newspaper articles, blogs… • A digital genre can be any paper genre … Read entire article »

Filed under: dialectic, discussions, dissemination, reflections

Dissemination: Multi-Labeling Web Pages by Genre

Excerpts from: Chaker Jebari. MLICC: A Multi-Label and Incremental Centroid-Based Classification of Web Pages by Genre. NLDB 2012: 183-190. For the full version, please contact:   Evaluation Corpus In our approach we used the corpus MGC. This corpus was gathered from internet and consists of 1539 English web pages classified into 20 genres as shown in the following table. In this corpus each web page was assigned by labelers to primary, secondary and final genres. Among 1539 web pages, 1059 are labeled with one genre, 438 with two genres, 39 with three genres and 3 with four genres. It is clear from the following table that the corpus MGC is unbalanced, meaning that the web pages are not equally distributed among the genres. … Read entire article »

Filed under: dissemination, reading suggestions

Reblogging: A freely available, open source taxonomy and autoclassification tool

Clade – a freely available, open source taxonomy and autoclassification tool by Charlie Hull at Flax ( One way to manage digital information is to classify it into a series of categories or a heirarchical taxonomy, and traditionally this was done manually by analysts, who would examine each new document and decide where it should fit. Building and maintaining taxonomies can also be labour intensive, as these will change over time (for a simple example, just consider how political parties change and divide, with factions appearing and disappearing). Search engine technology can be used to automate this classification process and the taxonomy information used as metadata, so that search results can be easily filtered by category, or automatically delivered to those interested in a particular area of the heirarchy. We’ve been working on an … Read entire article »

Filed under: dissemination, reblogging