Spreading the Word about (Web)Genre Research

What is genre? Why is it useful to master genre conventions? Can we classify document genres automatically? Around the world, lots of researches and scholars belonging to a wide range of disciplines are trying to provide answers to these and to many other questions. Aristotle suggested the first genre classification scheme by dividing literature into Tragedy, Comedy and Lyrics (well, I am oversimplifying…).  Aristotle smoothly classified all the knowledge of his time, so arguably classifying genres

Working Definition of Digital Genre (II)

Last Updated: 22 June 2014 – 26 June 2014 – 3 July 2014 - — draft in progress — In this blog post (that I will update seamlessly), I would like to pin down a working definition of digital genre that is appropriate for our computational experiments. The experiments I refer to are those that will be included in the forthcoming book "Computational Theory of Digital Genre" that I have already announced a while ago. With Michael Oakes and Georgious Paltoglou (both at University of Woleverhampton, UK), we are setting up experiments focussing on the computational modeling of the concept of digital genre. Since the concept of genre is difficult to define in a simple way, because it inherits all the idiosyncrasies and ambiguities that characterize language and human communication in general, I

Towards a Safer Web (with Language Technology)

Last Updated: 25 June 2013 On 18 June 2013, I attended an interesting conference on cybersecurity. The conference was held in one of the conference rooms at the Police Academy in Rome*. The title of the conference was "Critical Infrastructure Protection – Telecommunications"** and Italian was the working language. The conference was organized by  the I.C.S.A Foundation (Intelligence Culture and Strategic Analysis) ( Those who can understand Italian can read a press release here: As you can imagine, there were many people working for the Police and Defence Departments, but also people coming from industry and academia. I attended this conference because, in my opinion, Language Technology (LT) can help cybersecurity in many ways. We are currently thinking of a LT project, SafeWEB, whose aim is to detect threatening, mischievous and treacherous

Opinion Retrieval and Ranking: the creeping and ineluctable force of Genre

Last Updated: 27 May 2013 Two fundamental principles concurring to the definition and characterization of the concept of genre are conventions and expectations. Simply put, in textual (written or spoken) communication, genres are words that connote different types of text. For instance, on the web the home page genre is different from the blog genre; in a company, the minutes genre is different from the white paper genre; in the press the leader genre is different from the letter to the editor genre… Genres have the power of shaping information following rhetorical and discourse patterns that have become conventionalized. Genre conventions are implemented by the writer(s). When acknowledged, genre conventions raise predictable expectations in the readers or more generally in those who "process" a text… Although I am oversimplifying here, broadly speaking

Towards a Cross-Lingual Lexical Knowledge Base of Lexical Forms

Last updated: 15 May 2013 How do you overcome problems related to cross-linguality? My specific problem at them moment is caused by the poor coverage of everyday language in lexical resources. For instance, the Swedish single-word expression /egenremiss/ (14,900 hits, April 2013) – or alternatively as a a multiword expession (MWE) – /egen remiss/ (8,210 hits, April 2013) denotes a referral to a specialist doctor written by patients themselves. This expression is made up from two common Swedish words /egen/ `own (adj)' and /remiss/ `referral'. It is a recent expression (probably coined around 2010*) and not yet recorded in any official dictionary nor in Wiktionary or other multilingual online lexical resources. This compound happens to be very frequent in query logs belonging to a Swedish public health service website.

Question: How to Define Criteria for Subgenre Classification?

I had an interesting email exchange with Christophe Clugston, a researcher currently located in Thailand, about the classification of a specific subgenre belonging to the Netadvertising supergenre. He says: "I am looking at classifying a very narrow sub genre. Within the domain of Netvertising I am looking at an extant, variant genre that I am terming Long Scroll Web Advertisements (as the off line version is termed Long Copy Advertising). This type of advertising is very different than the multi media image tied to a few words or few clauses. It is based entirely on the factor of extended reading (some of these ads are over 24 pages when printed). I have enclosed a link to one type of ad in this category At current I am looking only at self defense

Towards a Computational Theory of Digital Genre (I): Working Definition of Genres for Computational Purposes

Towards a Computational Theory of Digital Genre (I): Working Definition of Genres for Computational Purposes by Marina Santini – Last Updated: 29 Oct 2012 1. What is a (textual) genre? • A genre is a class of texts with similar communicative, textual and linguistic features. 2. What characterizes a genre? A genre: • Must have a name • Must be recognized within a community • Must be produced or retrieved during a task • Must have conventions • Must raise expectations • Can change over time. It is an cultural artifact (culture here includes society, media, techonology, etc.) 3. What characterizes a digital genre? • The same characteristics listed above. • A digital genre is any kind of genre that has a digital form, such as emails, chats, online academic papers, online newspaper articles, blogs… • A digital genre can be any paper genre

Impact of Sociolinguistics in Opinion Mining Systems

Signed post by Alexander Osherenko, Socioware Development, Full paper: Considering Impact of Sociolinguistic Findings in Believable Opinion Mining Systems Proceedings of The Fifth International Conference On Cognitive Science. 2012. Kalinigrad, Russia ( Opinions are frequent means of communication in human society and automatic approaches to opinion mining in texts attracted therefore much attention. All in all, most approaches apply data mining techniques and extract lexical features (words) as reliable means of classi cation. Noteworthy that although the interest in opinion mining is huge, there are only few explorations on words extracted in opinion mining. This study considers this drawback and elaborates on a sociolinguistic explanation. We hypothesize: an opinion mining system should be trained for classifying opinions in texts of the same language style. Hence, this contribution focuses on the following questions: 1) do sociolinguistic … Read entire article »

Contextify: How to Contextualize Information

Marina Santini. Copyright © 2012 Work in progress: Contextify is a metadata tagger that performs text and content enrichment. Contexify enriches information through text classification and content markup. How can we capture context from a text? I would start with genre, sublanguage, and domain i.e. three textual dimensions that say something about the communicative context in which a text has been issued: A ”weird” word like ”Spweet” is not a typo if it belongs to a Twitter micropost (genre and sublanguage: tweet spam) A ”normal” word like ”mouse” is a specialized term if it belongs to the computer domain.   Other examples: surfing (sport, internet communication), agile (ordinary word, software),  sentence (law, grammar), appeal (ordinary language: ”appeal for help” or  legal sublanguage: to lodge an appeal, genre: newspaper, court act) etc. Context helps disambiguate words and assess the … Read entire article »

Mining Query Logs: Query Disambiguation & Understanding through a KB

Marina Santini. Copyright © 2012 Work in progress Talking about  query logs, Karlgren (2010) points out: “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“. However, some linguistic problems can be sorted out, for example those related to sublanguage, terminology, multi-word expressions, etc. Interestingly, the use of different sublanguages has been studied by Karin Friberg Heppin in her PhD thesis: Resolving Power of Search Keys in MedEval. A Swedish Medical Test collection with User Groups: Doctors and Patients. Karin highlights how patients (laymen) and doctors (experts) use different vocabulary (or terminology) to indicate the same concept. For example, patients might use the word “painkiller” while doctors may prefer the word “analgesic” to refer to the same treatment. Different sublanguages … Read entire article »

