The WebGenre Blog: The power of genre applied to digital information. By Marina Santini » Archive

Looking for Corpora to explore Cross-Linguality

Dear All, I am looking for corpora of any genre in the following languages: English, Swedish, Polish, Italian, Finnish, Estonian, and Hungarian. I am already aware of a number of corpora (several posts in this blog are dedicated to the dissemination of corpora-related information). These corpora are mostly in English. I would like now to focus on: 1) additional languages and 2) additional genres, such as search query logs, tv scripts, emails, tweets, whatsup messages, etc. All genres are well accepted! The only requirement is: corpora must be free and publicly available. Everybody must be able to replicate or extend experiments using the same corpora/datasets. The purpose of the experiments is to explore cross-linguality in different settings. Please, read the use cases below in order to have an idea of the type of communicative situations we … Read entire article »

Filed under: featured, requests

Papageno: Predictive Models for Crisis Intelligence

Last Updated Comments: 22 July 2013 Papageno: A Pilot Study to identify suitable Predictive Models for Crisis Intelligence I need some help to jot down real-world use cases for crisis intelligence.  Could you please point out to me past events or previous experiences that can be useful for a pilot study? ”Crisis intelligence” is a new research area that is becoming more and more crucial in medium-large organizations and companies. It consists in detecting an upcoming “crisis” (a scandal or general dissatisfaction or any negative attitude) by automatically analysing text documents of any kind in electronic format. Many commercial and open source solutions are proposed to identify the “mood” and the sentiment of masses with respect to a certain event, brand, or person through tweets, blogs, etc. But very little research has been carried … Read entire article »

Filed under: queries, requests

Requests for proposal (RFP) and IR

Last Updated: 1st May 2013 I am looking for a list of functions and features buyers may use in their request for proposal (RFP) to help them acquire an enterprise search/IR platform. Any experience to share about this topic? Any reference that can help analyze this problem in depth? Thanks in advance.   … Read entire article »

Filed under: queries, requests

Towards a Cross-Lingual Lexical Knowledge Base of Lexical Forms

Last updated: 15 May 2013 How do you overcome problems related to cross-linguality? My specific problem at them moment is caused by the poor coverage of everyday language in lexical resources. For instance, the Swedish single-word expression /egenremiss/ (14,900 hits, April 2013) – or alternatively as a a multiword expession (MWE) – /egen remiss/ (8,210 hits, April 2013) denotes a referral to a specialist doctor written by patients themselves. This expression is made up from two common Swedish words /egen/ `own (adj)’ and /remiss/ `referral’. It is a recent expression (probably coined around 2010*) and not yet recorded in any official dictionary nor in Wiktionary or other multilingual online lexical resources. This compound happens to be very frequent in query logs belonging to a Swedish public health service website. … Read entire article »

Filed under: discussions, featured, queries, reflections, requests

Request: Corpus-Based Sublanguage Glossary

How to build a glossary of: specialized term = common word automatically? Dear all, I wonder if you have any experience or if you can provide references on how to build automatically  a glossary from genre-specific corpora. The glossary should be made of pairs in the form of: sublangage term = common/familiar word. For instance: anemi = blood deficiency analgesic = painkiller etc. Thanks in advance for suggestions and pointers. Marina   … Read entire article »

Filed under: requests

Question: How to Define Criteria for Subgenre Classification?

I had an interesting email exchange with Christophe Clugston, a researcher currently located in Thailand, about the classification of a specific subgenre belonging to the Netadvertising supergenre. He says: “I am looking at classifying a very narrow sub genre. Within the domain of Netvertising I am looking at an extant, variant genre that I am terming Long Scroll Web Advertisements (as the off line version is termed Long Copy Advertising). This type of advertising is very different than the multi media image tied to a few words or few clauses. It is based entirely on the factor of extended reading (some of these ads are over 24 pages when printed). I have enclosed a link to one type of ad in this category At current I am looking only at self defense … Read entire article »

Filed under: discussions, queries, reflections, requests

Actionable Corpus & Actionable Intelligence

I am trying to figure out how to predict future trends independently from entities. For example, instead of trying to guess who (Obama and Romney are two entities) will win next American elections, I would like to predict the trend representing Americans’ confidence in a better US economy in years 2012-2017. This is just an example that simplifies my purpose, and it has nothing to do with my actual data. I would like to start this exploration with predictive methods using the ENRON email dataset ( I would like to predict - from this huge email corpus (UNSTRUCTURED BIG DATA) – whether and when (a point in the past) the ENRON SCANDAL could be expected to happen. The ENRON email dataset will be the “actionable corpus” that will be used to experiment on non-entity-based predictions. An … Read entire article »

Filed under: queries, requests

Request: Looking for Multi-Dimensional Social Network Datasets/Corpora/Collections

Is anyone aware of multi-dimensional social network datasets/corpora/collections where friendships are based on several attributes? For example, A is friend with B because they are co-author. Or, A is friend with C because they play badminton. Generally, Facebook-based datasets describe that A is related to B because A and B are friends on Facebook. We are currently looking for more complex relations. We are aware of the following resources and studies: 1) Facebook Project i 2) The Facebook datasets, described in K. Lewis, J. Kaufman, M. Gonzalez, A. Wimmer, and N.A. Christakis, “Tastes, Ties, and Time: A New (Cultural, Multiplex, and Longitudinal) Social Network Dataset Using,” Social Networks 30(4): 330-342 (October 2008) ARE STILL OFFLINE ( 3) J.H. Fowler and N.A. Christakis, “Cooperative Behavior Cascades in Human Social Networks” PNAS: … Read entire article »

Filed under: requests

Text/Content Analytics for Suicide Prevention (I)

A interesting topic has been brought to my attention almost simultaneously by two friends working in very different areas (namely by a linguist and a psychiatrist): the language of suicides. My mind has immediately converted their differing perspectives on the topic into a shared research issue: ***is it possible to use corpus-based automated language analysis, content analysis, style analysis, genre analysis and discourse analysis (i.e. in short Text/Content Analytics) to identify and prevent suicidal prospects?*** I have already found interesting links on the web (see breakdown below). Can anybody help find additional relevant material? Thanks, Marina ****Material for Corpus/Collection creation: suicide letters/notes in English try to contact: interesting blogs: “love letters and suicide notes”: ****Previous language analyses: Thesis 2011: The language of suicide notes: ****Books on sentiment and discourse analyses: Synthesis Lectures on Human Language Technologies May 2012, 167 pages, Bing … Read entire article »

Filed under: queries, requests

Applying Findability to Mine Query Logs for BI: Preliminaries

 Marina Santini. Copyright © 2012  Thanks for sharing pointers and for giving hints to the question: “Can anyone suggest references about mining query logs for BI and CEM?” ( Pls feel free to add comments to the blog post, if more suggestions come to your mind.  The question of this week is: “How can I profitably use query logs for making better business decisions and predict future trends?”  Citing from (Rud, Olivia (2009). Business Intelligence Success Factors: Tools for Aligning Your Business in the Global Economy. Hoboken, N.J: Wiley & Sons. ISBN 978-0-470-39240-9.), Wikipedia states: “Business intelligence (BI) is defined as the ability for an organization to take all its capabilities and convert them into knowledge, ultimately, getting the right information to the right people, at the right time, via the right channel. This produces large amounts … Read entire article »

Filed under: discussions, featured, reflections, requests