Dissemination: What sampling size?

Dear All, I paste here an interesting discussion I read on Corpora List some days ago. I think the issue of corpus size is relevant to many of us. Here is the discussion in its integrity: %— start Daniel Elmiger 9 Aug (5 days ago) to corpora Hello, In large corpora, it is very often impossible to analyse every single occurrence of a given phenomenon: Therefore, one often needs to reduce the amount of data via (random) sampling in order to have a more qualitative look at large quantities of data. I've seen several times that samples of 200 occurrences/examples/tokens are chosen, each of which is then individually examined. An early example of this approach is Jennifer Coates' study about „The Semantics of the Modal Auxiliaries" (1983). Does anybody know if this kind of sampling has inherent

Filed under: featured, queries

Papageno: Predictive Models for Crisis Intelligence

Last Updated Comments: 22 July 2013 Papageno: A Pilot Study to identify suitable Predictive Models for Crisis Intelligence I need some help to jot down real-world use cases for crisis intelligence.  Could you please point out to me past events or previous experiences that can be useful for a pilot study? "Crisis intelligence" is a new research area that is becoming more and more crucial in medium-large organizations and companies. It consists in detecting an upcoming "crisis" (a scandal or general dissatisfaction or any negative attitude) by automatically analysing text documents of any kind in electronic format. Many commercial and open source solutions are proposed to identify the "mood" and the sentiment of masses with respect to a certain event, brand, or person through tweets, blogs, etc. But very little research has been carried

Filed under: queries, requests

Requests for proposal (RFP) and IR

Last Updated: 1st May 2013 I am looking for a list of functions and features buyers may use in their request for proposal (RFP) to help them acquire an enterprise search/IR platform. Any experience to share about this topic? Any reference that can help analyze this problem in depth? Thanks in advance.

Filed under: queries, requests

Towards a Cross-Lingual Lexical Knowledge Base of Lexical Forms

Last updated: 15 May 2013 How do you overcome problems related to cross-linguality? My specific problem at them moment is caused by the poor coverage of everyday language in lexical resources. For instance, the Swedish single-word expression /egenremiss/ (14,900 hits, April 2013) – or alternatively as a a multiword expession (MWE) – /egen remiss/ (8,210 hits, April 2013) denotes a referral to a specialist doctor written by patients themselves. This expression is made up from two common Swedish words /egen/ `own (adj)' and /remiss/ `referral'. It is a recent expression (probably coined around 2010*) and not yet recorded in any official dictionary nor in Wiktionary or other multilingual online lexical resources. This compound happens to be very frequent in query logs belonging to a Swedish public health service website.

Filed under: discussions, featured, queries, reflections, requests

Question: How to Define Criteria for Subgenre Classification?

I had an interesting email exchange with Christophe Clugston, a researcher currently located in Thailand, about the classification of a specific subgenre belonging to the Netadvertising supergenre. He says: "I am looking at classifying a very narrow sub genre. Within the domain of Netvertising I am looking at an extant, variant genre that I am terming Long Scroll Web Advertisements (as the off line version is termed Long Copy Advertising). This type of advertising is very different than the multi media image tied to a few words or few clauses. It is based entirely on the factor of extended reading (some of these ads are over 24 pages when printed). I have enclosed a link to one type of ad in this category At current I am looking only at self defense

Filed under: discussions, queries, reflections, requests

Actionable Corpus & Actionable Intelligence

I am trying to figure out how to predict future trends independently from entities. For example, instead of trying to guess who (Obama and Romney are two entities) will win next American elections, I would like to predict the trend representing Americans' confidence in a better US economy in years 2012-2017. This is just an example that simplifies my purpose, and it has nothing to do with my actual data. I would like to start this exploration with predictive methods using the ENRON email dataset ( I would like to predict - from this huge email corpus (UNSTRUCTURED BIG DATA) – whether and when (a point in the past) the ENRON SCANDAL could be expected to happen. The ENRON email dataset will be the "actionable corpus" that will be used to experiment on non-entity-based predictions. An

Filed under: queries, requests

Text/Content Analytics for Suicide Prevention (I)

A interesting topic has been brought to my attention almost simultaneously by two friends working in very different areas (namely by a linguist and a psychiatrist): the language of suicides. My mind has immediately converted their differing perspectives on the topic into a shared research issue: ***is it possible to use corpus-based automated language analysis, content analysis, style analysis, genre analysis and discourse analysis (i.e. in short Text/Content Analytics) to identify and prevent suicidal prospects?*** I have already found interesting links on the web (see breakdown below). Can anybody help find additional relevant material? Thanks, Marina ****Material for Corpus/Collection creation: suicide letters/notes in English try to contact: interesting blogs: "love letters and suicide notes": ****Previous language analyses: Thesis 2011: The language of suicide notes: ****Books on sentiment and discourse analyses: Synthesis Lectures on Human Language Technologies May 2012, 167 pages, Bing

Filed under: queries, requests

How to construct a taxonomy of user’s interests automatically?

What is the best way of constructing a taxonomy to classify **user's interests** from unstructured social data that internet users have input? Have you tried with commercial products? If so, what is your experience with them? Have you tried with an ad-hoc algorithm? If so, which approach would you recommend? Do you know any existing taxonomy of user's interests? Thanks in advance for your answers. Cheers, Marina

Filed under: queries

Question: How do you inject emotions into chatbots, Andres?

Post signed by: ANDRES TOMÁS HOHENDAHL, NLP Researcher Injecting emotions is a rather complicated task, so I started by using lots of heuristics, classifying the user inputs and the bot answers, trying to figure out emotional communication and frustration states, as the emotional state is a result of several ongoing acts, and somehow measures or emotional readouts of the result of those interactive acts. In other words if the bot gets a question, and knows how to answer it, he gets happier, if the user denies or objects the answer, he gets a little frustrated, if the user says nonsenses (not understandable things) the bot gets curious and asks further, if the user employs harsh terms, he might go into a defensive-first and angry after state, all was based on the

Filed under: collaborative blogging, queries, signed posts