How can we convert massive quantities of unstructured data to structured information? What kind of “structure” do we need for a reliable interpretation of this undomesticated data? I suggest thinking of a text-analytic framework based on “context”.
Search keywords, events, entities, sentiments, attitudes, polarities, opinions etc. have a different weight and require a different assessment depending on the kind of texts, the situational context, the field of discussion, and the authority of the source, as well as on the purpose of use. For example, for an official use, factual texts might have more credibility than opinionated texts. In this respect, press conferences, declarations or announcements by a White House spokesman might be more reliable than newspapers’ speculations or op-ed articles. On the contrary, if we want to test the pulse and explore the feelings about a product or a politician, we might give more weight to blogs, forums or social networks’ microposts.
I claim that genre, sublanguage and domain are invaluable but unexploited textual dimensions that can be used to extract the communicative context (and descriptive metadata) needed to assess the value of information with respect to a purpose (business, learning, findability, monitoring, predicting, etc.).
In my view:
- Genre gives us the compositional context. When we know the genre of a document, we know how the content is organized, we know where we can find the most important information. For instance, on the web when the genre of a digital text is unknown or not declared explicitly, users feel often at a loss and do not know how to assess how reliable, objective or useful information is. The same is true within business intelligence, customer care optimization, and in many other practical applications.
- Sublanguage provides a situational context influenced by the medium of communication (e.g. telephone, face-to-face, chats, video-conferencing, microblogging, etc). Sublangage has nothing to do with terminology (specialized words, aka terms, used in a specialized domain); sublanguage is not register (e.g. cues of formality, casual conversation, etc.); sublanguage is not style. Think of the sublaguage characterizing tweets and the sublanguage used in customer care help centers or chats. They can be enormously different, though they might all be informal, conversational and polite. Sublanguage is formulaic, cross-topical and mostly domain-independent. For instance, the sublaguage used in a car rental help center is similar to the sublanguage used in a first-aid call center. In both cases, there will be a salutation (e.g. good morning), investigation (e.g. How can I help you? When did this happen? Where are you now?), personal detail requests (e.g. what is your name?), and similar.
- Domain refers to a field of interest or to a subject matter. It can be medicine, politics, marketing, literary criticism, etc. A domain can have a specific terminology.
Together these three textual dimensions help contextualize big unstructured data. Essentially, this means that linguistic pre-processing, textual analysis and the understanding of how human communication works on different media and across different types of text has a strong bearing on the quality, reliability and usability of “big data”.
Which text-analytic tools can presently contextualize information? My guess is that we should start working for and move forward to next generation of context-aware analytic tools.
Marina Santini, Copyright © 2012, All rights reserved.