Text Analytics and Genre Awareness to the rescue of Business Intelligence (BI) & Customer Experience Management (CEM)
by Marina Santini, Copyright © 2012, All rights reserved.
Citation: Santini, Marina (2012). Text Analytics and Genre Awareness to the rescue of Business Intelligence (BI) & Customer Experience Management (CEM). [White paper]. Retrieved from http://www.forum.santini.se/2012/02/white-paper-text-analytics-and-genre-awareness-to-the-rescue-of-bi-cem/
Business Intelligence and Customer Experience Management
Generally speaking, business intelligence (BI) refers to computer-based techniques used to analyze business data, such as sales revenues, relational database reports, etc , mainly through standard statistical packages. BI’s main aims are to support better business decision-making and planning. Many statistical packages are available to make sense of numbers and structured data, e.g. SAS, SPSS, COGNOS, etc.
However, businesses create a huge amount of valuable information in the form of emails, memos, notes from call-centers, news, user groups, chats, reports, tweets, facebook pages, blogs, forums, marketing material and so on. More than 80% of all business information exists in these forms (source) or I would say “genres“. These genres — i.e, these types of text — contain strategic but often unexploited or underexploited information. The information contained in these web genres or digital genres is called unstructured data. The exploitation of unstructured data is recognized as a major unsolved problem in the information technology industry that engenders a huge economic loss and poor decision-making.
Similar problems are faced with data coming from Customer Relations and Customer Experience Management (CEM). Especially, now that many companies have decided to enhance user engagement and loyalty, or track feedback through tweets and Facebook microblogging. One example for all is the winning strategy adopted by American Airlines and their loyalty program, AAdvantage, based on both Twitter accounts and Facebook pages to communicate with customers.
Text-based techniques — called data mining, text mining, text analytics, social media analytics, social media mining, topic detection, content discovery, content analytics and so forth– are being used to cope with this avalanche of unstructured data. Let’s use the term “text analytics” here as an umbrella term covering all these techniques.
Many text analytics commercial solutions (a short list is here) help clients and customers to make educated decision based on unstructured data analysis by extracting textual insights and organizing results in categories that are meaningful to specific business contexts.
The common tasks of a text-analytics effort typically include:
- Creation of a corpus, i.e. collecting or identifying a set of textual materials, on the web or held in a file system, database, or content management system, for analysis.
Extraction of one, some or all of the following information:
- Named Entity Recognition
- Features, such as telephone numbers, e-mail addresses, quantities (with units), etc.
- Coreference, i.e tje identification of noun phrases and other terms that refer to the same object.
- Relationship, fact, and event extraction: identification of associations among entities and other information in text
- Sentiment analysis, i.e. the identification of subjective or factual text and the extraction of various forms of attitudinal information: affect, sentiment, opinion, mood, and emotion.
Current text analytics solutions are a valuable step forward, indeed. However there are still many problems when dealing with unstructured data. For this reason, results are often disappointing and substantial adjustments and lots of heuristics are often needed to make sense of the data.
How genre can help
The concept of genre (a.k.a. text type, type of text, text typology, type of document, etc.) is widely acknowledged in many disciplines (Corpus Linguistics, Computational Linguistics, , Language Technology, NLP, Information Retrieval, Informaton Extraction, Sentiment Analysis, etc.) because information is organized in a different way according to the different types of texts. Essentially, the genre of a document has a bearing on the identification of relevant content. For instance, short texts like FAQs, contain short sentences and simple syntax and it is easy to extract the main information. Chats might be full of ellipses or typos and rely on the situational context, so they can be more difficult to handle. Longer texts like emails follow a more formal content organization, where the “core content” might be preceded by salutations, a short introduction and/or additional elements. FAQs, chats, emails are different genres with their own specificities that need to be taken into account when processing unstructured data.
A couple of genre-aware solutions for text analytics
1) Data Collection and Preparation (a.k.a. Corpus Construction, Corpus Cleaning, and Representativeness): common technical challenges include development of methods to facilitate streamlining of data management and reuse (cleaning, documentation, standardization). Additional techniques should be worked out for corpus profiling. A primary often missing requirement is the addition of genre matadata. The genre of a document affects the choices of the methodology to be applied for the identification and extraction of content, topic, sentiment, emotion, etc. For example, if you have a small corpus of short genres (say from 10 to 1000 words), such as FAQs, chats, microblog posts, text messages, newswires for topic discovery, you could apply a straightforward approach like the following: remove all the stop words (a.k.a. function words) and compute an n-gram (bigram or trigram) Zipf’s frequency list sorted by descending frequency. All the most frequent and meaningful combinations of content words will be placed at the top of the list. Presumably the top combinations represent the most important topics. Generally, this approach works well for languages like English. No need to apply any probabilistic model, no need to measure the distribution and decide for a parametric on non-parametric model. Basically, some simple text statistics will do the work. Conversely, if you have big data and complex genres, such as email threads, blog posts and related comments, huge amount of unconnected tweets, characterized by articulated syntax and complex discourse structure (e.g. see the emails included in the Enron email dataset), completely different decisions must be made to reliably carry out topic discovery.
2) Understanding the community structure and communication context: one of the most important novelties brought by the advent of digital communication, the web and social networks is the idea of customers as individuals who belong to communities. A single individual may belong to multiple communities and those communities may even partially overlap. The concept of genre provide contextual information relevant to the identification of communities. Techniques should be worked out for social network analysis and social behaviour analysis through multivariate statistics, e.g. Principal Component Analysis and Hierarchical Cluster Analysis (e.g. see Paolillo et al., 2010).
The concept of genre can add effectiveness and boost the synergy between text analytics and information technology industry to tame unstructured data.
Marina Santini, Copyright © 2012, All rights reserved.