Comments to the post: The Path Forward: From Big Unstructured Data to Contextualized Information (www.forum.santini.se/2012/03/the-path-forward-from-big-unstructured-data-to-contextualized-information/)
Discussion on LinkedIn: American Society for Information Science & Technology (http://lnkd.in/EdqJDb)
Tom Reamy • Hi Marina, good blog – and as someone dealing with the idea of context in text analytics for many years, I’m in total agreement as to its importance. There are quite a few other types of context that are important as well. Another conversation.
As far as text analytics tools dealing with this – most of them can but the ones with a full set of operators will probably do best. Two contextual areas come to mind immediately – how to get TA software to recognize context like genre when it is not specified and how to take context into account in categorization or extraction rules.
The former can be done with both statistical approaches and rules based, but the latter is something that you really needs rules for.
As to who does it,there a whole confusion of overlapping offerings (one of the services my company has been doing a lot of lately is helping companies do text analytics evaluation projects) but the top commercial names are SAS and IBM on the full platform end and Smart Logic, Expert Systems,and Concept Searching, and others mid-tier, and then sentiment analysis companies do some of this – Attensity, Clarabridge, Lexalytics, and lots of little companies like Janya and AI-One. Lastly there is opensource like GATE. Way more but in interest of my typing fatigue, let’s stop there.
The real trick is not the technology but building rules or statistical models that are not too brittle but sufficient depth -but that’s another topic entirely.
Marina Santini • Hi Tom, thanks for your useful list. My experience with some of these products is that they need substantial tweaking to identify genre and sublanguage differences…
Or as you say they are still “too brittle” 🙂
Tom Reamy • Hi Marina,
I’m not sure what you mean by tweaking – certainly none of those packages could categorize genre out of the box, but they are all designed to support developing rules that should enable them to characterize genre as long as there were indicators of genre in the text. I’ve mostly worked on topical categorization and entity extraction using the tools I listed, so haven’t really looked into what features you might use to create a set of genre classification rules.
If you could provide some examples of genre classification done by humans, I might be able to give you a better answer as to what kind of effort it might require to develop some automated rules. Or if it is possible at all given current technologies. For example — What kinds of features would a human typically use to categorize genre? What are some examples of genre that you find particularly useful to distinguish?
As far as the comment about “brittle” I was referring to the generality of rules you might build – it is often tricky to build rules that work with high precision for a particular set of documents but that fail when applied to new documents.
Marina Santini • Hi Tom,
as for examples of genre classification done by humans, Mark Rosso worked a lot with this topic. Here are some references:
* Abstract: Identification of Web Genres by User Warrant (http://www.forum.santini.se/2011/03/abstract-identification-of-web-genres-by-user-warrant/)
* USING GENRE TO IMPROVE WEB SEARCH (http://ils.unc.edu/~rossm/Rosso_dissertation.pdf)
* Mark A. Rosso: User-based identification of Web genres. JASIST 59(7): 1053-1072 (2008)
My own experiment with genre classification by humans is described here:
* Santini M. (2008). “Zero, Single, or Multi? Genres of Web Pages through the Users’ Perspective”. Information Processing & Management. Volume 44, Issue 2, March 2008, pp. 702–737.
When Mark and I tried to compare human classification and a state-of-the-art genre prototype, we got the following results:
* Testing a Genre-Enabled Application: A Preliminary Assessment (www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf)
An interesting small experiment is shortly described here:
* Rehm G., Santini M., Mehler M., Braslavski P., Gleim R., Stubbe A., Symonenko S., Tavosanis M. and Vidulin V. (2008). “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems”, LREC 2008. Marrakech. <http://www.astro-susi.de/genre/lrec2008.pdf>
* Case Study: Assigning genre labels to random web pages (http://www.webgenrewiki.org/index.php5/Case_Study:_Assigning_genre_labels_to_random_web_pages)
Your experience and insights will be valuable for me, especially because you have worked with topical text categorization. Many genre researchers claim that topic, genre and sublanguage are different concepts, although overlap exists among them. Presumably we need different features (and/or computational methods) to capture them. In my opinion, the great advantage with genre and sublanguage detection is that they can tell us something about the “context” where information has been uttered/issued/spread, thus reducing ambiguity and misunderstandings. In my experience most of state-of-the-art packages (and approaches) are topic-oriented, which is not enough to give account of genre and sublanguage.
Looking forward to hearing your opinion.
big thanx, marina
Discussion on LinkedIn: BCS Information Retrieval Specialist Group (http://lnkd.in/7XNRgX)
Alexandro G.S. Mancusi • Could you approach this as a classification problem and use SVM? We have employed such methods to work on very large document sets. We work together with Treparel Information Solutions to handle similair challenges but it helps to break it down further so you can maybe employ a mix of technologies.
Marina Santini • Hi Alexandro,
It is indeed a classification problem. In all the text analytic solutions that I have used/tried out so far, I have not seen any genre/sublanguage/domain classifications implemented. But of course I might have missed some of them. That’s why I would like to have additional suggestions. My claim is that we should add (at least) these three kinds of textual classifications to text analytic tools to assess the real value of information. In simple words, we need the “context” provided by these three concepts to value information. I would say “conceptualize rather than tweaking”.
As far as algorithms are concerned, SVM is very powerful and performs well with genre classification on smallish corpora/dataset (say,5000 documents).
Could you tell me your results on genre/sublanguage/domain classifications? How many documents in your datasets? Which P/R? Which features?
Alexander Osherenko • As far as I know, it doesn’t matter much what classifier you use. It is even not very important what feature evaluation method you use. For example, you can use SVM (analytical classifier). However, NaiveBayes (probabilistic classifier) can outperform it on particular kind of data. You can use frequency feature evaluation method. However, it can be outperformed by the presence feature evaluation.
I assume classification results depend rather on features resulting from the nature of information you classify and not on a classifier or a feature evaluation method. In the case of texts’ classification, I considered text length or grammatical correctness of texts.
Concerning results, the recall and precision values were in most cases in my experiments very close and I could establish a rule of thumb — they are about triple of choice by chance. For example, for a 5-classes problem the R/P values were about 60% (20% choice by chance).
Discussion on LinkedIn: Text Analytics (http://lnkd.in/N2XduH)
Ron Katriel • Hi Marina. If I understood you right, you are concerned with making sense of multiple information channels, each with its own unique structure. Automatic discovery of structure within a channel is a form of unsupervised learning which, as you probably already know, is a very difficult problem (also known as density estimation in statistics). It is related to artificial intelligence concepts such as data mining, clustering, feature extraction, and dimensionality reduction. You may also want to also take a look at new work being done on Sentiment Analysis. It looks like it partly overlaps with your area of interest, even if the goals are somewhat different.
Marina Santini • Hi Ron,
I see it more as an automatic text classification problem. Genre, sublanguage, domain are TEXTUAL dimensions that we can use to understand (or infer) the context in which a document/text/conversation has been issued. Say, that we want to automatically extract the most important information in a large set of telephone conversation transcripts from a call center. If we just extract the most frequent words/unigrams (with any kind of sophistication), in many cases this will not be enough and we will have to tweak a lot. In order to single out the most important information we need to know how human communication works in a certain context. Genre, sublanguage, domain (and other textual dimensions) help us to reconstruct the communicative context.
As I said to Alexandro (http://lnkd.in/7XNRgX), in my view we should add (at least) these three kinds of textual classifications to text analytic tools to assess the real value of information.
What do you think?
Ron Katriel • Hi Marina,
I can see why you would consider these a classification problem but to be able to treat them this way you need a “teacher” (i.e., a large training set that is already classified). For example, while working at an electronic news integrator, I led an R&D effort to automatically classify (tag) news stories in real time. The Genre included news feeds from vendors such as Business Wire, PR Newswire, Reuters, AP, etc. Some of the sources were already tagged using a classification system that included business, sports, entertainment, weather, etc., concepts (Domains).
A key insight was that one can view a coded news channel as a teacher, using it to train a classifier to augment the coding of channels that are not coded or have less sophisticated coding systems, thus ensuring uniformity of coding across all channels. This works because, regardless of vendor, electronic news articles share many stylistic attributes in common such as having the key information show up early in the article and have similar style due to the common audience and editorial influence.
In the case of news stories, the word/phrase frequency approach had to be augmented to account for structure (headline vs. opening paragraph vs. body of article), word/phrase counts had to be normalized with respect to the length of the article (to obtain frequencies of occurrence), and special entities such as organizations, names, stock symbols, locations, currencies, etc., were automatically tagged to increase their effectiveness. The relative importance of such Sublanguage aspects had to be tuned via a trial and error approach since there was no analytical way of doing this.
I hope this helps.
Marina Santini • Hi Ron,
what an interesting answer! Yes, Ron, your answer helps me understand better how the concepts of genre and/or sublanguage are dealt with.
What you call “a key insight” is basically “genre awareness”. Your “stylistic attributes” (aka “genre features”) are normally employed for AGI (Automatic Genre Identification).
Did you use a vector representation? How did you account for the compositional structure, i.e. title, opening paragraphs, etc? with weights? How big was your training corpus? and test set? Do you have any paper describing your experience?
An intresting experiment that might be of intrest is summarized here: http://www.forum.santini.se/2011/03/abstract-formulating-representative-features-with-respect-to-genre-classification/
You say there is no analytical way to accont for genre and sublanguage… I am arguing the opposite 🙂 Once you conceptualize them, the door towords the future is open…
What do you think?
Ron Katriel • Marina,
This work dates back to the mid 90’s, well before the concept Genre classification became a prominent subject of research. But, ultimately, it all ties back to the underlying statistical concepts of probabilities, priors, distributions, stratification, etc., so it is no surprise the same fundamental concepts come up again and again with different names.
Our approach was indeed vector based. We typically used between 100 and 1,000 features (words and phrases) per class, depending on the size of our training sample. We had hundreds of thousands of tagged news stories but as expected (i.e., Zipf’s Law) a few categories were over represented while the majority of them occurred less frequently.
The learning system depended on two key innovations: a clever feature selection algorithm (ranking words and phrases based on their ability to separate two distributions – news stories tagged as belonging to the class of interest and the rest which do not) and a high dimensionality sparse matrix linear equation solver. After several years of tuning the code we achieved recall and precision equaling or exceeding those of human coders.
As you guessed, structure was accounted for using weights. The primary components were headline (somewhat important), opening paragraphs (very important) and the rest of the news story (less important). The weightings were arrived at empirically by trial and error but the overall performance was not highly dependent on getting this perfectly right. This is definitely a research area which could lead to improvements in performance.
Thanks for the link to the article. Would you be able to email me the PDF version? As far as I can recall, we never published our results in a mainstream journal or conference proceeding but I have copies of internal documents on the subject which I may be able to share with you after checking with the company I worked for (the system is still used as a major revenue generator so there may be concerns about sharing detailed information).
Discussion on LinkedIn: The WebGenre R&D Group (http://lnkd.in/8ucizZ)
Daniel McMichael • try Thoughtweb http://www.thoughtweb.com
Marina Santini • Thanx, Daniel.
Nitesh Khilwani • In Meshlabs, we try solving similar problems for different domains (media, insurance, it, banking). http://meshlabsinc.com/.
Marina Santini • Thanks for the suggestion, Nitesh.
Srikant Jakilinki • @Nitesh Was of opinion #Meshlabs is inactive. Your demo video on website has been defunct and so also the copyright notice
Nitesh Khilwani • Hi Srikant,
Thanks for pointing out this.
Apologies for this misunderstanding. We are restructuring our entire website and will be up and running soon.
Marina Santini • @Alexander: the machine learning algorithm makes a difference. I tried a naive bayes classifier and an SVM (within WEKA) on the same dataset and got quite a big gap of statistically significant difference.
Have a look at Table 1 in this paper:
Santini M. (2007). “Automatic Genre Identification: Towards a Flexible Classification Scheme”. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007) http://ewic.bcs.org/content/ConWebDoc/13731
Alexander Osherenko • Of course, Marina, you are right — you probably mean recognition rates for “Personal Home Pages” and “Listings” where there is quite a big difference. Although I would be tempted to ask how your data is collected, do you have sparse data and what recall value your classification has as the evidence that classifier learnt classification, but it is not the point.
In my post I was trying to answer the question: how can I improve results of classification for a particular corpus? What means can be considered as beneficial insofar?
Popular answer would be: an appropriate classifier should be chosen. However, the list of “usual suspects” in text mining is quite short — you probably choose SVM or NaiveBayes. Maybe, you would consider ensemble classifiers. However, from my personal experience: classification results don’t become much better and such classification is time-consuming. After trying these classifiers you can try other classifiers or you can try other data mining means such as the method of feature evaluation. However, my findings didn’t confirm the conclusion that classification results can be significantly improved by such choice. Modern SVM and NaiveBayes are good enough and I wouldn’t waste time looking for an even better classifier etc.
That’s why my answer of the improvement question would be: a significant improvement of recognition can be achieved by considering not only data mining issues but more efficiently by extracting other features. Such features in text mining can be lexical, stylometric, deictic, grammatical or their combinations.
I assume that shifting the focus on feature extraction would improve classification more successfully than only on trying to answer data mining questions such as what is the best classifier, what is the best method of feature evaluation and so on.