Impact of Sociolinguistics in Opinion Mining Systems

Signed post

by Alexander Osherenko, Socioware Development,

Full paper:
Considering Impact of Sociolinguistic Findings in Believable Opinion Mining Systems
Proceedings of The Fifth International Conference On Cognitive Science. 2012. Kalinigrad, Russia (

Opinions are frequent means of communication in human society and automatic approaches to opinion mining in texts attracted therefore much attention. All in all, most approaches apply data mining techniques and extract lexical features (words) as reliable means of classi cation. Noteworthy that although the interest in opinion mining is huge, there are only few explorations on words extracted in opinion mining. This study considers this drawback and elaborates on a sociolinguistic explanation. We hypothesize: an opinion mining system should be trained for classifying opinions in texts of the same language style. Hence, this contribution focuses on the following questions: 1) do sociolinguistic aspects of corpora, for example, their colloquiality or literariness, infuence classi cation results; 2) how should reliable opinion mining systems train to obtain trustworthy classi cation results.

In the study, 4 text corpora of the same (emotional) domain: the Sensitive Arti ficial Listener (SAL) corpus [4], the Berardinelli movie reviews corpus (BMRC), Pang movie reviews corpus (PMRC) [5], the Corpus with product reviews (CwPR).

The table above shows results of classi cation of non-sociolinguistic and sociolinguistic datasets of di fferent corpora (column Corpus)
where R0 and R values refer to recall values averaged over classes for the non-sociolinguistic and sociolinguistic datasets respectively, and the
CN0 and CN columns specify the corresponding classes-number values.

Results show that sociolinguistic aspects affect classi cation results.

Read full paper here.

Feel free to share your comments, thoughts and different experiences with us.


3 comments for “Impact of Sociolinguistics in Opinion Mining Systems

  1. 3 June, 2012 at 16:02

    Indeed it is crucial to select data sets of ‘like’ texts (Laver et al. 2003; Kaal, Forthcoming)for machine training. This requires genre identification: characteristics of linguistic and discursive form and function in the social context in which the communication occurs.

    Kaal, B. Forthcoming. Worldviews: Spatial ground for political reasoning in Dutch election manifestos. CADAAD 6:1
    Laver, Michael, Kenneth Benoit, and John Garry. 2003. Extracting policy positions from political texts Using Words as Data. American Political Science Review 97, pp. 311-331.

  2. 5 June, 2012 at 13:57

    Thanks for the pointer, Bertie.

  3. 10 June, 2012 at 08:32

    Discussion on LinkedIn: The WebGenre R&D Group []

    Rajive Pai • The paper is amazing.. Thanks for posting this..

    Leslie Barrett • I am not sure where this paper appeared but it is surprising that they ignore so many studies on opinion and lexical features (see for may of the most recent and commonly-cited references). The research methodology itself seems sound but there is also a lack of explanation around the key definitions of “sociolinguistic” features. For me, that invalidates the conclusion.

    Marina Santini • Maybe Alexander can give us more details about sociolinguistic features…

    Alexander Osherenko • @Leslie As far as I know, the most studies, also in your paper, concentrate on classification of information of some particular (single) corpus according to the opinion-bearing aspect. For example, the papers of Jan Wiebe introduce means to identify opinionated phrases in MPQA. To reference these papers and save space in my 2-page extended abstract, I cited my book that contains these references.

    The issue I wanted to address in this abstract was classification of several corpora not a single one and what aspects had to be considered. Hence, for classification of the SAL corpus containing NL dialogues, of the BMRC movie reviews corpus, of PMRC with movie reviews weblogs, of the CwPR with product reviews I extract features always relying on Bag-Of-Words and use always SVM classification method. The only thing that changes are sociolinguistic aspects of the corpora I classify or their language code. “Adopted to texts, Belikov and Krisin [6] define language style in sociolinguistics roughly as texts that can be distinguished, for example, by language code, natural (English) or articial (Esperanto) or by language subcode, literary or colloquial language”. In this abstract, it is probably the only sociolinguistic definition I had to supply to understand its meaning.

    Cheers, Alexander

    Leslie Barrett • OK fair enough….so you are saying that Opinion detection currently has insufficient sensitivity to genre, basically? And would you use syntactic or dialogue-based features to serve as signals for that?

    Alexander Osherenko • Let me say so — I don’t exclude the possibility that feature extraction should be performed under genre consideration and not only under frequency consideration as it is done in many modern opinion mining systems.

    Regarding your questions. Syntactic features can be very unreliable if you think of opinion mining in Internet-texts such as weblogs — you can’t demand from your authors that they know syntax before they use your opinion mining system. Dialog-based features are a good choice since both opinion mining and feature extraction are performed in a coherent way.

    Diana Maynard • But for blogs at least, sentences are typically well-formed, so it’s not really an issue. Tweets are another matter, but there are other techniques for dealing with them.

    Alexander Osherenko • Anyway, if opinion mining is performed in not well-formed texts I would rather extract stylometric features — features frequently used in the author attribution — such as number of letters in texts or standard deviation of word lengths. Hence, an occasional evaluation inaccuracy is not a big problem and can be tolerated if the utilized classification method is flexible enough and can “correct” this inaccuracy.

    Diana Maynard • You really think that kind of method would work for e.g. tweets and facebook posts? I’d be incredibly surprised if it did…do you have some evidence to support that?

    Alexander Osherenko • I must correct you — I am not describing a new kind of classification method but point out that genre of information in corpora influences classification results. The evidence to support that are classification results of corpora I describe in the abstract.

    BTW, the SAL corpus with NL dialogues and the Pang movie reviews corpus I reference in the abstract contain not well-formed texts with repetitions, vague formulations etc that resemble tweets or facebook posts.

    Marina Santini • @Diana: I was reading a paper that you co-authored — Automatic detection of political opinions in Tweets ( — and wondered:

    * what it the current state-of-the-art in opinion mining in microposts? on what sample size?

    * are the manually annotated microposts that you used in your experiments publicly available? maybe the annotated testbed could be enlarged by somebody else if you list your annotation criteria…

    * Do you think that new NLP tools tailored to short texts (text msgs, tweets, FB microblogs etc) would help in opinion mining or are morphology and syntax insignificant for the understanding of short texts?

    Cheers, Marina

    Alexander Osherenko • I wanted to clarify my motivation for writing the abstract.

    I begin with stopwords. Stopwords can be considered as genre-neutral. First, in my PhD thesis on opinion mining, I found out that stopwords can be used to classify opinions. However, in some approaches to opinion mining stopwords are excluded from further consideration apparently because they are opinion-neutral. Nevertheless, in my experiments I found out that they are beneficial. What is the reason the result improvement?

    There can be two reasons for it. The first reason from the point of view of psychology, stopwords are per se function words but can be used to comprehend meaning (Chung C., J. Pennebaker. The Psychological Functions of Function Words). My intuition for the second reason from the point of view of computer science: features corresponding to stopwords facilitate classification since they divide the feature search space in smaller zones that are easier to handle. My abstract also confirms the dependency on the genre.


    Hugues de Mazancourt • To confirm Alexander’s point on stylometric features, this paper from Google Research is an interesting reading:

    Marina Santini • Let me add another more general but very recent pointer:

    Sentiment Analysis and Opinion Mining
    New Book, Bing Liu, Morgan & Claypool Publishers, May 2012

    Diana Maynard • Marina – you may be interested in a more recent paper of mine which takes that work on political tweets a stage further: and also in the tutorial I recently gave at LREC 2012:

    To answer your questions: the annotated corpus isn’t currently available as it was only a small test set, but we plan to make our larger annotated corpora available once they’re done.

    The last question is discussed quite well in my tutorial: basically yes, improved versions of tokenisers, POS taggers and so on developed specifically for tweets can certainly help. This becomes more important as one uses deeper linguistic analysis methods. And while shallow techniques are enough for getting many of the opinions right (low-hanging fruit), I think deeper linguistic analysis methods are certainly necessary for getting (some of) the remaining hard cases. This is the area I’m currently working on.

    Marina Santini • Thanks, Diana!

Leave a Reply

Your email address will not be published. Required fields are marked *