Book Review: Sequences in Language and Text (2015)

Book review by Marina Santini in publication on the LinguistList
http://linguistlist.org/issues/27/27-1505.html

Book announced at http://linguistlist.org/issues/26/26-2205.html

EDITOR: George K. Mikros
EDITOR: Ján Macutek
TITLE: Sequences in Language and Text
SERIES TITLE: Quantitative Linguistics [QL] 69
PUBLISHER: De Gruyter Mouton
YEAR: 2015

REVIEWER: Marina Santini, Uppsala University

Reviews Editor: Helen Aristar-Dry

SUMMARY

The volume “Sequences in Language and Text” is an edited collection of 14 chapters. The book also includes: a Foreword by the editors G. Mikros and J. Mačutek , a Subject Index and an Authors’ Index.

In the Foreword, the editors briefly outline the structure of the volume, which is roughly divided into theoretically oriented chapters on one hand, and chapters more focused on real-world problems on the other hand. The aim of the book is to document “the latest results of the language-in-the-line quantitative linguistics, an approach that is less prominent to the language-in-the-mass approach, but that apparently is gaining more and more visibility.” (p. v).

1) In the Introduction chapter, G. Altmann describes what sequences are: “Sequences occur in text, written or spoken, and not in language which is considered system. However, the historical development of some phenomenon in language can be considered sequence, too. ” (p. 1). Many different forms of sequences are known from text analyses, but it is not always possible to explain the rise of a given sequence because “sequences are secondary units and their establishment is a step in concept formation.” (p. 3). A building block in the sequential study of texts is “repetition”. There are several types of repetition, such as uninterrupted sequences of identical elements, aggregative repetitions, or cyclic repetitions. Other types of sequences in text are: symbolic sequences (eg. nominal classes), numerical sequence (e. g. distances between neighbours), and musical sequences (that, for example, characterize styles). Textual sequences are supposed to be regulated by laws. The effort of quantitative linguistics is to establish laws and systems of laws and to establish theories. But Altman warns us that an overall theory does not exist. Instead, quantitative linguists “look at the running text, try to capture its properties and find the links among these sequential properties.” (p. 6).

2) In the chapter “Linguistic Analysis Based on Fuzzy Similarity Models”, S. Andreev and V. Borisov discuss the relevance of building fuzzy similarity models for linguistic analysis. These models have a complex structure and are characterized by a hierarchy of interconnected characteristics. The models aim at solving a wide range of tasks under conditions of uncertainty, such as the estimation of the degree of similarity of the original text and its translations; the estimation of the similarity of parts of the compared texts; and the analysis of individual style development in fragments of texts. The authors apply the models to the original poem by Coleridge “The Rime of the Ancient Mariner” and to two translations in Russian — one by Gumilv and the other one by Levik — and provide a detailed qualitative interpretation of the numerical results (shown in Table 6 and in the charts in Figures 7, 8, 9 and 10) returned by a similarity models based on parts of speech of the original text by Coleridge.

3) In the third chapter “Textual navigation and autocorrelation”, F. Bauvaud, C. Cocco, and A. Xanthos introduce a unified formalism for textual autocorrelation. The term “textual autocorrelation” is defined as the tendency for neighbouring textual position to be more or less similar than randomly chosen positions” (p. 54). The presented approach is applicable to sequences and useful for text analysis, and is based on two factors: neighborhood-ness between textual position and (dis-)similarity between positions. The authors present and discuss case studies to illustrate the flexibility of their approach by addressing lexical, morpho-syntactic and semantic property of a text. The case studies include the autocorrelation between lines of a play (i.e. “Sgranelle ou le cocu imaginaire” by Molière”), free navigation within documents (example text: “Sgranelle ou le cocu imaginaire” by Molière”), hypertext navigation (example text: the “WikiTractatus”) and semantic autocorrelation (example text: “The masque of the red death” by Poe). In the conclusions, the authors argue that their approach is applicable “to any form of sequence and text analysis that can be expressed in terms of dissimilarity between positions or types, especially semantically-related problems” (p. 54).

4) In the chapter “Menzerath-Altmann law versus random model”, the authors — M. Benešová and R. Čech – build three random models to show that the MAL (Menzerath-Altmann law) controls human language behaviour. Menzerath’s law, or Menzerath–Altmann law (named after Paul Menzerath and Gabriel Altmann), is a linguistic law according to which the increase of a linguistic construct results in a decrease of its constituents, and vice versa. For example, the longer a sentence (measured in terms of the number of clauses) the shorter the clauses (measured in terms of the number of words), or the longer a word (in syllables or morphs) the shorter the syllables or words in sounds. The authors argue that MAL is a law that governs human language but not randomness. In order to support this claim, they present three random models, each of which takes into account different text characteristics and defines randomness differently. “The results returned by the experiment show that the data generated by the three random models does not fulfill the MAL.” (p. 65). These results support the claim that randomness and human language are governed by different laws. In all the three random models both the number of constructs and the number of constituents correspond to a real text (i.e. the essay “The power of the powerless” by Havel). The constructs are represented by sentences and the constituents by clause. The length of the sentences is defined as a sequence of words which ends with a full stop, the cause as a unit containing a predicate represented by a finite verb, and the word is defined graphically as a sequence of letters between spaces.

5) In “Text length and the lambda frequency structure of a text” R. Čech presents a study that shows the dependence of the lambda indicator (the lambda indicator is a measure of the frequency structure) on text length. The frequency structure of a text is accounted for by methods such as type-token and other vocabulary richness measures, which are affected by text length. Although normalization methods have been suggested to address this problem, it seems that when certain languages are analyzed separately (namely Czech and English in the present study), a dependence of lambda on text length emerges. The chapter presents a method for the empirical determination of the interval in which lambda should be independent from text length. The author argues that within this interval, lambda can be safely used for the comparison of genre, authorship, style etc.

6) In “Linguistic Motifs”, R. Köler presents a new unit, called “motif” that “can give information about the sequential organization of a text with respect to any linguistic unit and to any of its properties – without relying on a specific linguistic approach or grammar” (p. 89). Köler defines the linguistic motif as “the longest continuous sequence of equal or increasing values representing a quantitative property of a linguistic unit”. Linguistic motifs are subdivided into L-motif (i.e. a continuous series of equal or increasing length values), F-motifs (i.e. a continuous series of equal or increasing frequency values), P-motif (i.e. a continuous series of equal or increasing polysemy values) and T-motifs (i.e. a continuous series of equal or increasing politextuality values). The author uses one of the end-of-year speeches of Italian presidents in which words have been replaced b their lengths measured in syllables to show what L-motifs are. Then he converts the same speech into R-motifs using POS instead of words. In both cases, the result of the fitting is excellent. The advantages of motifs are: segmentation is unambiguous, exhaustive, scalable with respect to granularity and, last but not least, motifs show a rank-frequency distribution of the Zipf-Mandelbort type, that is they behave like other more traditional linguistic units. Interestingly, motifs provide a means to analyse text in their sequential structure with respect to all kinds of linguistic units and properties; even categorical properties can be studied in this way. The authors admit that the full potential of the proposed approach has not been explored yet.

7) In “Linguistic Modelling of Sequential Phenomena: The role of laws”, R. Köhler and A. Tuzzi argue that data alone does not give an answer to a research question. It is, instead, a theoretically grounded hypothesis, tested on appropriate data, that produces new knowledge. For a linguistically meaningful and valid analysis of linguistic objects, linguistic models are required, with their laws of language and texts. The authors illustrate the usefulness of linguistic laws in a practical example (p. 111). They use a corpus of 63 end-of-year messages delivered by all the president of the Italian Republic over the period from 1949 to 2011. Since the corpus is a set of texts representing an Italian political-institutional discourse, the authors set the hypothesis that the temporal behavior of the frequency of a word is discourse specific. Since a ready-made model of this kind of phenomenon is not available, they use the Piotrowski law. It comes out that some selected works follow the logistic growth function that is typical of this law.

8 ) The chapter “Manzerath-Altmann Law for Word Length” by J. Mačutek and G. Mikros resumes the investigations of motifs. The authors emphasize that motifs are relatively new linguistic units that make possible an in-depth investigation of sequential properties of texts. For instance, a word length motif is a continuous series of equal or increasing word lengths (often measured in syllables, morphemes or other length units) for which the MAL is valid. As explained earlier, the MAL describes the relations between the size of the construct (e.g., a word) and its constituents (eg syllables) and states that the larger the construct (the whole), the smaller its constituents (parts). The authors use a corpus of Modern Greek literature and also randomly generated data. For their data the following is true: the longer the motif (in number of words), the shorter the mean length of words (in the number of syllables). This chapter provides another confirmation that word-length motifs behave in the same way as other more traditional linguistic units. It remains an open question whether the parameters of the MAL can be used as characteristics of languages, genres or authors. If the answer is positive, they might be applied to language classification, authorship attribution and similar fields.

9) In the chapter “Is the Distribution of L-Motifs Inherited from the Word Length Distribution?” J. Milička points out that word length sequences can be successfully analyzed by means of L-motifs, which he considers to be a very promising attempt to discover syntagmatic relations of word lengths in a text. An L-motif is a “text segment which, beginning with the first word of the given text, consists of word lengths which are greater or equal to the left neighbor.” The main advantage of such segmentation is that it can be applied iteratively, i.e. L-motifs of the L-motifs can be obtained (so called LL motifs). Although applying the method several times may result in unintuitive sequences, these sequences follow lawful patterns and they can be useful for practical application, such as automatic text classification. However, even if L-motifs follow lawful patterns, this does not imply that L-motifs reflect syntagmatic relations, since these could be inherited from the word length distribution in a text. In order to prove that L-motifs reflect syntagmatic relation of the word lengths, the author tests the following null hypothesis: “the distribution of L-motifs measured on the text T is the same as the distribution of L-motifs measured on a pseudo text T’. The pseudo text T’ is created by the random transposition of all tokens of the text T within the text T.” (footnote 4). The author’s hypothesis is tested on three Czech texts and six Arabic texts. The null-hypothesis is rejected for the L-motifs (all texts) and for LL-motifs (except one text), but it is not rejected for L-motifs of higher order (LLL-motifs, etc.) in Czech, although it is not rejected for LLL-motifs in Arabic (except one text). In conclusion, the experiment carried out by the author shows that L-motifs can be useful to examine the syntagmatic relations in most cases.

10) In “Sequential structures in ‘Dalimil’s Chronicle”, A. Pawłowski and M. Eder carry out a quantitative analysis of style variation. They focus on the difference between orality and literacy. The objective of their study is to investigate the phenomenon of “prosaisation”, which was put forward by Woronczak in 1963, by means of tests performed on a variety of sequential text structures in the Chronicle of Dalimil, the first chronicle written in the Czech language at the beginning of the 14th century. The following data are analyzed: a series of chapter lengths (in letters, syllables and words); a series of verse lengths (in letters, syllable letters and words); alternations and correlations of rhyme pairs, quantity-based series of syllables (binary coding); stressed-based series of syllables (binary coding). In their tests, they verify the presence of latent rhythmic patterns and this partially confirms the hypothesis advanced by Woronczak. However, it also appears that the bare opposition of orality vs. literacy does not suffice to explain the quite complex stylistic shift in the Chronicle.

In “Comparative Evaluation of String Similarity Measures for Automatic Language Classification”, T. Rama and L. Borin present “the first known attempt to apply more than 20 different similarity (or distance) measures to the problem of genetic classification of languages on the basis of Swadesh-style core vocabulary lists” (p. 189). The Swadesh list is a compilation of basic concepts and it is used in historical-comparative linguistics. The authors present experiments performed on the Automated Similarity Judgment Program (ASJP) database that contains 40-item word lists of all the world’s languages. The authors examine the various measures in two respects, namely: (1) the capability of distinguishing related and unrelated languages and (2) the performance as measures for internal classification of related languages. Results show that the string similarity measure (i.e. a sequence-based measure) does not contribute to improving internal classification, but it helps in discriminating related languages from unrelated ones.

12) In the chapter “Predicting Sales Trends. Can sentiment analysis on social media help?”, V. Rentoumi, A. Krithara, and N. Tzanos present a two-stage approach based on tweets’ sequential data for the prediction of sales trends’ in products. In this approach the sentiment values expressed through the tweets’ sequences are taken into account. The authors present experiments based on a structure model, namely Conditional Random Field (CRF), and emphasize the benefits of their approach with respect to other approaches that are based on bag-of-words representations. CRF is an undirected graph model where joint probabilities for the existence of possible sequences given an observation are specified. The motivation for using CFR for a sentiment analysis task such as predicting sales trends based on social network data, is based “on the principle that the meaning a sentence can imply, is tightly bound to the ordering of its constituent words” (p. 205). The CRF model was trained using tweets derived from Sander’s collection (a corpus of 2471 manually annotated tweets). The test set was a corpus of four subsets of tweets on four different topics, namely: ipad, sony experia, Samsung galaxy and Kindle fire. The authors show that since CRF exploits structural information concerning a tweets’ data, it can capture non local dependencies that play an important role in the task of sales prediction, thus confirming the assumption that the representation of structural information, exploited by the CRF, simulates the semantic and sequential structure of the data.

13) In “Where Alice meets Little Prince”, A. Rovenchak describes a method to analyze text using an approach inspired by a statistical-mechanical analogy. This method is applied to study translations of two novels: Alice in the Wonderland and The Little Prince. The method is based on a model where a set of parameters can be obtained to describe texts. These parameters are related to the grammar type, intended as the “analycity level of a language” (p. 217). Results confirm that there exists a correlation between the level of language analyticity and the values of parameters calculated using the proposed approach. More specifically, the presented study shows that, within the same language, “the dependence on a translator of a given text appears much weaker than the dependence on the text genre” (p. 228). To date, however, the exact attribution of a language with respect to parameter values has not been provided, since the influence of genre has not been studied in detail.

14) In “A Probabilistic model for the Arc length in Quantitative Linguistics”, P. Zörnig argues that the arch length measure is a good alternative to the usual statistical measures of variations. The author illustrates the formulae for the two most important characteristics of the random variable arc length, namely expectation and variance. Using the sequential representation of texts – in the form (x1… xn), where xi represents the length of the i-th world of a text — he studies the sequences for 32 texts in 14 languages.

EVALUATION

This collection makes a good contribution to quantitative linguistics and to computational linguistics in general.

The volume presents an interesting set of models, laws and experiments that focus on linguistic sequences. Linguistic sequences are intended as linguistic units based on linear (or syntagmatic) sequence of symbols. The chapters in the book present several linguistic sequences and the different aspects that highlight their properties, the laws they are governed by, and their potential in linguistic and textual analysis. Some linguistic sequences are well-known, for example the type-token relation that accounts for the lexical richness of a text. Other linguistic sequences – such as ‘motifs’ — are more recent and are introduced and explained in this volume.

Linguistic sequences and the laws they are governed by appear to have a high potential for many areas in computational linguistics and linguistic applications. For instance, I am thinking about the possible use of linguistic motifs for text classification, automatic genre identification, stylometry, authorship attribution and the like. Motifs seem to have two important qualities: on the one hand, they are easy to compute and extract automatically (I would call them easily-extractable, computationally-light features) and on the other hand, they are linguistically motivated (which is not always the case with light features such as character n-grams). Since their potential has not been explored to date, it is worth including them in the list of linguistic features that can be used in future experiments in text analysis and classification.

Also the idea of discovering mathematically-based language laws and systems of laws may have a good potential in computational modelling (although to date empirical studies based on these laws seem to be still limited). The concept of law is understood as “the class of law hypotheses which have been deduced from theoretical assumptions, are mathematically formulated, are interrelated with other laws in the field, and have sufficiently and successfully been tested on empirical data” (Wikipedia). As mentioned above, a law, like the Menzerath-Altmann law, states that the sizes of the constituents of a construction decrease with increasing size of the construction under study. For example, the longer a sentence (measured in terms of the number of clauses) the shorter the clauses (measured in terms of the number of words. Intuitively, this mathematically-based law could be exploited not only in quantitative linguistics, but also in related areas such as distributional semantics and machine learning for language technology.

Although the volume is certainly valuable, I personally missed a few elements that would have given me a more comprehensive view of the added value the book. For instance I missed a general overview describing the state-of-the-art of Quantitative Linguistics (QL), its main purposes and motivations, and the importance of linguistic sequences and laws in this context. QL is considered to be a subfield of general linguistics and, more specifically, of mathematical linguistics. QL is related to Computational Linguistics, Corpus Linguistics and Applied Linguistics but overlaps with these. Therefore, it would have been helpful to point out explicitly its specificity and its similarities and differences with the neighboring fields.

I also missed an abstract at the beginning of each chapter delivering quick information about the aim, the motivation and the results of the study presented in the chapter itself. Since the volume is an edited collection, the presentation style of the different chapters varies a lot, and in some chapters the identification of purpose, motivation and results was not straightforward. An abstract would probably have helped gather these elements more quickly.

In conclusion, the volume is a good reading not only for linguists working within QA, but also for computational linguists and language technologist who are interested in exploring and experimenting with new features and with language laws that could help model language applications.

REFERENCES

Köhler R., Altmann G., and Piotrowski R. (eds.) (2005). QUANTITATIVE LINGUISTIK /QUANTITATIVE LINGUISTICS — Ein internationales Handbuch / An International Handbook, DE GRUYTER MOUTON.

IQLA – International Quantitative Linguistics Association (http://www.iqla.org)

Journal of Quantitative Linguistics (http://www.tandfonline.com/toc/njql20/current)

Quantitative linguistics entry from Wikipedia (https://en.wikipedia.org/wiki/Quantitative_linguistics)

Language in the Line vs. Language in the Mass: On the Efficiency of Sequential Modelling in the Analysis of Rhythm Author: Pawlowski, Adam Source: Journal of Quantitative Linguistics, Volume 6, Number 1, April 1999, pp. 70-77(8)

Leave a Reply

Your email address will not be published. Required fields are marked *

*