Book Review: Syntax-Based Collocation Extraction (2011)

Book Review by Marina Santini
Violeta Seretan, Syntax-Based Collocation Extraction. Springer, 2011, pp. 217

Broadly speaking, collocations are complex lexical items. More specifically, they are “canned phrases” characterized by selection constraints that we learn at early age. Firth (1957: 181) was the first who identified this linguistic phenomenon and provided the first definition. In the 1960s, the linguist Michael Halliday listed interesting examples. Halliday pointed out that we tend to talk of “strong tea” instead of “powerful tea”, even though the phrases make equal sense, while, on the other hand, “rain” is much more likely to be described as “heavy” than “strong” (Halliday 1966: 150). Simply put, collocations are meaningful “lexical chunks” appearing together frequently and in more or less crystallized combinations that we commit to our memory, thus nurturing formulaic language (e.g. cf. Wray 2002). The concept of collocations makes evident that language does not work with words taken in isolation, and that not all combinations are always lexically preferred, although semantically acceptable. Therefore, the concept of collocations (and other multi-word expressions) is crucial for advances in many fields of computational linguistics.
The volume Syntax-Based Collocation Extraction by Vioteta Seretan is definitively a stimulating and a valuable contribution to the field of automated collocation extraction. In the book, the author puts forward the claim that a syntax-based approach to collocation extraction is more reliable and more profitable than statistically-based approaches. This claim is supported by monolingual and multilingual corpus-based experiments. The corpora used in the experiments are processed by a multilingual parser, Fips, developed at University of Geneva by Eric Wehrli, who has supervised the author in the research.

Syntax-Based Collocation Extraction is a monograph structured into six chapters and six appendices. The book is based on the author’s doctoral thesis (2008) at University of Geneva.

Chapter 1 (“Introduction”) lays down that the concept of collocations, although not agreed-upon, is an important one, since collocations “are pervasive in texts of all genres and domains” (p. 1). The author then states the objective of the research described in the book, which is “to take advantage of recent advances achieved in syntactic parsing to propose a collocation extraction methodology that is more sensitive to the morpho-syntactic context in which collocations occur in the source corpora” (p. 4-5). The author lists the advantages of the syntactic approach that justify a “methodological shift” from statistics to syntax-based methods on page 3 and argues that by using the syntactic proximity criterion instead of a linear proximinty criterion in choosing candidate pairs, a substantial improvement can be gained in the quality of extraction results.

In Chapter 2 (“On Collocations”), the author presents a comprehensive overview of the definitions of collocations in the literature, but she “refrains” from providing her own working definition. She prefers nailing down the concept by identifying distinctions from other concepts (e.g. collocations vs. co-occurrence), by listing compulsory features (e.g. collocations are lexical combinations that are pre-fabricated, arbitrary, unpredictable, recurrent, unrestricted in length), and in terms of syntagmatic lexical functions (p. 26-27). Appendices A and B complement the chapter with a useful list of collocation dictionaries and a list of collocation definitions.

Chapter 3 (“Survey of Extraction Methods”) presents the state-of-the-art in automatic collocation extraction, including a survey of extraction techniques, the language pre-processing methods used for extracting collocations, and previous work in the field. The chapter is complemented by Appendix C that contains association measures.

Chapter 4 (“Syntax-Based Extraction”) is the heart of the book since it explains the original contribution of the author to the field. After having presented the Fips multilingual parser, the author delves into the description of the extraction method she proposes, illustrates experiments and evaluation, then carries out a qualitative analysis of the results, and finally discusses the findings. There are two main steps in the extraction method, namely candidate identification and candidate ranking. Plug-ins to Flips are used to analyze the syntactic structures built by the parser directly in the identification phase. Candidate ranking is implemented relying on the log-likelihood ratio measure (pp. 68-69). The evaluation experiments compare the performance of syntax-based collocation extraction method against a baseline represented by the sliding window method (syntax-free). In both methods the same association measure — log-likelihood ratio (pp. 36-42, 68-69) — is used for the candidate ranking step (p. 73). The evaluation is performed with two experiments, both focussing on binary collocations, i.e. collocations made of two words. The first experiment is based on a monolingual corpus, the French part of the Hansard corpus of Canadian parliamentary proceedings (112 files). In this first experiment, the evaluation is performed by assessing the top 500 word pairs (bigrams) returned by each of the two methods. Each pair has been annotated by three trained linguists (two different teams of linguists, one for each method) with one of the following categories: erroneous pair, regular pair (grammatically correct but lexically uninteresting) and interesting pair (a multi-word expression that is worth storing in the lexicon). In this experiment, collocability is broadly intented as “unpredictable for non-native speakers, and therefore have to be stored in a lexicon” (p. 77). The author concludes by stating that the method based on parsing performs better than the sliding window method in terms of grammaticality (i.e. regular pairs) and multi-word expressions (i.e. interesting pair). In the second experiment — that relies on data from the Europarl parallel corpus in French, English, Spanish and Italian (62 parallel files) — the annotation is performed by teams of two linguists with six categories, namely erroneous pair, regular pair, named entity, collocation, compound, and idiom. 2000 pair types have been evaluated in this second experiment (p. 83). In the evaluation, three categories are used, namely grammatical (i.e. regular pairs), multi-word expressions (a precision measure based on named entities, collocations, compounds, and idioms) and collocability (a precision measure based on collocations). The author’s conclusion is that results for experiment 2 are in line with those of experiment 1, that is, the method based on parsing performs better than the window-based method. The chapter also contains an interesting section with qualitative analysis and is complemented by Appendices D, E and F.

Chapter 5 (“Extensions”) extends the proposed extraction methodology to three different directions: extraction of complex collocations (i.e. n-grams of more than two words), data-driven induction of syntactic patterns, and corpus-based collocation translation.

Chapter 6 (“Conclusion”) summarizes the main findings and points out the directions for future research, which include a more extensive evaluation of the recall of extraction methods, a conclusive comparison with SketchEngine (Kilgarriff, 2004) — the a state-of-the-art extractor based on shallow parsing (a preliminary comparison was presented in Ch. 4), further investigation of the effect of syntactic ambiguity and more.

Syntax-Based Collocation Extraction opens a stimulating discussion (“syntax or statistics?”) and inspires future experiments for which it sets a baseline for binary collocations, or bigrams (see note 2 on page 103), in four languages. The volume is rich in detail, references, analyses and discussions. One brave stance of the author is to advocate for the use of deep parsing and for the benefits of syntactic information for the identification of complex linguistic phenomena. A nice feature of the proposed method is the reduction of the number of candidates (e.g. see Table 4.5 on page 76).
Although detailed in many respects, the book contains some inadvertence (cf. also Pecina, 2011). For instance, the author does not tell us what kind of significance test she has used to assess the value of the differences in the results for experiment 1 (p. 79-81 ) and experiment 2 (p. 85-88). Additionally, the author does not report the accuracy of the parser. This is a variable that will remain unknown if experiments with the same corpora but a different parser and a different approach will be carried out in the future. Without this information, we do not know whether Fips performs better or worse than other multi-lingual parsers, and what impact has the parser’s performance of the extraction results. Since “the parsing field is making steady progress” (p. 125), we need this data to evaluate the performance of the different approaches. I am also a bit puzzled by the following passage: “the internal representation of syntactic structure built by the parser was used directly. The benefit for the identification process is twofold: the parse information is readily available, and it contains all the rich and complex details provided by the current analysis” (p. 65). Does it mean that the parser output is less accurate than the internal processing? If so, why?
Some choices are debatable. For example, instead of asking the annotators to evaluate the top pairs of the extracted results, maybe it would have been more long-sighted (but of course much more time-consuming) to ask the annotators to identify collocations in (a part of) the corpora. Then we would now have a mini golden standard or a more long-lasting standoff annotation that would be useful for benchmarking future experiments. The lists of annotated pairs that we have in Appendix D and E are indeed interesting, but more restricted in scope and  use. As for the window-based approach that has been chosen for comparison, its results seem to be lower than those obtained with parsing. This is exciting for parser-supporters like me. However, it might be the case that another statistical method would perform much better with these corpora. So I would be more cautious about the assessment of the results and would make it more relative to the experimental setting and the research design that have been chosen by the author.

These shortcomings do not negate the overall value of the research, the rich documentation and the inspirational approach. The author of the book has shown that it is possible to apply deep parsing to fairly large quantities of running texts with promising results.
The book is a recommended reading for all those working in the areas of NLP, NLG computational linguistics, information retrieval and information extraction.

Firth J. R. 1957. Papers in Linguistics 1934-1951. Oxford University Press.
Hallyday, M.A.K. 1966. Lexis as linguistic level. In C. E. Bazell, J.C. Catford, M.A.K. Halliday, and R.H. Robins (eds.), In memory of J.R. Firth, pp. 148-162. Longmans.
Kilgarriff A., Rychly P., Smrz P., Tugwell D. 2004. The SketchEngine. Proceedings of the 11th EURALEX, Lorient, France, p. 105-116.
Pecina P. 2011. Book Reviews, Syntax-Based Collocation Extraction, Computational Linguistics, Vol. 37, Number 3.
Wray, A. 2002. Formulaic language and the lexicon. Cambridge University Press.

