Review of Corpus Linguistics and the Web (also available in Corpora. Volume 4, Page 209-211, November 2009)
Reviewer: Marina Santini
Editors: Marianne Hundt, Nadja Nesselhauf and Carolin Biewer
Title: Corpus Linguistics and the Web
Corpus Linguistics and the Web is an edited collection of articles originated from the papers presented at the symposium “Corpus Linguistics – Perspectives for the Future” and articles commissioned from leading corpus linguists. The book is a comprehensive, insightful and well-structured compendium of advantages and disadvantages of using web data for linguistic description and corpus compilation. The main message conveyed by the book as a whole is that traditional corpora and web data can complement each other. The book is a good resource for corpus linguists who find traditional corpora too small or not representative. It can also be useful for computational linguists and information scientists interested in linguistic and textual features.
The book opens with a short introduction written by the three editors (Hundt, Nesselhauf and Biewer) where the main issues and perspectives are summarized. It is then divided into four parts, each containing a variable number of articles.
The first part, “Accessing the web as corpus”, describes pitfalls and benefits of using data from the web. On the one hand, shortcomings such as the impossibility of reproducibility, the absence of meta-data (Ludelink et al.; Fletcher) must be kept in mind when assessing findings from web data. On the other hand, the richness and freshness of web material (Fletcher; Renouf et al.) seems to outweigh the downside, and encourage the development of Web as Corpus applications or initiatives, e.g. WaCky, WebKWIC and WebCorp. One major drawback of the Web as Corpus approach is the reliance on commercial search engines (like Google) that have a very rough linguistic sensibility and decide the relevant pages for one’s search using opaque criteria, thus requiring tedious refinements of the returned results.
The second part, “Compiling corpora from the internet”, focuses on the construction of corpora from the web, an unrivalled textual reservoir in terms of size and new genres or registers. More specifically, Hoffmann takes advantage of the plenitude of publicly available CNN transcripts in order to create a specialized corpus of spoken English. Claridge builds a corpus of message boards and examines how interaction and stance makers are distributed in this genre of computer-mediated communication. Finally, Biber and Kurjian assemble a corpus using material from two Google topical directories (Home and Science) and analyse and interpret them using the multi-dimensional approach.
The third part, “Critical voices”, contains wise admonitions against a use of web data that forgets about the achievements of traditional corpus linguistics to date. In particular, Leech underpins the importance of criteria such as representativeness, balance and comparability, which at the current state do not characterize neither the web as corpus approach nor the corpora built with web material (with some exceptions). He acknowledges that “while the internet is an added resource of immense potential, it does not remove the need to improve and update other textual resources, and does not render obsolete the corpus compiled according to design and systematic sampling” (p.145). Following this line of thinking, Kennedy analyses collocations of verbs and amplifiers in the BNC and suggests that the richness of the data in this important corpus is still somewhat underexplored for the description of English, and for exploring the nature of language learning and teaching. Additionally, he declares, research on traditional corpora is the necessary first step that can help assess successive research based on web data.
The fourth and final part, “Language variation and change”, describes how web data have been profitably utilized for a number of “recalcitrant” linguistic investigations that could not have been carried out otherwise. The authors contributing to this part unanimously emphasize that caution is an absolute requirement when dealing with web data, but the web remains nonetheless an invaluable source of insights into language. More specifically, Rosenbach focuses on the impact of animacy of the modifier on the choice between s-genitives and noun+noun constructions and comments on the advantages and disadvantages of using Google and WebCorp for analysing the use of these constructions. Rohdenburg and Mondorf use web data test linguistic hypotheses and underpin psycholinguistic evidence, respectively. Web data can also be invaluable to analyse geographic variations and diatopic alternatives (Hundt and Biewer; Anderwald). Finally, the web can supply data also for diachronic and variational analyses, which supplement those found in closed traditional corpora (Mair; Nesselhauf).
Overall, Corpus Linguistics and the Web helps the readers orientate themselves and suggests how to handle web data. Although it is mostly based on research findings and achievements dating back to 2004 and 2005 (p. 4) and limited to investigations of the English language, the book is a stimulating reading that stabilizes research carried out by the Web as Corpus and the Web for Corpus practitioners. In other words, the book summarizes the state of the art and shows that corpus linguists can indeed use web data, though cautiously, especially when the web is the only possible source providing material to study certain phenomena (as in the case study described by Rosenbach). In conclusion, a paraphrase of the concluding sentence of Leech’s article (p. 145 mentioned above) can express the gist of the book: while the web does not remove the need to improve and update other textual resources, and does not render obsolete the corpus compiled according to design and systematic sampling, it definitely represents an added resource of immense potential.