I would like to recommend “Building and Using Comparable Corpora” (edited by S. Sharoff, R. Rapp, P. Zweigenbaum and P. Fung) to those who are working with or are interested in multilingual and monolingual comparable corpora. The volume is an edited collection of articles covering many topics related to the compilation, measurement and use of comparable corpora. It is divided into two parts and includes 17 articles.
I found this volume useful and inspiring for my research. The volume is comprehensive and still up-to-date, although it collects extended papers from a BUCC (Building and Using Comparable Corpora) workshop held in 2011, or articles written between 2011-2012.
The book starts with an informative overview (article 1), where issues are presented neatly and where the definitions of the different types of corpora are provided to the reader. Several articles (namely articles 2, 3, 5, 8 ) focus on how to design or build comparable corpora from the web. These articles offer a wide range of approaches tailored to different languages and different purposes. Other articles describe how to exploit multilingual comparable corpora for dictionary making (article 3), lexicon extraction (article 7), the automatic extraction of multilingual chunks, such as parallel phrases (article 10), subsentential fragments (article 11), name translation pairs (article 13), but also the identification and extraction of monolingual medical paraphrases. Comparable corpora are useful not only for language technology, but also in the daily work of professional translators (article 16) and for contrastive linguistic analysis (article 17). The volume provides not only a range of practical approaches, but it also includes articles addressing theoretical issues. For instance, the notion of comparability is analyzed (article 4) and methods to measure the distance between multilingual comparable corpora are proposed (article 6). The evaluation of comparable corpora is not left out, and some articles describe approaches and results (e.g. article 5, section 4).
The volume brings the potential of the “comparable corpora” concept into full play and contains many of the topics that have been investigated in depth in the successive BUCCs (the list of all BUCC workshops are available here: https://comparable.limsi.fr/bucc2017/).
With the benefit of hindsight, it is easy to say that the notion of comparable corpora was not only visionary, long-sighted, and productive. It is also easy to say that this volume remains the optimal starting point for any research or for any applications in Language Technology leveraging on comparable corpora.
26 Feb 2017