PhD thesis reviewed by Marina Santini
The Word-Space Model by Magnus Sahlgren, Doctoral Thesis in Computational Linguistics at Stockholm University, Sweden 2006
Available online at: http://www.sics.se/~mange/TheWordSpaceModel.pdf
Contents and Research Questions
The PhD thesis The Word-Space Model by Magnus Sahlgren contains 16 chapters, namely an Introduction and 15 chapters distributed into four parts. Part I (Chapters 2-4) presents the theoretical background, Part II (Chapters 5-7) contains the theoretical foreground and is Sahlgren’s main original contribution, Part III (Chapters 8-15) describes the experiments and finally Part IV (Chapter 16) where research is summarized and conclusions are drawn. Most chapters start with a citation. Most citations are from The Simpsons.
The main research question around which the thesis is constructed and structured is: what kind of semantic information does the word-space model acquire and represent? The answer is derived through the identification and discussion of the three main theoretical elements of the word-space model:
- the geometric metaphor of meaning
- the distributional methodology
- the structuralist theory of meaning
Sahlgren argues that the word-space model acquires and represents two different types of relations between words, namely syntagmatic relations or paradigmatic relations, depending on how the distributional patterns of words are collected to build the word spaces. These two types of relations were introduced by the Swiss linguist Ferdinand de Saussure (1857–1913) — the father of structuralism.
For Saussure, syntagmatic relations are relations on an horizontal axis between the elements of a sentence, while paradigmatic relations are relations on an vertical level and look at all the possible elements that could come at the place of a certain element. Sahlgren reinterprets these relations in terms of syntagmatic and paradigmatic uses of contexts based on co-occurrence matrices. These matrices are employed to build syntagmatic and paradigmatic word spaces having different properties. In the thesis, the difference between syntagmatic and paradigmatic word spaces is empirically demonstrated in a number of experiments.
Chapter 1 motivates the need for modelling meaning and the choice of the vector-space model to meet such a need. It is explained why the vector space-model is not completely satisfactory as it was and which simplifying assumptions limit the research. The core questions that guide the research are (p. 13):
- What kind of semantic information is captured in the word-space model?
- Does the vector-space model constitute a complete model of the full spectrum of meaning, or does it only convey specific aspects of meaning?
- Is it possible to extract semantic knowledge by merely looking at usage data?
Chapters 2, 3 and 4 explain the word-space model in detail. Chapter 2 illustrates the “similarity-is-proximity metaphor” and presents the distributional hypothesis; Chapter 3 explains “context vectors”; finally in Chapter 4 we are presented with the main problems that affect the model — i.e. high dimensionality and sparseness — and three different implementations of the word-space model, namely LSA, HAL and RI.
Chapter 5 asks crucial questions such as: how can we determine whether a word space is good? how can we evaluate a word space? do we even know what it is we should evaluate? It is noted that current evaluation schemes — such as bilingual lexicon acquisition, query expansion, text categorization, information retrieval, synonymy tests, word-sense disambiguation, lexical priming data, knowledge assessment — do measure if a given word space is a viable representation of meaning to some extent, but none of them confirms or disconfirms that a word space is a satisfying representation of all aspects of meaning.
Importantly, the author notes that the distributional hypothesis is not a theory of meaning. It is instead a discovery procedure for unveiling meaning similarities, and it does not make any ontological commitments about the nature of meaning (p. 55).
Chapter 6 presents the main theoretical contribution of the research: the distributional hypothesis revisited in terms of paradigmatic and syntagmatic relations. This “Saussurian refinement” affects the reformulation of the distributional hypothesis: “A word-space model accumulated from co-occurrence information contains syntagmatic relations between words, while a word-space model accumulated from information about shared neighbours contains paradigmatic relations between words.” (p.61).
In Chapter 7, the author points out that the reformulation of the distributional hypothesis makes clear that the semantic properties of the word space is determined by the choice of context. Sahlgren defines a syntagmatic use of context and a paradigmatic use of context. The syntagmatic use of context is mainly characterized by the size of the context region within which co-occurrences are counted (p. 64). The paradigmatic use of context is characterized by (p. 66-67) at least three elements, namely:
1. The size of the context region within which paradigmatic information is collected.
2. The position of the words within the context region.
3. The direction in which the context region is extended (preceding or succeeding neighbours).
Chapter 8 describes the experimental setup, i.e. data, pre-processing, frequency threshold, transformation of frequency counts, weighting of context windows, word-space implementation, software, tests, and evaluation metrics. I summarize these details in the table below:
|Data||TASA (Touchstone Applied Science Associates), BNC (British National Corpus) occasionally (p. 75)|
|Pre-processing||Morphological normalization (p. 76)|
|Frequency thresholding||As appropriate (p. 76)|
|Transformation||Binary, dampened, tfidf, raw (as appropriate) (p.77)|
|Weighting||Constant and aggressive (as appropriate) (p. 78)|
|Word-space implementation||Unreduced context vectors from the words-by documents and words-by-words matrices (79)|
|Software||GSDM (Guile Sparce Distributed Memory) (p. 80)|
|Tests||Six tests: 1) direct comparison of the word spaces, 2) thesaurus comparison, 3) association test, 4) synonym test, 5) antonym test, 6) part-of-speech test|
Experiments (i.e. the six tests) are then described in detail in Chapters 9, 10, 11, 12, 13, and 14.
The analysis of the experiments is presented in Chapter 15, where the author discusses whether the revisited distributional hypothesis is verified successfully. The experiments carried out demonstrate that syntagmatic and paradigmatic word spaces produce consistently different results on a number of semantic tests. These differences support the hypothesis that syntagmatic and paradigmatic uses of context yield word spaces with inherently different semantic properties. What is more, the difference between syntagmatic and paradigmatic word spaces is not discrete. This means that it is not a question whether a word space contains syntagmatic OR paradigmatic relations only, but rather to what extent a word space contains these relations. Essentially, the difference between word spaces produced with different types of contexts is more like a semantic continuum, where syntagmatic and paradigmatic relations represent the extremities (p. 127).
In Chapter 16, the author summarises the results and lists the contributions. The research described in the thesis has provided the following answers to the research questions:
- the type of semantic information acquired and represented by the word-space model is syntagmatic or paradigmatic information, depending on how the context is used;
- the word-space model constitutes a complete model of meaning, if by “meaning” we refer to a structuralist dichotomy of syntagma and paradigm. In other words, the answer to this research question depends on the meaning theory underlying the research;
- the question whether it is possible to extract semantic knowledge by merely looking at usage data has a positive answer.
Sahlgren’s thesis is well-written, clear, interesting, and captivating. It is a good reading for students, researchers and developers dealing with computational lexical semantics and other computational studies of language.
An important finding in Sahlgren’s research is, in my opinion, the unveiled proficiency of syntagmatic relations. Results show that for some tests, notably the association test, syntagmatic contexts perform better than paradigmatic contexts. Since syntagmatic contexts have a reduced dimensionality (p. 69-70), this finding might have an important practical benefit because the dimensionality-reduction step can be skipped out, thus streamlining applications’ pipelines. This finding weakens some previous general claims saying that paradigmatic contexts capture semantic similarity better than syntagmatic ones (p. 126), and might have practical positive effect on the efficiency of real-world applications.
What I find unsettling in the model is the volatility of the acquired “meaning” (wouldn’t “word senses” be a less engaging a more appropriate expression?). The author says: “The word-space model acquires meanings by virtue of (or perhaps despite) being based entirely on noisy, vague, ambiguous and possibly incomplete language data.” (p.11). This means that since the revisited word-space model is grounded on usage data, it only represents what is in the current data. When meanings change or disappear in the data, the model changes accordingly. If an entirely different set of data is used, an entirely different model of meaning is built. This is a wavering scenario, and I wonder how instabilities and inconsistencies generated by the variability of data are handled in practical terms. Above all, my concern is whether and how it is possible to build stable and re-usable knowledge with this model.
From Research To Development?
Since Sahlgren’s PhD research, as any doctoral research, has a number of simplifying assumptions, I was wondering how much of the revisited word-space model entered Ethersource produced by Gavagai*, where Sahlgren currently works. Ethersource is a dynamic text analytics system that “reads a lot of text and retrieves actionable intelligence from dynamic data”. I read some posts in Gavagai blog (e.g. Hyperdimensionality, semantic singularity, and concentration of distances, The Advantage of Ethersource on the TOEFL Synonym Test Compared to other Methods, The difference between Ethersource and those other models, New words in New Text), but I found little information about the underlying model, and I could not make up in which way, if used, the model has been changed to cope with multilinguality, word order, compositionality and new words. If Magnus has time to tell us more about this, it would be a great pleasure for us to know that his research has been included in the development of new products.
* Gavagai was formed in 2008 by Magnus Sahlgren and Jussi Karlgren. Gavagai (Sweden), is a spin-off of SICS (Swedish Institute of Computer Science), where Sahlgren carried out his thesis and worked for a number of years.
Related Article (in Swedish)