A brief history of word embeddings (and some clarifications)
One of the strongest trends in Natural Language Processing (NLP) at the moment is the use of word embeddings, which are vectors whose relative similarities correlate with semantic similarity. Such vectors are used both as an end in itself (for computing similarities between terms), and as a representational basis for downstream NLP tasks like text classification, document clustering, part of speech tagging, named entity recognition, sentiment analysis, and so on. That this is a trend is obvious when looking at the proceedings from the recent large conferences in NLP, e.g. ACL or EMNLP. For the first time (ever), semantics was the dominating subject at EMNLP (“Empirical Methods in NLP”) this year. In fact, some people even suggested the conference be renamed “Embedding Methods in NLP”, due to the large amount of papers covering various types of methods, applications and evaluations for word embeddings.
This is a positive trend (if you like semantics and embeddings), and there is a lot of progress being made in NLP currently. However, many recent publications (and talks) on word embeddings are surprisingly oblivious of the large body of previous work in fields like computer science, cognitive science, and computational linguistics. Apart from this being bad academic manners, it also risks delaying progress by reinventing the wheel rather than building on existing knowledge.
The point of this post is to provide a brief history of word embeddings, and to summarize the current state of the art. The usual caveat applies: this is by no means meant as a complete list of all relevant research on the subject; on the contrary, it is intended as a reference and starting point for those interested in exploring the field further.
First, a note on terminology. Word embedding seems to be the dominating term at the moment, no doubt because of the current popularity of methods coming from the deep learning community. In computational linguistics, we often prefer the term distributional semantic model (since the underlying semantic theory is called distributional semantics). There are also many other alternative terms in use, from the very general distributed representation to the more specific semantic vector space or simply word space. For consistency, I will adhere to current practice and use the term word embedding in this post.
Word embeddings are based on the idea that contextual information alone constitutes a viable representation of linguistic items, in stark contrast to formal linguistics and the Chomsky tradition. This idea has its theoretical roots in structuralist linguistics and ordinary language philosophy, and in particular in the works of Zellig Harris, John Firth, and Ludwig Wittgenstein, all publishing important works in the 1950s (in the case of Wittgenstein, posthumously). The earliest attempts at using feature representations to quantify (semantic) similarity used hand-crafted features. Charles Osgood’s semantic differentials in the 1960s is a good example, and similar representations were also used in early works on connectionism and artificial intelligence in the 1980s.
Methods for using automatically generated contextual features were developed more or less simultaneously around 1990 in several different research areas. One of the most influential early models was Latent Semantic Analysis/Indexing (LSA/LSI), developed in the context of information retrieval, and the precursor of today’s topic models. At roughly the same time, there were several different models developed in research on artificial neural networks that used contextual representations. The most well-known of these are probably Self Organizing Maps (SOM) and Simple Recurrent Networks (SRN), of which the latter is the precursor to today’s neural language models. In computational linguistics, Hinrich Schütze developed models that were based on word co-occurrences, which was also used in Hyperspace Analogue to Language (HAL) that was developed as a model of semantic memory in cognitive science.
Later developments are basically only refinements of these early models. Topic models are refinements of LSA, and include methods like probabilistic LSA (PLSA) and Latent Dirichlet Allocation (LDA). Neural language models are based on the same application of neural networks as SRN, and include architectures like Convolutional Neural Networks (CNN) and Autoencoders. Distributional semantic models are often based on the same type of representation as HAL, and includes models like Random Indexing and BEAGLE.
The main difference between these various models is the type of contextual information they use. LSA and topic models use documents as contexts, which is a legacy from their roots in information retrieval. Neural language models and distributional semantic models instead use words as contexts, which is arguably more natural from a linguistic and cognitive perspective. These different contextual representations capture different types of semantic similarity; the document-based models capture semantic relatedness (e.g. “boat” – “water”) while the word-based models capture semantic similarity (e.g. “boat” – “ship”). This very basic difference is too often misunderstood.
Speaking of common misunderstandings, there are two other myths that need debunking:
- There is no need for deep neural networks in order to build good word embeddings. In fact, two of the most successful and acknowledged recent models – the Skipgram and CBoW models included in the word2vec library – are shallow neural networks of the same flavor as the original SRN.
- There is no qualitative difference between (current) predictive neural network models and count-based distributional semantics models. Rather, they are different computational means to arrive at the same type of semantic model; several recent papers have demonstrated both theoretically and empirically the correspondence between these different types of models [Levy and Goldberg (2014), Pennington et al. (2014), Österlund et al. (2015)].
So that’s a very brief history and a couple of clarifications. What about the current state of the art? The boring answer is that it depends on what task you want to solve, and how much effort you want to spend on optimizing the model. The somewhat more enjoyable answer is that you will probably do fine whichever current method you choose, since they are more or less equivalent. A good bet is to use a factorized model – either using explicit factorization of a distributional semantic model (available in e.g. the PyDSM python library or the GloVe implementation), or using a neural network model like those implemented in word2vec – since they produce state of the art results (Österlund et al., 2015) and are robust across a wide range of semantic tasks (Schnabel et al., 2015).
For those interested in playing around with word embeddings, I recommend the following libraries: