This is our blog. We write about cool experiments, interesting case studies, open talks and technicalities.
Grab the RSS feed


The Advantage of Ethersource on the TOEFL Synonym Test Compared to other Methods

  • This post compares the performance of various semantic algorithms
  • Ethersource solves a synonym test with 62% correct answers, while the best runner-up only reaches 52%
  • The results demonstrate the advantage of Ethersource over other relevant methods

As part of our internal system performance monitoring, we continuously evaluate Ethersource using a number of standardized benchmark tests. One such test is the synonym part of the TOEFL (Test of English as a Foreign Language). This multiple-choice vocabulary test measures the ability of the subject (in our case, Ethersource) to identify which of four alternatives is the correct synonym to a given target word.

We use the synonym part of the TOEFL as a performance benchmark for several reasons. The first is that a synonym test is a relevant test for a system that claims to know about meaning. At Gavagai, we believe in putting our money where our mouth is; if you claim that your system extracts meaning from text, you should be able to demonstrate this in a scientific test that measures meaning (such as, e.g., a standardized synonym test). Furthermore, the synonym part of the TOEFL has been used extensively in the scientific literature, so there is an abundance of published results to compare with. Lastly, the TOEFL test is normally administered to human test subjects, so you can actually compare the performance of your system to that of humans (which is nice, if you aim at intelligence).

Since Ethersource learns from the data it sees (in technical terms, we call it an unsupervised system), we benchmark its performance in relation to other unsupervised techniques. In this post, we include results for RI (Random Indexing), LSA (Latent Semantic Analysis), HAL (Hyperspace Analogue to Language), and LDA (Latent Dirichlet Allocation), since these are the standard algorithms for state-of-the-art unsupervised semantic analysis (see below for more details about the various algorithms).

In order to facilitate comparison and replicability, we apply all algorithms to the same freely available data set: the Open American National Corpus. We apply a minimum of preprocessing (non-alphabetic and non-numeric characters are replaced with white space, all characters are down-cased, and text within <p></p> is treated as a document for LSA and LDA), and run all algorithms with default parameters (unless otherwise stated).

Below are the results. As a comparison, random guessing would generate approximately 25% correct answers, while foreign applicants to U.S. colleges average around 64% (reported by Landauer and Dumais, 1997; see reference below).

Method Result
Ethersource (generation 1) 62.25%
LDA (300 topics) 52.50%
LSA (200 dimensions) 52.50%
RI-permutations (2000 dimensions) 48.75%
RI (2000 dimensions) 46.25%
HAL (300 dimensions) 43.75%

As can be seen by these results, Ethersource clearly outperforms the other unsupervised techniques included in this comparison. It should be noted that tweaking the parameters of the algorithms (and applying more careful preprocessing of the data, such as stemming and removal of high-frequency words) will typically lead to improved results for all algorithms. It should also be noted that the OANC data is comparatively small (~11M tokens), which explains why the results presented in this post fall below the state-of-the-art for algorithmic solutions to the synonym part of TOEFL.

The reason we use the OANC in this comparison is first of all to facilitate replicability, but also to be able to include results even for algorithms that do not scale very well. Furthermore, the point of this exercise is not to beat the state-of-the-art, but to compare the performance of a number of different algorithms on the same test using the same data (and, to be honest, beating the state-of-the-art on the TOEFL synonym test using unsupervised algorithms of the type we are focusing on here is mainly a matter of using sufficiently large, and sufficiently relevant, data to build the models – the results listed on the ACL Wiki are thus not very good indicators of relative performance).

To conclude, below is a short summary of the algorithms included in the comparison:


An example of a topic model, which interprets word occurrences as a result of the activation of a small set of latent topics. Words in this model become similar to the extent that they are generated by the same topics.

LDA reference: D. Blei, A. Ng and M. Jordan (2003) Latent Dirichlet allocation. Journal of Machine Learning Research 3 (4–5): pp. 993–1022.

This comparison uses the PLDA implemantation (Z. Liu, Y. Zhang, E. Chang and M. Sun (2011) PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning.).


A words-by-documents matrix is collected by noting occurrences of words in documents. The matrix is then transformed using truncated Singular Value Decomposition. Words in this model become similar to the extent that they co-occur in the same documents, and also (which is an effect of the truncated SVD) to the extent that they co-occur with the same other words.

LSA reference: T. Landauer and S. Dumais (1997) A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

This comparison uses the S-Space Package LSA implementation (D. Jurgens and K. Stevens (2010) The S-Space Package: An Open Source Package for Word Space Models. System Papers of the Association of Computational Linguistics).


A framework for incremental and scalable word space modeling. The standard RI model computes semantic word vectors in a fixed-dimensional space by noting co-occurrences within a sliding window spanning two preceding and two succeeding words. Words in this model become similar to the extent that they occur in similar contexts. The RI-permutations variation distinguishes preceding from succeeding co-occurrences.

RI reference: M. Sahlgren and J. Karlgren (2001) From words to Understanding. In Uesaka, Y., Kanerva, P. & Asoh, H. (Eds.): Foundations of Real-World Intelligence, pp.294-308, Stanford: CSLI Publications.

RI-permutations reference: M. Sahlgren, A. Holst and P. Kanerva (2008) Permutations as a Means to Encode Order in Word Space. Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci’08), July 23-26, Washington D.C., USA.

This comparison uses the original SICS Random Indexing implementation.


A words-by-words matrix is collected by noting co-occurrences within a sliding window spanning ten word tokens. Semantic word vectors are produced by concatenating the row and the column for each word, and (if needed for computational reasons) dropping dimensions that are less informative. Words in this model become similar to the extent that they share contexts.

HAL reference: K. Lund and C. Burgess (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203-208.

This comparison uses the S-Space Package HAL implementation (D. Jurgens and K. Stevens (2010) The S-Space Package: An Open Source Package for Word Space Models. System Papers of the Association of Computational Linguistics).

Category: technicalities