Weak signal synonym detection (in Swedish)

As we have previously discussed on this blog, Ethersource constantly and continuously learns new terminology by reading what is written on the Internet. As an example of how Ethersource picks up even weak linguistic signals, we noticed recently that Ethersource suggested the word “tutilurfräs” as a very positive Swedish term. None of us had ever encountered the term “tutilurfräs” before. We looked up the source of this linguistic invention, and found that it originates from a tweet by Swedish punk icon Kajsa Grytt, where she writes that:



A (somewhat creative) translation in English would be something like: “Oh Pelle! Oh Hives! What tutilurfräs!! I think they are genius. That band makes me absolutely happy.”

Quite obviously, Ethersource is correct in its understanding that “tutilurfräs” is a very positive word.

There are two lesson to be drawn from this example:

  1. If you do sentiment analysis in Swedish on Twitter and your model does not automatically learn new terminology, you should re-train or update your model to include the word “tutilurfräs“.
  2. If you invent a completely new word and start blogging or tweeting about it, Ethersource will learn it. It is true that in space, no one can hear you scream, but on the Internet, even if you whisper Ethersource will understand you.

Hyperdimensionality, semantic singularity, and concentration of distances

  • This post digs a bit deeper into Ethersource.
  • We discuss the problems of distance concentration and semantic singularity.
  • We argue that Ethersource is not susceptible to these problems.

As we have previously discussed in this blog, the number of unique words in social media grows at a rate that far exceeds what we are normally used to when working with collections of more traditional texts. To recapitulate, the lexical variation and growth in New Text is simply astounding; there is a constant and continuous influx of new tokens. We have also previously discussed how Ethersource is designed to handle such growth. The memory/processing model (we don’t make a distinction between these) of Ethersource does not explode in size as we add (lots and lots of) new data.

To repeat the message: if your data is highly dynamic, you’d better have a model that can handle variation.

Ethersource is based on hyperdimensional computing, which means that all operations in Ethersource are performed in fixed-dimensional spaces of very high dimensionality. Such representations have a number of very attractive features (see Kanerva’s paper in the references below for more details). One of the most useful properties of hyperdimensional representations is that the dimensionality is unaffected by the size of the data. This is the reason Ethersource seamlessly and unproblematically can handle such rapidly growing vocabularies as those encountered in social media (and in other kinds of streaming data sources).

Of central importance in Ethersource (and in other data mining systems) is the notion of similarity. Applications like social media monitoring/sentiment analysis, association analysis, etc, all boil down to questions of the type “how similar is this data point to that”? Association analysis in particular is an example of nearest neighbor search, in which the task is to find the data points that are most similar to a given query data point. Nearest neighbor search is a core functionality in many data mining applications. Examples include semantic search, pattern recognition, recommendation systems, etc. All these applications (and many more), depend on nearest neighbor searches in high-dimensional spaces.

Enter the phenomenon of distance concentration and the perils of the semantic singularity.

Imagine what the impact would be for systems that rely on the notion of similarity if this notion itself became meaningless. Clearly, not good. But is this really something we need to worry about? Could it ever happen?

Science fiction-like as it may sound, this is exactly what the phenomenon of distance concentration refers to. Essentially, this is a situation in which the distance from a query data point to the nearest neighbor approaches the distance to the farthest neighbor. In such a situation, the notion of similarity becomes useless because all distances are the same. Several recent papers (see below for references) have pointed out that this situation might actually occur in certain cases where the dimensionality of the data increases.

Remember the observation about the vocabulary growth of social media? This is a hallmark example of data with continuously increasing dimensionality. Thus, not only do you need to worry about the processing cost when dealing with such data, but you also need to worry about your representation collapsing into semantic singularity. And to make matters even worse, it has been shown that certain types of dimensionality reduction and approximate nearest neighbor search techniques can further aggravate the problem of distance concentration.

If we operate in high dimensions with vast and vastly growing data sets streaming in, we should take this problem seriously.

In the case of Ethersource, we use hyperdimensional computing to ensure that the representation remains unaffected by the size of the data. This means that Ethersource is not at risk of distance concentration due to increasing dimensionality of the representation per se. However, as the attentive reader would no doubt be wondering, what about the growth of the intrinsic dimensionality? Is there no risk of a hyperdimensional representation getting “saturated”? That is, how can we be sure that there will always be enough room, locally, in the fixed-size hyperdimensional representation when there is a continuous inflow of data?

This would be a tangible problem if we were faced with data of high intrinsic dimensionalities. In such cases, the local neighbourhood of a data point can become saturated with new neighbours, thus rendering the notion of vicinity meaningless, and thereby collapsing into semantic singularity. However, Ethersource operates on a very special type of data, which has comparatively low intrinsic dimensionality (Karlgren et al. 2008).

Thus, exit the problem of distance concentration in Ethersource.

And anyway, as someone so wisely said, “forgetting is the key to a healthy mind”, and we certainly want Ethersource to stay healthy.

To end this rather technical post, we include an illustrative example of how similarities behave when adding more data in Ethersource. The following graph shows how the pairwise similarities between semantically related and semantically unrelated words remain stable as we add more data (in this case, up to some 2 billion words).

This is exactly how we want the model to behave; related words stay related, while unrelated words stay unrelated. It would definitely not be a good thing if we saw an increase in similarity between the unrelated words as we add more data, merely as an effect of adding more data. What could happen though is that two previously unrelated words suddenly become similar as an effect of new language use. This, however, is perfectly in order, since we want the similarities to reflect actual usage patterns rather than presumed ones. The fluctuations in the graph correspond to such fluctuations in language use.

References

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft (1999) When Is “Nearest Neighbor” Meaningful? Proceedings of the 7th International Conference on Database Theory, 1999.

Ata Kabán (2011) On the distance concentration awareness of certain data reduction techniques. Pattern Recognition, 44 (2): 265-277.

Pentti Kanerva (2009) Hyperdimensional Computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2): 139-159.

Jussi Karlgren, Anders Holst and Magnus Sahlgren (2008) Filaments of Meaning in Word Space. Proceedings of the 30th European Conference on Information Retrieval, 2008.

The Advantage of Ethersource on the TOEFL Synonym Test Compared to other Methods

  • This post compares the performance of various semantic algorithms
  • Ethersource solves a synonym test with 62% correct answers, while the best runner-up only reaches 52%
  • The results demonstrate the advantage of Ethersource over other relevant methods

As part of our internal system performance monitoring, we continuously evaluate Ethersource using a number of standardized benchmark tests. One such test is the synonym part of the TOEFL (Test of English as a Foreign Language). This multiple-choice vocabulary test measures the ability of the subject (in our case, Ethersource) to identify which of four alternatives is the correct synonym to a given target word.

We use the synonym part of the TOEFL as a performance benchmark for several reasons. The first is that a synonym test is a relevant test for a system that claims to know about meaning. At Gavagai, we believe in putting our money where our mouth is; if you claim that your system extracts meaning from text, you should be able to demonstrate this in a scientific test that measures meaning (such as, e.g., a standardized synonym test). Furthermore, the synonym part of the TOEFL has been used extensively in the scientific literature, so there is an abundance of published results to compare with. Lastly, the TOEFL test is normally administered to human test subjects, so you can actually compare the performance of your system to that of humans (which is nice, if you aim at intelligence).

Since Ethersource learns from the data it sees (in technical terms, we call it an unsupervised system), we benchmark its performance in relation to other unsupervised techniques. In this post, we include results for RI (Random Indexing), LSA (Latent Semantic Analysis), HAL (Hyperspace Analogue to Language), and LDA (Latent Dirichlet Allocation), since these are the standard algorithms for state-of-the-art unsupervised semantic analysis (see below for more details about the various algorithms).

In order to facilitate comparison and replicability, we apply all algorithms to the same freely available data set: the Open American National Corpus. We apply a minimum of preprocessing (non-alphabetic and non-numeric characters are replaced with white space, all characters are down-cased, and text within <p></p> is treated as a document for LSA and LDA), and run all algorithms with default parameters (unless otherwise stated).

Below are the results. As a comparison, random guessing would generate approximately 25% correct answers, while foreign applicants to U.S. colleges average around 64% (reported by Landauer and Dumais, 1997; see reference below).

Method Result
Ethersource (generation 1) 62.25%
LDA (300 topics) 52.50%
LSA (200 dimensions) 52.50%
RI-permutations (2000 dimensions) 48.75%
RI (2000 dimensions) 46.25%
HAL (300 dimensions) 43.75%

As can be seen by these results, Ethersource clearly outperforms the other unsupervised techniques included in this comparison. It should be noted that tweaking the parameters of the algorithms (and applying more careful preprocessing of the data, such as stemming and removal of high-frequency words) will typically lead to improved results for all algorithms. It should also be noted that the OANC data is comparatively small (~11M tokens), which explains why the results presented in this post fall below the state-of-the-art for algorithmic solutions to the synonym part of TOEFL.

The reason we use the OANC in this comparison is first of all to facilitate replicability, but also to be able to include results even for algorithms that do not scale very well. Furthermore, the point of this exercise is not to beat the state-of-the-art, but to compare the performance of a number of different algorithms on the same test using the same data (and, to be honest, beating the state-of-the-art on the TOEFL synonym test using unsupervised algorithms of the type we are focusing on here is mainly a matter of using sufficiently large, and sufficiently relevant, data to build the models – the results listed on the ACL Wiki are thus not very good indicators of relative performance).

To conclude, below is a short summary of the algorithms included in the comparison:

LDA

An example of a topic model, which interprets word occurrences as a result of the activation of a small set of latent topics. Words in this model become similar to the extent that they are generated by the same topics.

LDA reference: D. Blei, A. Ng and M. Jordan (2003) Latent Dirichlet allocation. Journal of Machine Learning Research 3 (4–5): pp. 993–1022.

This comparison uses the PLDA implemantation (Z. Liu, Y. Zhang, E. Chang and M. Sun (2011) PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM Transactions on Intelligent Systems and Technology, special issue on Large Scale Machine Learning.).

LSA

A words-by-documents matrix is collected by noting occurrences of words in documents. The matrix is then transformed using truncated Singular Value Decomposition. Words in this model become similar to the extent that they co-occur in the same documents, and also (which is an effect of the truncated SVD) to the extent that they co-occur with the same other words.

LSA reference: T. Landauer and S. Dumais (1997) A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

This comparison uses the S-Space Package LSA implementation (D. Jurgens and K. Stevens (2010) The S-Space Package: An Open Source Package for Word Space Models. System Papers of the Association of Computational Linguistics).

RI

A framework for incremental and scalable word space modeling. The standard RI model computes semantic word vectors in a fixed-dimensional space by noting co-occurrences within a sliding window spanning two preceding and two succeeding words. Words in this model become similar to the extent that they occur in similar contexts. The RI-permutations variation distinguishes preceding from succeeding co-occurrences.

RI reference: M. Sahlgren and J. Karlgren (2001) From words to Understanding. In Uesaka, Y., Kanerva, P. & Asoh, H. (Eds.): Foundations of Real-World Intelligence, pp.294-308, Stanford: CSLI Publications.

RI-permutations reference: M. Sahlgren, A. Holst and P. Kanerva (2008) Permutations as a Means to Encode Order in Word Space. Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci’08), July 23-26, Washington D.C., USA.

This comparison uses the original SICS Random Indexing implementation.

HAL

A words-by-words matrix is collected by noting co-occurrences within a sliding window spanning ten word tokens. Semantic word vectors are produced by concatenating the row and the column for each word, and (if needed for computational reasons) dropping dimensions that are less informative. Words in this model become similar to the extent that they share contexts.

HAL reference: K. Lund and C. Burgess (1996) Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203-208.

This comparison uses the S-Space Package HAL implementation (D. Jurgens and K. Stevens (2010) The S-Space Package: An Open Source Package for Word Space Models. System Papers of the Association of Computational Linguistics).