What do Czech, Hebrew and Italian have in common?
Answer: they have just been added to the Gavagai Living Lexicon, which is an unsupervised semantic memory that continuously learns language by reading large amounts of online news and social media. You can think of the lexicon as a brain in silico (or, equivalently, as a piece of artificial intelligence) that tirelessly reads online media and learns how terms are related to each other. As of today (2016-02-16), the Gavagai Living Lexicon contains the following 20 languages:
Czech, Danish, Ducth, English, Estonian, Finnish, French, German, Hebrew, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish
…and more is to come!
For those who are not already familiar with the Gavagai Living Lexicon, it is a completely data-driven (unsupervised) lexicon that learns language by reading (large amounts of) online news and social media in various languages. The lexicon can be used in any application that requires a lexical component – one example can be keyword expansion – and to investigate how terms have been used in online media – as an example, the LES project uses the lexicon to investigate how terms like “democracy” have been used in a variety of languages.
The Living Lexicon learns various kinds of relations between terms. As an example, the lexicon has learned that the term “costa” in Italian can have several different meanings; in the sense of “coast”, referring to coastal regions, in the sense of “cost”, referring to the cost of things, and as a surname. The lexicon has also learned several different multiword terms featuring the word “costa”, like “costa rica” and “costa azzurra”, and it has learned that “costa” often occurs in a sports context, in documents about soccer (predominantely when used in its surname sense). And it has learned all this from merely reading online text, without any form of human supervision. For those interested in the technical aspects of the lexicon, we will present the underlying algorithms and architecture of the system at this year’s LREC conference in May.
We see the Living Lexicon as a long-term experiment that investigates:
- how unsupervised distributional semantic models (aka word embeddings) behave when continuously fed with large amounts of online media;
- how language use evolves over time.
Apart from being an interesting experiment in its own right, the Living Lexicon also constitutes a very important building block in most of the commercial products we develop at Gavagai. Examples of products utilizing the Living Lexicon include the Gavagai Monitor, which is a tool for social media monitoring, and the Gavagai Explorer, which is a tool for exploring various types of text data, like open-ended survey questions and reviews.