Gavagai Living Lexicon online
We are proud to announce the release of the Gavagai Living Lexicon – an online lexicon that gives you access to the knowledge our distributional semantic models gather about terms in language as it is used by people in every corner of the known world.
The lexicon is based on Gavagai’s distributional semantic models that learn language constantly from live data feeds with millions of documents per day from both social and news media. This means that the living lexicon is continuously evolving and always à jour with current language use. As an example, try searching for some topical term such as “earthquake” to see what the lexicon has learned during the last couple of days.
The lexicon currently provides the following information:
- the frequency rank of the term in the lexicon
- similarly spelled terms
- common left and right neighbors (i.e. left and right collocations)
- multiword units (n-grams) that include the search term
- semantically similar terms (i.e. terms that have been used in a similar way in online data)
- associatively related terms (i.e. terms that have often been used in the same documents as the search term)
Both the semantically similar and associatively related terms are automatically grouped into clusters of similar and related terms, respectively. The semantic groups are also labelled with the most common collocations. You can think of the labels as an explanation for why the terms are clustered together. As an example, look up “apple”. You can see that the distributional semantic model has learned a number of different usages of “apple”, including apple as an ingredient, apple as a product, apple as a stock, and apple as a fruit. Another example is found by looking up “suit”, which demonstrates that the lexicon has learned both the garment sense and the legal sense.
The lexicon is currently available in Arabic, Danish, English, Estonian, Finnish, German, French, Latvian, Lithuanian, Norwegian, Portuguese, Russian, Spanish, and Swedish. More languages will be added continuously. The size of the vocabulary for each language depends on the amount of online data we listen to for that particular language. English is currently the largest language in the lexicon, with a vocabulary of more than 2,500,000 unique terms. The 200,000 most common of these terms have entries in the English lexicon.
If you are a developer and want to access the lexicon functionality directly through our API, simply sign up for a free developer account. Note that our developer APIs also feature functionalities for doing multi-document summarization, tonality analysis, and keyword extraction.
We appreciate any feedback on the lexicon and our APIs. Contact us at:
(Publications describing the algorithms behind the living lexicon are under preparation and will be added to the lexicon site once published.)