This is our blog. We write about cool experiments, interesting case studies, open talks and technicalities.
Grab the RSS feed
all
(95)
case
studies
(49)
events
(6)
featured
post
(10)
Gavagai
Explorer
(18)
Gavagai
Lexicon
(8)
Gavagai
Monitor
(4)
New
language
(2)
projects
(1)
publication
(1)
sentiment
analysis
(18)
talks and presentations
(24)
technicalities
(9)
Uncategorized
(2)

2017-02-23

Inflection and distribution (or “are ‘lentil’ and ‘lentils’ the same word?”)

People who know about languages with rich inflectional systems often ask us how Gavagai handles morphology, i.e. the inflections and derivations words can be subject to in some languages. In English, for instance, a noun can be in singular and plural number and each can take a genitive ending: base form genitive singular lentil lentil’s plural lentils lentils’ which is not too complex. In Swedish a noun can take number, definiteness, and genitive, yielding eight surface forms for most nouns: base form genitive non definite definite non definite definite singular sill sillen sills sillens plural sillar sillarna sillars sillarnas which…

2016-08-30

What’s that Odor?

I use Gavagai Lexicon extensively to lookup the actual use of words present in everyday online language. So, when the notion of “odor”, for some reason, popped up at the office yesterday, I turned to the Lexicon to see what our self-learning semantic memory has learned about it. The annotated screenshot below shows that the semantic memory knows three fairly distinct meanings of “odor”. The most prominent one has to do with “smell, stench, reeked”. A second meaning has to do with odor as instrumental in alerting people to something else, such as “foul odor alerted residents to dead body”. The…

2016-03-08

Business bingo – Is your text analytics system up-to-date with current affairs?

In my role as Chief Data Officer at Gavagai, I meet with lots of leads, clients, and data providers. Much of our conversations are carried out in English, and as a non-native speaker, I sometimes find the choice of wordings peculiar, and at times slightly amusing.Touch base, reach out, back-to-back, and help me understand, to name but a few. In the game of buzzword bingo, players tick off pre-defined buzzwords available on a bingo-like board. But what to enter as buzzwords? How would you recognize such a word? In my view, many of the business terms I’ve encountered would qualify as…

2012-02-14

Hyperdimensionality, semantic singularity, and concentration of distances

This post digs a bit deeper into Ethersource. We discuss the problems of distance concentration and semantic singularity. We argue that Ethersource is not susceptible to these problems. As we have previously discussed in this blog, the number of unique words in social media grows at a rate that far exceeds what we are normally used to when working with collections of more traditional texts. To recapitulate, the lexical variation and growth in New Text is simply astounding; there is a constant and continuous influx of new tokens. We have also previously discussed how Ethersource is designed to handle such…

2011-12-14

The Advantage of Ethersource on the TOEFL Synonym Test Compared to other Methods

This post compares the performance of various semantic algorithms Ethersource solves a synonym test with 62% correct answers, while the best runner-up only reaches 52% The results demonstrate the advantage of Ethersource over other relevant methods As part of our internal system performance monitoring, we continuously evaluate Ethersource using a number of standardized benchmark tests. One such test is the synonym part of the TOEFL (Test of English as a Foreign Language). This multiple-choice vocabulary test measures the ability of the subject (in our case, Ethersource) to identify which of four alternatives is the correct synonym to a given target…

2011-11-01

We don’t do training, we do learning

We have already expressed in this blog how very pleased we are with the design of Ethersource, the technology we have developed. For Ethersource, the memory model and the processing model are the same thing. The memory model we have built Ethersource with has a built-in processing model. New data is projected into our memory model without confounding previous knowledge and without resizing the memory model. (We will return to technical details here in the near future.) Ethersource delivers salient term-term relations in real time, on-line, without recomputation or postprocessing of the aggregated data. We are frequently asked how much…

2011-10-31

The difference between Ethersource and those other models

We want to make clear what the difference between our approach and approach X is. (Substitute X for your favourite text analytics technology). In short, Ethersource is a vector space model, with the processing convenience that comes with a vector space. But a vector space is only as good as the process used to populate it with data. We use distributional data to populate our vector space matrix: nearness in our vector space means similarity with respect to distribution. And we build the vector space handily – it is also compact and remains tractable in size. Here is a brief…

2011-10-23

New words in New Text

New Text is what we like to call the sort of spontaneous non-edited material we spend much of our time processing. We contrast this primarily with traditional text from editorial sources. There are interesting differences between new text and traditional text — and this has been the subject of much debate in philological, sociological, and to some extent even computational circles. Much of what has been said is interesting, much is pure piffle, and we have made our own pronouncements about what sort of changes we believe are ahead (this one prononuncement in Swedish). We expect we will have reason…

2011-10-22

Designing for scalability and other good things

Text analytics involves watching data, and most modern applications of text analytics involve vast and vastly increasing amounts of spontaneous and non-edited human-generated text. Any realistic model of human language in use must handle incoming data streams of dimensions that only a few years ago were considered intractable. This means that memory model design is central to effective processing: a naive model for term-term or term-document relations will grow for each new document, each new token, each new observed item of interest. This growth never stops. To cope, most industrial text analytics implementations use various sampling, compression, encoding, or compilation…