Designing for scalability and other good things
Text analytics involves watching data, and most modern applications of text analytics involve vast and vastly increasing amounts of spontaneous and non-edited human-generated text. Any realistic model of human language in use must handle incoming data streams of dimensions that only a few years ago were considered intractable. This means that memory model design is central to effective processing: a naive model for term-term or term-document relations will grow for each new document, each new token, each new observed item of interest. This growth never stops.
To cope, most industrial text analytics implementations use various sampling, compression, encoding, or compilation algorithms to manage the potentially explosive growth of their respective memory models. To update and maintain the model, periodic recomputation is typically necessary, and to extract meaningful information from most models involves further generalisation and post-processing computation. If the model explodes in size you can throw more hardware at it and maybe do some informed sampling and you can still do what you set out to do. Until it explodes again. In the meanwhile you will have what you need, and if you set your parameters right after training you may have data of quite useful and high quality.
But there is a better way to do it. We are very pleased with the design of Ethersource, the technology we have developed. For Ethersource, the memory model and the processing model are identical. Ethersource will not blow up in your face as data comes in, will not require demanding computation to keep its semantic model up to date, and will deliver semantic relations in real time, on-line, without recomputation or postprocessing of the aggregated data.
The graph given here (taken from our more comprehensive paper on memory model growth) serves to illustrate the orders of magnitude in difference between two vanilla memory models and the Ethersource model, all three designed to relate words to words based on their distribution. At 11 000 000 Tweets, a word-by-word memory model requires 190 times the number of matrix cells used by Ethersource. At the same number of Tweets, a word-by-document matrix would be 5 500 times larger than the representation used by Ethersource.
Our parsimonious memory footprint is not the effect of a compression operation – it is an inherent aspect of the same design which makes Ethersource dynamic, robust, and effective. We want to stress the importance of having an informed design of the knowledge representation to begin with, a memory model which is scalable and bounded, and a processing model which is habitable in face of varied, multilingual, spontaneous, unedited data. We will return to the importance of design and to these other central characteristics of the technology in coming blog posts!