This is our blog. We write about cool experiments, interesting case studies, open talks and technicalities.
Grab the RSS feed


New words in New Text

New Text is what we like to call the sort of spontaneous non-edited material we spend much of our time processing. We contrast this primarily with traditional text from editorial sources. There are interesting differences between new text and traditional text — and this has been the subject of much debate in philological, sociological, and to some extent even computational circles. Much of what has been said is interesting, much is pure piffle, and we have made our own pronouncements about what sort of changes we believe are ahead (this one prononuncement in Swedish). We expect we will have reason to return to this discussion.

Two of the most obviously noticeable things about new text as compared to traditional are its lexical creativity and its proof-reading sloppiness. From the point of view of a text analysis processor they amount to much the same thing: a constant influx of new tokens, never seen before. Newspeak, lolspeak, l33tspeak, new topics, puns, lesser used sociolects and idiolects, misspellings, mistypings, and teenage angst all contribute to a vast and vastly growing symbol table.

Image 1: Number of unique words, as a function of time, in Tweets and newsprint

We are perfectly happy to cope with that sort of growth! This is one of the underlying principles of our design – not to be fazed by new and unexpected usage. An example: in a conversation, if your counterpart mispronounces a central term (or speaks an unexpected dialect or variety of the language) or uses a synonym which you didn’t know before, it might throw you the first time. Next time around, you cope with it. It might annoy you a few times but eventually – after only a handful of observations – you will be habituated to the pronunciation or the synonym. You will not be retranslating the new observation back to your previous knowledge of the world – some perceptual and lexical process in your language analysis system simply notes that this concept appears to be subject to some variance. This happens without retraining or recompilation of your lexicon: you don’t stop conversation to figure it all out. (Or, well, you shouldn’t. If you do so often, you will eventually lose friends.) This is the way we believe things should be done!

Here are some numbers we got out of our text collections (see Image 1 above). Take two years of newsprint from a reputable newspaper (in this case, first year from a major US daily, and the second year from a major UK daily). That’s about a 100 Mwords and about 100 kWords per day. The first few days, most words are new, but it settles pretty rapidly into a steady 100-200 new tokens per day. (The switch across the Atlantic probably contributes to slightly more of those, but not too noticeably.) In itself, an argument for a learning system!

Now, as comparison, take two months of tweets on various topics in English. (Well, mostly in English. Mixed-language tweets are in our test material. We don’t want to take them out – that’s the way the world is.) That’s more than 1 Gwords, working out to about 20 Mwords a day. And about 200 000 new words. Per day. Try to keep up with that, manually! We firmly believe the architecture of the system to handle this needs to view this sort of data variance as normal, not something to meet by quick hacks or filtering. Ethersource is built with this in mind.

And the future is near: what do you think will happen when language analytics will move to processing speech data as well as text? Do you believe the number of tokens in the data stream will converge to a smaller number?

Category: featured post, technicalities