The difference between Ethersource and those other models

We want to make clear what the difference between our approach and approach X is. (Substitute X for your favourite text analytics technology). In short, Ethersource is a vector space model, with the processing convenience that comes with a vector space.

But a vector space is only as good as the process used to populate it with data. We use distributional data to populate our vector space matrix: nearness in our vector space means similarity with respect to distribution. And we build the vector space handily – it is also compact and remains tractable in size.

Here is a brief comparison matrix.

Challenge Statistical Knowledge-based Ethersource
Vast scale Fine 
(if sampling is done correctly and samples are true to data)
Fine
(if processing model can be optimised)
Fine
inherent in memory model and in processing model
Multilinguality Fine 
(if labeled training collection is available, involves train-test-update cycle)
Problematic
(involves expensive retooling of knowledge base)
Fine
inherent in memory model and in processing model
Change Problematic
new data not guaranteed to conform to estimations based on previous data
Problematic
(involves expensive retooling of knowledge base)
Fine
inherent in memory model and in processing model
Variety Fine 
(if labeled training collection is available, involves train-test-update cycle)
Problematic
(involves expensive retooling of knowledge base)
Fine
inherent in memory model and in processing model
Coverage High recall High precision High recall
Abstraction Strings Concepts or logical forms Concepts

We view Ethersource as the base technology for any service or information process which relies on human language as an input. Any process which today uses grammars, lexica, thesauri, occurrence frequencies, estimates of collocation likelihoods will be well served by plugging in Ethersource as a base resource or as a replacement. Also, many new services which today would be prohibitive in engineering cost will be painless to design on top of Ethersource.

We have some examples in our service palette today, but we have by no means exhausted the possibilities!

Gavagai performs Open Source Intelligence analysis of the threat level towards Sweden

Ten months ago, Taimour Abdulwahab set off a bomb in central Stockholm. The image below illustrates the weak signal Violence Propensity Index (VPI) related to Sweden that Gavagai picked up in the Swedish blogosphere between August 1, and December 21, 2010. Note that the suicide bombing in Stockholm is preceded by approximately 30 days of increased volatility and rising VPI levels and a significant spike on the December 8, that is, three days prior to the event.

Image: Weak signal Violence Propensity Index with respect to Sweden, as measured in the Swedish blogosphere between August 2, and December 21, 2010.

Image: Weak signal Violence Propensity Index with respect to Sweden, as measured in the Swedish blogosphere between August 2, and December 21, 2010.

 

Analysis of buzz, hatred, and associations during the unraveling of Håkan Juholts accommodation reimbursements affair

In this post, we analyze the mention frequency, strong negative sentiment or hatred and terminology associated with the leader of the Swedish Social Democrats, Håkan Juholt, during the unraveling of his accommodation reimbursements affair. The analysis is made on the Swedish blogosphere between October 1 and 19, 2011, by using Ethersource technology and our proprietary associations engine.

The short version of the story is: Juholt went into the accommodation affair in early October with a fairly low buzz in the blogosphere, and a reasonable level of strong negative sentiment or hatred expressed towards him considering he is a leading politician in an opposition party. At the mid-end of the month, the buzz once again settles, but the hatred has reached high levels! The terms associated with him suggests that the affair will not wear off easily.

Image 1 and Image 2, below, illustrate several things. The blue curve denotes the mention frequency, that is, the number of mentions of Juholt in the Swedish blogosphere. The red curve denotes the hatred expressed in relation to Juholt. The words pinned to each day are the new prominent terms associated with Juholt with respect to the terms for the previous day.

The time period covered by Image 1 ranges from October 6 – 10, 2011. On October 6, the terms associated with Juholt are mainly concerned with the shadow budget proposed by the Social Democrats on the day before. The mention frequency is quite small, and the hatred expressed in relation to Juholt is not exceptional. October 7 is the day of publication of an article by the Swedish newspaper Aftonbladet claiming that Juholt had requested too much allowance for his residence. The mention frequency increases markedly, while the hatred is similar to the day before. Although still influenced by online discussion pertaining to the shadow budget, the terms associated with Juholt clearly show evidence of an affair in the making; the article by Aftonbladet has gained traction in the blogosphere. Saturday, October 8, shows further increase in mention frequency, which also tends to vary with the time-of-day. The graph also shows that the hatred is on the rise; bloggers are picking up on the reimbursements affair. This is also evident in the associated terms where Juholt is compared to a cameral mishap made by the former Social Democratic leader Mona Sahlin in 1995 known as tobleroneaffären. The terms also reflect that the affair is about Juholt’s apartment, and that he will hold a press conference. Moving on to October 9, we see that the mention frequency, now clearly varying with the time-of-day, levels out. As does the hatred. The associated terms concern solidarity of Juholt, that his trust is declining, and also refers to the Swedish Prosecution Authority. Finally, Image 1 shows that, for October 10, the mention frequency is still high, and the hatred is rising markedly; the aversion vented in relation to Juholt is reaching high levels! The associated terms are related to crime (preliminary investigation, fraud, prosecution) and to politics (voters, party, resign, party leaders, citizens).

Image 1: The period covering the on-set of the affair. The terms for a given day in the image are the new terms associated with Håkan Juholt for that particular day.

Image 1: The period covering the on-set of the affair, October 6 - 10, 2011. The terms for a given day in the image are the new terms associated with Håkan Juholt for that particular day. Click the image for a larger version.

Moving on to a later part of the affair, Image 2 illustrates the period of October 15 – 19. The general trend regarding mention frequency for the period is that it is declining. The divergence between frequency and hatred is interesting, and the fact that the hatred rises as the mention frequency declines suggests that while people are talking less about Juholt, those who do are still very upset. Let’s look at the associated terms day-by-day. On October 15, the blogosphere is mainly about the media hunt for Juholt. October 16 concerns the “obivious” rules for accommodation reimbursements (calling editor-in-chief Jan Helin). October 17, again, concerns the rules, the intent of Juholt, and Sundbyberg, where Juholt held a meeting with his fellow party members. October 18 was about cheating and the form Juholt filled out when requesting the allowance. Finally, Image 2 ends with October 19, mentioning netroots and politometern, both being portals for political blogs, as well as the Swedish Radio.

Image 2: The period covering a later part of the affair, October 15 - 19, 2011. Click the image for a larger version.

Image 2: The period covering a later part of the affair, October 15 - 19, 2011. Click the image for a larger version.

The complete list of salient terms associated with Håkan Juholt for the period October 1 – 19 is available at http://www.gavagai.se/reports/juholt-october-2011/ Legend to the list: a blue term means it is new on the list, a red term means that the association between it and Juholt is weaker than it was before. Analogously, a green term means that its association with Juholt is stronger than it was before.

New words in New Text

New Text is what we like to call the sort of spontaneous non-edited material we spend much of our time processing. We contrast this primarily with traditional text from editorial sources. There are interesting differences between new text and traditional text — and this has been the subject of much debate in philological, sociological, and to some extent even computational circles. Much of what has been said is interesting, much is pure piffle, and we have made our own pronouncements about what sort of changes we believe are ahead (this one prononuncement in Swedish). We expect we will have reason to return to this discussion.

Two of the most obviously noticeable things about new text as compared to traditional are its lexical creativity and its proof-reading sloppiness. From the point of view of a text analysis processor they amount to much the same thing: a constant influx of new tokens, never seen before. Newspeak, lolspeak, l33tspeak, new topics, puns, lesser used sociolects and idiolects, misspellings, mistypings, and teenage angst all contribute to a vast and vastly growing symbol table.

Image 1: Number of unique words, as a function of time, in Tweets and newsprint

We are perfectly happy to cope with that sort of growth! This is one of the underlying principles of our design – not to be fazed by new and unexpected usage. An example: in a conversation, if your counterpart mispronounces a central term (or speaks an unexpected dialect or variety of the language) or uses a synonym which you didn’t know before, it might throw you the first time. Next time around, you cope with it. It might annoy you a few times but eventually – after only a handful of observations – you will be habituated to the pronunciation or the synonym. You will not be retranslating the new observation back to your previous knowledge of the world – some perceptual and lexical process in your language analysis system simply notes that this concept appears to be subject to some variance. This happens without retraining or recompilation of your lexicon: you don’t stop conversation to figure it all out. (Or, well, you shouldn’t. If you do so often, you will eventually lose friends.) This is the way we believe things should be done!

Here are some numbers we got out of our text collections (see Image 1 above). Take two years of newsprint from a reputable newspaper (in this case, first year from a major US daily, and the second year from a major UK daily). That’s about a 100 Mwords and about 100 kWords per day. The first few days, most words are new, but it settles pretty rapidly into a steady 100-200 new tokens per day. (The switch across the Atlantic probably contributes to slightly more of those, but not too noticeably.) In itself, an argument for a learning system!

Now, as comparison, take two months of tweets on various topics in English. (Well, mostly in English. Mixed-language tweets are in our test material. We don’t want to take them out – that’s the way the world is.) That’s more than 1 Gwords, working out to about 20 Mwords a day. And about 200 000 new words. Per day. Try to keep up with that, manually! We firmly believe the architecture of the system to handle this needs to view this sort of data variance as normal, not something to meet by quick hacks or filtering. Ethersource is built with this in mind.

And the future is near: what do you think will happen when language analytics will move to processing speech data as well as text? Do you believe the number of tokens in the data stream will converge to a smaller number?

Designing for scalability and other good things

Text analytics involves watching data, and most modern applications of text analytics involve vast and vastly increasing amounts of spontaneous and non-edited human-generated text. Any realistic model of human language in use must handle incoming data streams of dimensions that only a few years ago were considered intractable. This means that memory model design is central to effective processing: a naive model for term-term or term-document relations will grow for each new document, each new token, each new observed item of interest. This growth never stops.

To cope, most industrial text analytics implementations use various sampling, compression, encoding, or compilation algorithms to manage the potentially explosive growth of their respective memory models. To update and maintain the model, periodic recomputation is typically necessary, and to extract meaningful information from most models involves further generalisation and post-processing computation. If the model explodes in size you can throw more hardware at it and maybe do some informed sampling and you can still do what you set out to do. Until it explodes again. In the meanwhile you will have what you need, and if you set your parameters right after training you may have data of quite useful and high quality.

But there is a better way to do it. We are very pleased with the design of Ethersource, the technology we have developed. For Ethersource, the memory model and the processing model are identical. Ethersource will not blow up in your face as data comes in, will not require demanding computation to keep its semantic model up to date, and will deliver semantic relations in real time, on-line, without recomputation or postprocessing of the aggregated data.

The graph given here (taken from our more comprehensive paper on memory model growth) serves to illustrate the orders of magnitude in difference between two vanilla memory models and the Ethersource model, all three designed to relate words to words based on their distribution. At 11 000 000 Tweets, a word-by-word memory model requires 190 times the number of matrix cells used by Ethersource. At the same number of Tweets, a word-by-document matrix would be 5 500 times larger than the representation used by Ethersource.

Image 1: A log-linear graph of memory model growth.

Our parsimonious memory footprint is not the effect of a compression operation – it is an inherent aspect of the same design which makes Ethersource dynamic, robust, and effective. We want to stress the importance of having an informed design of the knowledge representation to begin with, a memory model which is scalable and bounded, and a processing model which is habitable in face of varied, multilingual, spontaneous, unedited data. We will return to the importance of design and to these other central characteristics of the technology in coming blog posts!

Gavagai performs Open source intelligence analysis of the threat level towards Lars Vilks

In 2007, the artist Lars Vilks upset a large number of people by depicting the Islamic prophet Muhammad as a roundabout dog. In March 2010, eight people were arrested for plotting to murder Vilks.

We followed Vilks in Swedish social media during the spring of 2010, using a metric called the Violence Propensity Index (VPI) to quantify the expressions of violent language targeting Vilks. Image 1 illustrates how events unfolded.

Image 1: The Violence Propensity Index (VPI) for Lars Vilks in Swedish social media in May 2010. Note the rise in VPI on the days before the two attacks on Vilks.

On May 11, 2010 Vilks was assaulted while giving a lecture on free speech at Uppsala University. A few days later, on May 15, Vilks’ house was attacked by arsonists. Note the significant rise in VPI on the day before both attacks! What’s more: a public appearance by Vilks planned to May 4 was cancelled. There’s a rise in VPI on May 3, too!

Now, we’re not claiming that the bloggers did it. What we do say, however, is that the attitude towards a given subject, as expressed in on-line social media, may well reflect the attitudes at large in a population, including people who are about to take action and externalize their opinions. The VPI is a means to detect (weak) signals of violent chatter, and as such, may facilitate an early warning pertaining to targets at risk.

The killing of Mashaal Tammo through the eyes of Arabic social media

In this post, we show three things:

  1. The possibility of using Ethersource to monitor Arabic social media
  2. to detect violent on-line chatter, and
  3. to identify the real-worlds events underlying the resulting signal.

On the evening of Friday, October 7 2011, Kurdish opposition politician and founder of the Kurdish Future Movement Party Mashaal Tammo was shot dead by masked men in his home, in north eastern Syria. His killing was soon attributed to the regime of Syria. The next day, Saturday, October 8, the funeral party for Tammo, with 50,000 to 100,000 attendants, turned into the largest gathering of protesters since the start of the uprising seven months prior. Syrian security forces intercepted the crowd, and shot at least 5 people dead, injuring numerous others.

We’ve used Ethersource to monitor Syria in Arabic social media for quite some time. Image 1, below, illustrates the on-line violent chatter pertaining to Syria between Thursday, October 6, and Sunday, October 9, 2011; the weekend when Tammo was killed. Image 1 is annotated with the time of the killing of Tammo as he was reportedly attacked in the evening of the 7th (Syria being in a time zone one hour ahead of the time scale of the graph), and the approximate time for his funeral. What is striking is the surge in chatter after the demise of Tammo, caused by reactions by people active in social media. At this point in time, the steep rise, and the high level of violent expressions indicate that physical manifestation related to the killing of Tammo is likely. We call it a crowd induced event; a process sparked by a real-world event, and then fueled by an on-line crowd in such a way that increases the possibility of a physical reaction to the initial event.  After the attack on the funeral party, the levels of violent chatter increased even more.

Image 1: Violent chatter in Arabic social media with respect to Syria for the weekend when Kurdish opposition politician Tammo was gunned down.

Ethersource facilitates the verification of a signal by allowing the operator to inspect the individual documents contributing toit. Image 2, below, shows three screenshots representing some of the sources underlying the signal on Friday, October 7, and Saturday, October 8. The translations from Arabic to English was made using Google Translate.

Image 2: Screenshots of some of the sources contributing to the on-line violent expressions. The translations were made with Google Translate.

To sum up: By using Ethersource, we are able to aggregate the attitudes expressed in on-line media, as they are emitted, with respect to a given entity, in a given language, thus constructing a view of attitudes over time. The view facilitates the identification of time periods in which on-line activity warrants our attention. Ethersource, then, provides access to the documents contributing to the aggregated attitudes in the time period under scrutiny.

So, with Ethersource, we can follow any target with respect to any attitude in any language. On top of that, Ethersource continuously learns from the language it is exposed to.

 

Beyond “Positive” and “Negative” – Gavagai looks beyond the first take at sentiment analysis

“Sentiment analysis” as a research field took its time to become an overnight success. From being a concern of behavioural psychologists and philologists, about seven years ago, at the 2004 AAAI Spring Symposium on “Exploring Attitude and Affect in Text: Theories and Applications”, it entered stage as an application for language technology and information access technology. The commercial potential is self evident and straightforward, the pickings are easy and marketable, and the technology is fun for engineers to play with (way more fun than plain old topical search engines). The field has rightly exploded with commercial activity.

Some Random Emotions

Positive or negative sentiment? (Ill. by Joseph Clement Coll)

So far, applications for sentiment analysis have focussed on detecting and aggregating positive and negative sentiment in text, especially social media. But there is so much more than polarity to work with!

Here at Gavagai we have seen sentiment analysis as one of the many application areas for our base technology. But Ethersource is capable of much more than distinguishing “positive” and “negative” affect. Our poles of sentiment are tailored to the needs of our current customers and our processing model is built to accommodate rapid change in interest. Ethersource currently tracks sentiments such as “uncertainty”, “violence”, “sexy”, “worry”, “financial volatility” and new sentiments can be added in minutes. This is not something we built for marketing purposes: this is based on our view of how sentiment, opinion, mood, and attitude is expressed in text.

Thus – from our point of departure – we are very happy to note that the Research Breakout Session at this year’s Sentiment Analysis Symposium – a convention for practitioners in the field – specifically focusses on questions of “Beyond Positive and Negative”.

We are preparing a longer technology paper on the topic, but in the meanwhile, do read up on and make note of what the good people are saying at the symposium! Boredom. Curiosity! Uncertainty? Skepticism. Envy. Enthusiasm! Remember all those other emotions!

How are you Sweden? Happier AND more uncertain during weekends.

We’re using Ethersource to monitor the mood of the Swedish blogosphere in terms of positivity, negativity, uncertainty, and an index we call Positivity Propensity Index (PPI).

The graphs below show two particularly interesting things.

  1. Positivity (as in PPI) is cyclic on a weekly basis; Swedes are happier during weekends (Image 1). We all knew this, but it’s good to get it on ink.
  2. At the same time, Swedes are more uncertain during Fridays, Saturdays, and Sundays, than they are during the rest of the week (Image 2).

Image 1 shows the positivity, negativity, and PPI for a three week period in 2011. Note the rise in PPI and the decline in absolute positivity and negativity during the weekends (red circles). The decline is a direct result of people blogging less during weekends, especially during weekends with good weather, as was the one on October 1-2. The rise in PPI (blue curve) indicates that people are, in fact, saying kinder things and being more positive.

Image 1: The positivity, negativity, and PPI in the Swedish blogosphere, September 22 through October 11, 2011.

Image 1: The positivity, negativity, and PPI in the Swedish blogosphere, September 22 through October 11, 2011.

Image 2 shows the uncertainty (labelled IFFY), positivity, and negativity for the weekend October 7 – 9, 2011. The graph clearly shows the early onset of uncertainty compared to those of positivity and negativity. It also gives away that uncertainty is relatively high during weekends. The same patterns holds for the weekends we’ve seen so far. Upon inspecting some of the blog posts underlying the uncertainty curve, it is evident that us Swedes are prone to ponder the big questions in life during the dark hours of the weekends.

The expressed uncertainty during weekends is high, and the onset precedes that of the rise in positivity and negativity.

Image 2: The expressed uncertainty during weekends is high, and the onset precedes that of the rise in positivity and negativity.

We’ll continue to monitor the Swedish blogosphere and look for other interesting bits of information. For instance, is the pattern of rise in PPI during weekends more pronounced during a particular time of the year? Do we express more uncertainty in the face of Christmas?