We presented “Understanding and Big Data: Ethersource as the Semantic Processing Layer in the Big Data Stack” at yesterday’s meet-up on the topic “Big Data & Predictive Modeling – What’s happening in Stockholm“. It was a great event, spawning interesting discussions with bright people. Thank you Mikael at Klarna for organizing the meet-up!
- This post gives examples of Ethersource’s learning capabilities.
- It gives examples of automatically learned topics and senses of the use of the term Al-Qaeda in English social media.
Ethersource is continuously exposed to massive text streams. On a given day, it sees millions of blog posts, tweets, and forum posts. And it learns. It gobbles up information much the same way a human picks up new ways of using new language constructs. Ethersource learns how the terms it reads are related to each other. It learns about topicality, and it learns about the different senses of the terms.
As an example, let’s have a look at what Ethersource has learned regarding Al-Qaeda the past few days. Topicality-wise, the texts concerning Al-Qaeda are described by Ethersource using the following terms:
To us humans, possessing the background knowledge imposed on us in media over the past decade, these terms come as no surprise. They all make sense as describing Al-Qaeda. Ethersource, however, has learned these topics from scratch, without access to any prior knowledge.
Furthermore, Ethersource has discovered two distinct senses, or meanings, of the term Al-Qaeda, as it has been used in social media during the past couple of days.
- The first sense of Al-Qaeda was automatically labelled PKK. In this sense, Al-Qaeda is related to Turkish, terrorists, militants, and fighters.
- The second sense of Al-Qaeda was automatically labelled Syria. In this sense, Al-Qaeda is related to Iran, Libya, Turkey, Tunisia, and fighting.
Unsupervised topic detection and sense discovery are both inherent properties of the semantic representation at the core of Ethersource. This makes for a powerful tool for an analyst when forming an understanding of the use of target concepts, be it in brand management, Open Source Intelligence, or sudden swings in World Markets.
We conclude this post with the observation that Ethersource has recently learned a new synonym of “Obama”: Obameat.
Yesterday marked a year high in the number of people airing their concerns regarding Assange in terms of aggression, either toward Assange himself, the Swedish judicial system, or the possible intervention of the UK Government in order to extradict Assange to Sweden. The graph below illustrates that the steep rise in volume during the past 24 to 36 hours diminishes most of the previous on line activities. Although Assange is more or less inseparable from Wikileaks in that he is heavily associated with the organization, at the moment, the public’s subject matter of concern clearly lie with Assange himself.
- We use Ethersource to monitor usage of racist terminology in the Swedish blogosphere.
- We find that one of the largest demographic groups to use such terminology is young female bloggers.
- We demonstrate how we are able to cluster and profile users of racist terminology.
One of the many benefits of Ethersource is that it is not limited to the standard positive/neutral/negative sentiment palette, but that it can be used to analyze and monitor any type of textually manifested phenomena. Previous examples in this blog include artist popularity, flu trend, aversive language, and positivity vs headache.
In this post, we report on some observations on using Ethersource to monitor racist expressions in the Swedish blogosphere.
The following image shows the frequency of occurrence of racist terminology in the Swedish blogosphere from late March to the end of May 2012. Obviously, racist terminology is a frequent everyday occurrence on Swedish blogs.
However, merely counting the frequency of occurrence of racist terminology is of limited usefulness for understanding what people say and mean, since there are many ways to use terminology. Some uses may signal ideological or political standpoints, but other uses may not (e.g. discussions about the terminology itself, such as the origin and appropriateness of various terms). Thus, only counting the frequency of occurrence of racist terminology in the blogosphere may lead to premature or misleading conclusions. We therefore also monitor negative or degrading usage of racist terminology, as well as aggressive or hateful usage. And there is a difference between counting frequencies and counting opinionated usage, as we can see in the image below, which shows frequency (in blue), degrading usage (in green), and aggressive usage (in red).
It is obvious that the total frequency of occurrence of racist terminology is much larger than that of the frequencies of degrading use and aggressive use. As a rough estimate, approximately 10% of the total number of posts containing racist terminology are negative or degrading, while approximately 5% are aggressive or hateful.
The general trends in these graphs are not of lasting value, since the time span is relatively short. What is interesting – and surprising – is the demographic profile of bloggers found in the two bottom lines. Since Ethersource enables an analyst to retrieve individual blog posts which contain a given target (in this case, racist terminology), it is possible to further analyze the material. Looking at the blog posts that use racist terms in degrading ways, we find that roughly 25% are written by young female bloggers who write about their own lives. Perhaps even more surprising, around 10% of blog posts using racist terms in aggressive ways are written by these young females. This is a surprising discovery, considering that the topical content of these blogs revolve around everyday events, lifestyle, and fashion.
Demographic clustering and stylometric profiling
The noteworthy observation above suggests that it may be interesting to look more closely also at the non-opinionated usage of racist terminology (i.e. the occurrences that are neither aggressive nor degrading). We do so by automatically clustering all the blog posts containing racist terminology during 2012. Always keeping the obvious risk of overgeneralizing in mind, we infer from manual inspection of the material that the four main clusters represent following groups of bloggers:
Imagine that we for some reason could not inspect the material manually and therefore did not know the demographics of the clusters we found. In such cases, we can use stylometric profiling to characterize the stylistic differences between clusters, and based on these differences we can infer demographic information. As an example, consider the following comparison between the stylometric profile for the cluster containing the young female bloggers, and the stylometric profile for the cluster containing mainly political bloggers.
The comparison between these two stylometric profiles shows that the main stylistic differences between these two groups of bloggers (let’s call them group F for the young female bloggers and group P for the political bloggers) can be found in the following variables:
- Group F is more self-oriented, which indicates that this group talks mainly about things that happen to the author, stuff the author thinks or worries about, or things that the author does.
- Group F refers directly to the reader more often than does group P.
- Abstract vocabulary
- Group P tends to use more abstract and complex vocabulary than group F.
- Blog posts from group P contain more explicit temporal and spatial references than do posts from group F.
These differences suggest that authors in group F (the young female bloggers) write mainly from a subjective point of view, while authors in group P (the political bloggers) adopt a more factual perspective. Based on such differences, we may formulate hypotheses about the demographics of these two groups. This difference would allow us to propose that since the one group writes from a more personal and immediate perspective, they can be assumed to be younger and more personally engaged in their narration than the other group. This characterisation of author style is actually more salient than the objective notion of author age and gender since writing style and authoring background are more interesting for understanding blog posts than the age and gender or other demographich variables.
The analysis and discussion above serves as an illustrative example of how stylometric profiling correlates well with human intuition about demographic clustering, and that such profiles may serve as explanatory constructs for a demographic clustering solution. We conclude this blog post with the observation that the combination of attitude analysis, clustering, and profiling provides a very powerful framework for analysis of online content.
In this post, we confirm that Loreen is well placed to win the popular vote in the Eurovision Song Contest final 2012.
- We use Twitter to measure the popularity of the contestants in ESC 2012.
- When scaling with Twitter penetration, Sweden gets the highest relative popularity score.
- This is in line with current betting odds, which unanimously rank Sweden as the most likely winner.
- Gavagai has previously made accurate forecasts of the distribution of the popular vote in the national ESC final.
We have previously shown in this blog that Ethersource monitoring of on-line sentiment can predict the popular vote in certain high-profile media events, such as the national Eurovision Song Contest. In this post, we report on some observations on using Ethersource to measure the popularity of the contestants in the international Eurovision Song Contest, based on analysis of expressions of popularity on Twitter. The following image shows the relative popularity scores of the participating countries.
It should be obvious to anyone following the pre-contest speculations about who will win the ESC 2012 that the proportions of popularity in this image do not correlate with current betting odds for the ESC final (the current odds can be found at any betting site). The image shows Ireland and the UK as the most popular contributions in the ESC final (they are ranked 11th and 5th in the current betting odds). One reason for this discrepancy can be that popularity and betting odds do not refer to the same type of measurement; popularity refers to population-wide opinion, while betting odds are estimates of who will win the actual contest (which is determined both by popular and jury votes). Another reason for this discrepancy is the issues identified in commentaries of other recent attempts to predict election votes based on sentiment analysis of the Tweet stream:
- Twitter users (and users of other social media) do not constitute a perfect sample of the population, which means that measurements based on Twitter may not be representative for the population as a whole.
- Twitter is a perfect medium for marketers and campaigns, which makes the analysis sensitive to ad-bots and automated Twitter campaigns.
These concerns are of course valid also for the present scenario. However, even more important when comparing measurements based on Twitter analysis across different countries are the following issues:
- There is a huge difference in population size between the European countries: Russia has a European population of more than 100 million, while Iceland has a population of a mere 300 000 inhabitants.
- The Twitter penetration (i.e. proportion of the population that use Twitter) is very different for different countries. In the present scenario, where we measure expressions of popularity on Twitter, it means that some countries may get high popularity scores merely because a comparatively large proportion of the population in that country uses Twitter (people tend to promote their own country’s entry in the ESC).
It is somewhat difficult to find recent and reliable estimates of the Twitter penetration per country, but not so recent studies show that the Netherlands, Turkey, UK, and Ireland top the list for Twitter penetration in Europe. Perhaps this explains the results we see in the image above? Scaling the popularity scores for each country by the estimated number of Twitter users in that country produces the following image:
When scaling with Twitter penetration, Sweden gets the highest relative popularity score. This is in line with current betting odds, which unanimously rank Sweden as the most likely winner. However, the other countries that receive high normalized popularity scores do not correlate with odds rankings: Greece has the second highest popularity score (ranked 14th place in the odds rankings), followed by Denmark (ranked 8th place), Ireland (11th), and Iceland (7th). These discrepancies may be due to the issues with non-representativeness and Twitter penetration discussed above. We may also add the following issues:
- The activity level of the Twitter population in some countries may not correspond with the Twitter penetration; Twitter users may be more active in some countries than others.
- The interest for the ESC may be higher in certain countries than others, thus leading to more Tweets about the contestants from that country.
We conclude this post with the observation that Loreen seems to be the likely winner of the popular vote in the ESC final 2012. We also conclude that attempting to model population-wide opinions based on Twitter analysis is a non-trivial task that requires more than merely counting word frequencies.
As we have previously discussed on this blog, Ethersource constantly and continuously learns new terminology by reading what is written on the Internet. As an example of how Ethersource picks up even weak linguistic signals, we noticed recently that Ethersource suggested the word “tutilurfräs” as a very positive Swedish term. None of us had ever encountered the term “tutilurfräs” before. We looked up the source of this linguistic invention, and found that it originates from a tweet by Swedish punk icon Kajsa Grytt, where she writes that:
Å så Pelle!! Å så Hives! Vilket tutilurfräs!! Jag tycker de är genialiska. Blir helt jävla lycklig av det bandet.
— Kajsa Grytt (@KajsaGrytt) March 30, 2012
A (somewhat creative) translation in English would be something like: “Oh Pelle! Oh Hives! What tutilurfräs!! I think they are genius. That band makes me absolutely happy.”
Quite obviously, Ethersource is correct in its understanding that “tutilurfräs” is a very positive word.
There are two lesson to be drawn from this example:
- If you do sentiment analysis in Swedish on Twitter and your model does not automatically learn new terminology, you should re-train or update your model to include the word “tutilurfräs“.
- If you invent a completely new word and start blogging or tweeting about it, Ethersource will learn it. It is true that in space, no one can hear you scream, but on the Internet, even if you whisper Ethersource will understand you.
Despite the fact that the Swedish part of the Eurovision Song Contest final was broadcast live, as a TV viewer it was impossible to get a sense of just how popular the artists were at a given point in time. Having access to Ethersource made sifting out meaningful blog posts and Tweets in real-time a breeze! Below are two graphs outlining, minute-by-minute, the popularity of the two top contestants as expressed in Swedish on-line social media for the day of the final (click the image for a larger version). Note that the popularity score of Loreen’s reaches higher during her performance than does Danny’s. In fact, looking at the scale and the contents of the two graphs, it is clear that the expressions of popularity towards Loreen is consequently higher throughout the day.
The timing information for the performances of the artists is available at the official web site of the contest.
Sweden’s contribution to the Eurovision Song Contest this year has been decided in yesterday’s finale with ten contestants. The winner of 2012 year’s Swedish music fest Melodifestivalen is Loreen, with the song “Euphoria”, which landed almost 700000 call-in votes from the at-home TV audience.
Using the Ethersource technology, Gavagai followed the on-line sentiment towards all contestants throughout the lead-up to the event. We are pleased to note that Gavagai’s forecast of the results based on expressions of appreciation in blog posts and tweets which was published in the paper edition of Svenska Dagbladet (SvD) in the morning prior to the event – was close enough to the actual outcome of the viewers’ votes to not only correctly predict the three top spots from the start field of ten contenders but to get their vote percentages pretty much right!
Talk comes cheap – demonstration counts!
This was fun! We will make similar public opinion forecasts in coming analyses, and return to the observations we can make from this event and these and similar data.
Swedish bloggers and tweeters are increasingly chattering about the seasonal influenza (also covered in an earlier blog post). The trend of the flu signals captured by the Ethersource barometer is clearly on the rise. This should come as no surprise since we are, in fact, looking at the seasonal flu. The interesting thing here is how well the barometer reflects what is reported by the Swedish Institute for Communicable Disease Control (SMI) in their weekly reports. Those reports are based on input from sentinels and laboratories, and by necessity, they lag behind in time: the current report is for the period of February 6 – 13, which is a week old by now.
Of the approximately 80 000 social media posts that matched our criteria for being included in the barometer on a given day in the period depicted in the image below, only a small fraction concern the flu. The flu isn’t contagious via the web, but information is. Keep your eyes open, and report symptoms to influenzanet to allow scientists to better stay on top of the flu spreading throughout Sweden.
Meanwhile, as we keep our eyes open for the spring sun, we’ll make sure to monitor the flu barometer and take note of any declining trends. We’ll keep you posted.
- This post digs a bit deeper into Ethersource.
- We discuss the problems of distance concentration and semantic singularity.
- We argue that Ethersource is not susceptible to these problems.
As we have previously discussed in this blog, the number of unique words in social media grows at a rate that far exceeds what we are normally used to when working with collections of more traditional texts. To recapitulate, the lexical variation and growth in New Text is simply astounding; there is a constant and continuous influx of new tokens. We have also previously discussed how Ethersource is designed to handle such growth. The memory/processing model (we don’t make a distinction between these) of Ethersource does not explode in size as we add (lots and lots of) new data.
To repeat the message: if your data is highly dynamic, you’d better have a model that can handle variation.
Ethersource is based on hyperdimensional computing, which means that all operations in Ethersource are performed in fixed-dimensional spaces of very high dimensionality. Such representations have a number of very attractive features (see Kanerva’s paper in the references below for more details). One of the most useful properties of hyperdimensional representations is that the dimensionality is unaffected by the size of the data. This is the reason Ethersource seamlessly and unproblematically can handle such rapidly growing vocabularies as those encountered in social media (and in other kinds of streaming data sources).
Of central importance in Ethersource (and in other data mining systems) is the notion of similarity. Applications like social media monitoring/sentiment analysis, association analysis, etc, all boil down to questions of the type “how similar is this data point to that”? Association analysis in particular is an example of nearest neighbor search, in which the task is to find the data points that are most similar to a given query data point. Nearest neighbor search is a core functionality in many data mining applications. Examples include semantic search, pattern recognition, recommendation systems, etc. All these applications (and many more), depend on nearest neighbor searches in high-dimensional spaces.
Enter the phenomenon of distance concentration and the perils of the semantic singularity.
Imagine what the impact would be for systems that rely on the notion of similarity if this notion itself became meaningless. Clearly, not good. But is this really something we need to worry about? Could it ever happen?
Science fiction-like as it may sound, this is exactly what the phenomenon of distance concentration refers to. Essentially, this is a situation in which the distance from a query data point to the nearest neighbor approaches the distance to the farthest neighbor. In such a situation, the notion of similarity becomes useless because all distances are the same. Several recent papers (see below for references) have pointed out that this situation might actually occur in certain cases where the dimensionality of the data increases.
Remember the observation about the vocabulary growth of social media? This is a hallmark example of data with continuously increasing dimensionality. Thus, not only do you need to worry about the processing cost when dealing with such data, but you also need to worry about your representation collapsing into semantic singularity. And to make matters even worse, it has been shown that certain types of dimensionality reduction and approximate nearest neighbor search techniques can further aggravate the problem of distance concentration.
If we operate in high dimensions with vast and vastly growing data sets streaming in, we should take this problem seriously.
In the case of Ethersource, we use hyperdimensional computing to ensure that the representation remains unaffected by the size of the data. This means that Ethersource is not at risk of distance concentration due to increasing dimensionality of the representation per se. However, as the attentive reader would no doubt be wondering, what about the growth of the intrinsic dimensionality? Is there no risk of a hyperdimensional representation getting “saturated”? That is, how can we be sure that there will always be enough room, locally, in the fixed-size hyperdimensional representation when there is a continuous inflow of data?
This would be a tangible problem if we were faced with data of high intrinsic dimensionalities. In such cases, the local neighbourhood of a data point can become saturated with new neighbours, thus rendering the notion of vicinity meaningless, and thereby collapsing into semantic singularity. However, Ethersource operates on a very special type of data, which has comparatively low intrinsic dimensionality (Karlgren et al. 2008).
Thus, exit the problem of distance concentration in Ethersource.
And anyway, as someone so wisely said, “forgetting is the key to a healthy mind”, and we certainly want Ethersource to stay healthy.
To end this rather technical post, we include an illustrative example of how similarities behave when adding more data in Ethersource. The following graph shows how the pairwise similarities between semantically related and semantically unrelated words remain stable as we add more data (in this case, up to some 2 billion words).
This is exactly how we want the model to behave; related words stay related, while unrelated words stay unrelated. It would definitely not be a good thing if we saw an increase in similarity between the unrelated words as we add more data, merely as an effect of adding more data. What could happen though is that two previously unrelated words suddenly become similar as an effect of new language use. This, however, is perfectly in order, since we want the similarities to reflect actual usage patterns rather than presumed ones. The fluctuations in the graph correspond to such fluctuations in language use.
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft (1999) When Is “Nearest Neighbor” Meaningful? Proceedings of the 7th International Conference on Database Theory, 1999.
Ata Kabán (2011) On the distance concentration awareness of certain data reduction techniques. Pattern Recognition, 44 (2): 265-277.
Pentti Kanerva (2009) Hyperdimensional Computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2): 139-159.
Jussi Karlgren, Anders Holst and Magnus Sahlgren (2008) Filaments of Meaning in Word Space. Proceedings of the 30th European Conference on Information Retrieval, 2008.