On-line Activities Indicate Increasing Flu Trend.

Swedish bloggers and tweeters are increasingly chattering about the seasonal influenza (also covered in an earlier blog post). The trend of the flu signals captured by the Ethersource barometer is clearly on the rise. This should come as no surprise since we are, in fact, looking at the seasonal flu. The interesting thing here is how well the barometer reflects what is reported by the Swedish Institute for Communicable Disease Control (SMI) in their weekly reports. Those reports are based on input from sentinels and laboratories, and by necessity, they lag behind in time: the current report is for the period of February 6 – 13, which is a week old by now.

Of the approximately 80 000 social media posts that matched our criteria for being included in the barometer on a given day in the period depicted in the image below, only a small fraction concern the flu. The flu isn’t contagious via the web, but information is. Keep your eyes open, and report symptoms to influenzanet to allow scientists to better stay on top of the flu spreading throughout Sweden.

Meanwhile, as we keep our eyes open for the spring sun, we’ll make sure to monitor the flu barometer and take note of any declining trends. We’ll keep you posted.

People active in Swedish social media are increasingly concerned about the seasonal influenza. The image shows a seven day moving average for the past two months of the influenza signal. The annotations in the graph is the Ethersource anomaly detection algorithms at work: each flag indicate a point in time where the change in trend warranted our attention.

Artist Lars Vilks Attacked. Again.

At 6:45 pm, less than a minute after the news broke on Twitter, Ethersource picked up the first aversive signal relating to tonight’s attack on Lars Vilks. (We’ve covered him previously on this blog). This time, elements in the audience threw eggs at him during an evening lecture in Karlstad.

Of the major news outlets, the branch of Swedish Radio located in Karlstad was the quickest in publishing the news, putting it on their national web site at 7:43 pm (SR). The other players were roughly 30 to 45 minutes behind (DN, SVD, SVT). Having Ethersource doing real-time attitudinal analysis of the contents of Swedish Tweets would have brought the news to your attention even earlier.

Image 1: Aversive expressions related to Lars Vilks on February 21, 2012. The red circle shows the initial peak indicative of the egging event, targeting Vilks. The Tweet in Image 2 is the first one in the list of documents associated with the highlighted peak, easily accessible via the Ethersource GUI.

Image 2: The first Tweet picked up by Ethersource pertaining to egging event: "People in the audience just threw eggs at Lars Vilks and there was chaos"

 

 

Hyperdimensionality, semantic singularity, and concentration of distances

  • This post digs a bit deeper into Ethersource.
  • We discuss the problems of distance concentration and semantic singularity.
  • We argue that Ethersource is not susceptible to these problems.

As we have previously discussed in this blog, the number of unique words in social media grows at a rate that far exceeds what we are normally used to when working with collections of more traditional texts. To recapitulate, the lexical variation and growth in New Text is simply astounding; there is a constant and continuous influx of new tokens. We have also previously discussed how Ethersource is designed to handle such growth. The memory/processing model (we don’t make a distinction between these) of Ethersource does not explode in size as we add (lots and lots of) new data.

To repeat the message: if your data is highly dynamic, you’d better have a model that can handle variation.

Ethersource is based on hyperdimensional computing, which means that all operations in Ethersource are performed in fixed-dimensional spaces of very high dimensionality. Such representations have a number of very attractive features (see Kanerva’s paper in the references below for more details). One of the most useful properties of hyperdimensional representations is that the dimensionality is unaffected by the size of the data. This is the reason Ethersource seamlessly and unproblematically can handle such rapidly growing vocabularies as those encountered in social media (and in other kinds of streaming data sources).

Of central importance in Ethersource (and in other data mining systems) is the notion of similarity. Applications like social media monitoring/sentiment analysis, association analysis, etc, all boil down to questions of the type “how similar is this data point to that”? Association analysis in particular is an example of nearest neighbor search, in which the task is to find the data points that are most similar to a given query data point. Nearest neighbor search is a core functionality in many data mining applications. Examples include semantic search, pattern recognition, recommendation systems, etc. All these applications (and many more), depend on nearest neighbor searches in high-dimensional spaces.

Enter the phenomenon of distance concentration and the perils of the semantic singularity.

Imagine what the impact would be for systems that rely on the notion of similarity if this notion itself became meaningless. Clearly, not good. But is this really something we need to worry about? Could it ever happen?

Science fiction-like as it may sound, this is exactly what the phenomenon of distance concentration refers to. Essentially, this is a situation in which the distance from a query data point to the nearest neighbor approaches the distance to the farthest neighbor. In such a situation, the notion of similarity becomes useless because all distances are the same. Several recent papers (see below for references) have pointed out that this situation might actually occur in certain cases where the dimensionality of the data increases.

Remember the observation about the vocabulary growth of social media? This is a hallmark example of data with continuously increasing dimensionality. Thus, not only do you need to worry about the processing cost when dealing with such data, but you also need to worry about your representation collapsing into semantic singularity. And to make matters even worse, it has been shown that certain types of dimensionality reduction and approximate nearest neighbor search techniques can further aggravate the problem of distance concentration.

If we operate in high dimensions with vast and vastly growing data sets streaming in, we should take this problem seriously.

In the case of Ethersource, we use hyperdimensional computing to ensure that the representation remains unaffected by the size of the data. This means that Ethersource is not at risk of distance concentration due to increasing dimensionality of the representation per se. However, as the attentive reader would no doubt be wondering, what about the growth of the intrinsic dimensionality? Is there no risk of a hyperdimensional representation getting “saturated”? That is, how can we be sure that there will always be enough room, locally, in the fixed-size hyperdimensional representation when there is a continuous inflow of data?

This would be a tangible problem if we were faced with data of high intrinsic dimensionalities. In such cases, the local neighbourhood of a data point can become saturated with new neighbours, thus rendering the notion of vicinity meaningless, and thereby collapsing into semantic singularity. However, Ethersource operates on a very special type of data, which has comparatively low intrinsic dimensionality (Karlgren et al. 2008).

Thus, exit the problem of distance concentration in Ethersource.

And anyway, as someone so wisely said, “forgetting is the key to a healthy mind”, and we certainly want Ethersource to stay healthy.

To end this rather technical post, we include an illustrative example of how similarities behave when adding more data in Ethersource. The following graph shows how the pairwise similarities between semantically related and semantically unrelated words remain stable as we add more data (in this case, up to some 2 billion words).

This is exactly how we want the model to behave; related words stay related, while unrelated words stay unrelated. It would definitely not be a good thing if we saw an increase in similarity between the unrelated words as we add more data, merely as an effect of adding more data. What could happen though is that two previously unrelated words suddenly become similar as an effect of new language use. This, however, is perfectly in order, since we want the similarities to reflect actual usage patterns rather than presumed ones. The fluctuations in the graph correspond to such fluctuations in language use.

References

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft (1999) When Is “Nearest Neighbor” Meaningful? Proceedings of the 7th International Conference on Database Theory, 1999.

Ata Kabán (2011) On the distance concentration awareness of certain data reduction techniques. Pattern Recognition, 44 (2): 265-277.

Pentti Kanerva (2009) Hyperdimensional Computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2): 139-159.

Jussi Karlgren, Anders Holst and Magnus Sahlgren (2008) Filaments of Meaning in Word Space. Proceedings of the 30th European Conference on Information Retrieval, 2008.