Weak signal synonym detection (in Swedish)

As we have previously discussed on this blog, Ethersource constantly and continuously learns new terminology by reading what is written on the Internet. As an example of how Ethersource picks up even weak linguistic signals, we noticed recently that Ethersource suggested the word “tutilurfräs” as a very positive Swedish term. None of us had ever encountered the term “tutilurfräs” before. We looked up the source of this linguistic invention, and found that it originates from a tweet by Swedish punk icon Kajsa Grytt, where she writes that:



A (somewhat creative) translation in English would be something like: “Oh Pelle! Oh Hives! What tutilurfräs!! I think they are genius. That band makes me absolutely happy.”

Quite obviously, Ethersource is correct in its understanding that “tutilurfräs” is a very positive word.

There are two lesson to be drawn from this example:

  1. If you do sentiment analysis in Swedish on Twitter and your model does not automatically learn new terminology, you should re-train or update your model to include the word “tutilurfräs“.
  2. If you invent a completely new word and start blogging or tweeting about it, Ethersource will learn it. It is true that in space, no one can hear you scream, but on the Internet, even if you whisper Ethersource will understand you.

A Minute-by-minute Popularity Contest – Loreen versus Danny

Despite the fact that the Swedish part of the Eurovision Song Contest final was broadcast live, as a TV viewer it was impossible to get a sense of just how popular the artists were at a given point in time. Having access to Ethersource made sifting out meaningful blog posts and Tweets in real-time a breeze! Below are two graphs outlining, minute-by-minute, the popularity of the two top contestants as expressed in Swedish on-line social media for the day of the final (click the image for a larger version). Note that the popularity score of Loreen’s reaches higher during her performance than does Danny’s. In fact, looking at the scale and the contents of the two graphs, it is clear that the expressions of popularity towards Loreen is consequently higher throughout the day.

The popularity of Loreen and Danny Saucedo, measured minute-by-minute during the day of the final of the Swedish Eurovision Song Contest.

The popularity of Loreen and Danny Saucedo, measured minute-by-minute during the day of the final of the Swedish part of the Eurovision Song Contest. The annotations in red denotes the appearance on stage by the two artists.

The timing information for the performances of the artists is available at the official web site of the contest.

Fabulous Fest Forecast by Gavagai

Sweden’s contribution to the Eurovision Song Contest this year has been decided in yesterday’s finale with ten contestants. The winner of 2012 year’s Swedish music fest Melodifestivalen is Loreen, with the song “Euphoria”, which landed almost 700000 call-in votes from the at-home TV audience.

Using the Ethersource technology, Gavagai followed the on-line sentiment towards all contestants throughout the lead-up to the event. We are pleased to note that Gavagai’s forecast of the results based on expressions of appreciation in blog posts and tweets which was published in the paper edition of Svenska Dagbladet (SvD) in the morning prior to the event – was close enough to the actual outcome of the viewers’ votes to not only correctly predict the three top spots from the start field of ten contenders but to get their vote percentages pretty much right!

Talk comes cheap – demonstration counts!

This was fun! We will make similar public opinion forecasts in coming analyses, and return to the observations we can make from this event and these and similar data.

On-line Activities Indicate Increasing Flu Trend.

Swedish bloggers and tweeters are increasingly chattering about the seasonal influenza (also covered in an earlier blog post). The trend of the flu signals captured by the Ethersource barometer is clearly on the rise. This should come as no surprise since we are, in fact, looking at the seasonal flu. The interesting thing here is how well the barometer reflects what is reported by the Swedish Institute for Communicable Disease Control (SMI) in their weekly reports. Those reports are based on input from sentinels and laboratories, and by necessity, they lag behind in time: the current report is for the period of February 6 – 13, which is a week old by now.

Of the approximately 80 000 social media posts that matched our criteria for being included in the barometer on a given day in the period depicted in the image below, only a small fraction concern the flu. The flu isn’t contagious via the web, but information is. Keep your eyes open, and report symptoms to influenzanet to allow scientists to better stay on top of the flu spreading throughout Sweden.

Meanwhile, as we keep our eyes open for the spring sun, we’ll make sure to monitor the flu barometer and take note of any declining trends. We’ll keep you posted.

People active in Swedish social media are increasingly concerned about the seasonal influenza. The image shows a seven day moving average for the past two months of the influenza signal. The annotations in the graph is the Ethersource anomaly detection algorithms at work: each flag indicate a point in time where the change in trend warranted our attention.

Artist Lars Vilks Attacked. Again.

At 6:45 pm, less than a minute after the news broke on Twitter, Ethersource picked up the first aversive signal relating to tonight’s attack on Lars Vilks. (We’ve covered him previously on this blog). This time, elements in the audience threw eggs at him during an evening lecture in Karlstad.

Of the major news outlets, the branch of Swedish Radio located in Karlstad was the quickest in publishing the news, putting it on their national web site at 7:43 pm (SR). The other players were roughly 30 to 45 minutes behind (DN, SVD, SVT). Having Ethersource doing real-time attitudinal analysis of the contents of Swedish Tweets would have brought the news to your attention even earlier.

Image 1: Aversive expressions related to Lars Vilks on February 21, 2012. The red circle shows the initial peak indicative of the egging event, targeting Vilks. The Tweet in Image 2 is the first one in the list of documents associated with the highlighted peak, easily accessible via the Ethersource GUI.

Image 2: The first Tweet picked up by Ethersource pertaining to egging event: "People in the audience just threw eggs at Lars Vilks and there was chaos"

 

 

Tebow, Tebowed, Tebowing: Spelling Variants and Associations

The Wall Street Journal recently ran a piece on the countless ways to spell Tebow. The article reports on spelling variants such as “Teebow”, “Teeeebow”, and “Teeebowww”, all of which are easily recognized using regular expressions. Nevertheless, this is a nice example of how the productivity of the language use of Internet users may pose challenges for keyword-based systems.

Ethersource does not use regular expressions to handle this type of variation. On the contrary, it learns terminological variation continuously by observing language use. This means that Ethersource will not only find the type of variants reported in the WSJ article, but also more unpredictable variants, such as:

  • Twbow
  • Tibow
  • Tebox
  • Teboq
  • Tewbow
  • Teobow
  • Teabow
  • Teblow
  • Tebowm

In addition to finding out the spelling variants of a given term, Ethersource can also find associated terms that help frame its meaning.  That is, help answering the question “What is a Tebow?”.

According to our ever-changing, live data, the top terms associated with Tebow include:

  • Broncos
  • Tim
  • Denver
  • quarterback
  • Tebowing
  • Tebowed

From this, we (manually) infer that Tebow is a person whose first name is Tim, that he is a quarterback, and that he is playing for the Denver Broncos. The final two terms in the list puzzled us a bit. This is what we learned. Tebowing refers to the act of getting down on one knee and starting to pray, even if everyone around you is doing something completely different. Tebowed, on the other hand, has little to do with spirits as it denotes being run over while playing American football. Thus, we add spirituality and toughness to our notion of Tebow.

Positiveness Correlates with Holidays, Headache Correlates with New Year’s Day

We’ve previously seen that the aggregated overall positiveness of Swedes is cyclical on a weekly basis. Swedes love their days off. We’re now happy to asses what we’ve all suspected for a long time: during Christmas and New Year we all excel in positive thinking!

Additionally, the image below reveals that, for some reason, Swedes appear to be very concerned with headaches on the day after the New Year festivities.

Positiveness correlates with holidays, and headache correlates with New Year's Day.

Positiveness correlates with holidays (red circles, Christmas and New Year), and headache correlates with New Year's Day.

Iowa and social media sentiment

We must confess we were a bit wary of extending social media-based prediction into to the minds of Iowans gathering in caucus halls around their state to select their favourite candidate for presidential candidate. Iowan politics is famously local: our measurements are global.

As it turns out we were fairly good at picking out what matters. The results gave Mitt Romney, Ron Paul and Rick Santorum more or less equal votes, with others – Newt Gingrich, Michele Bachmann, Rick Perry, Jon Huntsman trailing far behind.

Our measurements of social media in the last few days showed that the three most talked about candidates were Santorum, Romney and Paul. The six most mentioned candidates received about the same amount of appreciation. But comparing the amount of appreciation with the amount of aversive sentiment they generate we find that Romney had the best differential, and that Gingrich and Bachmann show strongly negative differential.

We will not quite as shy in four years’ time!

Proportion of all mentions in social media for the day before the Iowa caucuses

Proportion of all mentions in social media for the day before the Iowa caucuses

Proportion of positive mentions in social media for the day before the Iowa caucuses

Proportion of positive mentions in social media for the day before the Iowa caucuses

Proportion of negative mentions in social media for the day before the Iowa caucuses

Proportion of negative mentions in social media for the day before the Iowa caucuses

GOP Hopefuls in Social Media

The blogsite amerikanskpolitik.se has published some measurements we made on the relative stature in social media for the main Republican party presidential candidates. Their blog post is in Swedish but the main observations are:

  1. Ron Paul has gained a massive boost in mentions lately and is now the most talked about candidate. (This is likely to be a partial effect of the general libertarian and counterestablishmentarian bias of the blogosphere).
  2. Michele Bachmann is now the candidate viewed with the most skepticism. (This is likely to be an effect of her recently expressed views on vaccination, which run counter to many health professionals’ views.)
  3. Newt Gingrich is the candidate most associated with aversive affect.
  4. Mitt Romney and Newt Gingrich are the candidates most associated with positive affect.
Aversive mentions during the week of December 22-28, 2012.

Aversive mentions during the week of December 22-28, 2012.

Proportion of mentions during the week of December 22-28, 2012.

Proportion of mentions during the week of December 22-28, 2012.

Skeptical mentions during the week of December 22-28, 2012.

Skeptical mentions during the week of December 22-28, 2012.

Positive mentions during the week of December 22-28, 2012.

Positive mentions during the week of December 22-28, 2012.

Real-time Syndromic Surveillance of Social Media for Disease Symptoms related to Seasonal Influenza

  • We do real-time monitoring of  social media for disease symptoms
  • there is still no evidence of an outbreak of the seasonal flu in Sweden
  • we observe, however, an increasing trend in the intensity of symptoms

The inevitable influenza season will soon come knocking on our doors. How do we know when it has started, and how do we know just how severe it is? To this end, there are on-line tools for syndromic surveillance, aiding individual medical practitioners and national disease control centers alike to combat the spread of influenza. Internationally, perhaps the most well-known monitoring service is Google Flu Trends. Nationally, Influensakoll keeps track of the current state of flu-related illness in Sweden. Along the same lines, research carried out at the Swedish Institute for Infectious Disease Control (SMI) show the feasibility of using search queries submitted to the medical web site Vårdguiden for outbreak detection and monitoring. SMI also publishes weekly influenza reports based on input from labs and sentinels.

In addition, there is a growing effort in the research community of mining on-line social media, mostly Twitter, in English, and only by using keywords with the purpose of facilitating early-warning and outbreak detection to be used by health authorities in their planning and conducting targeted counter-measures to epidemic diseases.  Another interesting approach is that taken by the Iowa Electronic Health Market which is a prediction market for syndromic surveillance.

While the above mentioned services and research rely on either active participation on behalf of the users, or on keyword matching in social media feeds with the purpose of finding patterns, we’ve taken a different route to finding out the state of illness of Sweden. We’ve enhanced the barometer introduced earlier with concepts (not keywords) corresponding to a range of disease symptoms such as migraine, fever, expectorate, headache, nausea, sore throat, and head cold, facilitating the triangulation of more complex illnesses without having to wait for the bloggers, tweeters, forum participants, and facebookers out there to become so ill that they either actively seek answers related to their health condition, or start communicating using the actual name of the disease.

Our approach attempts to catch signs of illness early on, expressed as the participants in social media do what they usually do, that is, communicate with their peers. By focusing on the symptoms, we believe it is possible to get an early-warning of the seasonal flu, before anyone realizes it is what they are actually talking about. The image below illustrates the discrepancy between the score for the concept of influenza  (the green, nearly flat line at the bottom of the graph) and the scores for some of the symptoms of influenza; expectoration (blue line), headache (red line), and fever (yellow line). Clearly, people have not yet experienced the flu strongly enough to talk about it, although they talk loudly about some of its symptoms. Note that the graph reveals an increasing trend in the intensity of the symptoms! The Ethersource-based barometer thus serves as a complement to other surveillance tools in that it picks up on trends of (combinations of) symptoms earlier.

Expressions of the concepts expectoration, headache, fever, and influenza in Swedish social media, early December 2011. Note that while the influenza score is constantly low, the other three symptoms vary with the time-of-day, taking precedence over each other in various ways. Clearly, people have not yet experienced the flu strong enough to talk about it.

Expressions of the concepts expectoration, headache, fever, and influenza in Swedish social media, early December 2011. Note that while the influenza score is constantly low, the other three symptoms vary with the time-of-day, taking precedence over each other in various ways.

Gavagai’s Ethersource technology allows for the kind of syndromic surveillance of disease symptoms described in this blog post to be carried out in real-time, in any language.