This is our blog. We write about cool experiments, interesting case studies, open talks and technicalities.
Grab the RSS feed
Back

2016-05-30

The extraordinary productivity of foul language – Do you and your text analytics solution know these bad words?

By looking into the extraordinary productivity of foul language, this post showcases the ability of the Gavagai’s semantic memories to automatically learn and relate terms in a vocabulary. If you are sensitive to swearing and cursing, you should stop reading now!

Foul language, profanity, expletives, and bad words. The creativity of the human mind when it comes to inventing impolite, rude or offensive language is simply amazing. But regardless of how productive a single human being might be, she still will never be able to come up with all the variants of a given bad language concept used throughout an entire community of speakers. Of course, this goes for all types of terms in a language – the wisdom of the crowd is greater than that of any single individual. In this post, we turn to the area of foul language for illustrative purposes simply because it seems more vivid and productive than most other areas of our vocabulary.

Take a moment and see if you can come up with 20 synonyms in English to the word “fucking” and as many terms to use instead of “dumb ass”. Write them down.

How well did you do? You will be able to compare your results with that of Gavagai’s semantic memories shortly.

Gavagai’s semantic memories are exposed to millions of news documents and social media posts every day. From this stream of text, the memories continuously and unsupervisedly learn the words of a language, including multi-word terms, and how they relate to each other. Thus, the semantic memories are always up to date with current language use. I will illustrate the learning capability by looking up three bad words, as well as their immediate semantic neighborhoods: “bi ch”, “fucking”, and “dumb ass”.

Is “bi ch” a “birch”?

First, consider the  term “bi ch”. It consists of two separate words that the system has picked up as being a single multi word term by reading English language social media posts. One might wonder whether the whitespace in the term should be filled with a character, and if so, what character should it be? An “r” perhaps? To make it “birch”? By looking at the semantic neighborhood of “bi ch”, we can easily see that the missing character is not an “r”, but a “t”. The term “bi ch” is an obfuscated variant of “bitch”. The image below shows “bi ch” (marked with a red circle) and its surroundings (click the image for a larger version).

gavagai-lexicon-graph-bi-ch-min 

The nodes in the graph are terms (automatically identified by the system), and the edges between nodes denote strong semantic relationships (again, automatically identified by the system – there’s no manual intervention going on). The term “bi ch” is connected to “fu k” immediately to its right by means of an edge labelled “* n gga”. The edge label is read as “bi ch n gga”, and “fu k n gga”. The label is an automatically extracted common sequential term that occurs to the left or right of terms denoted by the nodes it connects.

Here’s how to understand the relationships in the graph. Start with, e.g., the “hahaha” node (lower left corner of the image). The term “hahaha” has been used in the same textual contexts as “lmao”. That is why there is an edge between them. The term “lmao” has been used in the same context as “bruh”, which has been used in the same context as “nigga”, which, in turn has been used the same way as “n gga”. Finally, “n gga” has been used in the same way as “bi ch”. Thus, there is a path in the graph connecting the usage of “hahaha” and “bi ch” in five steps.

Name 20 ways of conveying the meaning of “dumb ass”

Another common derogatory term is “dumb ass”. The graph below shows “dumb ass” (marked with red) and its nearest neighbors (click the image for a larger version). Did you manage to come up with 20 terms to use instead of “dumb ass”? Check the closely related terms in the graph: did you think of a term that is not there? If so, please post it in the comments section.

gavagai-lexicon-graph-dumb-ass-min

As you can see, there are four groups of terms in the image. Besides the rather tight cluster around “dumb ass”, there is a cluster representing the derogatory meaning of a person’s behind (connected to “dumb ass” via the term “fat ass”). There is also a connection between “dumb ass” and “moronic”, however, the “moronic” part is actually the entrance to two different structures: the first one connected to “dumbass” (note the alternative spelling of “dumb ass”) in the sense of “idiotic”, “absurd”, and “ridiculous”; the second structure is indirectly connected to “dumbass” by way of the terms “moronic” and “idiotic”, and conveys the meaning represented by, e.g., “ignorant”, “gullible”, “uneducated”, and “deluded”.

Do you know 20 ways to spell “fucking”?

Finally, consider the term “fucking”. Its neighborhood forms a very tight cluster, consisting mostly of spelling variants. To make it a bit more interesting to look at, I’ve manually aligned the nodes to form a more readable pattern (click the image to make it bigger). How many of these variants did you come up with?

gavagai-lexicon-graph-fucking-min

What is this good for?

If you’re in a business segment where you need up-to-date word knowledge, in multiple languages, accessible as SaaS, you should contact us! The semantic memories are currently available in 20 languages, via the Gavagai Living Lexicon, and an API, and are at the core of our media monitoring application, Gavagai Monitor, as well as our text intelligence platform, Gavagai Explorer.

The graphs above were made with a tool I made for populating a Neo4j graph database with word information extracted from the Gavagai Living Lexicon via the Gavagai API.

 

 

Category: case studies, Gavagai Lexicon