Labs

RESEARCH

Our technology is based on 15+ years of research in computational linguistics and computer science. We believe that you cannot make any significant progress unless you push the boundaries and break new ground. Our lab is constantly refining and advancing our algorithms and methodologies. We work in two main areas of research:

Distributional semantics, a research area in which we develop and study theories and methods for quantifying and categorising semantic similarities between linguistic items based on their distributional properties in large samples of language data. Our interest here is manifold: we work on algorithms for the effective acquisition of understanding from text, on rich and useful representations for linguistic content and situational context, and on applications of distributional models to real world tasks (obviously, mostly tasks of commercial interest to us).

Evaluation of learning language models, finding methods and metrics to test and compare algorithms, memory models, and processing approaches, both to benchmark improvements and to validate approaches with respect to tasks of interest.

PUBLICATIONS

Plausibility Testing for Lexical Resources

Magdalena Parks, Jussi Karlgren, Sara Stymne
This paper describes principles for evaluation metrics for lexical components using randomly scrambled sentences compared with naturally occurring ones, and sentences where some salient referent has been replaces with a liar's item, together with a pilot implementation of those principles based on requirements from practical information system
Presented at the 2018 CLEF Conference

Second Workshop on Search and Exploration of X-Rated Information (SEXI’16)

Vanessa Murdock, Charles L.A. Clarke, Jaap Kamps, and Jussi Karlgren.

Adult content is pervasive on the web, has been a driving factor in the adoption of the Internet medium, and is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. The research questions surrounding adult content access behaviours are unique, and interesting and valuable research in this area can be done ethically. WSDM 2016 features a half day workshop on Search and Exploration of X-Rated Information (SEXI) for information access tasks related to adult content. While the scope of the workshop remains broad, special attention is devoted to the privacy and security issues surrounding adult content by inviting keynote speakers with extensive experience on these topics. The recent release of the personal data belonging to customers of the adult dating site Ashley Madison provides a timely context for the focus on privacy and security.

Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM '16). ACM, New York, NY, USA, 697-698. DOI=http://dx.doi.org/10.1145/2835776.2855118

Pre-presentation of the SEXI workshop in WSDM proceedings, from the ACM digital library

(report from the workshop will be published in SIGIR Forum later in 2016)

The Gavagai Living Lexicon

Magnus Sahlgren, Amaru Cuba Gyllensten, Fredrik Espinoza, Ola Hamfors, Jussi Karlgren, Fredrik Olsson, Per Persson, Akshay Viswanathan, and Anders Holst

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

Short paper presented at the 10th edition of the Language Resources and Evaluation Conference (LREC 2016), 23-28 May 2016, Portorož

Dead Man Tweeting

David Nilsson, Magnus Sahlgren, and Jussi Karlgren

This paper presents our first take at a text generator: Dead Man Tweeting, a system that learns semantic avatars from (dead) people’s texts, and makes the avatars come alive on Twitter. The system includes a language model for generating sequences of words, a topic model for ensuring that the sequences are topically coherent, and a semantic model that ensures the avatars can be productive and generate novel sequences. The avatars are connected to Twitter and are triggered by keywords that are significant for each particular avatar.

We will be continuing the development of this first whimsical prototype for other projects where generating topical text is of interest.

Paper presented on May 28 at the RE-WOCHAT 2016 workshop on Collecting and Generating Resources for Chatbots and Conversational Agents Development and Evaluation, in conjunction with LREC 2016 in Portorož

Evaluating Categorisation in Real Life – an argument against simple but impractical metrics

Vide Karlsson, Jussi Karlgren, and Pawel Herman

Text categorisation in commercial application poses several limiting constraints on the technology solutions to be employed. This paper describes how a method with some potential improvements is evaluated for practical purposes and argues for a richer and more expressive evaluation procedure. In this paper one such method is exemplified by a precision-recall matrix which exchanges convenience for usefulness.

Presented at the 7th CLEF 2016 Conference and Labs of the Evaluation Forum, 5-8 September 2016, Évora, Portugal.

Paper is here

En rekommenderad svensk språkteknologisk terminologi

Viggo Kann (KTH), Lars Ahrenberg
(Linköping University), Rickard Domeij (Swedish Language Council),
Ola Karlsson (Swedish Language Council),
Jussi Karlgren,
Henrik Nilsson (Terminologicentrum), Joakim Nivre (Uppsala University)
In 2014 the Swedish Language Technology Terminology Group was created, with representatives from different parts of the language technology community, both higher education and research, industry and governmental agencies. In 2016 we have recommended Swedish terms for the 270 language technological concepts in the Bank of Finnish Terminology in Arts and Sciences. The language technology terms are published on folkets-lexikon.csc.kth.se/LTterminology, where anyone can lookup Swedish and English terms interactively and read the full list of terms. We also try to enter the most important Swedish terminology into the Swedish Wikipedia. We encourage use of these Swedish terms and welcome suggestions for improvements of the Swedish terminology.
Presented to the Sixth Swedish Language Technology Conference

Random indexing of multidimensional data

Fredrik Sandin, Blerim Emruli, and Magnus Sahlgren
This paper gives a model for how to generalise random indexing to multidimensional arrays and therefore enable approximation of higher-order statistical relationships in data. The generalised method is a sparse implementation of random projections, which is the theoretical basis also for ordinary random indexing and other randomisation approaches to dimensionality reduction and data representation. We present numerical experiments which demonstrate that a multidimensional generalisation of random indexing is feasible, including comparisons with ordinary random indexing and principal component analysis. An open source implementation of generalised random indexing is provided.
Knowledge and Information Systems (2016). doi:10.1007/s10115-016-1012-2

A proposal to use distributional models to analyse dolphin vocalization

Mats Amundin, Robert Eklund, Henrik Hållsten, Jussi Karlgren, Lars Molinder
This paper gives a brief introduction to the starting points of our coming---pending favourable decisions by research funding agencies---experimental project to study dolphin communicative behaviour using distributional semantics as implemented by us at Gavagai. It presents some of the challenges and conveys some of the optimism we feel is warranted given the rapid increase of available data and of processing power. This is an opportunity both to test the limits of our model and the characteristics of dolphin communication systems! Co-authors are Mats Amundin from Kolmården Wildlife Park, Robert Eklund from Linköping University, Henrik Hållsten, myself, and Lars Molinder of Carnegie, one of our financial advisors, who came up with the original idea. The paper was presented by Mats Amundin at the 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots.
In: 1st International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots, 2017.

Detecting Speculations, Contrasts and Conditionals in Consumer Reviews

Maria Skeppstedt, Teri Schamp-Bjerede, Magnus Sahlgren, Carita Paradis and Andreas Kerren

A support vector classifier was compared to a lexicon-based approach for the task of detecting the stance categories speculation, contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation. This outperformed the lexicon-based approach, for which an F-score of just above 80 was achieved. The machine learning results for the other two categories showed a lower average (an approximate F-score of 60 for contrast and 70 for conditional), as well as a larger variance, and were only slightly better than lexicon matching. Therefore, while machine learning was successful for detecting speculation, a well-curated lexicon might be a more suitable approach for detecting contrast and conditional. 

Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA '15)

Factorization of Latent Variables in Distributional Semantic Models

David Ödling, Arvid Österlund and Magnus Sahlgren

This paper discusses the use of factorization techniques in distributional semantic models. We focus on a method for redistributing the weight of latent variables, which have previously been shown to improve the performance of distributional semantic models. However, this result has not been replicated and remains poorly understood. We refine the method, and provide additional theoretical justification, as well as empirical results that demonstrate the viability of the proposed approach.

EMNLP 2015

Navigating the Semantic Horizon using Relative Neighborhood Graphs

Amaru Cuba Gyllensten and Magnus Sahlgren

This paper introduces a novel way to navigate neighborhoods in distributional semantic models. The approach is based on relative neighborhood graphs, which uncover the topological structure of local neighborhoods in semantic space. This has the potential to overcome both the problem with selecting a proper k in k-NN search, and the problem that a ranked list of neighbors may conflate several different senses. We provide both qualitative and quantitative results that support the viability of the proposed method.

EMNLP 2015

Evaluating Learning Language Representations

Jussi Karlgren, Jimmy Callin, Kevyn Collins-Thompson, Amaru Cuba Gyllensten, Ariel Ekgren, David Jürgens, Anna Korhonen, Fredrik Olsson, Magnus Sahlgren, and Hinrich Schütze

This paper reports from the workshop on Evaluating Learning Language Representations hosted by Gavagai in October 2014.

Presented at the 6th CLEF 2015 Conference and Labs of the Evaluation Forum, 8-11 September 2015, Toulouse, France. This work was partially funded by the European Science Foundation through its ELIAS project.

Paper is here

Inferring the location of authors from words in their texts

Max Berggren, Jussi Karlgren, Robert Östling, and Mikael Parkvall

This paper  describes a series of experiments to determine how positionally annotated Twitter texts can be used to learn words which indicate location of other texts and their authors. Many texts are locatable but most have no explicit indication of place --- many applications, both commercial and academic, have an interest in knowning where a text or its author is from. 

The notion of placeness of a word is introduced as a measure of how locational a word is, and we find that modelling word distributions to account for several locations, using  local distributional context, and aggregating locational information in a centroid for each text gives the most useful results. The results are applied to data in the Swedish language.

Presented at the 20th NoDaLiDa, Nordic Conference on Computational Linguistics in May 11-13, 2015, Vilnius. This work was done in cooperation with Stockholm University and was partially funded by Vetenskapsrådet, the Swedish Research Council, under its grant SINUS (Spridning av innovationer i nutida svenska).

A Use Case Framework for Information Access Evaluation

Preben Hansen, Gunnar Eriksson, Anni Järvelin, and Jussi Karlgren

Despite that the need for a common evaluation framework for multimedia and multimodal documents for various use cases, including non-topical use, is widely acknowledged, such a framework is still not in place. Retrieval system evaluation results are not regularly validated in laboratory or field studies; the infrastructure for generalizing results over tasks, users and collections is still missing. This chapter presents a use case-based framework for experimental design in the field of interactive information access.  The framework is highlighted by examples that sketch out how the framework can be productively used in experimental design and reporting with a minimal threshold for adoption. 

In "Professional Search in the Modern World", Paltoglou, Georgios, Loizides, Fernando, Hansen, Preben (Eds.). Springer. 2014.

Semantic Topology

Jussi Karlgren Gabriel Isheden Martin Bohman Ariel Ekgren Emelie Kullmann David Nilsson
Gavagai and KTH

A reasonable requirement (among many others) for a lexical or semantic component in an information system is that it should be able to learn incrementally from the linguistic data it is exposed to, that it can distinguish between the topical impact of various terms, and that it knows if it knows stuff or not.

We work with a specific representation framework – semantic spaces – which well accommodates the first requirement; in this short paper, we investigate the global qualities of semantic spaces by a topological procedure – mapper – which gives an indication of topical density of the space; we examine the local context of terms of interest in the semantic space using another topologically inspired approach which gives an indication of the neighbourhood of the terms of interest. Our aim is to be able to establish the qualities of the semantic space under consideration without resorting to inspection of the data used to build it.

In: Proceedings of the 23d ACM international conference on Conference on information & knowledge management (CIKM '14) in Shanghai, Nov 3-7. ACM, New York, NY, USA, 2014.

Språket avslöjar hur vi röstar

Jussi Karlgren.

"Hur ser det politiska opinionsläget ut? Det går förstås att fråga väljarna. Men bättre är kanske att se vad de skriver. Nu är ett datorprogram väljarnas sympatier på spåren."

2014. Språktidningen. 6: 16-22.
Link

The STAVICTA Group Report for RepLab 2014 Reputation Dimensions Task

Afshin Rahimi, Magnus Sahlgren, Andreas Kerren, and Carita Paradis

In this paper we present our experiments on the RepLab 2014 Reputation Dimension task. RepLab is a competitive challenge for Reputation Management Systems. RepLab 2014’s reputation dimensions task focuses on categorization
of Twitter messages with regard to standard reputation dimensions (such as performance, leadership, or innovation). Our approach only relies on the textual content of tweets and ignores both metadata and the content of URLs within tweets. We carried out several experiments focusing on different feature sets including bag of n-grams, distributional semantics features, and deep neural
network representations. The results show that bag of bigram features with minimum frequency thresholding work quite well in reputation dimension task especially with regards to average F1 measure over all dimensions where two of our four submitted runs achieve highest and second highest scores. Our experiments also show that semi-supervised recursive autoencoders outperform other feature sets used in our experiments with regards to accuracy measure and is a promising subject of future research for improvements.

 

Proceedings of CLEF 2014

Issue framing and language use in the Swedish blogosphere: Changing notions of the outsider concept

Stefan Dahlberg and Magnus Sahlgren
Department of Political Science, University of Gothenburg and Gavagai, Stockholm

Issue framing has become one of the most important means of elite influence on public opinion. In this paper, we introduce a method for investigating issue framing based on statistic analysis of large samples of language use. Our method uses a technique called Random Indexing (RI), which enables us to extract semantic and associative relations to any target concept of interest, based on co-occurrence statistics collected from large samples of relevant language use. As a first test and evaluation of our proposed method, we apply RI to a large collection of Swedish blog data and extract semantic relations relating to our target concept “outsiders”. This concept is widely used in the public debate both in relation to labour market issues and socially related issues.

In: Bertie Kaal, Isa Maks and Annemarie van Elfrinkhof (eds.) From Text to Political Positions: Text analysis across disciplines, John Benjamins Publishing Company, 2014, pp. 71–92.

Temperature in the Word Space: Sense Exploration of Temperature Expressions using Word-Space Modelling

Maria Koptjevskaja-Tamm and Magnus Sahlgren
Department of Linguistics, Stockholm University and Gavagai, Stockholm

This chapter deals with a statistical technique for sense exploration based on distributional semantics known as word space modelling. Word space models rely on feature aggregation, in this case aggregation of co-occurrence events, to build an aggregated view on the distributional behaviour of words. Such models calculate meaning similarity among words on the basis of the contexts in which they occur and represent it as proximity in high-dimensional vector spaces. The main purpose of this study is to test to what extent word-space modelling is in principle suitable for lexical-typological work by taking a first little step in this direction and applying the method for the exploration of the seven central English temperature adjectives in three corpora representing different genres. In order to better capture and account for the potentially different senses of one and the same word we have suggested and applied a new variant of this general method, “syntagmatically labelled partitioning”.

In: Benedikt Szmrecsanyi and Bernhard Wälchli (eds.) Aggregating Dialectology, Typology, and Register Analysis: Linguistic Variation in Text and Speech, Berlin, Boston: De Gruyter, 2014, pp. 231–267.

Search and Exploration of X-Rated Information (SEXI 2013)

Vanessa Murdock, Charles L.A. Clarke, Jaap Kamps, and Jussi Karlgren

Adult content is pervasive on the Web, has been a driving factor in the adoption of the Internet medium. It is responsible for a significant fraction of traffic and revenues, yet rarely attracts attention in research. We propose that the research questions surrounding adult content access behaviors are unique, and we believe interesting and valuable research in this area can be done ethically. The workshop on Search and Exploration of X-Rated Information (SEXI) addresses these issues for information access tasks related to adult content.

In: Proceedings of the sixth ACM international conference on Web Search and Data Mining (WSDM '13), ACM, New York, NY, USA, 2013, pp. 795–796; Workshop report in SIGIR Forum Vol. 47 No. 1

Frontiers, Challenges, and Opportunities for Information Retrieval – Report from SWIRL 2012, The Second Strategic Workshop on Information Retrieval in Lorne

Jussi Karlgren with 45 other senior Information Retrieval researchers

During a three-day workshop in February 2012, 45 Information Retrieval researchers met to discuss long-range challenges and opportunities within the field. The result of the workshop is a diverse set of research directions, project ideas, and challenge areas. This report describes the workshop format, provides summaries of broad themes that emerged, includes brief descriptions of all the ideas, and provides detailed discussion of six proposals that were voted "most interesting" by the participants.

Key themes include the need to: move beyond ranked lists of documents to support richer dialog and presentation, represent the context of search and searchers, provide richer support for information seeking, enable retrieval of a wide range of structured and unstructured content, and develop new evaluation methodologies.

In: SIGIR Forum, ISSN 0163-5840, Vol. 46, no 1, 2-32

Usefulness of Sentiment Analysis

Jussi Karlgren, Magnus Sahlgren, Fredrik Olsson, Fredrik Espinoza, and Ola Hamfors
Gavagai, Stockholm

What can text sentiment analysis technology be used for, and does a more usage-informed view on sentiment analysis pose new requirements on technology development?

In: Ricardo Baeza-Yates, Arjen P. Vries, Hugo Zaragoza, B. Barla Cambazoglu, and Vanessa Murdock (eds.) Proceedings of the 34th European conference on Advances in Information Retrieval (ECIR'12), Springer-Verlag, Berlin, Heidelberg, 2012, pp. 426–435.

Profiling Reputation of Corporate Entities in Semantic Space

Jussi Karlgren, Magnus Sahlgren, Fredrik Olsson, Fredrik Espinoza and Ola Hamfors
Gavagai, Stockholm

Gavagai used its first-generation baseline system for the profiling task for evaluation campaign for online reputation management systems of CLEF 2012. The system builds on large scale analysis of streaming text and performed excellently on this task with standard settings.

In: P. Forner, J. Karlgren and C. Womser-Hacker (eds.) Notebook for RepLab at CLEF 2012

Exploiting Semantic Annotations in Information Retrieval (Workshop series 2010-2014)

ESAIR 2010: Jaap Kamps (University of Amsterdam), Jussi Karlgren (Gavagai), Ralf Schenkel (MPI)
ESAIR 2011: Omar Alonso (Microsoft) Jaap Kamps (University of Amsterdam), Jussi Karlgren (Gavagai),
ESAIR 2012: Jaap Kamps (University of Amsterdam), Jussi Karlgren (Gavagai), Peter Mika (Yahoo! Research), and Vanessa Murdock (Microsoft Bing)
ESAIR 2013: Paul. N. Bennett (Microsoft Research), Evgeniy Gabrilovich (Google), Jaap Kamps, and Jussi Karlgren (Gavagai)
ESAIR 2014: Jaap Kamps (University of Amsterdam), Jussi Karlgren (Gavagai), Omar Alonso (Microsoft)

There is an increasing amount of structure on the web as a result of modern web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. 

The ESAIR series of workshops takes as its starting point that there is an increasing amount of structure on the web as a result of modern web languages, user tagging and annotation, emerging robust NLP tools, and an ever growing volume of linked data. These meaningful, semantic, annotations hold the promise to significantly enhance information access, by enhancing the depth of analysis of today's systems. Currently, we have only started exploring the possibilities and only begin to understand how these valuable semantic cues can be put to fruitful use. To complicate matters, standard text search excels at shallow information needs expressed by short keyword queries, and here semantic annotation contributes very little, if anything.

ESAIR'10 focussed on formulating a framework for viewing annotation as a linking procedure, connecting an analysis of information objects with a semantic model of some sort, expressing relations that contribute to a task of interest to end users.

ESAIR'11 brought together discussions on how to unleash the potential of semantic annotations requires us to think outside the box, by combining the insights of natural language processing (NLP) to go beyond bags of words, the insights of database technologies (DB) to use structure efficiently even when aggregating over millions of records, the insights of information retrieval (IR) in effective goal-directed search and evaluation, and the insights of knowledge management (KM) to get grips on the greater whole.

ESAIR'12 focussed on how to leverage the rich context currently available, especially in a mobile search scenario, giving powerful new handles to exploit semantic annotations and on how to fruitfully combine classic information retrieval and knowledge intensive approaches, and for the first time work actively toward a unified view on exploiting semantic annotations.

ESAIR'13 focussed on two of the most challenging aspects to address in the coming years. First, there is a need to include the currently emerging knowledge resources (such as DBpedia, Freebase) as underlying semantic model giving access to an unprecedented scope and detail of factual information. Second, there is a need to include annotations beyond the topical dimension (think of sentiment, reading level, prerequisite level, etc) that contain vital cues for matching the specific needs and profile of the searcher at hand.

ESAIR'14 focussed on how to elicit more articulate queries or expressions of information need, with concepts and relations linking their statement of request to existing semantic models as offered by emerging knowledge bases. The discussion centered to a large extent on how to provide useful event and entity identification from unstructured streaming information.

ESAIR 2010: Proceedings of the 20th ACM international conference on Conference on information & knowledge management (CIKM '10), Toronto.
ESAIR 2011: Proceedings of the 20th ACM international conference on Conference on information & knowledge management (CIKM '11), Glasgow.
ESAIR 2012: Proceedings of the 21st ACM international conference on Conference on information & knowledge management (CIKM '12), Maui.
ESAIR 2013: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (CIKM '13), San Francisco.
ESAIR 2014: Proceedings of the 23d ACM international conference on Conference on information & knowledge management (CIKM '14), Shanghai.