Generating bags of words from the sums of their word embeddings

Research output: Chapter in Book/Conference paperConference paperpeer-review

3 Citations (Scopus)

Abstract

Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings. As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.

Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers
EditorsAlexander Gelbukh
Place of PublicationAustria
PublisherSpringer
Pages91-102
Number of pages12
ISBN (Print)9783319754765
DOIs
Publication statusPublished - 1 Jan 2018
Event17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016 - Konya, Türkiye
Duration: 3 Apr 20169 Apr 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9623 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016
Country/TerritoryTürkiye
CityKonya
Period3/04/169/04/16

Fingerprint

Dive into the research topics of 'Generating bags of words from the sums of their word embeddings'. Together they form a unique fingerprint.

Cite this