TY - GEN
T1 - Generating bags of words from the sums of their word embeddings
AU - White, Lyndon
AU - Togneri, Roberto
AU - Liu, Wei
AU - Bennamoun, Mohammed
PY - 2018/1/1
Y1 - 2018/1/1
N2 - Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings. As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.
AB - Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings. As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.
UR - http://www.scopus.com/inward/record.url?scp=85044432220&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-75477-2_5
DO - 10.1007/978-3-319-75477-2_5
M3 - Conference paper
AN - SCOPUS:85044432220
SN - 9783319754765
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 91
EP - 102
BT - Computational Linguistics and Intelligent Text Processing - 17th International Conference, CICLing 2016, Revised Selected Papers
A2 - Gelbukh, Alexander
PB - Springer
CY - Austria
T2 - 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016
Y2 - 3 April 2016 through 9 April 2016
ER -