Towards geological knowledge discovery using vector-based semantic similarity

Majigsuren Enkhsaikhan, Wei Liu, Eun Jung Holden, Paul Duuring

Research output: Chapter in Book/Conference paperConference paper

1 Citation (Scopus)

Abstract

It is not uncommon for large organisations and corporations to routinely produce various kinds of reports indefinitely. Apart from archiving them and the occasional retrieval of some, very little can be done to take advantage of these massive resources for valuable knowledge discovery. The under-utilised unstructured data written in natural language text is often referred to as part of the “dark data”. The good news is, recent success of learning distributed representation of words in vector spaces, especially, the similarity and analogy queries enabled by the so-learned word vectors drive a paradigm shift from “document retrieval” to “knowledge retrieval”. In this paper, we investigated how representational learning of words can affect the entity query results from a large domain corpus of geological survey reports. Extensive similarity tests and analogy queries have been performed. It demonstrated the necessity of training domain-specific word embeddings, as pre-trained embeddings are good at capturing morphological relations, but are inadequate for domain specific semantic relations. Carrying out entity extractions prior to word embedding training will further improve the quality of analogy query results. The framework developed in this paper can also be readily applied to other domain specific corpus.

Original languageEnglish
Title of host publicationAdvanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings
EditorsGuojun Gan, Xue Li, Shuliang Wang, Bohan Li
PublisherSpringer-Verlag Wien
Pages224-237
Number of pages14
ISBN (Print)9783030050894
DOIs
Publication statusPublished - 1 Jan 2018
Event14th International Conference on Advanced Data Mining and Applications, ADMA 2018 - Nanjing, China
Duration: 16 Nov 201818 Nov 2018
http://adma2018.nuaa.edu.cn/

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11323 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference14th International Conference on Advanced Data Mining and Applications, ADMA 2018
CountryChina
CityNanjing
Period16/11/1818/11/18
Internet address

Fingerprint

Semantic Similarity
Knowledge Discovery
Data mining
Semantics
Query
Analogy
Geological surveys
Vector spaces
Retrieval
Document Retrieval
Natural Language
Vector space
Industry
Paradigm
Resources
Learning
Training
Similarity
Corpus

Cite this

Enkhsaikhan, M., Liu, W., Holden, E. J., & Duuring, P. (2018). Towards geological knowledge discovery using vector-based semantic similarity. In G. Gan, X. Li, S. Wang, & B. Li (Eds.), Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings (pp. 224-237). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11323 LNAI). Springer-Verlag Wien. https://doi.org/10.1007/978-3-030-05090-0_20
Enkhsaikhan, Majigsuren ; Liu, Wei ; Holden, Eun Jung ; Duuring, Paul. / Towards geological knowledge discovery using vector-based semantic similarity. Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings. editor / Guojun Gan ; Xue Li ; Shuliang Wang ; Bohan Li. Springer-Verlag Wien, 2018. pp. 224-237 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{22f60a96df1e4ae49296d33a0d6d7c9e,
title = "Towards geological knowledge discovery using vector-based semantic similarity",
abstract = "It is not uncommon for large organisations and corporations to routinely produce various kinds of reports indefinitely. Apart from archiving them and the occasional retrieval of some, very little can be done to take advantage of these massive resources for valuable knowledge discovery. The under-utilised unstructured data written in natural language text is often referred to as part of the “dark data”. The good news is, recent success of learning distributed representation of words in vector spaces, especially, the similarity and analogy queries enabled by the so-learned word vectors drive a paradigm shift from “document retrieval” to “knowledge retrieval”. In this paper, we investigated how representational learning of words can affect the entity query results from a large domain corpus of geological survey reports. Extensive similarity tests and analogy queries have been performed. It demonstrated the necessity of training domain-specific word embeddings, as pre-trained embeddings are good at capturing morphological relations, but are inadequate for domain specific semantic relations. Carrying out entity extractions prior to word embedding training will further improve the quality of analogy query results. The framework developed in this paper can also be readily applied to other domain specific corpus.",
keywords = "Cosine similarity, FastText, Geological domain, Word analogy, Word embedding, Word2Vec",
author = "Majigsuren Enkhsaikhan and Wei Liu and Holden, {Eun Jung} and Paul Duuring",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-030-05090-0_20",
language = "English",
isbn = "9783030050894",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer-Verlag Wien",
pages = "224--237",
editor = "Guojun Gan and Xue Li and Shuliang Wang and Bohan Li",
booktitle = "Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings",
address = "Austria",

}

Enkhsaikhan, M, Liu, W, Holden, EJ & Duuring, P 2018, Towards geological knowledge discovery using vector-based semantic similarity. in G Gan, X Li, S Wang & B Li (eds), Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11323 LNAI, Springer-Verlag Wien, pp. 224-237, 14th International Conference on Advanced Data Mining and Applications, ADMA 2018, Nanjing, China, 16/11/18. https://doi.org/10.1007/978-3-030-05090-0_20

Towards geological knowledge discovery using vector-based semantic similarity. / Enkhsaikhan, Majigsuren; Liu, Wei; Holden, Eun Jung; Duuring, Paul.

Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings. ed. / Guojun Gan; Xue Li; Shuliang Wang; Bohan Li. Springer-Verlag Wien, 2018. p. 224-237 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11323 LNAI).

Research output: Chapter in Book/Conference paperConference paper

TY - GEN

T1 - Towards geological knowledge discovery using vector-based semantic similarity

AU - Enkhsaikhan, Majigsuren

AU - Liu, Wei

AU - Holden, Eun Jung

AU - Duuring, Paul

PY - 2018/1/1

Y1 - 2018/1/1

N2 - It is not uncommon for large organisations and corporations to routinely produce various kinds of reports indefinitely. Apart from archiving them and the occasional retrieval of some, very little can be done to take advantage of these massive resources for valuable knowledge discovery. The under-utilised unstructured data written in natural language text is often referred to as part of the “dark data”. The good news is, recent success of learning distributed representation of words in vector spaces, especially, the similarity and analogy queries enabled by the so-learned word vectors drive a paradigm shift from “document retrieval” to “knowledge retrieval”. In this paper, we investigated how representational learning of words can affect the entity query results from a large domain corpus of geological survey reports. Extensive similarity tests and analogy queries have been performed. It demonstrated the necessity of training domain-specific word embeddings, as pre-trained embeddings are good at capturing morphological relations, but are inadequate for domain specific semantic relations. Carrying out entity extractions prior to word embedding training will further improve the quality of analogy query results. The framework developed in this paper can also be readily applied to other domain specific corpus.

AB - It is not uncommon for large organisations and corporations to routinely produce various kinds of reports indefinitely. Apart from archiving them and the occasional retrieval of some, very little can be done to take advantage of these massive resources for valuable knowledge discovery. The under-utilised unstructured data written in natural language text is often referred to as part of the “dark data”. The good news is, recent success of learning distributed representation of words in vector spaces, especially, the similarity and analogy queries enabled by the so-learned word vectors drive a paradigm shift from “document retrieval” to “knowledge retrieval”. In this paper, we investigated how representational learning of words can affect the entity query results from a large domain corpus of geological survey reports. Extensive similarity tests and analogy queries have been performed. It demonstrated the necessity of training domain-specific word embeddings, as pre-trained embeddings are good at capturing morphological relations, but are inadequate for domain specific semantic relations. Carrying out entity extractions prior to word embedding training will further improve the quality of analogy query results. The framework developed in this paper can also be readily applied to other domain specific corpus.

KW - Cosine similarity

KW - FastText

KW - Geological domain

KW - Word analogy

KW - Word embedding

KW - Word2Vec

UR - http://www.scopus.com/inward/record.url?scp=85059735663&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-05090-0_20

DO - 10.1007/978-3-030-05090-0_20

M3 - Conference paper

SN - 9783030050894

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 224

EP - 237

BT - Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings

A2 - Gan, Guojun

A2 - Li, Xue

A2 - Wang, Shuliang

A2 - Li, Bohan

PB - Springer-Verlag Wien

ER -

Enkhsaikhan M, Liu W, Holden EJ, Duuring P. Towards geological knowledge discovery using vector-based semantic similarity. In Gan G, Li X, Wang S, Li B, editors, Advanced Data Mining and Applications - 14th International Conference, ADMA 2018, Proceedings. Springer-Verlag Wien. 2018. p. 224-237. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-05090-0_20