TY - BOOK
T1 - Learning lightweight ontologies from text across different domains using the web as background knowledge
AU - Wong, Wilson
PY - 2009
Y1 - 2009
N2 - [Truncated abstract] The ability to provide abstractions of documents in the form of important concepts and their relations is a key asset, not only for bootstrapping the Semantic Web, but also for relieving us from the pressure of information overload. At present, the only viable solution for arriving at these abstractions is manual curation. In this research, ontology learning techniques are developed to automatically discover terms, concepts and relations from text documents. Ontology learning techniques rely on extensive background knowledge, ranging from unstructured data such as text corpora, to structured data such as a semantic lexicon. Manually-curated background knowledge is a scarce resource for many domains and languages, and the effort and cost required to keep the resource abreast of time is often high. More importantly, the size and coverage of manually-curated background knowledge is often inadequate to meet the requirements of most on- tology learning techniques. This thesis investigates the use of the Web as the sole source of dynamic background knowledge across all phases of ontology learning for constructing term clouds (i.e. visual depictions of terms) and lightweight ontologies from documents. To appreciate the significance of term clouds and lightweight ontologies, a system for ontology-assisted document skimming and scanning is developed. This thesis presents a novel ontology learning approach that is devoid of any manually-curated resources, and is applicable across a wide range of domains (the current focus is medicine, technology and economics). More specifically, this research proposes and develops a set of novel techniques that take advantage of Web data to address the following problems: (1) the absence of integrated techniques for cleaning noisy data; (2) the inability of current term extraction techniques to systematically explicate, diversify and consolidate their evidence; (3) the inability of current corpus construction
AB - [Truncated abstract] The ability to provide abstractions of documents in the form of important concepts and their relations is a key asset, not only for bootstrapping the Semantic Web, but also for relieving us from the pressure of information overload. At present, the only viable solution for arriving at these abstractions is manual curation. In this research, ontology learning techniques are developed to automatically discover terms, concepts and relations from text documents. Ontology learning techniques rely on extensive background knowledge, ranging from unstructured data such as text corpora, to structured data such as a semantic lexicon. Manually-curated background knowledge is a scarce resource for many domains and languages, and the effort and cost required to keep the resource abreast of time is often high. More importantly, the size and coverage of manually-curated background knowledge is often inadequate to meet the requirements of most on- tology learning techniques. This thesis investigates the use of the Web as the sole source of dynamic background knowledge across all phases of ontology learning for constructing term clouds (i.e. visual depictions of terms) and lightweight ontologies from documents. To appreciate the significance of term clouds and lightweight ontologies, a system for ontology-assisted document skimming and scanning is developed. This thesis presents a novel ontology learning approach that is devoid of any manually-curated resources, and is applicable across a wide range of domains (the current focus is medicine, technology and economics). More specifically, this research proposes and develops a set of novel techniques that take advantage of Web data to address the following problems: (1) the absence of integrated techniques for cleaning noisy data; (2) the inability of current term extraction techniques to systematically explicate, diversify and consolidate their evidence; (3) the inability of current corpus construction
KW - Ontologies (Information retrieval)
KW - Data mining
KW - Technological innovations
KW - Abstracting
KW - Semantic Web
KW - Computer programming
KW - Internet programming
KW - Ontology learning
KW - Term extraction
KW - Natural language processing
KW - Text mining
KW - Corpus linguistics
KW - Term clustering
KW - Taxonomy
KW - Wikipedia mining
M3 - Doctoral Thesis
ER -