Detecting and ranking outliers in high-dimensional data

Amardeep Kaur, Amitava Datta

Research output: Contribution to journalArticle

Abstract

Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.

Original languageEnglish
Pages (from-to)75-87
Number of pages13
JournalInternational Journal of Advances in Engineering Sciences and Applied Mathematics
Volume11
Issue number1
DOIs
Publication statusPublished - Mar 2019

Cite this

@article{a6426d0e28b54d9dafb254d15fafc342,
title = "Detecting and ranking outliers in high-dimensional data",
abstract = "Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82{\%} precision as well as recall.",
keywords = "Data mining, Outlier detection, High-dimensional data, DATA QUALITY, ALGORITHMS",
author = "Amardeep Kaur and Amitava Datta",
year = "2019",
month = "3",
doi = "10.1007/s12572-018-0240-y",
language = "English",
volume = "11",
pages = "75--87",
journal = "International Journal of Advances in Engineering Sciences and Applied Mathematics",
issn = "0975-0770",
publisher = "Springer (India) Private Ltd.",
number = "1",

}

Detecting and ranking outliers in high-dimensional data. / Kaur, Amardeep; Datta, Amitava.

In: International Journal of Advances in Engineering Sciences and Applied Mathematics, Vol. 11, No. 1, 03.2019, p. 75-87.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Detecting and ranking outliers in high-dimensional data

AU - Kaur, Amardeep

AU - Datta, Amitava

PY - 2019/3

Y1 - 2019/3

N2 - Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.

AB - Detecting outliers in high-dimensional data is a challenging problem. In high-dimensional data, outlying behaviour of data points can only be detected in the locally relevant subsets of data dimensions. The subsets of dimensions are called subspaces and the number of these subspaces grows exponentially with increase in data dimensionality. A data point which is an outlier in one subspace can appear normal in another subspace. In order to characterise an outlier, it is important to measure its outlying behaviour according to the number of subspaces in which it shows up as an outlier. These additional details can aid a data analyst to make important decisions about what to do with an outlier in terms of removing, fixing or keeping it unchanged in the dataset. In this paper, we propose an effective outlier detection algorithm for high-dimensional data which is based on a recent density-based clustering algorithm called SUBSCALE. We also provide ranking of outliers in terms of strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt according to different density parameter settings. We experimented with different datasets, and the top-ranked outliers were predicted with more than 82% precision as well as recall.

KW - Data mining

KW - Outlier detection

KW - High-dimensional data

KW - DATA QUALITY

KW - ALGORITHMS

U2 - 10.1007/s12572-018-0240-y

DO - 10.1007/s12572-018-0240-y

M3 - Article

VL - 11

SP - 75

EP - 87

JO - International Journal of Advances in Engineering Sciences and Applied Mathematics

JF - International Journal of Advances in Engineering Sciences and Applied Mathematics

SN - 0975-0770

IS - 1

ER -