Simple Numeric Outlier

OnoSureiya
Apr 21, 2020
2 min read

This is the simplest, nonparametric outlier detection method. They are used for a one-dimensional featured dataset. Here outliers are calculated by means of the InterQuartile Range. Implementation of the code is in the following GitHub link: https://github.com/AmanPriyanshu/Machine-Learning-Unsupervised/blob/master/Outliers/simple_numeric_outlier.py

Quartiles: A quartile is a type of quantile which divides the number of data points into four more or less equal parts or quarters. Data needs to be ordered from smallest to largest in order to compute quartiles, quartiles are a form of Order statistic.

The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set.

Formula:

The second quartile (Q2) is the median of a data set and 50% of the data lies below this point.

Formula:

The third quartile (Q3) is the middle value between the median and the highest value of the data set.

Formula:

Where N is the number of data-points in the dataset.

Interquartile Range or IQR is an important factor in deciding Outliers here. The formula for IQR can be given as follows:

Here, we will define a new factor: k, such that X[i] data-point which lies outside the range:

Q1 - k(IQR) < x[i]< Q3 + k(IQR)

here, k > 0. Therefore, any point lying outside this range is considered an outlier.

Dataset Development

We will be creating Pseudo-Dataset, where we develop a one-dimensioned dataset. We will be using the following code for data generation, it allows us to create a slightly biased dataset which allows us to form a cluster. At the same time, it allows us to develop a few Outliers so that we can evaluate our Outlier Selection Algorithm. Following is the implementation of the dataset development code:

Here is an output of the dataset for n = 1000, for seed(0):

And as we can see most of the points lie between 0.6 and 0.8, almost close to 50%(~501), as we have set that above in our probability list. Now that we have biased dataset let us begin the implementation of Simple Numeric Outlier.

Simple Numeric Outlier Detection:

Now that we have discussed degeneration, let us develop the outlier selection algorithm. We will be using the same formulae as stated above. Here is the in-code implementation of the above:

Now that we have developed the outlier classifier we have to set-up an appropriate value for k. Let us consider that our normal cases lie within (0.5, 0.9). Let us draw the accuracy against the value of k.

We can clearly see, k = 0.45 has the highest accuracy. Therefore, with k=0.45, it has an accuracy of 98.5%. Let us take a look at the accuracy for a dataset with similar bias/cluster.

For seed(1), for k=0.45, accuracy = 0.996

For seed(2), for k=0.45, accuracy = 0.992

Here is the visualisation of this algorithm for linear outlier detection. The graph is the index position against the dataset value.

Conclusion:

We can clearly see that this is a useful algorithm for a one-dimensional dataset. It is fairly simple algorithmically and mathematically, we can clearly see that it has good accuracy for a fairly biased dataset. It is simple in design and easy to implement, however, it cannot be implemented for multi-dimensional datasets. It is important for us to understand that Real-Word Data is generally Multi-Dimensional and has many features with complicated relations which this model may not be able to classify correctly.

Simple Numeric Outlier

Dataset Development

Simple Numeric Outlier Detection:

Conclusion:

Recent Posts

Comments