Z-Score Outlier Detection
- OnoSureiya
- Apr 22, 2020
- 3 min read
Outlier Detection is essential for accurate statistical analysis and hypothesis tests that use the various outlier selection algorithms to select data which can be determined as Anomalies in the given dataset. Therefore, it is vital to discuss specific methods for Outlier Detection. In this article, we will be discussing the Z-Score Outlier Detection Algorithm. No libraries other than Numpy and Pandas will be used. It can be used for Univariate Datasets which have a Gaussian Distribution which is a type of parametric method-type dataset.
Take a look at the complete implementation here at, https://github.com/AmanPriyanshu/Machine-Learning-Unsupervised/blob/master/Outliers/Z_score.py
Having understood the base of the Dataset, let us begin by creating a univariate dataset.
DATASET
We will be creating a function to create N data-points with a pre-determined Gaussian Distribution. Before generating this Dataset let us discuss a Gaussian Distribution. Now a Gaussian Distribution or Normal Distribution is a type of continuous probability distribution for a real-valued random variable. Here is the general formula for a Gaussian Distribution,

The parameter μ is the mean or expectation of the distribution, and σ is its standard deviation. The variance of the distribution is σ^2. Now having understood this, let us visualise, some Gaussian Distributions.

Explanation of the above image:
GREEN: Mean = 0, Variance = 1
RED: Mean = 0, Variance = 2
BLUE: Mean = 0, Variance = 3
BLACK: Mean = 2, Variance = 5
YELLOW: Mean = 3, Variance = 1
As we can see as the variance increases the curve becomes broader, thereby increasing its standard deviation (Square root of Variance). The graph is bound between 1 and 0. At the same time, the Global Maxima for each individual curve is located at its Mean. This concludes some basic development in Gaussian Distributions. We can further understand the plot of f(X) using the formula above. It allows us to define the maximum height at H:

as well as the increase and decrease in its derivative.
Finally, we can begin by creating a Dataset pre-determined by a Gaussian Distribution. We can use the Gauss method from Python's Random module. We will be using the following code for generation of the Dataset:

Now that we have generated a dataset, let us take a quick look at the first 10 elements of the generated dataset:
As you can see most of the data-points lie very close to 5. However, there are some extremes as we can see with the value 2.20684379.
Now we can begin working on our Z-score implementation.
Z-Score:
The Z-Score is basically the number of standard deviations a data point is from the sample’s mean. We have to be sure that our dataset is Gaussian in nature. Here is a mathematical representation of it,

Let us look at the coding implementation of the above.

It is an easy implementation, let us take a look at our Dataset and remove the outliers which fall at a distance, twice that of our standard deviation.

Let us take a look at its Gaussian Representation as well:

As you can see, we have gotten sufficient results and outliers have been correctly selected. You can take a look at the complete implementation of Z-Score at:
Conclusion:
Although easy to implement and simple to understand it can become a very strong and useful tool for processing univariate or low dimensional feature space datasets, which have a parametric distribution. It is mainly applied on Gaussian Distributions, however, it can be applied to other parametric distributions as well by methods such as scaling. For a complete understanding of code take a look at the GitHub page here: https://github.com/AmanPriyanshu/Machine-Learning-Unsupervised/blob/master/Outliers/Z_score.py
Comments