Member-only story
Are You Using Feature Distributions to Detect Outliers?
Here are three better ways.
As data scientist you’ve probably encountered them: data points that don’t fit in and have a bad influence on your models’ performance. How do you detect them? Do you take a look at the box- or scatterplots? And after detection, do you throw away the outliers or do you use other methods to improve the quality of the data? In this article, I will explain three ways to detect outliers.
Definition and Data
An outlier is defined as a data point that differs significantly from other observations. But what is significantly? And what should you do if a data point looks normal for separate features, but the combination of feature values is rare or unlikely? In most projects the data will contain multiple dimensions, and this makes it hard to spot the outliers by eye. Or with boxplots.
Below, you see scatterplots of the data I will use in this article. It’s housing data from OpenML (Creative Commons license). The plots and data make it easier to explain the detection techniques. I’ll only use four features (GrLivArea, YearBuilt, LotArea, OverallQual) and the target SalePrice. The data contains 1460 observations.