Photo by Will Myers on Unsplash

Member-only story

Are You Using Feature Distributions to Detect Outliers?

Here are three better ways.

Hennie de Harder
TDS Archive
Published in
7 min readAug 30, 2022

As data scientist you’ve probably encountered them: data points that don’t fit in and have a bad influence on your models’ performance. How do you detect them? Do you take a look at the box- or scatterplots? And after detection, do you throw away the outliers or do you use other methods to improve the quality of the data? In this article, I will explain three ways to detect outliers.

Definition and Data

An outlier is defined as a data point that differs significantly from other observations. But what is significantly? And what should you do if a data point looks normal for separate features, but the combination of feature values is rare or unlikely? In most projects the data will contain multiple dimensions, and this makes it hard to spot the outliers by eye. Or with boxplots.

Below, you see scatterplots of the data I will use in this article. It’s housing data from OpenML (Creative Commons license). The plots and data make it easier to explain the detection techniques. I’ll only use four features (GrLivArea, YearBuilt, LotArea, OverallQual) and the target SalePrice. The data contains 1460 observations.

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Hennie de Harder
Hennie de Harder

Written by Hennie de Harder

📈 Data Scientist & ML Engineer 💡 Simplifying complex topics ✨ Sharing fun side projects 💻 Working at IKEA and BigData Republic 🐈 Love math, cats, & running

Responses (5)

Write a response