Balancing Regression Datasets

Ashhadul Islam
5 min readJun 20, 2024

--

Enhancing Model Performance with KNNOR-Reg Oversampling Technique in Imbalanced Regression problems

Changing distribution of minority data points based on parameters. Demo hosted on streamlit

Time and again, we have seen that no matter how sophisticated the subsequent models are, if the foundation of data is weak, the entire machine-learning exercise is rendered futile.

Among the various challenges marring your data, one of the most detrimental is an imbalance in the target variable distribution.

Imbalanced data makes your regression model struggle to provide accurate predictions.

Imagine you have a dataset with 9900 data points representing typical house prices and only 100 data points representing luxury house prices. A regression model trained on this data could severely underpredict the prices of luxury houses, yet still appear to perform well on average due to the overwhelming presence of typical house prices.

In order to combat this defect, together with Dr. Khelil, Dr. Samir, Dr. Ala, and Dr. Abdulsalam, we at Hamad Bin Khalifa University and Geneva School of Economics and Management have developed a novel data augmentation algorithm.

Idea:

We divide the data into bins (histogram)

This shows the distribution of the target column. We have a large number of data points with target value in range 30–50 and very few with target value more than 60 or less than 15. So we want to oversample the data such that, we create new data samples that have target values in the range above 60 and less than 15. This is what the y values histogram look like after oversampling.

After oversampling

In this case our cut-off frequency was 80. As a result, data samples at all bins with frequency less than 80 have been oversampled to reach the cut-off frequency.

Alternative Understanding

Here is another approach to understanding the method.

Below is a plot of the data points (let us consider only two features along the x and y axes). The y values are shown on the vertical bar on the right.

Datapoints distribution along with mapping to y value

Let us now see the histogram mapped on the vertical y-value bar.

Data distribution mapped to y values on the vertical bar along with a histogram showing the distribution of target values

As we can see in the figure above, there is a dearth of representation for y-values above 60 and below 25. Our target is to create new data points that will have corresponding y values in the same range. It is depicted in the below figure.

New data points with circled crosses having target (y) values in the range >60 and <25

Above figure shows the new data points created and marked by circled crosses (X). The corresponding target (y) values of these points are also generated, thereby changing the histogram shape.

Now the process of generating new data points and their corresponding to target (y) values is discussed in detail in the research paper. Below is the demonstration of the Python library to oversample imbalanced regression data.

pip install knnor-reg

Install the knnor-regression library with the above command. Let us say you have a csv file (with all numerical data). Following code snippet shows how to apply the knnor-reg library with and without parameters (Github link to jupyter notebook).

Without any parameters

from knnor_reg import data_augment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv("concrete.csv")

X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

# Print original data shapes
print("X=", X.shape, "y=", y.shape)
print("Original Regression Data shape:", X.shape, y.shape)

# Plot original data histogram
plt.hist(y)
plt.title("Original Regression Data y values")
plt.show()
print("************************************")
# Initialize KNNOR_Reg
knnor_reg = data_augment.KNNOR_Reg()
# Perform data augmentation
X_new, y_new = knnor_reg.fit_resample(X, y)
y_new = y_new.reshape(-1, 1)
# Print augmented data shapes
print("After augmentation shape", X_new.shape, y_new.shape)
print("KNNOR Regression Data:")

# Plot augmented data histogram
plt.hist(y_new)
plt.title("After KNNOR Regression Data y values")
plt.show()
Output

As we can see, the histogram above, shows the distribution of y values before augmentation. After augmentation, the number of y-values in the range < 30 and > 50 have increased. This is because the algorithm automatically defines the number of bins as the square root of the number of data points and the cutoff frequency as the average frequency of the histogram.

With parameters

Let us say we use number of bins as 15. The histogram looks like the following.

Histogram with number of bins as 15

We decide that all y values having a frequency below 80 must be oversampled. The code for that is as follows (Github Jupyter notebook).

from knnor_reg import data_augment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

num_bins=15
target_freq=80
dataset = pd.read_csv("concrete.csv")

X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values

# Print original data shapes
print("X=", X.shape, "y=", y.shape)
print("Original Regression Data shape:", X.shape, y.shape)

# Plot original data histogram
plt.hist(y,bins=num_bins)
plt.title("Original Regression Data y values")
plt.show()
print("************************************")
# Initialize KNNOR_Reg
knnor_reg = data_augment.KNNOR_Reg()
# Perform data augmentation
X_new, y_new = knnor_reg.fit_resample(X, y, bins = num_bins, target_freq=target_freq)
y_new = y_new.reshape(-1, 1)
# Print augmented data shapes
print("After augmentation shape", X_new.shape, y_new.shape)
print("KNNOR Regression Data:")

# Plot augmented data histogram
plt.hist(y_new, bins=num_bins)
plt.title("After KNNOR Regression Data y values")
plt.show()

The two important parameters here are num_bins and target_freq. We set num_bins as 15 and target_freq as 80. The result is as follows.

Histograms showing y values before and after augmentation with num_bins as 15 and cutoff frequency as 80

As the output figure shows, all data points with y values having frequency less than 80 have been oversampled and corresponding target (y) values have been generated.

Regressors can be trained now with less bias and more confidence.

I invite you to use this library in your day-to-day ml activities. Following are the resources that you can use to familiarize yourself with it.

You can read the journal paper or watch a demo video. You can also find the source code and an example notebook using the KNNOR-Reg package implementing this algorithm.

If you face any issues or have any ideas to improve please do not hesitate to email me at ashhadulislam@gmail.com

Until next time.

--

--

Ashhadul Islam
Ashhadul Islam

No responses yet