Balancing Regression Datasets
Enhancing Model Performance with KNNOR-Reg Oversampling Technique in Imbalanced Regression problems
Time and again, we have seen that no matter how sophisticated the subsequent models are, if the foundation of data is weak, the entire machine-learning exercise is rendered futile.
Among the various challenges marring your data, one of the most detrimental is an imbalance in the target variable distribution.
Imbalanced data makes your regression model struggle to provide accurate predictions.
Imagine you have a dataset with 9900 data points representing typical house prices and only 100 data points representing luxury house prices. A regression model trained on this data could severely underpredict the prices of luxury houses, yet still appear to perform well on average due to the overwhelming presence of typical house prices.
In order to combat this defect, together with Dr. Khelil, Dr. Samir, Dr. Ala, and Dr. Abdulsalam, we at Hamad Bin Khalifa University and Geneva School of Economics and Management have developed a novel data augmentation algorithm.
Idea:
We divide the data into bins (histogram)
This shows the distribution of the target column. We have a large number of data points with target value in range 30–50 and very few with target value more than 60 or less than 15. So we want to oversample the data such that, we create new data samples that have target values in the range above 60 and less than 15. This is what the y values histogram look like after oversampling.
In this case our cut-off frequency was 80. As a result, data samples at all bins with frequency less than 80 have been oversampled to reach the cut-off frequency.
Alternative Understanding
Here is another approach to understanding the method.
Below is a plot of the data points (let us consider only two features along the x and y axes). The y values are shown on the vertical bar on the right.
Let us now see the histogram mapped on the vertical y-value bar.
As we can see in the figure above, there is a dearth of representation for y-values above 60 and below 25. Our target is to create new data points that will have corresponding y values in the same range. It is depicted in the below figure.
Above figure shows the new data points created and marked by circled crosses (X). The corresponding target (y) values of these points are also generated, thereby changing the histogram shape.
Now the process of generating new data points and their corresponding to target (y) values is discussed in detail in the research paper. Below is the demonstration of the Python library to oversample imbalanced regression data.
pip install knnor-reg
Install the knnor-regression library with the above command. Let us say you have a csv file (with all numerical data). Following code snippet shows how to apply the knnor-reg library with and without parameters (Github link to jupyter notebook).
Without any parameters
from knnor_reg import data_augment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv("concrete.csv")
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
# Print original data shapes
print("X=", X.shape, "y=", y.shape)
print("Original Regression Data shape:", X.shape, y.shape)
# Plot original data histogram
plt.hist(y)
plt.title("Original Regression Data y values")
plt.show()
print("************************************")
# Initialize KNNOR_Reg
knnor_reg = data_augment.KNNOR_Reg()
# Perform data augmentation
X_new, y_new = knnor_reg.fit_resample(X, y)
y_new = y_new.reshape(-1, 1)
# Print augmented data shapes
print("After augmentation shape", X_new.shape, y_new.shape)
print("KNNOR Regression Data:")
# Plot augmented data histogram
plt.hist(y_new)
plt.title("After KNNOR Regression Data y values")
plt.show()
As we can see, the histogram above, shows the distribution of y values before augmentation. After augmentation, the number of y-values in the range < 30 and > 50 have increased. This is because the algorithm automatically defines the number of bins as the square root of the number of data points and the cutoff frequency as the average frequency of the histogram.
With parameters
Let us say we use number of bins as 15. The histogram looks like the following.
We decide that all y values having a frequency below 80 must be oversampled. The code for that is as follows (Github Jupyter notebook).
from knnor_reg import data_augment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
num_bins=15
target_freq=80
dataset = pd.read_csv("concrete.csv")
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,-1].values
# Print original data shapes
print("X=", X.shape, "y=", y.shape)
print("Original Regression Data shape:", X.shape, y.shape)
# Plot original data histogram
plt.hist(y,bins=num_bins)
plt.title("Original Regression Data y values")
plt.show()
print("************************************")
# Initialize KNNOR_Reg
knnor_reg = data_augment.KNNOR_Reg()
# Perform data augmentation
X_new, y_new = knnor_reg.fit_resample(X, y, bins = num_bins, target_freq=target_freq)
y_new = y_new.reshape(-1, 1)
# Print augmented data shapes
print("After augmentation shape", X_new.shape, y_new.shape)
print("KNNOR Regression Data:")
# Plot augmented data histogram
plt.hist(y_new, bins=num_bins)
plt.title("After KNNOR Regression Data y values")
plt.show()
The two important parameters here are num_bins and target_freq. We set num_bins as 15 and target_freq as 80. The result is as follows.
As the output figure shows, all data points with y values having frequency less than 80 have been oversampled and corresponding target (y) values have been generated.
Regressors can be trained now with less bias and more confidence.
I invite you to use this library in your day-to-day ml activities. Following are the resources that you can use to familiarize yourself with it.
You can read the journal paper or watch a demo video. You can also find the source code and an example notebook using the KNNOR-Reg package implementing this algorithm.
If you face any issues or have any ideas to improve please do not hesitate to email me at ashhadulislam@gmail.com
Until next time.