A novel augmentation technique for imbalanced data
Tipping the scales
Time and again we have seen, irrespective of the sophistication placed in the subsequent models, if the foundation of data is weak, the whole Machine Learning exercise is rendered futile.
Among the different handicaps marring your data, one of the worst could be the imbalance in population.
Imbalanced data makes your classifier wake up in the middle of the night in a pool of cold sweat.
Imagine you have a dataset with 9900 datapoints about healthy patients and 100 patients with an illness. A classifier trained on this data could label all the sick patients as healthy and still achieve 99% accuracy.
Accuracy = (9900/10000)*100 = 99%
In order to combat this defect, together with Dr Halima and Dr Samir, we at QCRI and Hamad Bin Khalifa University have come up with a novel data augmentation algorithm.
K-Nearest Neighbor OveRsampling (KNNOR) approach builds on top of oversampling algorithms like SMOTE and proposes a smarter and safer method of creating artificial data points. The KNNOR algorithm has been peer reviewed and published in the journal of Applied Soft Computing. It has been tested over multiple imbalanced datasets and has proved to perform better than the top 10 state-of-the-art augmentation algorithms.
The animation below shows the algorithm in action as it produces different distributions of artificial data with varying parameters.
You can try your hands on the algorithm by visiting the application hosted on streamlit. Alter the parameters and see how the distribution changes. There are 4 parameters used by the algorithm.
- Final proportion of minority population: The algorithm has to know how many artificial points to generate. A value of 0.5 implies that after the process of augmentation the number of points created and the initial minority points would sum together to 50% of the majority population. For example, if your data set has 100 majority points and 10 minority points then a proportion of 1.0 would imply that 90 minority data points should be added to the data set to create an equal proportion (1) of minority and majority data points.
- Number of neighbors: This is the number of minority neighbors that will be used during the production of each minority data point.
- Distance: It is the maximum distance proportion between any two minority data points at which the new minority point will be placed.
- Proportion of minority: This is the proportion of critical minority points that will be used to create new minority points
Each of these parameters above deserve more explanation that can be found in the research paper. Let us move on to using this algorithm as a Python library. We have open-sourced the code in GitHub and also packaged it into pip. Using the library is as simple as using any other machine learning modules
pip install knnor
Run the above to install the library in your python interpreter. Following code shows the implementation of the library on a sample dataset from the sklearn library.
Now this is where the augmentation happens
As can be seen, its the knnor.fit_resample() function where you pass in the X and y values. The algorithm computes the 4 parameters that we discussed above on its own and applies the optimized parameters. However, you can also pass the parameters along with the fit_resample() function.
Once you pass the augmented data to the classifiers, you can be hopeful that you will get better results as your data will be more balanced and well represented. We tested this on 19 balanced datasets and our accuracy improved by different proportions.
The last thing that I want, is to leave this innovation in a dark corner of the world of science. I invite you to use this library in your day-to-day ml activities. Following are the resources that you can use to familiarize yourself with it.
You can read the journal paper or watch an explainer video. You can also find the source code and the implementation documentation. Here is an example notebook using the KNNOR package implementing this algorithm.
If you face any issues or have any ideas to improve please do not hesitate to email me at ashhadulislam@gmail.com
Finally a vote of thanks to Mısra Turp for her simple and effective tutorial on youtube that enabled me to set up the visualization on streamlit in under 6 hours. Streamlit to data science projects is what medium is to blogging.
Until next time.