Mohammed Arebi

Experienced data scientist. I blog about machine learning and my journey. Actively Building. Always Learning

Applying Over-Sampling Methods to Highly Imbalanced Data

I mentioned various undersampling approaches for dealing with highly imbalanced data in the earlier post “Applying Under-Sampling Methods to Highly Imbalanced Data” In this article, I present oversampling strategies for dealing with the same problem. By reproducing minority class examples, oversampling raises the weight of the minority class. Although it does not provide information, it introduces the issue of over-fitting, which causes the model to be overly specific. It is possible that while the accuracy for the training set is great, the performance for unseenĀ datasets is poor.

Applying Under-Sampling Methods to Highly Imbalanced Data

Class imbalance can lead to a significant bias toward the dominant class, lowering classification performance and increasing the frequency of false negatives. How can we solve the problem? The most popular strategies include data resampling, which involves either undersampling the majority of the class, oversampling the minority class, or a combination of the two. As a consequence, classification performance will improve. In this article, I will describe what unbalanced data is, why Receiver Operating Characteristic Curve (ROC) fails to measure accurately, and how to address the problem.

Logistic Regression: With Application and Analysis on the 'Rain in Australia' Dataset

Introduction The logistic model (or logit model) in statistics is a statistical model that represents the probability of an event occurring by making the log-odds for the event a linear combination of one or more independent variables. Logistic regression is another approach borrowed from statistics by machine learning. It is the go-to strategy for binary classification problems (problems with two classes) even though it is a regression algorithm (predicts probabilities, more on that later).