SMOTE and Edited Nearest Neighbors Undersampling for Imbalanced Classification
Imbalanced datasets are a special case for classification problem where the class distribution is not uniform among the classes. One of the techniques to handle imbalance datasets is data sampling.
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique that generates synthetic samples from the minority class to match the majority class. It is used to obtain a synthetically class-balanced or nearly class-balanced training set. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.
Edited Nearest Neighbor (ENN) is an undersampling method technique that remove the majority class to match the minority class. ENN works by removing samples whose class label differs from the class of the majority of their k nearest neighbors.
Both types of sampling can be used together to handle imbalanced datasets. We can demonstrate this on a simple binary classification problem with a 1:100 imbalance class.
We can fit a DecisionTreeClassifier model on this datasets as the benchmark model.
The model achieved a ROC AUC of about 0.737 without the sampling method.
Now lets try SMOTE and ENN sampling techniques to our example datasets.
Now lets fit our previous DecisionTreeClassifier model on this datasets after we used the sampling method.
The model achieved a ROC AUC of about 0.921 with the sampling method much better than without sampling. Our results show that the oversampling and undersampling can provided a good results for imbalanced datasets.