#collapse
import pickle
from collections import defaultdict
from itertools import product
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

from imblearn.datasets import fetch_datasets
from imblearn.ensemble import BalancedRandomForestClassifier, BalancedBaggingClassifier
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.pipeline import make_pipeline

from sklearn.datasets import make_blobs
from sklearn.model_selection import cross_validate, cross_val_predict, StratifiedKFold
from sklearn.metrics import precision_recall_curve
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

Introduction

Imbalanced datasets occur in various fields and create unique challenges in model training as well as evaluation. In this post, we compare a few ensemble algorithms on imbalanced datasets. We delve deeper on a specific model: a bag of logistic regressions, and discuss why it might be a good choice for many imbalanced datasets.

Datasets

We use three datasets from different domains. All three are available from the UCI ML Database. An overview of their properties is provided below.

#collapse

abalone_fname = 'abalone.feather'

def fetch_abalone():
    """Download Abalone dataset from UCI archives and load in Pandas"""
    abalone_uci = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
    abalone = pd.read_csv(abalone_uci, header=None)
    abalone.columns = ['sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 
                       'viscera_weight', 'shell_weight', 'rings']
    abalone = pd.get_dummies(abalone, columns=['sex'])
    abalone.to_feather(abalone_fname)  # save local copy
    return abalone

def abalone_Xy(thr):
    """For a given threshold thr, all values above the threshold become
    the positive class in binary classification"""
    return {
        'data': abalone.drop(columns='rings'),
        'target': (abalone['rings'] >= thr).astype(np.int32),
        'DESCR': f'abalone_{thr}+'
    }

def get_stats(ds):
    """Extract statistics from a dataset."""
    n_examples, n_feats = ds['data'].shape
    is_positive = ds['target'] == 1
    return {
        'name': ds['DESCR'],
        'n_examples': n_examples,
        'n_feats': n_feats,
        'prop_positive': is_positive.mean(),
        'n_positive': is_positive.sum()
    }


if Path(abalone_fname).is_file():
    abalone = pd.read_feather(abalone_fname)
else:
    abalone = fetch_abalone()
    
datasets = fetch_datasets(filter_data=('ozone_level', 'coil_2000'))
thr = 16
datasets[f'abalone_{thr}+'] = abalone_Xy(thr=thr)

ds_info = pd.DataFrame(get_stats(ds) for ds in datasets.values())
ds_info
name n_examples n_feats prop_positive n_positive
0 ozone_level 2536 72 0.028785 73
1 coil_2000 9822 85 0.059662 586
2 abalone_16+ 4177 10 0.062485 261

Ozone level

Houston pollution

This dataset contains 7 years of daily metereological (wind speed, temperature) as well as air quality indicators for Houson, Texas. The goal is to predict "ozone days" - days when the ground-level ozone concentration exceeds a certain threshold, which can be harmful to human health.

Coil 2000

A caravan trailer

This dataset was used for the CoIL data mining competition. It is an insurance dataset where the goal is to predict which customers of a Dutch insurance company own an insurance policy covering a caravan. The features include socio-demographic data (derived from postal codes) and product ownership data. Importantly, all continuous features have been discretized in advance by the company.

Abalone

The pink abalone, Haliotis corrugata

Abalone are molluscs and a type of marine snail. This dataset contains physical measurements of abalone (weight, size, length, sex) and the target variable is the number of rings (which is related to their age). The ground truth for the rings is obtained by cutting the shell and staining it, which is a time-consuming task, whereas the physical measurements are more easily obtained.

For the purposes of creating an imbalanced dataset, we take all abalone with rings >= 16 as the minority class at about 6%.

Experiments

Models

Dealing with imbalanced datasets often involves resampling. For example, we can under-sample the majority class, over-sample the minority class, or do a combination of the two. There are also methods that over-sample based on specific rules, which might involve generating synthetic examples. e.g., see the family of SMOTE methods in the imbalanced-learn library.

Bagging methods can be a particularly good fit for imbalanced datasets because resampling is already part of the algorithm. All we need to do is to modify the resampling procedure to introduce rebalancing of the dataset. This is already implemented for us in imblearn.ensemble. For example, a BalancedRandomForestClassifier(sampling_strategy=0.5) is similar to a regular Random Forest, but each tree will be fitted to a rebalanced dataset, where the ratio of minority- to majority-class samples is 0.5

The models we use are:

  • Baseline model: RandomOversampler followed by a single logistic regression, trained on the rebalanced dataset.
  • Balanced Random Forest (or a Bag of Trees)
  • Balanced Bag of Logistic Regressions. Similar to the balanced forest, but the base model is a logistic regression.

The rationale for this choice is that a balanced forest model might be able to capture non-linear relationships and interactions between variables, but the bag of logistic regressions might be more robust to overfitting in the imbalanced setting where label information is sparse.

The code below includes the model definitions:

def get_clfs(sampling_strategy, max_features=0.5):
    """Define ensemble classifiers for experiment.
    
    sampling_strategy: float or str
        Target ratio for minority / majority class when rebalancing.
        With sampling_strategy=0.5, there will be twice as many samples in the majority class
        after the rebalancing procedure.
    max_features: the proportion of features to draw at each resampling.
    """
    return {
        'single_logistic': make_pipeline(
            StandardScaler(),
            RandomOverSampler(sampling_strategy=sampling_strategy),
            LogisticRegression(penalty='l2')
        ),
        'logistic_bag': make_pipeline(
            StandardScaler(),
            BalancedBaggingClassifier(
                base_estimator=LogisticRegression(penalty='l2'),
                sampling_strategy=sampling_strategy,
                max_features=max_features, 
                n_estimators=20            
            )
        ),
        'balanced_forest': BalancedRandomForestClassifier(
            min_samples_split=10, 
            sampling_strategy=sampling_strategy,
            n_estimators=100,
            max_features=max_features)
    }

Evaluation

To evaluate the models, we will use 10-fold cross-validation, using a stratified split, that will ensure that each split receives roughly the same number of positive examples. This should produce more stable results for imbalanced dataset compared to a regular cross-validation based on a shuffled split.

We will examine the precision-recall curves obtained from the cross-validation. We also pick somewhat arbitrarily a recall threshold of 60%, and see which model will obtain better precision at that level.

#collapse
def run_exp(ds, clf, seed=42):
    """Given a dataset ds and classifier clf, run a cross-validation experiment."""
    X, y = ds['data'], ds['target']
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
    scores = cross_validate(clf, X, y, cv=cv, n_jobs=-1,
                            scoring=['balanced_accuracy', 'f1', 'precision', 'recall'])
    y_probs = cross_val_predict(clf, X, y, method='predict_proba', cv=cv, n_jobs=-1)[:, 1]
    scores['y_probs'] = y_probs
    return scores

def plot_precision_recall(y_true, y_probs, ax=None, **plot_kwargs):
    ax = ax or plt.gca()
    prec, recall, thr = precision_recall_curve(y_true, y_probs)
    ax.plot(recall, prec, **plot_kwargs)
    ax.set_ylabel('Precision', fontsize=12)
    ax.set_xlabel('Recall', fontsize=12)
    ax.set_ylim(0, 1)
    return ax

def collect_results(datasets, clfs):
    """Collect experiment results.
    
    The results shape is {<dataset_name>: {'classifier_name'}: result}
    """
    results = defaultdict(dict)
    for (ds_name, ds), (clf_name, clf) in tqdm(product(datasets.items(), clfs.items()),
                                               total=len(datasets) * len(clfs)):
        results[ds_name][clf_name] = run_exp(ds, clf)
    return results
results = collect_results(datasets, get_clfs(sampling_strategy=0.5))
with open('base_results.pickle', 'wb') as f:
    pickle.dump(results, f)
100%|██████████| 9/9 [00:53<00:00,  5.92s/it]

Results

Below are the plots of the precision-recall curves.

#collapse
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14, 3.5), sharey=True)
for i, (ds_name, ds_dict) in enumerate(results.items()):
    ax = axes[i]
    y_true = datasets[ds_name]['target']
    for clf_name, res in ds_dict.items():
        ax = plot_precision_recall(y_true, res['y_probs'], ax=ax, label=clf_name)
    ax.set_title(ds_name)
    ax.legend()
    ax.grid(True)

For the Ozone dataset, the bag of logistic regressions outperforms the balanced forest, at least for near the recall region of interest at recall=0.6. On the other hand, the balanced forest does slightly better on the Abalone dataset.

Both models do poorly on the Coil dataset. Sadly, one of the great challenges of our time, predicting caravan insurance policy, remains unsolved.

Note that the Ozone dataset includes 72 features, but it is the most imbalanced, at about 2.9%. The amount of signal available from the limited positive labels is likely quite sparse. This is possibly why the bag of logistic models does better than the balanced forest.

The Abalone dataset contains the smallest number of features (10). It is possible that the ability of the random forest to capture interactions and non-linear effects gave it a slight edge over the other model.

Bag of logistic regressions

The bag of logistic regressions performed quite well in this experiment but it might not be an obvious choice. So it is worth spending a moment to consider why it might work so well. We address a few questions about it below.

Isn't a bag of linear models simply a linear model?

It depends on the linear model, and how the individual models' predictions are combined in the bag. Indeed, for regression problems, a bag of linear regressions will produce another linear model. Logistic regression, however, includes a link function:

$$p_i(x) = \frac {1}{1+e^{-\mathbf{w x_i}}}$$

where $\mathbf{w}$ is the vector of weights fitted in the model, and $p_i$ is the predicted probability for example $x_i$.

In a BaggingClassifier, we are averaging the predicted probabilities of the base models - basically it uses the predict_proba method averages these together. For example, with two models, we obtain:

$$p_{i, bag}(x) = \frac{p_{i, 1} + p_{i, 2}}{2} = \frac{1}{2} [\frac {1}{1+e^{-\mathbf{w_1 x_i}}} + \frac {1}{1+e^{-\mathbf{w_2 x_i}}}]$$

This combination of the $p_i$ value will not necessarily produce a linear decision boundary as long as the weights $\mathbf{w_i}$ are sufficiently different. We can generate a few toy 2D datasets to illustrate this.

#collapse
def make_imbalanced_blobs(n_positive, n_negative=400, randon_state=0):
    """Creates a synthetic dataset with four blobs, where the blob at
    (0, 0) is the negative class.
    
    By setting n_positive to a small number, we can create an imbalanced dataset.
    """
    centers = [(0, 0), (5, 5), (5, -5), (-5, 5)]
    total = n_negative * len(centers)
    assert n_positive + n_negative <= total, f'n_positive + n_negative exceeds total {total}'
    X, y_blob = make_blobs(n_samples=n_negative * len(centers), centers=centers, 
                           shuffle=False, random_state=randon_state)
    X_center = X[y_blob == 0]  # negative class
    X_side   = X[y_blob != 0]  # positive class before resampling
    sample = np.random.choice(len(X_side), size=n_positive, replace=False)
    X_side = X_side[sample]

    X = np.concatenate([X_center, X_side])
    y = np.concatenate([np.zeros(len(X_center), dtype=int), 
                        np.ones(len(X_side),    dtype=int)])
    return X, y

def plot_blobs_with_boundaries(clfs, X, y):
    """Plots the decision boundaries for classifiers clfs, given 
    a blobs dataset. One plot for each classifier is created.
    """
    for clf in clfs.values(): clf.fit(X, y)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))

    fig, axes = plt.subplots(nrows=1, ncols=len(clfs), figsize=(14, 4))

    for ax, (clf_name, clf) in zip(axes, clfs.items()):
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, alpha=0.4)
        ax.scatter(X[:, 0], X[:, 1], c=y, s=20, edgecolor='black')
        ax.set_title(clf_name)

Below we plot a toy dataset which is not linearly separable. In the first example below we use a large number for the n_positive parameter in the make_imbalanced_blobs function (expand cell above). This effectively creates a fairly balanced dataset. We plot the decision boundaries for the two ensemble classifiers we used above.

X, y = make_imbalanced_blobs(n_positive=200)
plot_blobs_with_boundaries(get_clfs(sampling_strategy='auto'), X, y)

The bag of logistic regressions generates a near-linear decision boundary, so it is unable to effectively separate the classes. As expected, the balanced forest easily separates this dataset.

Now let's examine the imbalanced version of the dataset. The new decision boundaries are plotted below.

X, y = make_imbalanced_blobs(n_positive=20)
plot_blobs_with_boundaries(get_clfs(sampling_strategy=0.5), X, y)

We see that the decision boudary for the bag of logistic models has changed substantially, and is now non-linear. It is also better able to fit the dataset. In comparison, the balanced forest decision boundary is mostly the same.

Can we use bagging with a low-variance estimator like logistic regression?

As we see above, when we apply the bag of logistic regressions to a balanced dataset, most of the logistic models will tend to agree on the fitted coefficients, resulting in a near-linear boundary. There is little benefit of combining many nearly-identical models.

On the other hand, when using imbalanced datasets, the individual models exhibit high variance in the fitted weights, so combining them in a bag might produce a robust classifier.

Conclusion

Ensemble models provide a natural ways of dealing with imbalanced datasets, by incorporating a re-balancing procedure in the resampling. Logistic regression is not commonly used as a base model in bagging since it is a low-variance estimator. In imbalanced settings, even logistic regression can exhibit high variance. This is especially the case as we increase the dimensionality of the problem. In those cases, a bag of logistic regressions might produce competitive results.