Strategies for Imbalanced Data

 

Addressing issues of data imbalanced has many strategies. Another keyword: long-tailed distribution.

  1. (Easiest) Oversamping and downsampling
    • Cons of oversampling: overfitting
    • Cons of downsampling: wasting data
  2. (Binary classification) Adjustment of prediction threshold
    If the ratio is 3:7, the classification probability threshold should be adjusted to 0.3 from 0.5.
  3. (Common) Loss weight adjustment
    The loss weight should be inversely proportional to the data ratio. If the ratio is 3:7 in binary classification, the weight should be 7:3.
  4. (For non-severe imbalance) Ensemble and K-fold split
    If the ratio is 1:9 in binary classification, samples in the major class are split into 9 subsets. The minor class remains the same. Eventually, each subset will have balanced data ratio of 1:1. Each subset will be assigned a classifier. The final output is provided by the ensemble of the all 9 sub-classifiers.
  5. (Hard/easy samples imbalance) OHEM (Online Hard Examples Mining) or focal loss
  6. Decoupling
    Train a CNN+FC normal classifier with imbalanced data. Fix the classifier and add an additional FC layer after it. Train the whole DNN with balanced data.