Addressing issues of data imbalanced has many strategies. Another keyword: long-tailed distribution.
- (Easiest) Oversamping and downsampling
- Cons of oversampling: overfitting
- Cons of downsampling: wasting data
- (Binary classification) Adjustment of prediction threshold
If the ratio is 3:7, the classification probability threshold should be adjusted to 0.3 from 0.5. - (Common) Loss weight adjustment
The loss weight should be inversely proportional to the data ratio. If the ratio is 3:7 in binary classification, the weight should be 7:3. - (For non-severe imbalance) Ensemble and K-fold split
If the ratio is 1:9 in binary classification, samples in the major class are split into 9 subsets. The minor class remains the same. Eventually, each subset will have balanced data ratio of 1:1. Each subset will be assigned a classifier. The final output is provided by the ensemble of the all 9 sub-classifiers. - (Hard/easy samples imbalance) OHEM (Online Hard Examples Mining) or focal loss
- Decoupling
Train a CNN+FC normal classifier with imbalanced data. Fix the classifier and add an additional FC layer after it. Train the whole DNN with balanced data.
PREVIOUSLinux