Resampling methods improve the predictive power of modeling in class-imbalanced datasets

Lee, PH

doi:10.3390/ijerph110909776

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/35963

Title:	Resampling methods improve the predictive power of modeling in class-imbalanced datasets
Authors:	Lee, PH
Issue Date:	2014
Source:	International journal of environmental research and public health, Sept. 2014, v. 11, no. 9, p. 9776-9789
Abstract:	In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009-2010 dataset. A total of 4677 participants aged >= 20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1: 1, 1: 2, and 1: 4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.
Keywords:	Automated classifier Data mining Decision tree Oversampling Predictive power Rare events
Publisher:	Molecular Diversity Preservation International (MDPI)
Journal:	International journal of environmental research and public health
ISSN:	1661-7827
EISSN:	1660-4601
DOI:	10.3390/ijerph110909776
Rights:	© 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/). The following publication Lee, P. H. (2014). Resampling methods improve the predictive power of modeling in class-imbalanced datasets. International Journal of Environmental Research and Public Health, 11(9), (Suppl. ), 9776-9789 is available at https://dx.doi.org/10.3390/ijerph110909776
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Lee_Resampling_Class-imbalanced_Datasets.pdf		574.77 kB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show full item record

Page views

194

Last Week
1

Last month

Citations as of Feb 9, 2026

Downloads

76

Citations as of Feb 9, 2026

SCOPUS^TM
Citations

55

Last Week
0

Last month
1

Citations as of May 8, 2026

WEB OF SCIENCE^TM
Citations

49

Last Week
0

Last month
0

Citations as of Apr 23, 2026

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Page views

Downloads

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM