Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/35963
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorSchool of Nursing-
dc.creatorLee, PH-
dc.date.accessioned2016-04-15T08:36:07Z-
dc.date.available2016-04-15T08:36:07Z-
dc.identifier.issn1661-7827-
dc.identifier.urihttp://hdl.handle.net/10397/35963-
dc.language.isoenen_US
dc.publisherMolecular Diversity Preservation International (MDPI)en_US
dc.rights© 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).en_US
dc.rightsThe following publication Lee, P. H. (2014). Resampling methods improve the predictive power of modeling in class-imbalanced datasets. International Journal of Environmental Research and Public Health, 11(9), (Suppl. ), 9776-9789 is available athttps://dx.doi.org/10.3390/ijerph110909776en_US
dc.subjectAutomated classifieren_US
dc.subjectData miningen_US
dc.subjectDecision treeen_US
dc.subjectOversamplingen_US
dc.subjectPredictive poweren_US
dc.subjectRare eventsen_US
dc.titleResampling methods improve the predictive power of modeling in class-imbalanced datasetsen_US
dc.typeJournal/Magazine Articleen_US
dc.identifier.spage9776-
dc.identifier.epage9789-
dc.identifier.volume11-
dc.identifier.issue9-
dc.identifier.doi10.3390/ijerph110909776-
dcterms.abstractIn the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009-2010 dataset. A total of 4677 participants aged >= 20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1: 1, 1: 2, and 1: 4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.-
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationInternational journal of environmental research and public health, Sept. 2014, v. 11, no. 9, p. 9776-9789-
dcterms.isPartOfInternational journal of environmental research and public health-
dcterms.issued2014-
dc.identifier.isiWOS:000342027500070-
dc.identifier.scopus2-s2.0-84908081464-
dc.identifier.pmid25238271-
dc.identifier.eissn1660-4601-
dc.identifier.rosgroupid2014000616-
dc.description.ros2014-2015 > Academic research: refereed > Publication in refereed journal-
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumberOA_IR/PIRAen_US
dc.description.pubStatusPublisheden_US
Appears in Collections:Journal/Magazine Article
Files in This Item:
File Description SizeFormat 
Lee_Resampling_Class-imbalanced_Datasets.pdf574.77 kBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

80
Last Week
1
Last month
Citations as of Apr 21, 2024

Downloads

36
Citations as of Apr 21, 2024

SCOPUSTM   
Citations

44
Last Week
0
Last month
Citations as of Apr 19, 2024

WEB OF SCIENCETM
Citations

40
Last Week
0
Last month
Citations as of Apr 25, 2024

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.