Machine learning assisted software defect prediction

Xu, Zhou

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/91872

Title:	Machine learning assisted software defect prediction
Authors:	Xu, Zhou
Degree:	Ph.D.
Issue Date:	2021
Abstract:	Software products have been integrated into every aspects of our daily life. However, due to various factors in the process of software design, development and configuration, the defects are inevitable in the software. The defect hidden in the software module (a code snippet) threatens the security and decrease the reliability of the software products. Therefore, it is essential to detect and fix the defective modules before delivering the products. However, due to the continuous growth of the software scale and complexity, it is an increasingly challenging task for software developers and testers to improve the software quality. As the limited testing resources are usually unaffordable for supporting thorough code reviews, this requests a prioritization to better analyze the software product. In other words, developers and testers should reasonably allocate the limited resources to test the modules that have a high probability to contain defects. To seek for such prioritization, researchers propose software defect prediction to identify such high-risk modules for priority inspection. The most widely studied defect prediction methods are supervised models which first train a classification model on labeled software modules and then use it to determine whether or not the unlabeled modules contain defects. The supervised models need the labeled modules of historical data of the current project or external projects as the training set. According to the different sources of the training set, supervised defect prediction can be divided into the inner version defect prediction scenario, cross version defect prediction scenario, and cross project defect prediction scenario. In the three kinds of scenarios, the training set comes from the same version of a project, the previous version of a project, and other external projects, respectively. This thesis mainly studies new machine learning technologies to solve the different difficulties faced in the three kinds of defect prediction scenarios, aiming to further improve the performance of defect prediction. The research contents are described as follows: (1) In order to learn more discriminative feature representation and solve the inherent class imbalance problem of defect data, this thesis proposes an inner version defect prediction framework which combines a kernel principal component analysis method and a weighted extreme learning machine. The framework firstly maps the training set and test set into a high-dimensional feature space separately using the kernel principal component analysis method. The feature mapping makes it easy to distinguish the modules which are linearly inseparable in the original feature space. Then the framework uses the mapped training set to construct a classification model based on a weighted extreme learning machine to predict the labels of the mapped test set. This classification model solves the class imbalance problem by assigning different weights to the defective and non-defective software modules. We conduct experiments on ten projects in the NASA dataset and five projects in the AEEEM dataset, and use six indicators to evaluate the performance of the proposed framework. The results show that the performance of our proposed inner version defect prediction framework is gererally better than its variant methods, some feature selection methods, and class imbalanced learning methods. (2) In order to select a subset of software modules from the previous version as the training set which is optimal for the data of the current version, this thesis proposes a two-stage training subset selection framework for cross version defect prediction. This framework first uses the sparse modeling representation selection method to filter out some useless software modules and keeps the software modules that can minimize the error of reconstructing original data. Since this process does not rely on the assistance of the software modules from the current version, it is a self-simplification stage. Then, with the participation of the data from the current version, the framework uses the dissimilarity-based sparse subset selection method to further select a subset from the selected modules in the previous stage to effectively represent the data of the current version. The model constructed with the final selected module subset is more targeted to the data of the current version. Since this process requires the assistance of the software modules from the current version, it is an auxiliary refining stage. We conduct experiments on 67 versions from 17 projects in the PROMISE dataset and also use six indicators to evaluate the performance of the proposed framework. The results show that, across a total of 50 cross-version pairs, the overall performance of our proposed cross version defect prediction framework is superior to other training subset selection methods and the variant methods based on one-stage training subset selection. (3) In order to further narrow the distribution difference between the two cross-project data, this thesis proposes a new transfer learning based cross project defect prediction framework by introducing a state-of-the-art balanced distribution adaptive model. Unlike the previous transfer cross project defect prediction models which only considered the marginal distribution differences across data, this model comprehensively considers the marginal and conditional distribution differences across data. In addition, considering the impacts of the similarity between cross project data on the relative importance degrees of the two distribution differences, the model also assigns the weights to the two differences for adapting different cross-project pairs. We conduct experiments on five projects in the NASA dataset and five projects in the AEEEM dataset, and also use six indicators to evaluate the performance of the proposed framework. The results show that, across a total of 40 cross-project pairs, the overall performance of our proposed cross project defect prediction framework performs better than other transfer learning based and training data filter based cross project methods. In conclusion, this paper aims at solving difficult problems in different software defect prediction scenarios and proposing new framework models to improve the performance of defect prediction by combining new machine learning technologies. This paper expands the application of machine learning technologies in the field of software engineering and provides new solutions to the software defect prediction task, which is of great significances for software quality assurance activities.
Subjects:	Machine learning Computer software -- Quality control Software engineering Hong Kong Polytechnic University -- Dissertations
Pages:	iv, xx, 163 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/11502

Show full item record

Page views

292

Last Week
11

Last month

Citations as of Jan 18, 2026

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM